P2P

winter23

Peer to Peer: ILTA's Quarterly Magazine

Issue link: https://epubs.iltanet.org/i/1515316

Contents of this Issue

Navigation

Page 90 of 94

91 I L T A N E T . O R G challenges compared to languages with a more defined syntax structure. To address these challenges, FRONTEO has invested heavily in developing advanced AI technologies, including natural language processing (NLP) algorithms and machine learning models, that are specifically tailored to handle CJK data in addition to being fully competitive with other AI models on the market when it comes to handling English language. Improving Japanese Parsing and Processing Japanese language processing requires two core technologies – one is the decomposition of sentences into individual words and other morphemes (known as tokenization and morphological analysis) and the other is the analysis of the resultant morphemes to derive meaning. The application of these technologies to Japanese is complicated by the nature of Japanese language as a non- segmented language – a language that does not utilize word breaks to delineate individual concepts or idioms within a sentence. Without those word breaks, which make tokenizing Western languages trivial, tokenization algorithms must be sophisticated, adapted to the specific linguistic content and context sensitive. The accuracy of search and analysis is heavily influenced by the quality of tokenization. Poor tokenization can result in inaccurate search results, misinterpretation of text, and incorrect data processing. Consider an example – " 会社員です" – the interpretation of this phrase would depend on whether it is tokenized as [会社, 員, です], where 会社 means "company" and 員 means "employee" or as [会, 社 員, です], where 会 means "meeting" or "gathering," and 社 員 means "employee" or "staff member." The correct answer would depend on the context and the intended meaning of the sentence. Each of these interpretations may result in a different treatment by AI in analyzing this concept and may affect search term results. Furthermore, after the decomposition is complete, it is often difficult to evaluate the importance of single morphemes (such as "ha" and "ni") in the segmented text and properly weigh their impact on the overall relevance of a particular document. With development of the "KIBIT" engine – FRONTEO's engineering team has leveraged over a decade of natural language processing experience to fine-tune our AI process to better handle tokenization output and expand CJK customized word dictionary to address its complexity directly and increase overall analysis accuracy. In 2022, FRONTEO further released an update to its KIBIT algorithms to automatically identify and discard low-value single-letter words during the machine learning process, improving metrics like recall and precision on test data by approximately 7%. On top of that, this update released in 2022 brought several other improvements to proprietary AI algorithms, not only in precision but also in computational speed – which has increased by a factor of 10 – enabling AI analysis of multi- million document sets to complete in a few hours at most. Figure 1 above shows the total improvement in the F I G U R E 1 : I M P R O V E M E N T S I N A I P R E C I S I O N O V E R T H E L A S T Y E A R

Articles in this issue

Archives of this issue

view archives of P2P - winter23