It’s a true story that Germans love their long words. However, this fact may not be so loved for text processing procedures. The lack of NLP libraries in Python adapted to German makes it difficult to properly analyze this kind of words.
Let us share with you our NLP tool to split word compounds. It will transform the AI market.
A compounding language is a language that allows making up new words by just joining one after another forming a single unit, like for example German, Dutch, Korean, Norwegian or Swedish.
They pose a challenge for language processing tools since most of them are based on limited lexicons that cannot cover all possible inflections or know how compounds are formed. As a result, splitting these words turns out to be a crucial pre-processing step for a high-quality lemmatization process in these languages.
Accordingly, a decompounding tool is critical for search applications, where applying decompounding before indexing and searching can significantly increase recall.
There is, therefore, a need to decompose compounds so that their coverage increases and out-of-vocabulary terms are reduced. At Bitext, a lexical analyzer was developed to offer an extensive support for languages such as German (including the Swiss variant), Dutch (including Belgian), Korean, Norwegian Bokmål, Norwegian Nynorsk and Swedish.
In German, for example, in order to split the word Abwasserbehandlungsanlagen, which means sewage treatment plants, our tool reverses the German rules for compounding and breaks the word into its basic components, Abwasser+Behandlung+s+Anlagen. Additionally, the tool can lemmatize the whole compound, Abwasserbehandlungsanlage, or each of its components, Abwasser + Behandlung + Anlage.
Let’s imagine the case for a search application: if there are several documents containing the word Abwasserbehandlungsanlage, and a user searches for Abwasserbehandlung (sewage treatment), those documents will not be returned.
However, if during indexing, the compound is broken into the lemmas Abwasser, Behandlung and Anlagen, then searching for Abwasserbehandlung will return the right results.
Moreover, the Bitext Lexical Analyzer is highly configurable, based on the desired application. Its main configurable aspects are:
Improve your solutions by adding NLP tools based on a linguistic analysis to get the results you’re expecting. Why not see it for yourself? Try our decompounding tool in our API
Are you missing any languages above? We love challenges, just let us know about it and your wishes will come true.
In the era of data-driven decision-making, Knowledge Graphs (KGs) have emerged as pivotal tools for…
A robust discussion persists within the technical and academic communities about the suitability of LLMs…
Chinese, Southeast Asian, and Arabic names require transliteration, often resulting in inconsistent spellings in Roman…
Customizing Large Language Models in 2 steps via fine-tuning is a very efficient way to…
Discover the advantages of using symbolic approaches over traditional data generation techniques in GenAI. Learn…