AI

Decompounding German, Korean and More: a ‘Gesamt + Kunst + Werk’

It’s a true story that Germans love their long words. However, this fact may not be so loved for text processing procedures. The lack of NLP libraries in Python adapted to German makes it difficult to properly analyze this kind of words.

Let us share with you our NLP tool to split word compounds. It will transform the AI market.

A compounding language is a language that allows making up new words by just joining one after another forming a single unit, like for example German, Dutch, Korean, Norwegian or Swedish.

They pose a challenge for language processing tools since most of them are based on limited lexicons that cannot cover all possible inflections or know how compounds are formed. As a result, splitting these words turns out to be a crucial pre-processing step for a high-quality lemmatization process in these languages.

Accordingly, a decompounding tool is critical for search applications, where applying decompounding before indexing and searching can significantly increase recall.

There is, therefore, a need to decompose compounds so that their coverage increases and out-of-vocabulary terms are reduced. At Bitext, a lexical analyzer was developed to offer an extensive support for languages such as German (including the Swiss variant), Dutch (including Belgian), Korean, Norwegian Bokmål, Norwegian Nynorsk and Swedish.

In German, for example, in order to split the word Abwasserbehandlungsanlagen, which means sewage treatment plants, our tool reverses the German rules for compounding and breaks the word into its basic componentsAbwasser+Behandlung+s+Anlagen. Additionally, the tool can lemmatize the whole compound, Abwasserbehandlungsanlage, or each of its components, Abwasser + Behandlung + Anlage.

Let’s imagine the case for a search application: if there are several documents containing the word Abwasserbehandlungsanlage, and a user searches for Abwasserbehandlung (sewage treatment), those documents will not be returned.

However, if during indexing, the compound is broken into the lemmas AbwasserBehandlung and Anlagen, then searching for Abwasserbehandlung will return the right results.

Moreover, the Bitext Lexical Analyzer is highly configurable, based on the desired application. Its main configurable aspects are:

  • Lexicalized compounds: by default, the decompounder will not split compounds such as Wörterbuch (dictionary) or Rindfleisch (beef), which are extremely common and have become lexicalized (they appear in dictionaries). However, in some applications, it may be useful to split these compounds, so our decompounder can optionally do this.
  • Case sensitivity: by default, the decompounder enforces case sensitivity (since German nouns are capitalized), but this may be disabled in cases where the input text is from an informal source.
  • Alternative spellings: in German, the digraph ss and the letter ß are interchangeable in some contexts but not in others. By default, the Lexical Analyzer will return the lemma using the same spelling present in the form (except for cases where this is not valid, such as ißtessen). This can be configured as needed, such as for dialects like Swiss German, which entirely eliminates the use of ß.
  • Compounds in the different language variants: the word Chuchichäschtli (kitchen cupboard) from Swiss German which is not used in Standard German.

Improve your solutions by adding NLP tools based on a linguistic analysis to get the results you’re expecting. Why not see it for yourself? Try our decompounding tool in our API

Are you missing any languages above? We love challenges, just let us know about it and your wishes will come true.

admin

Recent Posts

Multilingual Named Entity Recognition for Knowledge Graphs: Supporting 70+ Languages with Precision

In the era of data-driven decision-making, Knowledge Graphs (KGs) have emerged as pivotal tools for…

1 day ago

Integrating Bitext NAMER with LLMs

A robust discussion persists within the technical and academic communities about the suitability of LLMs…

1 month ago

Bitext NAMER Cracks Named Entity Recognition

Chinese, Southeast Asian, and Arabic names require transliteration, often resulting in inconsistent spellings in Roman…

2 months ago

Deploying Successful GenAI-based Chatbots with less Data and more Peace of Mind.

Customizing Large Language Models in 2 steps via fine-tuning is a very efficient way to…

7 months ago

Any Solutions to the Endless Data Needs of GenAI?

Discover the advantages of using symbolic approaches over traditional data generation techniques in GenAI. Learn…

8 months ago