Lemmatization

Lemmatization vs Stemming

Almost all of us use a search engine in our daily working routine, it has become a key tool to get our tasks done. However, with each minute the amount of data and resources available grows exponentially, and providing high quality results that match user’s queries becomes more complex.

One of the issues that complicates the process are the ambiguous words. This type of terms have different meanings depending on their function in the sentence. Let’s see an example:

  • Let’s take five minutes break in this meeting.
  • This vase made of glass can break easily.

In both sentences we have used the word break, however it has different meanings, in the first one acting as a noun and it means tin, while in the second sentence is acting as a verb and it means possibility.

When we are working with large databases to look for precise information, ambiguous words may complicate our search, because results retrieved by the search engine will include data containing the term “can” in both meanings, while maybe we are just interested in one of them.  Some of them will be interesting to us, but the others are just noise that slow down our job.

Ambiguity might not be the top problem while searching in English, however it plays a mayor role in high inflected languages, like French, Spanish or Polish.

These languages commonly use declension and adjectives, pronouns and noun inflections.

How does inflection affect search?

When introducing a term in the browser it needs to be normalized, bot at the query and the index time, so what the user is looking for can match a term that is contained in the database. To normalize any word there are two different approaches:

  • Lemmatization: based on its usage, the machine looks for the appropriate dictionary form of the word.
  • Stemming: characters are removed of the end of the word by following language-specific rules.

In weak-inflected languages, the method chosen may not influence the quality of the results. But our internal research has shown that for highly inflected languages the chosen process determines the accuracy of the end results.

The main advantage of lemmatization is that it takes into consideration the context of the word to determine which is the intended meaning the user is looking for. This process allows for a decrease in noise and speeds up the user’s task.

You can see here an example of what happens when we look for an ambiguous word in French, as you can see if we follow the stemming methodology, the noise will be much higher.

In most cases for an ambiguous word coming from two different words the stem will be the same. While if we pay attention to the lemma we see the difference.

If you are interested in additional multilingual examples download our benchmark!

 

admin

Recent Posts

Integrating Bitext NAMER with LLMs

A robust discussion persists within the technical and academic communities about the suitability of LLMs…

3 days ago

Bitext NAMER Cracks Named Entity Recognition

Chinese, Southeast Asian, and Arabic names require transliteration, often resulting in inconsistent spellings in Roman…

2 weeks ago

Deploying Successful GenAI-based Chatbots with less Data and more Peace of Mind.

Customizing Large Language Models in 2 steps via fine-tuning is a very efficient way to…

6 months ago

Any Solutions to the Endless Data Needs of GenAI?

Discover the advantages of using symbolic approaches over traditional data generation techniques in GenAI. Learn…

7 months ago

From General-Purpose LLMs to Verticalized Enterprise Models

In the blog "General Purpose Models vs. Verticalized Enterprise GenAI," the focus is on the…

8 months ago

Case Study: Finequities & Bitext Copilot – Redefining the New User Journey in Social Finance

Bitext introduced the Copilot, a natural language interface that replaces static forms with a conversational,…

10 months ago