Lemmatization vs Stemming

Almost all of us use a search engine in our daily working routine, it has become a key tool to get our tasks done. However, with each minute the amount of data and resources available grows exponentially, and providing high quality results that match user’s queries becomes more complex.

One of the issues that complicates the process are the ambiguous words. This type of terms have different meanings depending on their function in the sentence. Let’s see an example:

Let’s take five minutes break in this meeting.
This vase made of glass can break easily.

In both sentences we have used the word break, however it has different meanings, in the first one acting as a noun and it means tin, while in the second sentence is acting as a verb and it means possibility.

When we are working with large databases to look for precise information, ambiguous words may complicate our search, because results retrieved by the search engine will include data containing the term “can” in both meanings, while maybe we are just interested in one of them. Some of them will be interesting to us, but the others are just noise that slow down our job.

Ambiguity might not be the top problem while searching in English, however it plays a mayor role in high inflected languages, like French, Spanish or Polish.

These languages commonly use declension and adjectives, pronouns and noun inflections.

How does inflection affect search?

When introducing a term in the browser it needs to be normalized, bot at the query and the index time, so what the user is looking for can match a term that is contained in the database. To normalize any word there are two different approaches:

Lemmatization: based on its usage, the machine looks for the appropriate dictionary form of the word.
Stemming: characters are removed of the end of the word by following language-specific rules.

In weak-inflected languages, the method chosen may not influence the quality of the results. But our internal research has shown that for highly inflected languages the chosen process determines the accuracy of the end results.

The main advantage of lemmatization is that it takes into consideration the context of the word to determine which is the intended meaning the user is looking for. This process allows for a decrease in noise and speeds up the user’s task.

You can see here an example of what happens when we look for an ambiguous word in French, as you can see if we follow the stemming methodology, the noise will be much higher.

In most cases for an ambiguous word coming from two different words the stem will be the same. While if we pay attention to the lemma we see the difference.

If you are interested in additional multilingual examples download our benchmark!

admin

Next Is it possible to speed up the training process in Deep Learning? »

Previous « Differences between Polarity and Topic-Based Sentiment Analysis

Fine-tuning LLM

Deploying Successful GenAI-based Chatbots with less Data and more Peace of Mind.

Customizing Large Language Models in 2 steps via fine-tuning is a very efficient way to…

10 months ago

Lemmatization vs Stemming

How does inflection affect search?

Recent Posts

Bitext NAMER: Slashing Time and Costs in Automated Knowledge Graph Construction

Multilingual Named Entity Recognition for Knowledge Graphs: Supporting 70+ Languages with Precision

How LLM Verticalization Reduces Time and Cost in GenAI-Based Solutions

Integrating Bitext NAMER with LLMs

Bitext NAMER Cracks Named Entity Recognition

Deploying Successful GenAI-based Chatbots with less Data and more Peace of Mind.

Lemmatization vs Stemming

How does inflection affect search?

Related Post

Recent Posts

Bitext NAMER: Slashing Time and Costs in Automated Knowledge Graph Construction

Multilingual Named Entity Recognition for Knowledge Graphs: Supporting 70+ Languages with Precision

How LLM Verticalization Reduces Time and Cost in GenAI-Based Solutions

Integrating Bitext NAMER with LLMs

Bitext NAMER Cracks Named Entity Recognition

Deploying Successful GenAI-based Chatbots with less Data and more Peace of Mind.

Bitext NAMER: Slashing Time and Costs in Automated Knowledge Graph Construction