Lemmatization

What is the difference between stemming and lemmatization?

Stemming and lemmatization are methods used by search engines and chatbots to analyze the meaning behind a word. Stemming uses the stem of the word, while lemmatization uses the context in which the word is being used. We’ll later go into more detailed explanations and examples.  

When running a search, we want to find relevant results not only for the exact expression we typed on the search bar, but also for the other possible forms of the words we used.

For example, it’s very likely we will want to see results containing the form “skirt” if we have typed “skirts” in the search bar. Lemmatization and stemming are applied in this case. 

In the case of a chatbot, lemmatization is one of the most effective ways to help a chatbot better understand the customers’ queries. Because this method carries out a morphological analysis of the words, the chatbot is able to understand the contextual form of every word and, therefore, it is able to better understand the overall meaning of the entire sentence

Main differences between stemming and lemmatization

The aim of both processes is the same: reducing the inflectional forms of each word into a common base or root. However, these two methods are not exactly the same. The main difference is the way they work and therefore the result each of them returns.

  • Stemming algorithms work by cutting off the end or the beginning of the word, taking into account a list of common prefixes and suffixes that can be found in an inflected word. This indiscriminate cutting can be successful in some occasions, but not always, and that is why we affirm that this approach presents some limitations. Below we illustrate the method with examples in both English and Spanish.

  • Lemmatization, on the other hand, takes into consideration the morphological analysis of the words. To do so, it is necessary to have detailed dictionaries which the algorithm can look through to link the form back to its lemma. Again, you can see how it works with the same example words.

Another important difference to highlight is that a lemma is the base form of all its inflectional forms, whereas a stem isn’t. This is why regular dictionaries are lists of lemmas, not stems. This has two consequences:

  • First, the stem can be the same for the inflectional forms of different lemmas. This translates into noise in our search results. In fact, it is very common to find entire forms as instances of several lemmas; let’s see some examples.

In Telugu (above), the form for “robe” is identical to the form for “I don’t share”, so their stems are indistinguishable too. But they, of course, belong to different lemmas. The same happens in Gujarati (below), where the forms and stems for “beat” and “set up” coincide, but we can separate one from another by looking at their lemmas.

  • Also, the same lemma can correspond to forms with different stems, and we need to treat them as the same word. For example, in Greek, a typical verb has different stems for perfective forms and for imperfective ones. If we were using stemming algorithms we won’t be able to relate them with the same verb, but using lemmatization it is possible to do so. We can clearly observe it in the example below:

 

How do they work?

  • Stemming: there are different algorithms that can be used in the stemming process, but the most common in English is Porter stemmer. The rules contained in this algorithm are divided in five different phases numbered from 1 to 5. The purpose of these rules is to reduce the words to the root.
  • Lemmatization: the key to this methodology is linguistics. To extract the proper lemma, it is necessary to look at the morphological analysis of each word. This requires having dictionaries for every language to provide that kind of analysis.

 

How to increase recall beyond lemmatization?

Lemmatization is a common technique to increase recall (to make sure no relevant document gets lost). However, lemmatization may not be enough in many cases and we may need to further increase recall with other techniques.

For example, if you search for information on “John Kennedy”, documents that contain this will be relevant definitely:

“JFK”, “John F Kennedy”, “John Fitzgerald Kennedy”

Plus all variations with/without spaces or periods: “John F. Kennedy”…

Another similar example is “cost of labor”, where you want to retrieve also “cost of labour”.

The same thing happens with “bull market” and “bullish market” or “up market”.

These types of semantic equivalents are popularly known as “synonyms” (although in linguistic terms some are not synonyms but acronyms or regional US/UK variations; our point is to stress that there are many types of variations that we need to consider for increasing recall and query expansion). 

Making sure that your search engine knows about this language nuances will improve results make the user experience much more positive.

Which one is best: lemmatization or stemming?

As a conclusion, we can say developing a stemmer is far simpler than building a lemmatizer. In the latter, deep linguistics knowledge is required to create the dictionaries that allow the algorithm to look for the proper form of the word. Once this is done, the noise will be reduced and the results provided on the information retrieval process will be more accurate.

We have seen the benefits of a lemmatizer for search engines, but there are more applications of lemmatization, like textual bases or e-commerce search. Know your tools!

admin

View Comments

  • Great post. I used to be checking continuously this blog and I
    am impressed! Extremely useful information specifically the last section :) I
    maintain such info much. I was looking for this particular information for a long time.
    Thank you and best of luck.

  • You're so awesome! I don't believe I've read a single thing like that before.
    So wonderful to find someone with a few original thoughts on this topic.
    Really.. thank you for starting this up. This site is something that is required
    on the web, someone with a bit of originality!

    My web page :: Binary Options

Recent Posts

Integrating Bitext NAMER with LLMs

A robust discussion persists within the technical and academic communities about the suitability of LLMs…

2 days ago

Bitext NAMER Cracks Named Entity Recognition

Chinese, Southeast Asian, and Arabic names require transliteration, often resulting in inconsistent spellings in Roman…

2 weeks ago

Deploying Successful GenAI-based Chatbots with less Data and more Peace of Mind.

Customizing Large Language Models in 2 steps via fine-tuning is a very efficient way to…

6 months ago

Any Solutions to the Endless Data Needs of GenAI?

Discover the advantages of using symbolic approaches over traditional data generation techniques in GenAI. Learn…

7 months ago

From General-Purpose LLMs to Verticalized Enterprise Models

In the blog "General Purpose Models vs. Verticalized Enterprise GenAI," the focus is on the…

8 months ago

Case Study: Finequities & Bitext Copilot – Redefining the New User Journey in Social Finance

Bitext introduced the Copilot, a natural language interface that replaces static forms with a conversational,…

10 months ago