Almost all of us use a search engine in our daily working routine, it has become a key tool to get our tasks done. However, with each minute the amount of data and resources available grows exponentially, and providing high quality results that match user’s queries becomes more complex.
One of the issues that complicates the process are the ambiguous words. This type of terms have different meanings depending on their function in the sentence. Let’s see an example:
In both sentences we have used the word break, however it has different meanings, in the first one acting as a noun and it means tin, while in the second sentence is acting as a verb and it means possibility.
When we are working with large databases to look for precise information, ambiguous words may complicate our search, because results retrieved by the search engine will include data containing the term “can” in both meanings, while maybe we are just interested in one of them. Some of them will be interesting to us, but the others are just noise that slow down our job.
Ambiguity might not be the top problem while searching in English, however it plays a mayor role in high inflected languages, like French, Spanish or Polish.
These languages commonly use declension and adjectives, pronouns and noun inflections.
When introducing a term in the browser it needs to be normalized, bot at the query and the index time, so what the user is looking for can match a term that is contained in the database. To normalize any word there are two different approaches:
In weak-inflected languages, the method chosen may not influence the quality of the results. But our internal research has shown that for highly inflected languages the chosen process determines the accuracy of the end results.
The main advantage of lemmatization is that it takes into consideration the context of the word to determine which is the intended meaning the user is looking for. This process allows for a decrease in noise and speeds up the user’s task.
You can see here an example of what happens when we look for an ambiguous word in French, as you can see if we follow the stemming methodology, the noise will be much higher.
In most cases for an ambiguous word coming from two different words the stem will be the same. While if we pay attention to the lemma we see the difference.
If you are interested in additional multilingual examples download our benchmark!
A robust discussion persists within the technical and academic communities about the suitability of LLMs…
Chinese, Southeast Asian, and Arabic names require transliteration, often resulting in inconsistent spellings in Roman…
Customizing Large Language Models in 2 steps via fine-tuning is a very efficient way to…
Discover the advantages of using symbolic approaches over traditional data generation techniques in GenAI. Learn…
In the blog "General Purpose Models vs. Verticalized Enterprise GenAI," the focus is on the…
Bitext introduced the Copilot, a natural language interface that replaces static forms with a conversational,…