The use of word embeddings has become the standard approach for dealing with text input in AI models.
While an extensive research has been carried out during these years to analyze all theoretical underpinnings of algorithms such as word2vec, fastText and BERT, it is surprising that little has been done, in turn, to solve some of the more complex linguistic issues raised when getting down to business.
In our previous post, ‘Main Challenges for Word Embeddings: Part I’ we described two main challenges posed by linguistic phenomena such as homographs and inflection. In this post, we will discuss additional problems that can be easily overcome thanks to linguistic resources.
While the issues exposed in our previous post are problematic, this one is particularly risky. Since word embedding algorithms generate vectors based on the context of words, they also tend to generate similar vectors for words with opposite meanings.
It is not unusual, for instance, that love and adore share a similarity of 0.72. The problem comes, in turn, when love and hate present a close similarity of 0.62.
This issue may cause disastrous effects when dealing with text classification in sentiment analysis or conversational agents. On the one hand, a system that cannot properly distinguish between good and bad will not analyze user reviews correctly. On the other hand, a home automation system considering up and down to be the same will have many problems when controlling a thermostat.
Solution: Bitext technologies are solving this problem using lexical knowledge. This solution includes a reliable identification of antonyms and synonyms during the preprocessing stage.
Although some words that are spelled alike can be distinguished by their part of speech, the issue of polysemy (same spelling and POS but different meaning) remains largely unsolved.
A good illustration may be the adjective social and its different meanings depending on the context as in social security and social media.
When talking about token-based word embeddings, social media is not considered a token. Therefore, it would be hard for a ML system to compare it to the word Twitter, for instance, even if combining both vectors for social and media:
Solution: A word embedding model trained on a corpus where all noun phrases are marked as single tokens. In this case, if a comparison between social_media_NounPhrase and Twitter_NOUN is made, a similarity of 0.68 will be easily reached.
Bitext’s approach not only helps to deal with polysemy but also with expressions or verb phrases such as you’d better or I’d rather. Bitext’s Entity Extraction tool helps also apply this solution to related entities such as:
Linguistic knowledge enhances machine learning by applying linguistic solutions to raw data before entering the learning scheme.
The study, the evaluation, and the results show that Bitext technology, based on a linguistic approach, can make any vector space model (VSM) perform better in downstream tasks; not only in standardization but also in information extraction and topic detection.
A robust discussion persists within the technical and academic communities about the suitability of LLMs…
Chinese, Southeast Asian, and Arabic names require transliteration, often resulting in inconsistent spellings in Roman…
Customizing Large Language Models in 2 steps via fine-tuning is a very efficient way to…
Discover the advantages of using symbolic approaches over traditional data generation techniques in GenAI. Learn…
In the blog "General Purpose Models vs. Verticalized Enterprise GenAI," the focus is on the…
Bitext introduced the Copilot, a natural language interface that replaces static forms with a conversational,…