Main Challenges for Word Embeddings: Part II

The use of word embeddings has become the standard approach for dealing with text input in AI models.

While an extensive research has been carried out during these years to analyze all theoretical underpinnings of algorithms such as word2vec, fastText and BERT, it is surprising that little has been done, in turn, to solve some of the more complex linguistic issues raised when getting down to business.

In our previous post, ‘Main Challenges for Word Embeddings: Part I’ we described two main challenges posed by linguistic phenomena such as homographs and inflection. In this post, we will discuss additional problems that can be easily overcome thanks to linguistic resources.

Antonyms

While the issues exposed in our previous post are problematic, this one is particularly risky. Since word embedding algorithms generate vectors based on the context of words, they also tend to generate similar vectors for words with opposite meanings.

It is not unusual, for instance, that love and adore share a similarity of 0.72. The problem comes, in turn, when love and hate present a close similarity of 0.62.

This issue may cause disastrous effects when dealing with text classification in sentiment analysis or conversational agents. On the one hand, a system that cannot properly distinguish between good and bad will not analyze user reviews correctly. On the other hand, a home automation system considering up and down to be the same will have many problems when controlling a thermostat.

Solution: Bitext technologies are solving this problem using lexical knowledge. This solution includes a reliable identification of antonyms and synonyms during the preprocessing stage.

Phrases, Entities and Expressions

Although some words that are spelled alike can be distinguished by their part of speech, the issue of polysemy (same spelling and POS but different meaning) remains largely unsolved.

A good illustration may be the adjective social and its different meanings depending on the context as in social security and social media.

When talking about token-based word embeddings, social media is not considered a token. Therefore, it would be hard for a ML system to compare it to the word Twitter, for instance, even if combining both vectors for social and media:

social vs. Twitter: 0.38
media vs. Twitter: 0.32
social + media vs. Twitter: 0.42

Solution: A word embedding model trained on a corpus where all noun phrases are marked as single tokens. In this case, if a comparison between social_media_NounPhrase and Twitter_NOUN is made, a similarity of 0.68 will be easily reached.

Bitext’s approach not only helps to deal with polysemy but also with expressions or verb phrases such as you’d better or I’d rather. Bitext’s Entity Extraction tool helps also apply this solution to related entities such as:

Places: United States, Buenos Aires, New England, New York…
Companies: Standard & Poor’s, Home Depot…
Discourse markers: on the one hand, of course, by the way…
Phrasal verbs: turn on, turn off…

Linguistic knowledge enhances machine learning by applying linguistic solutions to raw data before entering the learning scheme.

The study, the evaluation, and the results show that Bitext technology, based on a linguistic approach, can make any vector space model (VSM) perform better in downstream tasks; not only in standardization but also in information extraction and topic detection.

admin

Next Speed Up Your Bot Training with Artificial Data »

Previous « Main Challenges for Word Embeddings: Part I

Fine-tuning LLM

Deploying Successful GenAI-based Chatbots with less Data and more Peace of Mind.

Customizing Large Language Models in 2 steps via fine-tuning is a very efficient way to…

1 year ago

Main Challenges for Word Embeddings: Part II

Antonyms

Phrases, Entities and Expressions

Recent Posts

Bitext NAMER: Slashing Time and Costs in Automated Knowledge Graph Construction

Multilingual Named Entity Recognition for Knowledge Graphs: Supporting 70+ Languages with Precision

How LLM Verticalization Reduces Time and Cost in GenAI-Based Solutions

Integrating Bitext NAMER with LLMs

Bitext NAMER Cracks Named Entity Recognition

Deploying Successful GenAI-based Chatbots with less Data and more Peace of Mind.

Main Challenges for Word Embeddings: Part II

Antonyms

Phrases, Entities and Expressions

Related Post

Recent Posts

Bitext NAMER: Slashing Time and Costs in Automated Knowledge Graph Construction

Multilingual Named Entity Recognition for Knowledge Graphs: Supporting 70+ Languages with Precision

How LLM Verticalization Reduces Time and Cost in GenAI-Based Solutions

Integrating Bitext NAMER with LLMs

Bitext NAMER Cracks Named Entity Recognition

Deploying Successful GenAI-based Chatbots with less Data and more Peace of Mind.

Bitext NAMER: Slashing Time and Costs in Automated Knowledge Graph Construction