Enriched Arabic Text Embeddings
Bitext provides linguistically enriched text embeddings for Arabic for that outperform traditional embeddings in a wide range of downstream tasks such as text classification, topic modeling and semantic search.
We offer both:
- Pre-trained embeddings for morphologically-complex languages like Arabic
- Services to create custom embeddings for Arabic
Our Customers
Working with 3 of the Top 5 Largest Companies in NASDAQ
Advantages
Linguistically enriched embeddings have been proven to increase accuracy in different downstream tasks. In standard semantic similarity tests, our enriched embeddings for outperform regular embeddings by 7% (going from 0.47 to 0.54).
As a broader showcase of the performance improvements offered by our embeddings for morphologically-rich languages, we have benchmarked the performance of our enriched embeddings on more complex downstream tasks like:
- Topic Modeling: topic coherence increases by up to 15%
- Semantic Textual Similarity (STS): F1-score increases by 4%
- Question Answering (QA): exact match scores by 14%
Features
We provide service to generate enriched embeddings for Arabic that extend traditional unsupervised embeddings into semi-supervised ones leveraging Bitext’s linguistic resources:
- lexical resources for MSA with 15M words…
- dialectal/regional variants like Egyptian, Gulf or Najdi Arabic
- rich morphological attribute tagging: POS, tense, gender, number, aspect…
- extensive corpora
- named entities dictionaries
- offensive language tags
Bitext also offers pre-trained word embeddings and full pipelines for Arabic designed to take full advantage of these embeddings. These pipelines include:
- high-quality lemmatization
- full pipeline of linguistic services such as tokenization, entity extraction, etc.
Our enriched embeddings offer the following features:
- Ready to use with virtually any platform/pipeline: Spacy, Gensim, BERTopic…
- Wide vocabulary coverage: 50K lemmas, covering up to 15M word forms from Bitext’s balanced 5B word corpus
- Compact size: 150 dimensions
Features
The core strengths of the Bitext embeddings & pipelines are:
- Based on large corpora. Embeddings are built on corpora in the range of 25-50 GB of text (5-10 Billion tokens). These corpora are designed to be well-balanced with respect to text typology, texts sources and verticals.
- Extensive vocabulary coverage. All pipeline components use comprehensive lexical resources. For example, our lexicon for Arabic covers 15 million word forms. These lexical resources:
- ensure high quality linguistic tagging functions, like POS tagging and lemmatization
- reduce the number of unknown (OOV) words in embeddings
- provide rich morphological features like tense, person, number, gender, case…
- Full pipeline coverage. The pipeline allows for control of all the usual steps:
- Sentence segmentation. Splits texts into sentences.
- Word Segmentation. Covers languages without spaces between words like Japanse or Chinese, or words with spaces between syllables, like Vietnamese
- Tokenization. Splits sentences into individual tokens (words, numbers…)
- POS tagging. Provides better word type segmentation, grouping word classes that have similar behavior, like nouns, verbs, adjectives or adverbs
- Lemmatization. Increases the vocabulary coverage of the embedding, particularly for morphologically rich languages like Spanish, French or Italian
- Decompounding. Reduces vocabulary size by splitting compounds into known words, reducing data sparsity
- Named Entity Recognition. NER uses entity dictionaries that reduce the number of unknown words
- Dependency Parsing.
Have doubts? Don´t hesitate to
MADRID, SPAIN
Camino de las Huertas, 20, 28223 Pozuelo
Madrid, Spain
SAN FRANCISCO, USA
541 Jefferson Ave Ste 100, Redwood City
CA 94063, USA