Bitext Linguistic Analysis Platform: the Lemmatizer
Bitext provides a multilingual platform to perform full text analysis and tagging, including both lexical and syntactic analysis. The Bitext Lemmatizer is the core component of the lexical analysis component. For an overview of our platform, see Bitext Deep Linguistic Analysis Platform
Main Features of Bitext Lemmatizer:
- covers +100 languages and variants: 77 languages and 25 variants
- processes +60,000,000 words per second on a standard server
- source code available, also in escrow
- optimized versions for search & indexing, for chatbots and for SEO
- integrated with complementary tools:
– decompounding for German, Korean…
– word segmentation for Japanese, Chinese…
How the Lemmatizer works
Bitext Lemmatization service identifies all potential lemmas (also called roots) for any word, using morphological analysis and lexicons curated by computational linguists.
- if the word is a form, all the lemmas it can correspond to that form
- if the word is a lemma, the lemma itself
For example, for the word “spoke”, the lemmatization service will return the lemmas “speak” and “spoke”. For the sentence “We spoke yesterday about fixing the bike.”, the Lemmatization service will output a set of lemmas for every word in the sentence.
The service uses two main components: a powerful software engine and the most comprehensive lexicons in the market.
1. The software. Bitext Lemmatizer runs on any platform and has a powerful engine capable of processing millions of words per second.
2. The Lexicons. Bitext Lexical Dictionaries contain linguistically-curated wordlists that cover all possible words of each language with their morphological and semantic attributes. They are constantly updated against real language corpora
Use Cases
These clients trust Bitext Lemmatizer…
Languages and Dictionaries
The Lemmatization Service supports 77+ languages and 25 language variants. The lexical coverage for each language is highly comprehensive. The number of lemmas and forms varies across languages: from around 60,000 lemmas and 130,000 forms for English, to 65,000 lemmas and 3,000,000 forms for Spanish, and 70,000 lemmas and 32,000,000 forms for Finnish. Lemmas and forms for all languages are enriched with comprehensive morphosyntactic features.
77 languages currently available:
- Afrikaans
- Albanian
- Amharic
- Arabic
- Armenian
- Assamese
- Azeri
- Basque
- Belarusian
- Bengali
- Bulgarian
- Burmese
- Catalan
- Chinese
- Croatian
- Czech
- Danish
- Dutch
- English
- Esperanto
- Estonian
- Finnish
- French
- Galician
- Georgian
- German
- Greek
- Gujarati
- Hebrew
- Hindi
- Hungarian
- Icelandic
- Indonesian
- Irish Gaelic
- Italian
- Japanese
- Kannada
- Kazakh
- Khmer
- Korean
- Kyrgyz
- Lao
- Latvian
- Lithuanian
- Macedonian
- Malay
- Malayalam
- Marathi
- Mongolian
- Nepali
- Norwegian Bokmal
- Norwegian Nynorsk
- Oriya
- Persian
- Polish
- Portuguese
- Punjabi
- Romanian
- Russian
- Serbian
- Sindhi
- Sinhala
- Slovak
- Slovenian
- Spanish
- Swahili
- Swedish
- Tagalog
- Tamil
- Telugu
- Thai
- Turkish
- Ukrainian
- Urdu
- Uzbek
- Vietnamese
- Zulu
- Afrikaans
- Albanian
- Amharic
- Arabic
- Armenian
- Assamese
- Azeri
- Basque
- Belarusian
- Bengali
- Bulgarian
- Burmese
- Catalan
- Chinese
- Croatian
- Czech
- Danish
- Dutch
- English
- Esperanto
- Estonian
- Finnish
- French
- Galician
- Georgian
- German
- Greek
- Gujarati
- Hebrew
- Hindi
- Hungarian
- Icelandic
- Indonesian
- Irish Gaelic
- Italian
- Japanese
- Kannada
- Kazakh
- Khmer
- Korean
- Kyrgyz
- Lao
- Latvian
- Lithuanian
- Macedonian
- Malay
- Malayalam
- Marathi
- Mongolian
- Nepali
- Norwegian Bokmal
- Norwegian Nynorsk
- Oriya
- Persian
- Polish
- Portuguese
- Punjabi
- Romanian
- Russian
- Serbian
- Sindhi
- Sinhala
- Slovak
- Slovenian
- Spanish
- Swahili
- Swedish
- Tagalog
- Tamil
- Telugu
- Thai
- Turkish
- Ukrainian
- Urdu
- Uzbek
- Vietnamese
- Zulu
Variants:
- Arabic (MSA)
- Arabic (Gulf)
- Arabic (Najdi)
- Chinese (Simplified)
- Chinese (Traditional)
- Dutch (Netherlands)
- Dutch (Belgium)
- English (US)
- English (UK)
- English (India)
- Finnish (Standard)
- Finnish (Colloquial)
- French (France)
- French (Canada)
- French (Switzerland)
- German (Germany)
- German (Switzerland)
- Italian (Italy)
- Italian (Switzerland )
- Portuguese (Portugal)
- Portuguese (Brazil)
- Spanish (Spain)
- Spanish (North America)
- Spanish (Central America)
- Spanish (Andes)
- Spanish (Southern Cone)
- Arabic (MSA)
- Arabic (Gulf)
- Arabic (Najdi)
- Chinese (Simplified)
- Chinese (Traditional)
- Dutch (Netherlands)
- Dutch (Belgium)
- English (US)
- English (UK)
- English (India)
- Finnish (Standard)
- Finnish (Colloquial)
- French (France)
- French (Canada)
- French (Switzerland)
- German (Germany)
- German (Switzerland)
- Italian (Italy)
- Italian (Switzerland)
- Portuguese (Portugal)
- Portuguese (Brazil)
- Spanish (Spain)
- Spanish (North America)
- Spanish (Central America)
- Spanish (Andes)
- Spanish (Southern Cone)
Lemmatization and Stemming
Sometimes, lemmatization and stemming are used as interchangeable terms; however, there are differences that are important to note.
Demos
We have both API and on-premise demo versions of the Bitext Lemmatizer. Let us know about your needs (languages, OS/platform, programming language…) and we will compile a customized evaluation version for you.
Technical Specifications
Software
The Lemmatization service is powered by Bitext’s Lexical Analyzer, which uses a proprietary implementation of Finite State Automata (FSA) that allows for high performance and high levels of compression. Compression rates can reach up to 1:100 (100 MB of raw data can be compressed into 1MB) for more complex languages like Finnish or Turkish.
How to use third-party software and still be on the safe side? We can provide source code upon request on different modalities such as in escrow.
Footprint
Lexical Data are encoded in a compressed format which allows direct lookups without decompression. For example, the run-time version for English uses 1MB for 130,000 forms;
for Spanish, 3,000,000 forms take up 2MB; and for Finnish 32,000,000, forms take up 3MB; including full morphosyntactic information.
Throughput
On an 18-core Intel® Xeon® W-2295 @ 3.00GHz with 128GB of RAM and a 512 GB SSD, the Lexical Analyzer can process:
- ~60,000,000 lookups per second, or
- ~750MB of text per second
Scalability
The software is provided as a thread-safe library. The software is completely “self-contained” with minimal OS dependencies (just the standard C libraries) and with no need for complex installation procedures.
OS Independence
The Lexical Analyzer is written in platform-independent C. As a result, it can be run on any OS that can compile C: Windows, Linux, Solaris… It can also be run on mobile devices, given their current processing power.
Complementary tools: decompounding and segmentation
The Bitext Lemmatizer is fully integrated with additional tools that are required by some languages. A lemmatizer usually processes words, rather than compounds or parts of words; as a result, in these languages, running a lemmatizer without preprocessing would yield very poor results.
There are two main tools that solve these problems:
- Decompounding Tool: some languages like German, Korean, Dutch, Norwegian or Swedish can create new tokens/strings by joining words. For example, the word “Diskussionsthemen” in German. In these languages, decompounding is a necessary step before lemmatizing, otherwise, the error rate of the lemmatizer will be way too high. In our example, the word “Diskussionsthemen” in German would be wrongly lemmatized if we do not split the compound into noun “Diskussion” and noun “Themen” with “Thema” as root.
- Word Segmentation Tool. Some languages like Chinese, Japanese, Vietnamese or Thai do not separate words as it is done, for example, in Romance or Anglo-Saxon languages like Spanish, English or French. In short, words in these languages are not separated by spaces, so the lemmatizer needs to identify them and separate them into words (although the system does not work the same way in Chinese, Japanese, Vietnamese or Thai or in other languages). For example, in Chinese, this sentence “把音量调低一点” will be split as 把 | 音量 | 调低 | 一点 .
Use Case
Bitext provides MarkLogic customers with the leading-edge benefits of Bitext’s Deep Linguistics Analysis Platform.
Benchmark
The report presents a comparison among NLTK, Stanford and Bitext. Check out the results now!
Demo
Do you have questions?
Schedule a conversation with one of our experts to find out which NLP solution works best for you.
Whitepaper
On how lemmatization and POS tagging can facilitate Machine Learning projects, using less data and less time.
Benchmark. We have run a benchmark with the most popular enterprise lemmatizers in the market: NLTK, Stanford, TwinWord, CST, Spacy and Simplemma (please, contact us if you know of other lemmatizers that should be included here). The benchmark focuses on three main points:
- Linguistic accuracy in identifying the proper lemma (for now with English).
- Processing speed: how many millions of words per second and with what hardware resources.
- Customization and maintenance tasks.
You can download the benchmark here
MADRID, SPAIN
Camino de las Huertas, 20, 28223 Pozuelo
Madrid, Spain
SAN FRANCISCO, USA
541 Jefferson Ave Ste 100, Redwood City
CA 94063, USA