Lexical Resources

Bitext Lexical Data Resources are the most comprehensive and consistent set of language resources in the world, with support for +100 languages and dialects. This proprietary data has been developed to meet the highest quality standards in the field of computational linguistics.

77

Languages

26

Variants

See Language Specifications

Talk to an Expert

Bitext data is used in production by some of the world’s largest and most successful software companies, including 3 out of the top 5 NASDAQ companies.

Bitext lexical data contains full morphological descriptions of 77 languages and 25 language regional variants like:

lemma, form, POS, voice, tense, aspect person, gender…

For up to 18 different tags or attributes.

Bitext lexical data uses a consistent set of descriptive tags for all languages, regardless of their morphological typology: fusional, agglutinative… This allows for a consistent management of different applications across languages with the same source code, simplifying the extension of any application to new languages.

This morphological information is enriched with:

related phenomena like contractions or clitic pronouns
complementary entity dictionaries, categorized in 16 types: places, names…
frequency information in large representative language corpora
other features like offensiveness or formality information

Companies that already own data and software for mainstream languages (English, Spanish…) can expand language coverage of existing applications to cover 77 languages and 25 language regional variants.

Companies that don’t have significant assets in the NLP space can quickly build a suite of Natural Language Processing (NLP) components (tokenizers, lemmatizers, POS taggers, phrase extractors, parsers, etc.), since Bitext Lexical Data can be delivered with source code to perform a full analysis from raw text to full parsing.

We currently offer 77 languages and 26 variants (and we regularly add support for additional languages as we develop new resources):

Languages

- Afrikaans
- Albanian
- Amharic
- Arabic
- Armenian
- Assamese
- Azeri
- Basque
- Belarusian
- Bengali
- Bulgarian
- Burmese
- Catalan
- Chinese
- Croatian
- Czech
- Danish
- Dutch
- English
- Esperanto
- Estonian
- Finnish
- French
- Galician
- Georgian
- German

Greek
Gujarati
Hebrew
Hindi
Hungarian
Icelandic
Indonesian
Irish Gaelic
Italian
Japanese
Kannada
Kazakh
Khmer
Korean
Kyrgyz
Lao
Latvian
Lithuanian
Macedonian
Malay
Malayalam
Marathi
Mongolian
Nepali
Norwegian Bokmal

Norwegian Nynorsk
Oriya
Persian
Polish
Portuguese
Punjabi
Romanian
Russian
Serbian
Sindhi
Sinhala
Slovak
Slovenian
Spanish
Swahili
Swedish
Tagalog
Tamil
Telugu
Thai
Turkish
Ukrainian
Urdu
Uzbek
Vietnamese
Zulu

Languages

- Afrikaans
- Albanian
- Amharic
- Arabic
- Armenian
- Assamese
- Azeri
- Basque
- Belarusian
- Bengali
- Bulgarian
- Burmese
- Catalan
- Chinese
- Croatian
- Czech
- Danish
- Dutch
- English
- Esperanto
- Estonian
- Finnish
- French
- Galician
- Georgian
- German
- Greek
- Gujarati
- Hebrew
- Hindi
- Hungarian
- Icelandic
- Indonesian
- Irish Gaelic
- Italian
- Japanese
- Kannada
- Kazakh
- Khmer
- Korean
- Kyrgyz
- Lao
- Latvian
- Lithuanian
- Macedonian
- Malay
- Malayalam
- Marathi
- Mongolian
- Nepali
- Norwegian Bokmal
- Norwegian Nynorsk
- Oriya
- Persian
- Polish
- Portuguese
- Punjabi
- Romanian
- Russian
- Serbian
- Sindhi
- Sinhala
- Slovak
- Slovenian
- Spanish
- Swahili
- Swedish
- Tagalog
- Tamil
- Telugu
- Thai
- Turkish
- Ukrainian
- Urdu
- Uzbek
- Vietnamese
- Zulu

Variants

- Arabic (MSA)
- Arabic (Gulf)
- Arabic (Najdi)
- Chinese (Simplified)
- Chinese (Traditional)
- Dutch (Netherlands)
- Dutch (Belgium)
- English (US)
- English (UK)

English (India)
Finnish (Standard)
Finnish (Colloquial)
French (France)
French (Canada)
French (Switzerland)
German (Germany)
German (Switzerland)
Italian (Italy)

Italian (Switzerland )
Portuguese (Portugal)
Portuguese (Brazil)
Spanish (Spain)
Spanish (North America)
Spanish (Central America)
Spanish (Andes)
Spanish (Southern Cone)

Variants

- Arabic (MSA)
- Arabic (Gulf)
- Arabic (Najdi)
- Chinese (Simplified)
- Chinese (Traditional)
- Dutch (Netherlands)
- Dutch (Belgium)
- English (US)
- English (UK)
- English (India)
- Finnish (Standard)
- Finnish (Colloquial)
- French (France)
- French (Canada)
- French (Switzerland)
- German (Germany)
- German (Switzerland)
- Italian (Italy)
- Italian (Switzerland)
- Portuguese (Portugal)
- Portuguese (Brazil)
- Spanish (Spain)
- Spanish (North America)
- Spanish (Central America)
- Spanish (Andes)
- Spanish (Southern Cone)

Features

Lemma: the canonical form for the inflected word is provided.
POS: part of Speech such as noun, verb, adjective, etc. is defined.
Voice: verb form is classified as active or passive.
Tense: specifies when the action takes place such as past, present, future, etc.
Aspect: indicates whether the action is complete, ongoing, habitual, etc.
Mood: modality of the verb form is provided: indicative, subjunctive, imperative, etc.
Person: verb or pronoun refers to the first, second or third person.
Number: state of being singular, dual or plural.
Gender: noun, verb or adjective forms are provided, masculine, feminine, neuter, etc.
Case: the function that the noun or adjective plays within a sentence.
Degree: an adjective is specified as in its positive, comparative or superlative form.
Definiteness: specifies whether a noun or adjective refers to a concrete or general concept.
Polarity: indicates whether a verb, adjective or noun is in a negative form.
Contractions: shortened form of a word or group of words are provided.
Pronominal Clitics: clitic pronouns are identified and tagged.
Formality: indicates the social status of the speaker in relation to the context.
Frequency: relative frequency of the form based on a large general-purpose corpus.
Named Entities: pre-defined entities are tagged as person names, places, organization, etc.
Offensive: indicates whether the form might be considered offensive in certain contexts.

Talk to an Expert

MADRID, SPAIN

Camino de las Huertas, 20, 28223 Pozuelo
Madrid, Spain

SAN FRANCISCO, USA

541 Jefferson Ave Ste 100, Redwood City
CA 94063, USA