Lexical Resources
Bitext Lexical Data Resources are the most comprehensive and consistent set of language resources in the world, with support for +100 languages and dialects. This proprietary data has been developed to meet the highest quality standards in the field of computational linguistics.
77Languages | 26Variants |
Bitext data is used in production by some of the world’s largest and most successful software companies, including 3 out of the top 5 NASDAQ companies.
Bitext lexical data contains full morphological descriptions of 77 languages and 25 language regional variants like:
- lemma, form, POS, voice, tense, aspect person, gender…
For up to 18 different tags or attributes.
Bitext lexical data uses a consistent set of descriptive tags for all languages, regardless of their morphological typology: fusional, agglutinative… This allows for a consistent management of different applications across languages with the same source code, simplifying the extension of any application to new languages.
This morphological information is enriched with:
- related phenomena like contractions or clitic pronouns
- complementary entity dictionaries, categorized in 16 types: places, names…
- frequency information in large representative language corpora
- other features like offensiveness or formality information
- Companies that already own data and software for mainstream languages (English, Spanish…) can expand language coverage of existing applications to cover 77 languages and 25 language regional variants.
- Companies that don’t have significant assets in the NLP space can quickly build a suite of Natural Language Processing (NLP) components (tokenizers, lemmatizers, POS taggers, phrase extractors, parsers, etc.), since Bitext Lexical Data can be delivered with source code to perform a full analysis from raw text to full parsing.
We currently offer 77 languages and 26 variants (and we regularly add support for additional languages as we develop new resources):
Languages
- Afrikaans
- Albanian
- Amharic
- Arabic
- Armenian
- Assamese
- Azeri
- Basque
- Belarusian
- Bengali
- Bulgarian
- Burmese
- Catalan
- Chinese
- Croatian
- Czech
- Danish
- Dutch
- English
- Esperanto
- Estonian
- Finnish
- French
- Galician
- Georgian
- German
- Greek
- Gujarati
- Hebrew
- Hindi
- Hungarian
- Icelandic
- Indonesian
- Irish Gaelic
- Italian
- Japanese
- Kannada
- Kazakh
- Khmer
- Korean
- Kyrgyz
- Lao
- Latvian
- Lithuanian
- Macedonian
- Malay
- Malayalam
- Marathi
- Mongolian
- Nepali
- Norwegian Bokmal
- Norwegian Nynorsk
- Oriya
- Persian
- Polish
- Portuguese
- Punjabi
- Romanian
- Russian
- Serbian
- Sindhi
- Sinhala
- Slovak
- Slovenian
- Spanish
- Swahili
- Swedish
- Tagalog
- Tamil
- Telugu
- Thai
- Turkish
- Ukrainian
- Urdu
- Uzbek
- Vietnamese
- Zulu
Languages
- Afrikaans
- Albanian
- Amharic
- Arabic
- Armenian
- Assamese
- Azeri
- Basque
- Belarusian
- Bengali
- Bulgarian
- Burmese
- Catalan
- Chinese
- Croatian
- Czech
- Danish
- Dutch
- English
- Esperanto
- Estonian
- Finnish
- French
- Galician
- Georgian
- German
- Greek
- Gujarati
- Hebrew
- Hindi
- Hungarian
- Icelandic
- Indonesian
- Irish Gaelic
- Italian
- Japanese
- Kannada
- Kazakh
- Khmer
- Korean
- Kyrgyz
- Lao
- Latvian
- Lithuanian
- Macedonian
- Malay
- Malayalam
- Marathi
- Mongolian
- Nepali
- Norwegian Bokmal
- Norwegian Nynorsk
- Oriya
- Persian
- Polish
- Portuguese
- Punjabi
- Romanian
- Russian
- Serbian
- Sindhi
- Sinhala
- Slovak
- Slovenian
- Spanish
- Swahili
- Swedish
- Tagalog
- Tamil
- Telugu
- Thai
- Turkish
- Ukrainian
- Urdu
- Uzbek
- Vietnamese
- Zulu
Variants
- Arabic (MSA)
- Arabic (Gulf)
- Arabic (Najdi)
- Chinese (Simplified)
- Chinese (Traditional)
- Dutch (Netherlands)
- Dutch (Belgium)
- English (US)
- English (UK)
- English (India)
- Finnish (Standard)
- Finnish (Colloquial)
- French (France)
- French (Canada)
- French (Switzerland)
- German (Germany)
- German (Switzerland)
- Italian (Italy)
- Italian (Switzerland )
- Portuguese (Portugal)
- Portuguese (Brazil)
- Spanish (Spain)
- Spanish (North America)
- Spanish (Central America)
- Spanish (Andes)
- Spanish (Southern Cone)
Variants
- Arabic (MSA)
- Arabic (Gulf)
- Arabic (Najdi)
- Chinese (Simplified)
- Chinese (Traditional)
- Dutch (Netherlands)
- Dutch (Belgium)
- English (US)
- English (UK)
- English (India)
- Finnish (Standard)
- Finnish (Colloquial)
- French (France)
- French (Canada)
- French (Switzerland)
- German (Germany)
- German (Switzerland)
- Italian (Italy)
- Italian (Switzerland)
- Portuguese (Portugal)
- Portuguese (Brazil)
- Spanish (Spain)
- Spanish (North America)
- Spanish (Central America)
- Spanish (Andes)
- Spanish (Southern Cone)
Data Samples & Languages Specifications
Kazakh
Armenian
Slovak
Mongolian
Russian
Portuguese
Bitext’s Lexical Data Resources
Download a full description of the features available in each language
Features
- Lemma: the canonical form for the inflected word is provided.
- POS: part of Speech such as noun, verb, adjective, etc. is defined.
- Voice: verb form is classified as active or passive.
- Tense: specifies when the action takes place such as past, present, future, etc.
- Aspect: indicates whether the action is complete, ongoing, habitual, etc.
- Mood: modality of the verb form is provided: indicative, subjunctive, imperative, etc.
- Person: verb or pronoun refers to the first, second or third person.
- Number: state of being singular, dual or plural.
- Gender: noun, verb or adjective forms are provided, masculine, feminine, neuter, etc.
- Case: the function that the noun or adjective plays within a sentence.
- Degree: an adjective is specified as in its positive, comparative or superlative form.
- Definiteness: specifies whether a noun or adjective refers to a concrete or general concept.
- Polarity: indicates whether a verb, adjective or noun is in a negative form.
- Contractions: shortened form of a word or group of words are provided.
- Pronominal Clitics: clitic pronouns are identified and tagged.
- Formality: indicates the social status of the speaker in relation to the context.
- Frequency: relative frequency of the form based on a large general-purpose corpus.
- Named Entities: pre-defined entities are tagged as person names, places, organization, etc.
- Offensive: indicates whether the form might be considered offensive in certain contexts.
MADRID, SPAIN
Camino de las Huertas, 20, 28223 Pozuelo
Madrid, Spain
SAN FRANCISCO, USA
541 Jefferson Ave Ste 100, Redwood City
CA 94063, USA