Linguistic Services
Bitext provides core tools to automatically pre-annotate custom corpora & datasets. These tools annotate both at the word level (lemmatization/stemming, inflection, etc.) and at the sentence level (Topic-Based Sentiment Analysis, Categorization, Parsing, etc.). We provide:
Lexical Services (No Grammar)
Sentence Segmentation
- Splits text into sentences, according to language-specific punctuation rules.
- Available in all languages.
- Example: Hello! How are you doing? → Hello! | How are you doing?
Tokenization
- Splits a sentence into words, according to language-specific space and punctuation rules.
- Available in all languages (except Chinese, Japanese, Vietnamese, Thai…)
- Example: How are you doing? → How | are | you | doing | ?
Word Segmentation (No-space Tokenization)
- Splits text into words for languages that do not use spaces to separate them.
- Available in Chinese, Japanese, Vietnamese.
- Example: 把音量调低一点→ 把 | 音量 | 调低 | 一点
Decompounding
- Splits compound words/tokens into its individual component words.
- Available in German, Dutch, Norwegian, Swedish, Korean
- Example: Rindfleischetikettierung → Rind | Fleisch | Etikettierung
Lemmatization (Ambiguous)
- Returns the possible roots for a word form
- Available in most languages (except Chinese, Vietnamese, Thai and other languages without inflection)
- Example: running → run
POS Tagging (Ambiguous)
- Returns the possible parts of speech (and optionally other attributes) of a word
- Available in all languages
- Example: run → verb (infinitive), verb (1st person singular, present tense), noun (singular)
Inflection
- Returns all forms of a root word
- Available in most languages (except Chinese, Vietnamese, Thai, and other languages without inflection)
- Example: run → run, runs, ran, running
Language identification
- Detects the language(s) used in each sentence of a longer input text
- Available in all languages
- Example: Oui! I love Paris → “Oui!” – French, “I love Paris” – English
Spell Checking
- Checks if a word is spelled correctly
- Available in all languages
- Example: excelent → incorrect
Spell Suggestions
- Suggests corrections for incorrectly spelled words
- Available in all languages
- Example: excelent → excellent
Syntactic and Semantic Services (Grammar and Meaning)
Entity Extraction
- Detect proper names (like people and places) and other special text (like phones and URLs)
- Available in Dutch, English, French, German, Italian, Portuguese, Spanish.
- Example: John lives in New York → “John” – person name, “New York” – place
Offensive Language Detection
- Detect offensive or vulgar expressions in text
- Available in all languages.
- Example: tell John to f*ck off → “f*ck off” – offensive
Anonymization
- Remove sensitive or personal information (PII) from text
- Available in Dutch, English, French, German, Italian, Portuguese, Spanish
- Example: My name is John and my account number is 1234567 → My name is XXXX and my account number is XXXX
POS-Tagging (Disambiguated)
- Returns the part of speech for each word in a sentence
- Available in English, Dutch, Danish, Czech, Catalan
- Example: John runs back home → “John” – proper noun, “runs” – verb, “back” – preposition, “home” – noun
Phrase Extraction
- Returns the constituents (like noun phrases and verb phrases) of a sentence
- Available in English, French, German, Dutch, Italian, Portuguese, Spanish, Catalan.
- Example: John’s sister was performing in the theatre → “John’s sister” – NP, “was performing” – VP, “in the theatre” – PP
Topic-Based Sentiment Analysis
- Returns the sentiment and corresponding topic of opinions in text
- Available in Catalan, Dutch, English, French, German, Italian, Portuguese, Spanish.
- Example: I hate my old phone → opinion: “hate” (negative), topic: “my old phone”
Categorization
- Returns the categories applicable to a text, based on pre-defined rules
- Available in Dutch, English, French, German, Italian, Portuguese, Spanish.
- Example: John is feeling great. → HAPPINESS [RULE: feel + great → HAPPINESS]
- Example: John was weeping like a willow. → SADNESS [RULE: weep + like + willow → SADNESS]
Parsing
- Produces a tree with the hierarchical constituent parts of a sentence (words, phrases, clauses, etc.)
- Available in Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, French, German, Hungarian, Italian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swedish, Ukrainian
Languages
- Afrikaans
- Albanian
- Amharic
- Arabic
- Armenian
- Assamese
- Azeri
- Basque
- Belarusian
- Bengali
- Bulgarian
- Burmese
- Catalan
- Chinese
- Croatian
- Czech
- Danish
- Dutch
- English
- Esperanto
- Estonian
- Finnish
- French
- Galician
- Georgian
- German
- Greek
- Gujarati
- Hebrew
- Hindi
- Hungarian
- Icelandic
- Indonesian
- Irish Gaelic
- Italian
- Japanese
- Kannada
- Kazakh
- Khmer
- Korean
- Kyrgyz
- Lao
- Latvian
- Lithuanian
- Macedonian
- Malay
- Malayalam
- Marathi
- Mongolian
- Nepali
- Norwegian Bokmal
- Norwegian Nynorsk
- Oriya
- Persian
- Polish
- Portuguese
- Punjabi
- Romanian
- Russian
- Serbian
- Sindhi
- Sinhala
- Slovak
- Slovenian
- Spanish
- Swahili
- Swedish
- Tagalog
- Tamil
- Telugu
- Thai
- Turkish
- Ukrainian
- Urdu
- Uzbek
- Vietnamese
- Zulu
Data Samples & Languages Specifications
Kazakh
Armenian
Slovak
Mongolian
Russian
Portuguese
Variants
- Arabic (MSA)
- Arabic (Gulf)
- Arabic (Najdi)
- Chinese (Simplified)
- Chinese (Traditional)
- Dutch (Netherlands)
- Dutch (Belgium)
- English (US)
- English (UK)
- English (India)
- Finnish (Standard)
- Finnish (Colloquial)
- French (France)
- French (Canada)
- French (Switzerland)
- German (Germany)
- German (Switzerland)
- Italian (Italy)
- Italian (Switzerland )
- Portuguese (Portugal)
- Portuguese (Brazil)
- Spanish (Spain)
- Spanish (North America)
- Spanish (Central America)
- Spanish (Andes)
- Spanish (Southern Cone)
Contact us for more information about our evaluation and training data
MADRID, SPAIN
Camino de las Huertas, 20, 28223 Pozuelo
Madrid, Spain
SAN FRANCISCO, USA
541 Jefferson Ave Ste 100, Redwood City
CA 94063, USA