Bitext Methodology
With our personalized consulting service, we develop your Bot with the expertise of Bitext helping your company in all the launching processes and lifecycle of the Chatbot.
Access to Our Repositories
You can access to our Github Repository and Hugging Face Dataset
How to make your bot data work at 90% accuracy
Value Proposition
- 100% custom-made
- Chatbot platform independence: Lex, Luis, Dialogflow…
- Multilingual, 14 languages available.
- Set up, scalable & adjustable 100% guarantee
- QA & improvement service included: global LiveCycle of the Bot
- Guaranteed accuracy 90%
- Bootstrapping: Reduce time to market
- Reduce customer service costs from day 1
- Training data without privacy problems or manual errors
- Create a custom bot from scratch or improve your current bot .
How we do it
Bitext Methodology
For different languages
🏳 English 🏳 Spanish 🏳 German 🏳 French 🏳 Italian 🏳 Dutch 🏳 Portuguese | 🏳 Danish 🏳 Swedish 🏳 Polish 🏳 Turkish 🏳 Korean 🏳 Chinese **under preparation 🏳 Japanese **under preparation |
And language Variants
🏳 Spanish from Mexico, Argentina and Colombia
🏳 German from Germany, Switzerland and Austria
🏳 French from France, Belgium and Switzerland
Linguistic features included in our datasets
The dataset contains annotations for all relevant linguistic phenomena that can be customized to adapt bot training to different user language profiles. Some of the most relevant annotations are:
Lexical variation:
- M – Morphological variation: inflectional and derivational
“is my SIM card active”
“is my SIM card activated”
- L – Semantic variations: synonyms, use of hyphens, compounding…
“what’s my billing date”
“what’s my anniversary date”
Syntactic structure variation:
- B – Basic syntactic structure:
“activate my SIM card”
“I need to activate my SIM card”
- I – Interrogative structure
“can you activate my SIM card”
“how do I activate my SIM card”
- C- Coordinated syntactic structure
“I have a new SIM card, what do I need to do to activate it?”
- D – Indirect speech
“ask my agent to activate my SIM card”
Language register variations:
- P – Politeness variation
“could you help me activate my SIM card, please?”
- Q – Colloquial variation
“can u activ8 my SIM?”
- R – Respect structures – Language-dependent variations
English: “may” vs “can…”
French: “tu” vs “vous…”
Spanish: “tú” vs “usted…”
- W – Offensive language
“I want to talk to a f*cking agent”
Stylistic variations:
- K – Keyword mode
“activate SIM”
“new SIM”
- E – Use of abbreviations:
“I’m / I am interested in getting a new SIM”
- Z – Errors and Typos: spelling issues, wrong punctuation…
“how can i activaet my card”
- G – Regional variations
US English vs UK English: “truck” vs “lorry”
France French vs Canadian French: “tchatter” vs “clavarder”
- Y – Code switching
“activer ma SIM card”
Available on Demand
Any Language
Linguistic Resources
75+ Languages and Variants
Data Selection for Bot Training
From the hundreds of utterances Bitext can generate for a chatbot, a careful selection has to be made, because the amounts of utterances that common chatbot platforms can hold is small. That selection follows the following criteria:
- The quantitative limitations of the NLU engine of chatbot platforms
- The fact that some utterances can make one intent overlap with another one
- The balance needed between the number of total utterances and the number of intents, also depending on the platform
- A careful qualitative profiling of the language expected to be used by the users of the chatbot. This includes: language register (more colloquial or formal language, offensive language expected or not, spelling errors and mistakes…) and the expected users’ region (UK/US English, Spain/Mexico Spanish…)
In summary, Bitext selects the most appropriate utterances (and intents) to best adapt a generic NLU engine (like a chatbot) to a specific language, a vertical and a user profile.
Data Customization for any NLU Engine
Bitext has created a new paradigm for taking a general-purpose NLU engine and adapt it to a specific vertical or industry.
This paradigm relies on a knowledge-transfer methodology that models the linguistic knowledge that configures a vertical and transfers it to a general NLU engine. This transfer is performed in several areas: the language of the vertical (via dictionaries, grammars and ontologies), its contents (taken from public and private data sources, as FAQs of its main companies or logs obtained in previous experiences), and the linguistic profile of the expected users of the NLU engine (their region, their register, and the peculiarities of their language). Then we model that information into the specific NLU platform that is going to be used (Amazon Lex, MS LUIS, Dialogflow…)
All the knowledge used to perform this transfer is contained in our training and evaluation datasets.
How to generate your bot data in one day
Bitext accelerates the deployment of any chatbot or virtual assistant by bootstrapping the training process with prebuilt chatbots for a wide range of verticals, so customers can have a multilingual system up and running in a few hours.
Building effective conversational agents requires large amounts of training data. Producing this data manually is an expensive, time-consuming and error-prone process which does not scale. Platform providers usually do not have the infrastructure required to tackle the wide range of verticals, languages and locales that their large clients need to handle, while clients rarely have the expertise necessary to collect and annotate their data, and outsourcing the task is complicated by the fact that the data often contains sensitive data that cannot be exposed to third parties.
Bitext offers an easy solution to bootstrap new bots or boost existing ones in minutes, providing a high level of accuracy out-of-the-box and replacing the need for weeks or months of manual bot development.
Building effective conversational agents requires large amounts of training data. Producing this data manually is an expensive, time-consuming and error-prone process which does not scale. Platform providers usually do not have the infrastructure required to tackle the wide range of verticals, languages and locales that their large clients need to handle, while clients rarely have the expertise necessary to collect and annotate their data, and outsourcing the task is complicated by the fact that the data often contains sensitive data that cannot be exposed to third parties.
Bitext offers an easy solution to bootstrap new bots or boost existing ones in minutes, providing a high level of accuracy out-of-the-box and replacing the need for weeks or months of manual bot development.
Verticals
Each Prebuilt Chatbot contains the 20 to 40 most frequent intents for the corresponding vertical, designed to give you the best performance out-of-the-box.
Our Prebuilt Chatbots are trained to deal with language register variations including polite/formal, colloquial and offensive language. We have profiled the language register use in user queries from a wide range of vertical bots, and we use this information to generate training data with a similar profile, ensuring maximum linguistic coverage.
We also introduce noise into the training data, including spelling mistakes, run-on words and missing punctuation. This makes the data even more realistic, which makes our Prebuilt Chatbots more robust to the type of “noisy” input that is common in real life.
Methodology
We employ a scalable and data-driven linguist-in-the-loop methodology. We begin by collecting large volumes of text from domain-specific public data sources such as FAQs, knowledge bases and technical documentation. We then apply our Deep Parsing technology to automatically extract the most frequent actions and objects that appear in those texts. This results in a knowledge graph that captures the semantic structure of the vertical, which is then curated by computational linguists to identify synonyms and to ensure consistency and completeness. Actions are grouped into categories and intents, and the intent structure is then validated against FAQs and with domain experts.
Finally, the linguistic structure of each intent is defined, together with the applicable frame types which allow our Natural Language Generation (NLG) technology to generate utterances which are predictable and consistent semantic variations of each intent request. This approach provides a measurable improvement to NLU performance: benchmarks comparing a manual baseline with our synthetic data show a >30% increase in intent detection and slot filling accuracy across multiple platforms.
Our methodology and tools allow us to easily customize and adapt the datasets to changing needs, including new intents, corporate terminology, language registers, new regions, markets and languages. With each change, the data is automatically regenerated, allowing for continuous improvement in a scalable fashion.
MADRID, SPAIN
Camino de las Huertas, 20, 28223 Pozuelo
Madrid, Spain
SAN FRANCISCO, USA
541 Jefferson Ave Ste 100, Redwood City
CA 94063, USA