NLG Technology to Generate Hybrid Datasets for LLM Fine-tuning
Our datasets are hybrid datasets because they combine the scale and volume of synthetic text generation with the quality of expert curation. These datasets are tagged with linguistic properties that motivate variation: colloquial/formal language, intentional spelling errors, different syntactic structures, etc.
The datasets are designed to fine-tune Large Language Models (LLMs) for conversational applications and, in particular, for customer support. Our datasets use a hybrid methodology that merges synthetic techniques and linguistic supervision to solve problems that are typical of text produced with generative AI like hallucination, bias, and PII.
Bitext Open-Source Dataset
We have shared a sample Hybrid Dataset to enable the AI community to evaluate and leverage it. Here are the main features of this sample dataset:
- Primary Objective:
The dataset is chiefly designed for training Large Language Models (LLMs) aimed at enhancing the efficiency of conversational applications, particularly in customer support. It addresses common issues associated with data produced through generative AI, such as hallucination, bias, and PII. This is achieved by utilizing a hybrid methodology that combines synthetic techniques with linguist supervision, facilitating the creation of smaller, easier-to-operate LLMs with higher accuracy. Importantly, our datasets comply with AWS (Amazon Web Services) and Apple policies for PII and data sharing, which make them ideal platforms, applications, and models (like AWS, Siri, or Lex). - Language Coverage:
Currently, the dataset covers English and Spanish, with some data generated in German. The technology is also ready for another 8 languages, including German, French, Italian, Dutch, Portuguese, Swedish, Polish, and Korean. In the future we plan to expand to cover Danish, Turkish, Chinese, and Japanese for a total of 14 languages. - Content Characteristics:
The dataset encompasses questions and answers typical for Customer Support within the e-commerce domain. These questions are enriched with extensive linguistic tagging (formal, colloquial, noisy, etc.). The primary facets of these datasets are categorized under ‘intent’, ‘instruction’, and ‘response’, with the option to include additional fields such as ‘context’ or ‘system prompt’. - Volume Metrics:
In terms of volume, the dataset comprises 3.5 million tokens and 27,000 question-answer pairs.
Bitext Dataset Language Tagging
The Bitext Datasets are enriched with a large set of language tags, capturing the diverse ways in which language can generate variants.
The corpus contains more than 12 different tag types (a detailed description of the tagging is provided below).
Tags for Lexical variation
M – Morphological variation: inflectional and derivational “is my SIM card active”, “is my SIM card activated”
L – Semantic variations: synonyms, use of hyphens, compounding… “what’s my billing date”, “what’s my anniversary date”
Tags for Syntactic structure variation
B – Basic syntactic structure: “activate my SIM card”, “I need to activate my SIM card”
I – Interrogative structure “can you activate my SIM card?”, “how do I activate my SIM card?”
C – Coordinated syntactic structure “I have a new SIM card, what do I need to do to activate it?”
N – Negation “I do not want this item, where to cancel my order?”
Tags for language register variations
P – Politeness variation “could you help me activate my SIM card, please?”
Q – Colloquial variation “can u activ8 my SIM?”
W – Offensive language “I want to talk to a f*&%*g agent”
Tags for stylistic variations
K – Keyword mode “activate SIM”, “new SIM”
E – Use of abbreviations: “I’m / I am interested in getting a new SIM”
Z – Errors and Typos: spelling issues, wrong punctuation… “how can i activaet my card”
Other tags not in use in this Dataset
D – Indirect speech “ask my agent to activate my SIM card”
G – Regional variations US English vs UK English: “truck” vs “lorry” France French vs Canadian French: “tchatter” vs “clavarder”
R – Respect structures – Language-dependent variations English: “may” vs “can…” French: “tu” vs “vous…” Spanish: “tú” vs “usted…”
Y – Code switching “activer ma SIM card”
Here are some examples:
- Handling LLMs with Multiple Registers:
In natural language, the same content can be expressed in different language registers, typically formal and colloquial. For example:- Tag “COLLOQUIAL”: Indicates the utterance contains informal expressions.
Ex: “can u close my account” - Tag “FORMAL”: Indicates the utterance contains formal language.
Ex: “Could you please help me close my account?”
- Tag “COLLOQUIAL”: Indicates the utterance contains informal expressions.
With the information about variants captured in tagging, it’s possible to fine-tune the model for different types of speakers and their language preferences: informal for younger audiences and more formal for senior audiences.
Detecting Non-Desired Language:
Variants of texts with biased and offensive language have been included and tagged. This tagging facilitates the training and evaluation of models to detect undesired biased or offensive language. These texts guarantee that all biased or offensive language is tagged.
- Tag “OFFENSIVE”: Indicates the utterance contains offensive expressions.
Ex: “open my f*&%* account”
- Tag “OFFENSIVE”: Indicates the utterance contains offensive expressions.
- Errors/Typos in Texting:
To better replicate actual texts from users querying LLMs, classical errors like typos have been included.- Tag “NOISE”: Indicates the utterance contains typos.
Ex: “how can i activaet my card”
- Tag “NOISE”: Indicates the utterance contains typos.
MADRID, SPAIN
Camino de las Huertas, 20, 28223 Pozuelo
Madrid, Spain
SAN FRANCISCO, USA
541 Jefferson Ave Ste 100, Redwood City
CA 94063, USA