Synthetic Training Data for Chatbots

What is Training Data?

Training data is the data that is used to train an NLU engine. An NLU engine allows chatbots to understand the intent of user queries.

The training data is enriched by data labeling or data annotation, with information about entities, slots…

This training process provides the bot with the ability to hold a meaningful conversation with real people.

After the training process, the bot is evaluated to measure the accuracy of the NLU engine. Evaluation identifies errors in the bot behavior and these errors are then fixed by improving training data. This cycle is repeated

When working on AI projects, owning data to nurture your solution is key for good performance.

Gathering e-mails and conversation logs to train your bot may be as good as a makeshift solution, but this lack of data can now be cut off at the root. Why not start farming your own data instead of harvesting it?

Building effective customer support agents requires large amounts of data to understand every query made by the user. Nevertheless, obtaining and manually tagging example utterances for AI training is expensive, time-consuming and error-prone:

On the one hand, smaller companies are stuck trying to come up with examples of the various ways in which users can request intents supported by the bots.
On the other hand, even large companies with extensive customer support chat logs must manually tag the unstructured data so that it can be used for AI purposes.

Both are slow processes that will probably lead to inconsistencies and overall poor NLU performance.

Too often, companies also get a simple bot up and running hoping that users’ interactions will produce enough logs to improve and augment the training data.

This approach is risky since a bot performing poorly may drive users away, and the resulting low engagement means that not enough data is collected.

We propose an entirely different approach: generating artificial training data.

What is Synthetic Data?

Synthetic data is data that is artificially created. It could be created with the help of algorithms and is used for a wide range of activities, including as test data for new products and tools, for model validation, and in AI model training.

Synthetic training data, also called artificial training data, is not a brand-new idea – it has been used in various Machine Learning (ML) fields, including computer vision, especially for self-driving cars, either augmenting existing data by transforming images (mirroring, darkening, etc) or generating completely new data – such as adapting driving simulation games to act as environments to train self-driving cars.

However, usefulness is limited by how well we can model the data we are trying to generate – for example, synthetic data is used extensively in physics computer simulations, where the ’rules’ are well-known.

At the same time, advancements are being made in training GANs (general adversarial networks), where one network generates data and another one tries to detect ‘fake’ data to optimize the generator so that it can generate synthetic data that is indistinguishable from real data.

Synthetic Data for Chatbots

As in physics, the rules that govern the language are well known – humans have been studying the language for hundreds of years.

As seen in our previous post, artificial training data helps automate your bot’s training phase.

In the AI field, you can make use of ontologies/knowledge graphs to model a specific domain (for example, retail), describing the relevant objects, actions, modifiers and the ways in which they are related to one another.

Using linguistics, you can define structures for the various ways in which these words can be expressed in language – covering changes in morphology, syntax, synonyms, different levels of politeness, commands/questions.

After that, this ‘generated data’ is correct, fully tagged, consistent and customizable (e.g. for specific sub-domains).

The generation of a new vertical only requires building a new ontology, which can be highly automated using various NLP tools.

Results can be incrementally improved to handle even non-explicit implied requests (e.g. ‘I forgot my password’ should be interpreted as a request to reset a user’s password) as they are incurred.

While AI algorithms have become a commodity, useful data is lacking in training them.

Smaller companies do not have enough resources or access to the large volumes of training data required to train high-quality models.

Therefore, artificial (synthetic) training data generation is the answer to ‘democratize’ the field.

Thus, results are paramount when data can be modeled using well-known rules (such as physics or language).

What are the Benefits of Synthetic Data?

Synthetic data has several benefits here we are listing some of them:

Focuses on relationships: Synthetic data aims to preserve the multivariate relationships between variables instead of specific statistics alone.
Overcoming real data usage restrictions: Real data may have usage constraints due to privacy rules or other regulations. Synthetic data can replicate all important statistical properties of real data without exposing real data, thereby eliminating the issue.
Creating data to simulate not yet encountered conditions: Where real data does not exist, synthetic data is the only solution.
Immunity to some common statistical problems: These can include item nonresponse, skip patterns, and other logical constraints.
Synthetic data by definition is 100% free of privacy issues.