Unstructured Synthetic Text: Beyond Tabular Data

The case for evaluation of NLU platforms

Synthetic image and video have proven to be a big success for cost-cutting. Synthetic text is following suit: tabular data (that is the data organized in a table with rows and columns) is becoming mainstream already, and the next step is synthetic unstructured text, which is the data that doesn`t have a predefined format.

Synthetic unstructured text supports more complex cases, where actual text in the form of full sentences or documents is required.

One of the most popular use cases of synthetic unstructured text is evaluation of NLU engines or intent classification engines. Evaluating an NLU engine like Dialogflow, Lex, RASA, Ada or Kore-ai is a time-consuming task. It involves:

finding and augmenting the data, or generating it by hand
making sure the data is comprehensive enough to test all intents or classes
making sure the data captures the language of different user profile: young people use more colloquial language and typos, while senior users tend to be more formal, etc.

This is particularly relevant in multilingual scenarios, where languages like Arabic, Japanese or German have low resources compared to English, even if they are mainstream languages in terms of business.

Additionally, synthetic unstructured text provides the usual advantages of synthetic data:

Speed up evaluation cycles: using NLG (Natural Language Generation) is faster than compiling manual data
Avoiding GDPR issues: anonymized text is not 100% safe as synthetic data
Guarantee wider coverage: there is virtually no limit to the amount of text that can be generated

The key point: unstructured text allows us to handle more complex cases than tabular data.

To help push forward research on this use case, we have published a dataset with more than 260,000 utterances, labeled with intent, semantic category, language register and more.

Take a Look to our GitHub Repository and access to our Dataset to try it by yourself.

Please, feel free to use it for your testing tasks and share results.

Synthetic unstructured text is being used for training purposes too, but we will cover that in another post

admin

Next Synthetic Text: The Moment for Enterprise Applications Is Now »

Previous « Multilingual Synthetic Training Data For Intent Detection

Fine-tuning LLM

Deploying Successful GenAI-based Chatbots with less Data and more Peace of Mind.

Customizing Large Language Models in 2 steps via fine-tuning is a very efficient way to…

1 year ago

Unstructured Synthetic Text: Beyond Tabular Data

The case for evaluation of NLU platforms

Take a Look to our GitHub Repository and access to our Dataset to try it by yourself.

Recent Posts

Bitext NAMER: Slashing Time and Costs in Automated Knowledge Graph Construction

Multilingual Named Entity Recognition for Knowledge Graphs: Supporting 70+ Languages with Precision

How LLM Verticalization Reduces Time and Cost in GenAI-Based Solutions

Integrating Bitext NAMER with LLMs

Bitext NAMER Cracks Named Entity Recognition

Deploying Successful GenAI-based Chatbots with less Data and more Peace of Mind.

Unstructured Synthetic Text: Beyond Tabular Data

The case for evaluation of NLU platforms

Take a Look to our GitHub Repository and access to our Dataset to try it by yourself.

Related Post

Recent Posts

Bitext NAMER: Slashing Time and Costs in Automated Knowledge Graph Construction

Multilingual Named Entity Recognition for Knowledge Graphs: Supporting 70+ Languages with Precision

How LLM Verticalization Reduces Time and Cost in GenAI-Based Solutions

Integrating Bitext NAMER with LLMs

Bitext NAMER Cracks Named Entity Recognition

Deploying Successful GenAI-based Chatbots with less Data and more Peace of Mind.

Bitext NAMER: Slashing Time and Costs in Automated Knowledge Graph Construction