Chatbots

Unstructured Synthetic Text: Beyond Tabular Data

The case for evaluation of NLU platforms

Synthetic image and video have proven to be a big success for cost-cutting. Synthetic text is following suit: tabular data (that is the data organized in a table with rows and columns) is becoming mainstream already, and the next step is synthetic unstructured text, which is the data that doesn`t have a predefined format.

Synthetic unstructured text supports more complex cases, where actual text in the form of full sentences or documents is required.

One of the most popular use cases of synthetic unstructured text is evaluation of NLU engines or intent classification engines. Evaluating an NLU engine like Dialogflow, Lex, RASA, Ada or Kore-ai is a time-consuming task. It involves:

  • finding and augmenting the data, or generating it by hand
  • making sure the data is comprehensive enough to test all intents or classes
  • making sure the data captures the language of different user profile: young people use more colloquial language and typos, while senior users tend to be more formal, etc.

This is particularly relevant in multilingual scenarios, where languages like Arabic, Japanese or German have low resources compared to English, even if they are mainstream languages in terms of business.

Additionally, synthetic unstructured text provides the usual advantages of synthetic data: 

  • Speed up evaluation cycles: using NLG (Natural Language Generation) is faster than compiling manual data
  • Avoiding GDPR issues: anonymized text is not 100% safe as synthetic data
  • Guarantee wider coverage: there is virtually no limit to the amount of text that can be generated

The key point: unstructured text allows us to handle more complex cases than tabular data.

To help push forward research on this use case, we have published a dataset with more than 260,000 utterances, labeled with intent, semantic category, language register and more.

Take a Look to our GitHub Repository and access to our Dataset to try it by yourself.

 

 

 

 

Please, feel free to use it for your testing tasks and share results.

Synthetic unstructured text is being used for training purposes too, but we will cover that in another post

admin

Recent Posts

Multilingual Named Entity Recognition for Knowledge Graphs: Supporting 70+ Languages with Precision

In the era of data-driven decision-making, Knowledge Graphs (KGs) have emerged as pivotal tools for…

1 day ago

Integrating Bitext NAMER with LLMs

A robust discussion persists within the technical and academic communities about the suitability of LLMs…

1 month ago

Bitext NAMER Cracks Named Entity Recognition

Chinese, Southeast Asian, and Arabic names require transliteration, often resulting in inconsistent spellings in Roman…

2 months ago

Deploying Successful GenAI-based Chatbots with less Data and more Peace of Mind.

Customizing Large Language Models in 2 steps via fine-tuning is a very efficient way to…

7 months ago

Any Solutions to the Endless Data Needs of GenAI?

Discover the advantages of using symbolic approaches over traditional data generation techniques in GenAI. Learn…

8 months ago