Synthetic image and video have proven to be a big success for cost-cutting. Synthetic text is following suit: tabular data (that is the data organized in a table with rows and columns) is becoming mainstream already, and the next step is synthetic unstructured text, which is the data that doesn`t have a predefined format.
Synthetic unstructured text supports more complex cases, where actual text in the form of full sentences or documents is required.
One of the most popular use cases of synthetic unstructured text is evaluation of NLU engines or intent classification engines. Evaluating an NLU engine like Dialogflow, Lex, RASA, Ada or Kore-ai is a time-consuming task. It involves:
This is particularly relevant in multilingual scenarios, where languages like Arabic, Japanese or German have low resources compared to English, even if they are mainstream languages in terms of business.
Additionally, synthetic unstructured text provides the usual advantages of synthetic data:
The key point: unstructured text allows us to handle more complex cases than tabular data.
To help push forward research on this use case, we have published a dataset with more than 260,000 utterances, labeled with intent, semantic category, language register and more.
Please, feel free to use it for your testing tasks and share results.
Synthetic unstructured text is being used for training purposes too, but we will cover that in another post
In the era of data-driven decision-making, Knowledge Graphs (KGs) have emerged as pivotal tools for…
A robust discussion persists within the technical and academic communities about the suitability of LLMs…
Chinese, Southeast Asian, and Arabic names require transliteration, often resulting in inconsistent spellings in Roman…
Customizing Large Language Models in 2 steps via fine-tuning is a very efficient way to…
Discover the advantages of using symbolic approaches over traditional data generation techniques in GenAI. Learn…