When shopping online, customers frequently have the need to modify their order: exchanging an item in the basket, deleting something already added…
Customers ask for these kinds of changes in many different ways, like “how do I change my order?” or “I need to delete a product from my basket”.
Customers may use a formal register (“can you please help me…”), or an informal one (“can u help me…”), use only keywords (“delete item”) or add spelling or grammar errors (“need change my baskt”), among other phenomena.
To illustrate this variety in practice, with this post we release a tagged dataset that contains 10,000 ways of asking for an order modification, in English this time.
Our first reaction to this number may be: are there are really 10,000 ways to ask for a change in your customer’s order?
Indeed, there are 10,000 and 100,000 and 1,000,000 ways to modify your basket. This is a feature of all natural languages.
Language has been designed to produce literally infinite ways to express the same content.
This expressive power has many different purposes, for one, it allows for expressions of subjectivity, something essential to humans, and keeps language from being boring like formal languages.
That’s why when customers express themselves they want to be polite and formal, or colloquial and informal; or want to include offensive language if they are angry; or stress their geographical origin, like Canadian French speakers vs. France French speakers.
Language has the power to express these and many other variations.
The dataset we are releasing is tagged with these variants and many more, see here for a comprehensive list
Now, the first question is: where do you get enough data to cover all these variations in your chatbot training and evaluation for all the intents your virtual assistant needs to cover?
If you don’t have historical data to leverage –or if you just want to avoid privacy issues, the typical answer is generating and tagging this data by hand.
As chatbots grow in scope, crowdsourcing text generation or tagging is becoming more challenging. As in any other field, the trend is going towards automating data generation.
As NLG (Natural Language Technology) develops, synthetic text is becoming a solid alternative for question/answer systems, for the generation and labeling of textual data.
These large datasets can be used for training of course; training is the first need in the chatbot development cycle. But they can be used for evaluation too, particularly in the absence of real data.
See this post on evaluation
The sample dataset we have released is just an example of what current technology can achieve.
Download it here and let us know your thoughts: does it work for you?
This is just the beginning. We will soon publish another 20+ intents to complete a full chatbot for customer support.
For more information, visit our website and follow Bitext on Twitter or LinkedIn.
Customizing Large Language Models in 2 steps via fine-tuning is a very efficient way to…
Discover the advantages of using symbolic approaches over traditional data generation techniques in GenAI. Learn…
In the blog "General Purpose Models vs. Verticalized Enterprise GenAI," the focus is on the…
Bitext introduced the Copilot, a natural language interface that replaces static forms with a conversational,…
Automating Online Sales with a New Breed of Copilots. The next generation of GenAI Copilots…
GPT and other generative models tend to provide disparate answers for the same question. Having…