Chatbots

Evaluate the Quality of your Chatbots and Conversational Agents

It is always important to evaluate the quality of your chatbots and conversational agents in order to know the its real health, accuracy and efficiency.

Chatbot accuracy can only be increased by constantly evaluating and retraining it with new data that answers your customer’s queries. 

Chatbots require large amounts of training data to perform correctly. If you want your chatbot to recognize a specific intent, you need to provide a large number of sentences that express that intent, usually generated by hand. This manual generation is error-prone and can cause erroneous results.

How can we solve it?

With artificially-generated data. Since Dialogflow is one of the most popular chatbot-building platforms, we chose to perform our tests using it.

We tested how Dialogflow can benefit from the Artificial Training Data approach, comparing chatbots trained using hand-tagged sentences with chatbots that used automatically-generated training data. Our tests show that if we train bots with only 2 or 3 example sentences per intent in Dialogflow, performance suffers. Furthermore, using 10 sentences per intent, there is only minimal improvement.

On the other hand, by extending these hand-tagged corpora with additional variants automatically generated by Artificial Training Data, there is higher overall improvement and chatbot accuracy.

We carried out two different tests (A and B), both using the following 5 intents related to the house lighting management. In the first test (A), we trained two different bots:

  • A first bot (A1) was trained with only 12 hand-tagged sentences (2 to 3 sentences per intent). Using those sentences as input, our Bitext Artificial Training Data service generated 391 sentences which, combined with the 12 sentences from bot A1, were used to train a second bot A2 (with around 80 sentences per intent).
  • The second test (B) was very similar to the first. The only difference was the number of sentences used in the training and evaluation sets. In this case, the first bot (B1) was trained with a hand-tagged training set of 50 sentences (10 per intent).
    Using those sentences as input, our Bitext Artificial Training Data service generated 798 sentences which, combined with the 50 sentences from bot B1, were used to train the second bot B2 (with around 170 sentences per intent). We used the same 100 chatbot evaluation sentences from test A as the evaluation set.

In both tests, we observed a significant improvement reaching at least 90% in the chatbot accuracy in both intent detection and slot filling. Do you want to see the results for yourself? Download our Dialogflow Full Benchmark Dataset now.

The Bitext Artificial Training Data service lets you create big training sets with no effort. If you only want to write one or two sentences per intent, our service is able to generate the rest of the variants needed to go from poor results to great chatbot accuracy.

 

If you would like to get further details, you can check some additional tools here:

admin

Recent Posts

Deploying Successful GenAI-based Chatbots with less Data and more Peace of Mind.

Customizing Large Language Models in 2 steps via fine-tuning is a very efficient way to…

5 months ago

Any Solutions to the Endless Data Needs of GenAI?

Discover the advantages of using symbolic approaches over traditional data generation techniques in GenAI. Learn…

6 months ago

From General-Purpose LLMs to Verticalized Enterprise Models

In the blog "General Purpose Models vs. Verticalized Enterprise GenAI," the focus is on the…

7 months ago

Case Study: Finequities & Bitext Copilot – Redefining the New User Journey in Social Finance

Bitext introduced the Copilot, a natural language interface that replaces static forms with a conversational,…

9 months ago

Automating Online Sales with Proactive Copilots

Automating Online Sales with a New Breed of Copilots. The next generation of GenAI Copilots…

10 months ago

Taming the GPT Beast for Customer Service

GPT and other generative models tend to provide disparate answers for the same question. Having…

1 year ago