Chatbots

What do you evaluate in your chatbots? Some ideas

In this blog we will discuss three ways of doing your chatbot evaluation by using:

  1. real world evaluation data
  2. synthetic data
  3. “in scope” or “out of scope” queries

You have a chatbot up and running, offering help to your customers. But how do you know whether the help you are providing is correct or not?  Chatbot evaluation can be complex, especially because it is affected by many factors. 

We have gathered some ideas based on our experience in helping our clients improve their bots:

  • If you can get your hands on real-world evaluation data (external datasets pertaining to your domain, with test utterances and their corresponding intent), you have everything you need to carry out a proper performance evaluation.
    We usually compute a confusion matrix that allows us to easily measure chatbot accuracy, precision, and recall (more about these terms here).
    Apart from that, another thing we usually measure is whether there are cases where the model’s prediction is “unclear” (i.e. the difference in confidence score between the first and second candidates is small), which is often an indication that there is potential overlap between two intents (or their training utterances). Some bot platforms include tools to help you perform these chatbot evaluations, and there are also some third party model evaluators around the web.
  • If real-world evaluation data is not available, we usually use our own data to build evaluation sets by taking utterances that have not been used for the training set.
    Rather than having a single evaluation set, we construct multiple ones using different modules (core, colloquial, polite…) and carry out multiple evaluation iterations, testing how the bot performs with different language registers.
  • In the chatbot evaluation work we’re doing for end clients, another concept we work with is “in scope” vs. “out of scope” queries – including “out of scope”, utterances in the evaluation data is key to identify both true negatives and false positives.For this, what we often do is to take data from our datasets for other industries/verticals (e.g. testing a Banking chatbot using utterances from the Travel industry).

All these steps help us measure the usefulness of our chatbots or chatbot training datasets.

You can use any of them to evaluate the Free Dataset we offer, created with our Multilingual Synthetic Data technology, centered on Customer Support: feel free to download it here and give us your feedback!

 

For more information, visit our website and follow Bitext on Twitter or LinkedIn.

admin

Recent Posts

Deploying Successful GenAI-based Chatbots with less Data and more Peace of Mind.

Customizing Large Language Models in 2 steps via fine-tuning is a very efficient way to…

5 months ago

Any Solutions to the Endless Data Needs of GenAI?

Discover the advantages of using symbolic approaches over traditional data generation techniques in GenAI. Learn…

6 months ago

From General-Purpose LLMs to Verticalized Enterprise Models

In the blog "General Purpose Models vs. Verticalized Enterprise GenAI," the focus is on the…

7 months ago

Case Study: Finequities & Bitext Copilot – Redefining the New User Journey in Social Finance

Bitext introduced the Copilot, a natural language interface that replaces static forms with a conversational,…

9 months ago

Automating Online Sales with Proactive Copilots

Automating Online Sales with a New Breed of Copilots. The next generation of GenAI Copilots…

10 months ago

Taming the GPT Beast for Customer Service

GPT and other generative models tend to provide disparate answers for the same question. Having…

1 year ago