Bitext

Why Do You Need to Fine-tune Your Conversational LLM with 100’s (If Not 1,000’s) of Examples?

For Consistent LLM Answers, Fine-tune with Examples. LOTS of Examples

LLMs tend to be very creative and introduce diversity and creativity in answers.

That’s good for certain types of questions like:

  • What can you tell me about La Cibeles?
  • What gothic buildings should I visit in Madrid?

Questions that do not have a single obvious answer, questions that knowledgeable people might answer very differently, are great questions for a search-based approach like RAG.

For some other questions, the right answer is consistent and precise. In these cases, creativity can be flat-out wrong. Some good examples of these types of questions are:

  • What time does the Metropolitan Museum open?
  • Do you need tickets to visit The Cathedral? Can I buy the tickets online?
  • Who is the architect of Reina Sofia Museum? Does it have paintings by Picasso?
  • Is there underground service from Atocha to Barajas airport?

For these questions, excessive creativity may cause significant problems; creative answers are far less likely to be the correct answer. In a real life application, getting these questions wrong seriously undermines user confidence.

Does the Museum open at 9am or at 10am? Variability in this answer is risky.

A unique answer that is still consistent and precise is required.

To achieve this consistency in an LLM-based application, like a chatbot, a training dataset with hundreds of variations of these type of questions can help with the task. The dataset should contain:

  • Variations of the factual questions like:

What time does the Metropolitan Museum open?

What’s the schedule for the Metropolitan Museum

Is the Metropolitan Museum open on Mondays?

  • Example answers to be fed to the LLM
  • Optionally, some tagging about the linguistic rationale behind each variant: colloquial vs formal language, etc..

How many variants of the question are required to safely fine-tune the LLM and be sure that the question will be properly understood and answered? Our experimental trial, which you can see here.

Bitext provides an example of this type of dataset for Customer Support, with 3M tokens and 27,000 question-answer pairs (which you can find here), suggests that that number is a little under 1,000.

The dataset is freely available, including for commercial use. It can be used in real-life applications to check how effective additional training data is at preventing both hallucinations and excessively creative answers for factual questions.

admin

Recent Posts

Deploying Successful GenAI-based Chatbots with less Data and more Peace of Mind.

Customizing Large Language Models in 2 steps via fine-tuning is a very efficient way to…

5 months ago

Any Solutions to the Endless Data Needs of GenAI?

Discover the advantages of using symbolic approaches over traditional data generation techniques in GenAI. Learn…

6 months ago

From General-Purpose LLMs to Verticalized Enterprise Models

In the blog "General Purpose Models vs. Verticalized Enterprise GenAI," the focus is on the…

7 months ago

Case Study: Finequities & Bitext Copilot – Redefining the New User Journey in Social Finance

Bitext introduced the Copilot, a natural language interface that replaces static forms with a conversational,…

9 months ago

Automating Online Sales with Proactive Copilots

Automating Online Sales with a New Breed of Copilots. The next generation of GenAI Copilots…

10 months ago

Taming the GPT Beast for Customer Service

GPT and other generative models tend to provide disparate answers for the same question. Having…

1 year ago