Fine-tuning LLM

Any Solutions to the Endless Data Needs of GenAI?

The GenAI community has turned to synthetic data to satisfy its hunger for data. However, it’s done so with significant reluctance, even sometimes assuming that synthetic data is equivalent to “fake data”. This article published by Bloomberg is very good expression of this trend:
This is a great article, and definitely worth reading, but with one caveat in our opinion: it seems to assume that GenAI is the only technique available to generate synthetic text. However, there are other approaches, based on rules (symbolic approach) rather than on pre-existing data, that provide alternative solutions.
What are the advantages of the symbolic approach?
  • Data generated is 100% reliable, which means that all generated variations keep 100% semantic integrity with the seed sentence (the seed sentence being the initial text for which we request variants)
  • This implies that the data is 100% free of hallucination, bias and PII issues
The reason why this approach reaches 100% accuracy is that the generation of variants is controlled by rules with predictable behavior. Before actually generating any text, the Natural Language Generation (NLG) platform requires a seed sentence and a set of linguistic parameters.

As a result, this approach is a combination of the best of two worlds:

  • Full Automation based on Symbolic NLG Technology
  • Expert parametrization by linguists to verticalize the generated data
This makes this approach particularly valuable for Conversational AI applications, since they are sometimes based on exchanges of short questions and answers.

To illustrate the differences between statistical and symbolic approaches, we’ve run a benchmark to compare data generated by ChatGPT 3.5, by ChatGPT 4 and by Bitext NLG technology.

The Benchmark description is as follows:

  • The seed sentence is: “cancel order”
  • The minimum requested number of generated variants is: 20

The results are a bit surprising:

  • With ChatGPT 3.5, 60% of utterances generated changed the meaning; 12 out of 20 variants are wrong; this is a very high number
  • The minimum requested number of generated variants is: 20
  • With Bitext, 100% of utterances are semantically equivalent to the seed sentence
We include a few examples below (and the full list at the end of the article):
What are the reasons for errors in ChatGPT? Not surprisingly they are related to what’s called “hallucinations”, surfaced in the form of addition of content that didn’t exist in the seed sentence. For example, some of the variants generated include information about a refund, which is not included in the seed sentence:
  • Can I cancel my order and refund the payment, please?

Some other variants just don’t make sense, like this one asking for the reason for the cancellation, the reason should be known by the customer, not by the chatbot; a clear case of hallucination:

  • Can you please cancel my order and provide a reason for the cancellation?
There is also irrelevant content added about fees, confirmation…
We include below the full set of the 20 sentences generated by each of the three systems.

In summary, synthetic text remains one of the top alternatives to solve the pressing data issue that GenAI faces for deployment, particularly at the enterprise level. GenAI-based approaches are not providing the solution that the market needs. However, there are other approaches that generate useful data, like the symbolic approach, although they are bit blurred by the current hype.

Sample results obtained with ChatGPT 3.5, 4 and Bitext NLG:

ChatGPT 3.5 Generation
ID Generated Variant for Seed "cancel order" OK KO
1 Can you cancel my order and refund the payment, please? X
2 I need to cancel my order for personal reasons. X
3 Can you please cancel my order and provide a refund? X
4 I need to cancel my order and would like to know the refund process. X
5 Could you please cancel my order and provide a cancellation confirmation? X
6 I need to cancel my order and would like to know when I will receive a refund. X
7 Can you please cancel my order and let me know if there are any fees? X
8 I've decided to cancel my order, can you please help me with that? X
9 I need to cancel my order due to a mistake in the order details. X
10 Can you cancel my order and confirm the refund amount? X
11 I regret to say that I need to cancel my order. X
12 Can you please cancel my order and provide a reason for the cancellation? X
13 I need to cancel my order and would appreciate your prompt assistance. X
14 Can you cancel my order and let me know if I need to take any additional steps? X
15 I need to cancel my order and would like to know the status of the refund. X
16 Can you please cancel my order and let me know about the refund? X
17 I need to cancel my order due to a change in circumstances. X
18 Can you cancel my order and provide a confirmation email? X
19 I would like to cancel my order and receive a full refund. X
20 I need to cancel my order, please X
Total: 8 12
ChatGPT 4 Generation
ID Generated Variant for Seed "cancel order" OK KO
1 I'd like to cancel my order, please. X
2 Can you help me cancel the order I just placed? X
3 I made a mistake and need to cancel my order. X
4 I've changed my mind and would like to cancel my recent order. X
5 I need to stop my order before it's shipped, can you help me with that? X
6 Please cancel my order, I no longer need the items. X
7 Could you please assist me in canceling my purchase? X
8 Can you abort the shipment of my order? X
9 I want to halt my order, how do I do that? X
10 Please undo the order I just made. X
11 I don't want to proceed with my order anymore, can you cancel it? X
12 Is it possible to terminate my order? X
13 I accidentally placed an order. Can it be cancelled? X
14 Help me in cancelling the order I placed a few minutes ago. X
15 I would like to retract my recent order. X
16 I've decided against my order, can we cancel it? X
17 I need to nullify my order, could you please assist? X
18 How do I go about revoking my order? X
19 I need to cancel my purchase, can you guide me? X
20 Please, discontinue my order processing. X
Total: 12 8
Bitext NLG
ID Generated Variant for Seed "cancel order" OK KO
1 can you give me information about canceling an order? X
2 problems with canceling orders X
3 I do not know how I can cancel the last order I made X
4 I would like to cancel the order I made X
5 I need help canceling an order X
6 how could I get information about cancelling an order? X
7 I need help cancelling an order I made X
8 I have problems with cancelling orders X
9 I have a question about canceling an order X
10 problem with cancelling orders X
11 problems with cancelling an order X
12 how could I cancel an order? X
13 where could I get information about order cancellations? X
14 don't know how to cancel order X
15 I need assistance cancelling an order I made X
16 I am trying to find information about order cancelations X
17 I want help cancelling an order I have made X
18 I need help with cancelling an order I have made X
19 where to cancel an order I made? X
20 how do I cancel my order X
Total: 20 0
Antonio Valderrábanos

Recent Posts

Deploying Successful GenAI-based Chatbots with less Data and more Peace of Mind.

Customizing Large Language Models in 2 steps via fine-tuning is a very efficient way to…

5 months ago

From General-Purpose LLMs to Verticalized Enterprise Models

In the blog "General Purpose Models vs. Verticalized Enterprise GenAI," the focus is on the…

7 months ago

Case Study: Finequities & Bitext Copilot – Redefining the New User Journey in Social Finance

Bitext introduced the Copilot, a natural language interface that replaces static forms with a conversational,…

9 months ago

Automating Online Sales with Proactive Copilots

Automating Online Sales with a New Breed of Copilots. The next generation of GenAI Copilots…

10 months ago

Taming the GPT Beast for Customer Service

GPT and other generative models tend to provide disparate answers for the same question. Having…

1 year ago

Can You Use GPT for CX Purposes? Yes, You Can

ChatGPT has major flaws that prevent it from becoming a useful tool in industries like…

1 year ago