LLMs Cannot Find Any More Data; What Are They Going to Do Now?

Finite Data: Where to Go When the Data Runs Out

Data is often called the oil of the AI industry. The metaphor almost works, except for the fact that the world has a surplus of oil; it’s running out of data.

What’s the Problem in the AI Market?

Businesses are investing heavily in creating LLM-based applications, with GPT, LLaMa, MPT, Falcon, etc. Since all these models rely on very similar datasets and architectures, they tend to be indistinguishable in practice from each other. This lack of differentiation leads to AI applications that offer undifferentiated experiences. There isn’t much room for differentiation when all the applications have similar models with similar data and similar architectures. A16z elaborated quite eloquently on this differentiation dilemma in their article, “Who Owns the Generative AI Platform?“

What Solutions Are Available?

Since LLM architectures are mostly made public via open source, data seems to offer one potential path out of this maze of uniformity. At Bitext we’ve produced data for NLP/NLU/AI applications for a few years. To help overcome the mimicry sickness infecting LLMs, we have produced “Hybrid Datasets” (like in “hybrid cars”). We call them hybrid because they are a combination of manual and synthetic data, created with a methodology that combines NLG technology with curation by linguists and vertical experts.

Hybrid Datasets Have Two Advantages

First, despite being synthetic, hybrid datasets still avoid the typical problems of the generative approach because they are:

Hallucination free. The corpus is 100% hallucination free. This makes it particularly suitable for high-quality LLM fine-tuning.
Bias free. The corpus includes tagging for offensive language generated from human-curated dictionaries.
PII free. The corpus is 100% free of Personal Identifiable Information; instead of actual names there are placeholders or slots.

Second, our hybrid datasets have extensive synthetic tagging. All of the data in our datasets is enriched with information about formality levels, what category of language is being used, and other idiomatic paradigms. For example:

a request like, “can u send me a new pw?” will be tagged as “colloquial”
another request like, “just cancel right now the f***g order” will be tagged as “offensive”

Bitext uses 12 different tags to mark the data in its datasets. Our unique tagging strategy allows us to easily create different vertical datasets for different customers. For example, customers who speak in colloquial dialects can easily be directed to a model which responds in a way that they are most comfortable with. We will discuss this in more detail in Part 2.

admin

Next GPT Referee: Using GPT-4 to Evaluate Synthetically Generated Responses in Conversational Systems »

Previous « Synthetic Text Datasets: Teaching Business Strategy to LLMs

Fine-tuning LLM

Deploying Successful GenAI-based Chatbots with less Data and more Peace of Mind.

Customizing Large Language Models in 2 steps via fine-tuning is a very efficient way to…

10 months ago

LLMs Cannot Find Any More Data; What Are They Going to Do Now?

Finite Data: Where to Go When the Data Runs Out

Data is often called the oil of the AI industry. The metaphor almost works, except for the fact that the world has a surplus of oil; it’s running out of data.

What’s the Problem in the AI Market?

What Solutions Are Available?

Hybrid Datasets Have Two Advantages

Recent Posts

Bitext NAMER: Slashing Time and Costs in Automated Knowledge Graph Construction

Multilingual Named Entity Recognition for Knowledge Graphs: Supporting 70+ Languages with Precision

How LLM Verticalization Reduces Time and Cost in GenAI-Based Solutions

Integrating Bitext NAMER with LLMs

Bitext NAMER Cracks Named Entity Recognition

Deploying Successful GenAI-based Chatbots with less Data and more Peace of Mind.

LLMs Cannot Find Any More Data; What Are They Going to Do Now?

Finite Data: Where to Go When the Data Runs Out

Data is often called the oil of the AI industry. The metaphor almost works, except for the fact that the world has a surplus of oil; it’s running out of data.

What’s the Problem in the AI Market?

What Solutions Are Available?

Hybrid Datasets Have Two Advantages

Related Post

Recent Posts

Bitext NAMER: Slashing Time and Costs in Automated Knowledge Graph Construction

Multilingual Named Entity Recognition for Knowledge Graphs: Supporting 70+ Languages with Precision

How LLM Verticalization Reduces Time and Cost in GenAI-Based Solutions

Integrating Bitext NAMER with LLMs

Bitext NAMER Cracks Named Entity Recognition

Deploying Successful GenAI-based Chatbots with less Data and more Peace of Mind.

Bitext NAMER: Slashing Time and Costs in Automated Knowledge Graph Construction