How will the data shortage for LLM gets solved
How will the data shortage for LLM gets solved
➕
Plus
24
Ṁ738
2026
55%
Synthetic Data
88%
Multimodal Data

The Limitations of Scaling Laws

The scaling laws in AI, particularly for Large Language Models (LLMs), have often been viewed as conservative. In reality, the demand for data is insatiable – the more data, the better the model performance. This is evident in developments like Mistral 7B, which, rumors suggest, leverages a staggering 7 trillion (7T) data points. Such examples underscore the importance of vast datasets in achieving significant advancements in AI capabilities.

The Finite Nature of Internet Corpuses

Despite the hunger for more data, there's a looming challenge: the size of the internet corpus is finite, estimated to be around 20 terabytes (20T). This limitation poses a significant hurdle for the continued scaling of LLMs, as the available data is insufficient to meet the ever-growing demands of more sophisticated models.

The Role of Synthetic Data

One potential solution to this conundrum is the generation and utilization of synthetic data. Synthetic data, artificially created rather than obtained by direct measurement, can be tailored to specific needs and potentially provide an inexhaustible source of information for training LLMs. This could represent a paradigm shift in how training datasets are compiled and used.

The Promise of Multimodal Data

Another avenue is the exploitation of multimodal data. The exploration of multimodal data as a solution to the data shortage in Large Language Models (LLMs) is particularly intriguing when considering the human learning process. Humans do not rely solely on textual information to develop intelligence and understanding; instead, we learn from a rich tapestry of sensory experiences - visual, auditory, and kinesthetic. This human-inspired approach suggests that LLMs could also benefit significantly from a more diverse, multimodal dataset.

Incorporating various types of data, such as images, videos, and audio, alongside traditional text, can provide a more holistic learning experience for AI models. This method could potentially reduce the sheer volume of text data required by imitating the human ability to derive complex understandings from multiple data types. By leveraging multimodal data, we can aim to create AI that not only processes information more efficiently but also understands and interacts with the world in a way that is more akin to human cognition. This approach could be a key stepping stone towards more advanced, nuanced AI systems, capable of better understanding and interacting within our multifaceted world.

Resolution Criterion

The resolution of this market will be based on the consensus as of January 1, 2026. It will evaluate which approach – synthetic data, multimodal data, a combination of both, or neither – has emerged as the predominant solution to the data shortage challenge for LLMs. The market will consider expert opinions, published research, and industry trends to determine the resolution. Participants are encouraged to predict and trade based on which solution they believe will gain the most traction in addressing the data needs of future LLMs.

Get
Ṁ1,000
and
S3.00


Sort by:
9mo

Resolution on this is tricky.

9mo

Hope by 2026, it will be trivial

12mo

When I created this market, I believed multimodal was the way to go. Yet, in the past 6 months, while no evidence suggests multimodal training benefits LLM, we have seen a huge improvement in reasoning ability thanks to synthetic data.

Both Claude 3 and Llama 3 suggest that most of the utility boost comes from synthetic data.

1y

https://arxiv.org/abs/2312.06585

Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models

1y

How you guys think of the new Gemini? Seems like the multimodal data played a big role. Maybe alphacode2 can be seen as an example of using synthetic data.

1y

How does this market resolve if the expert consensus is that both methods have failed?

1y

@NathanShowell Then both resolves to NO. This question is a multiple choice with multiple answers. The two choices will resolve independently

1y

@HanchiSun They’re linked and add up to 100%, which is usually understood to imply one and exactly one resolves YES

1y

@TheBayesian That looks like a coincident. Many people believe that both will resolve to YES, so 100% means some people think they will both resolve to NO

Ohhh you're right! That's a funny coincidence

1y

I expect there not to be consensus, and that each approach will have large advantages depending on the use case. Sometimes, you can use multimodal data, and in those cases you use it; but sometimes you can't, and you use synthetic data. in that case, how does this market resolve?

1y

@TheBayesian We have several years before settling down the resolution criterion.

Maybe one way to do it is to consider what method the best model at the time uses

What is this?

What is Manifold?
Manifold is the world's largest social prediction market.
Get accurate real-time odds on politics, tech, sports, and more.
Win cash prizes for your predictions on our sweepstakes markets! Always free to play. No purchase necessary.
Are our predictions accurate?
Yes! Manifold is very well calibrated, with forecasts on average within 4 percentage points of the true probability. Our probabilities are created by users buying and selling shares of a market.
In the 2022 US midterm elections, we outperformed all other prediction market platforms and were in line with FiveThirtyEight’s performance. Many people who don't like trading still use Manifold to get reliable news.
How do I win cash prizes?
Manifold offers two market types: play money and sweepstakes.
All questions include a play money market which uses mana Ṁ and can't be cashed out.
Selected markets will have a sweepstakes toggle. These require sweepcash S to participate and winners can withdraw sweepcash as a cash prize. You can filter for sweepstakes markets on the browse page.
Redeem your sweepcash won from markets at
S1.00
→ $1.00
, minus a 5% fee.
Learn more.
© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules