How will the data shortage for LLM gets solved

Plus

Ṁ738

2026

ALL

55%

Synthetic Data

88%

Multimodal Data

The Limitations of Scaling Laws

The scaling laws in AI, particularly for Large Language Models (LLMs), have often been viewed as conservative. In reality, the demand for data is insatiable – the more data, the better the model performance. This is evident in developments like Mistral 7B, which, rumors suggest, leverages a staggering 7 trillion (7T) data points. Such examples underscore the importance of vast datasets in achieving significant advancements in AI capabilities.

The Finite Nature of Internet Corpuses

Despite the hunger for more data, there's a looming challenge: the size of the internet corpus is finite, estimated to be around 20 terabytes (20T). This limitation poses a significant hurdle for the continued scaling of LLMs, as the available data is insufficient to meet the ever-growing demands of more sophisticated models.

The Role of Synthetic Data

One potential solution to this conundrum is the generation and utilization of synthetic data. Synthetic data, artificially created rather than obtained by direct measurement, can be tailored to specific needs and potentially provide an inexhaustible source of information for training LLMs. This could represent a paradigm shift in how training datasets are compiled and used.

The Promise of Multimodal Data

Another avenue is the exploitation of multimodal data. The exploration of multimodal data as a solution to the data shortage in Large Language Models (LLMs) is particularly intriguing when considering the human learning process. Humans do not rely solely on textual information to develop intelligence and understanding; instead, we learn from a rich tapestry of sensory experiences - visual, auditory, and kinesthetic. This human-inspired approach suggests that LLMs could also benefit significantly from a more diverse, multimodal dataset.

Incorporating various types of data, such as images, videos, and audio, alongside traditional text, can provide a more holistic learning experience for AI models. This method could potentially reduce the sheer volume of text data required by imitating the human ability to derive complex understandings from multiple data types. By leveraging multimodal data, we can aim to create AI that not only processes information more efficiently but also understands and interacts with the world in a way that is more akin to human cognition. This approach could be a key stepping stone towards more advanced, nuanced AI systems, capable of better understanding and interacting within our multifaceted world.

Resolution Criterion

The resolution of this market will be based on the consensus as of January 1, 2026. It will evaluate which approach – synthetic data, multimodal data, a combination of both, or neither – has emerged as the predominant solution to the data shortage challenge for LLMs. The market will consider expert opinions, published research, and industry trends to determine the resolution. Participants are encouraged to predict and trade based on which solution they believe will gain the most traction in addressing the data needs of future LLMs.

This question is managed and resolved by Manifold.

Get

1,000

and

3.00

12 Comments

18 Holders

48 Trades

Sort by:

Resolution on this is tricky.

Hope by 2026, it will be trivial

When I created this market, I believed multimodal was the way to go. Yet, in the past 6 months, while no evidence suggests multimodal training benefits LLM, we have seen a huge improvement in reasoning ability thanks to synthetic data.

Both Claude 3 and Llama 3 suggest that most of the utility boost comes from synthetic data.

https://arxiv.org/abs/2312.06585

Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models

How you guys think of the new Gemini? Seems like the multimodal data played a big role. Maybe alphacode2 can be seen as an example of using synthetic data.

How does this market resolve if the expert consensus is that both methods have failed?

@NathanShowell Then both resolves to NO. This question is a multiple choice with multiple answers. The two choices will resolve independently

@HanchiSun They’re linked and add up to 100%, which is usually understood to imply one and exactly one resolves YES

@TheBayesian That looks like a coincident. Many people believe that both will resolve to YES, so 100% means some people think they will both resolve to NO

Ohhh you're right! That's a funny coincidence

I expect there not to be consensus, and that each approach will have large advantages depending on the use case. Sometimes, you can use multimodal data, and in those cases you use it; but sometimes you can't, and you use synthetic data. in that case, how does this market resolve?