Are LLMs capable of reaching AGI?

Plus

104

Ṁ23k

2100

66%

chance

ALL

"AGI" is defined here as a computer program that can do any intellectual task that a human can do, excluding unfair cases like asking for secret information that only the human knows or testing sub-second reaction time. It must be able to do the task at least as reliably and as quickly as the most skilled living human (at market creation) could do it.

It also must be capable of running on all of the Earth's computing power as of market creation, to prevent trick answers like "use an LLM to implement brute force search on 10^10 galaxies worth of GPUs". For training, it must be trainable within 100 years.

In the event of disagreement over whether something counts as AGI, I will err towards no; it needs to be pretty clear that computers have attained intellectual parity with humans. e.g. it must be the case that there is no longer any economic reason why a company would want to hire humans over AIs, excluding physical tasks and things like "our customers will pay more for human outputs" or "we don't trust the AIs to not defraud us".

"LLM" is defined here as any neural network that's trained primarily on human text. It can be multimodal, but any non-text inputs like images or synthetic mathematical data must constitute less than 50% of the training data. Reasoning models and agentic frameworks do count as LLMs, provided the scaffolding is just some form of feeding its answers back into itself, giving it multiple scratchpads, or other simple things like that. It cannot add any external system that would itself seem "intelligent".

In the event of disagreement over whether something counts as an LLM, I'll err towards seeing how other people - particularly those who developed the AGI and other experts in the field - are using the term and whether they think it applies. If there's disagreement among experts, I'll err towards no; it must be pretty unambiguously an LLM.

This market resolves once there's a broad consensus as to the correct answer, which likely won't be until after AGI has been reached and humanity has a much better understanding of what "intelligence" is and how it works.

This definition is open to minor modifications if a problem in it is pointed out to me that makes it not align with the spirit of the title. In the event of severe disagreement over what constitutes an LLM or AGI according to this description, I'll defer to a vote among Manifold users.

This question is managed and resolved by Manifold.

Get

1,000

and

3.00

26 Comments

95 Holders

282 Trades

Sort by:

bought Ṁ500 NO

What proportion of training data of gpt-5 was human text? I think less than 50%. And the quantity of human text being generated is slowing down as more text is generated by AI.

Francois Chollet: ''' Calling something like o1 "an LLM" is about as accurate as calling AlphaGo "a convnet" '''

https://x.com/fchollet/status/1869612195425972715?t=JSX6cv49z95VaT6_3s_QuQ&s=19

reposted

bumping for isaac

Oct 2023 worldwide compute was around 1e22 FLOP/s.
There's around 3e7 seconds in a year
that gives us an upper bound of 3e29 FLOP for a model trained in a year with oct 2023 levels of compute

most compute hungry frontier model (grok 4) uses around 5e26 FLOP according to epoch.ai

So under the reading that the model must be trainable in a year at oct 2023 levels of compute, we can grow by 3 OoMs beyond current frontier models (not much!)

Alternatively, if we read

It also must be capable of running on all of the Earth's computing power as of market creation

As "at inference time, it must be able to run at least one instance with 1e22 FLOP/s", without the model being able to be trained with that computing power in less than several years, that makes the task (much) easier.

If we need the LLM to generate 5 tokens/s, that's

1e22 FLOP/s = 2FLOP/param/token * x params * 5 tokens/s

-> x = ~1e21 active params

GPT-4 is around 1.8T (1.8e12) params, and around 200B (2e11) active params.

So the largest model we can run at inference time to make use of all of Oct 2023's compute is around 5e8 (5e9) times more total (active) params than gpt-4, as well as current frontier models (which are likely around the size of gpt-4).

That's the same difference as the difference between GPT-2 and GPT-4, ~4 times over (so ~GPT-12). However, to make the most of these parameters (make the most out of 1e21 active params), we would need around 1000 tokens of diverse and quality training data per active parameter, meaning 1e24 tokens (probably more than 1000 doesn't hurt but the chinchilla optimal amt is 20 tok per param and current frontier models trained on around 1000 tokens per param so that seems a good baseline). Current non-junk internet tokens are on the order of 50T (5e13) tokens, so we'd need around 10 OoM of scale on human text, which would require a lot of effort from humans generating really high quality training data. It could be done across decades or centuries and stuff.. (edit: more like millions of years) or we could probably do with not saturating the params. The FLOP required to train such a model optimized to resolve this market positively would be around

C_train = 6 N D ~= 6FLOP/a-param/tok * 1e21 a-params * 1e24 tok ~= 6e45 FLOP

Which at Oct 2023 amts of compute, would take

6e45 FLOP / 1e22 FLOP/s ~= 6e23 seconds or 2e16 years or 2000 trillion years to train. um..

Another issue is that at that scale, the bottleneck would not be FLOP, it would be memory and bandwidth. I'm not sure if those are limited to oct 2023 values.

With 0.4 OoM/year in algo improvements according to epoch.ai, it seems likely we're not near an upper bound of training efficiency. With better data quality, ai models learn much more efficiently, and we are also not likely not near an upper bound in data quality either. Many things LLMs are "dumb" about, like modified puzzles, are due to inductive biases caused by internet data, like the same puzzle appearing 10,000 times in the training corpus, with modifications of it appearing dozens of times at most. Those issues can be alleviated with better data quality.

An important bottleneck mentioned in the description is

"LLM" is defined here as any neural network that's trained primarily on human text. It can be multimodal, but any non-text inputs like images or synthetic mathematical data must constitute less than 50% of the training data

Text inputs are highly information dense compared to images, so this might be an important limiting factor. Most bits of sensory info humans get is visual, but they are much more repetitive / condensable. It's not clear to me that to get a "well rounded LLM" with similar textual and visual capabilities as a human, it wouldn't need eg 99% of its training tokens to be visual or multimodal. but overall I don't suspect that to be the case, and or that you can give it a bunch of junk text at low learning rate to cheat out of this issue. Maybe that's not allowed though.

I think my main uncertainty is in what

It also must be capable of running on all of the Earth's computing power as of market creation

means, but if it means the more advantageous interpretation, it seems pretty obvious to me that LLMs at that upper bound of compute alongside upper bounds of data quality and algorithmic improvements pushed close to realistic limits, when generating a single output stream from the entire compute availability of the world in oct 2023, would be more than enough to have a system that is AGI. If we take the less advantageous interpretation, it would be quite hard, but i'd lean toward algorithmic improvements and years-long humanity-wide data curation being enough for that.

@Bayesian Good point on inference vs. training. I don't want to be too restrictive, but it still needs to be realistic, so I added a cap of 100 years to train.

sold Ṁ422 YES

I think this is an extremely important question, so I've added an M$10,000 subsidy in an attempt to get more action here. Granted this doesn't matter much if you don't expect it to resolve for decades, but hey, I tried.

I've rewritten the description to be more clear, and sold my stake since it might end up being a little subjective.

bought Ṁ100 YES

I am updating towards yes on this based on OpenAI's IMO Gold claim with

an LLM and no tool use.

I think scaling something ~vanilla transformer is insufficient probably to reach the required learning sample efficiency OOD, but I think this is probably doable with smth like one deep insight™️ that produces an architecture that’s transformerish but also meaningfully different. Unambiguously is the key bit of phrasing.

Also, LLM is not equal to transformer, if we are going by the exact phrasing this is a question like “could all the compute in the world, ~perfectly pointed towards language modeling, produce AGI” and the answer there is obviously yes.

Unless by exist you mean extant instead of the platonist sort of sense of exist, in which case shrug, back to being an interesting question. This is a sufficiently ill-defined question that I’ll leave this market.

This effectively cannot resolve to no and will just resolve to yes as soon as AGI exists whether that be in 5 years or 50 lol

EDIT: my point is about the wording logic of the market not about LLMs or AGI

@FriendlyMerc this question should not be resolved as soon as AGI exists but no sooner than it's architecture is know.

Are agentic systems purely an LLM? I don't know their architecture, but I highly doubt they are "unambigously an LLM".

Surely an LLM is an important part. But LLMs - I guess - will just be one important part among many if they play a role in AGI. (A car is not unambiguously a motor with seats either. There's more to it, e.g. a gearbox.)

TL;DR: If LLMs are one of several parts of comparable importance to an AGI system, I definitely do not conaider this system " unambiguosly an AGI". I think we are still a few breakthrough technologies away from AGI.

bought Ṁ50 NO at 70%

@bbb Large language models pretraining on language is the primary way they reach the level of intelligence (and learning capability) that they already do in practice. However, the question is, in the limit, can LLMs reach AGI-level, with the required efficiency? This is a matter of understanding universality (oversimplifying a bit here for brevity) and complexity classes. Current LLMs are constant-time token predictors, so it's extremely unlikely that inference efficiency will be a problem on the order of causing this to resolve NO (there are also alternative samplers with different levels of efficiency which I won't discuss). As for universality, LLMs can learn to use tools in the same way humans do, applying their general intelligence to solve problems. LLMs learn to search over spaces of circuits which are able to solve problems they haven't seen before; it's a general problem-solving method, and successes and failures can be used to inform future searches.

If you're familiar with the rate of improvement of LLM coding abilities (i.e. program search), you'll appreciate how much territory search abilities can cover already in practice, and that the limits of those abilities are far from saturated. Just taking the rate of improvement of coding abilities of Claude Sonnet 3.5 (old), Sonnet 3.5 (new) and then Sonnet 3.7, there was major ability improvement between each release, and each release was only 4 months after its predecessor. For those requiring direct evidence of trajectories, that should be a significant update.

There are many papers demonstrating that transformers can be improved in many ways, without needing to train on other modalities, but I'm making trades based on my own independent awareness of what's feasible. While this market is not really an open question from my perspective, you're right that the question would not likely resolve as soon as AGI exists, because it's unlikely that the first lab to produce a broadly general model will detail its architecture and the training methods used. I'm betting at the rate I am to signal that I have knowledge of the way the market would resolve, independent of valuing the payoff.

Emmett Shear: "It has been increasingly obvious that "just scale up transformers bigger" is not going to lead to human level general intelligence. [...]"

https://x.com/eshear/status/1858660987530023148?t=0t5lYNS07G1Txp1uGa_E2Q&s=19

bought Ṁ250 YES from 54% to 63%

[double post. can be deleted]

"Ilya Sutskever, co-founder of AI labs Safe Superintelligence (SSI) and OpenAI, told Reuters recently that results from scaling up pre-training - the phase of training an AI model that use s a vast amount of unlabeled data to understand language patterns and structures - have plateaued."

from https://www.reuters.com/technology/artificial-intelligence/openai-rivals-seek-new-path-smarter-ai-current-methods-hit-limitations-2024-11-11/

bought Ṁ100 NO at 37%

bought Ṁ500 YES from 40% to 65%

@bbb he said pre-training has plateaued, not LLMs.

@NeuralBets yes

could be trained and run on all the world's computing power combined as of market creation

Given arbitrary training data?

@MartinRandall imo giving it training data like: "these are the thousand shortest ways to create an AGI" would not make the LLM itself an AGI.

What hypothetical data do you have in mind?

@MartinRandall Has to be human language in order to count as a language model, but it can be any such human language, yeah.

bought Ṁ100 YES

does this count LMM like GPT-4o as LLMs?

i.e. is the question more: are autoregressive transformers capable of reaching agi? or is the transformers architecture capable of reaching agi? (including things like SoRA)

@CampbellHutcheson Yeah

This would probably never resolve to no.

How does this question resolve if the architecture uses LLMs as the cruicial subcomponent behind it's intelligence, but nonetheless it's overall architecture isn't an LLM. Specifically I'm thinking of agentic systems like AutoGPT, which have a state-machine architecture with explicitly coded elements like short-term and long-term memory, but use LLMs to form (natural language) plans and decide on what state transitions should be made. If these systems become AGI when LLMs are scaled up, how does the question resolve.

@MaxMorehead I think that's fine

What counts as AGI here? Is it sufficient for it to do all text-based tasks as well as the average human?

@Metastable @IsaacKing

Related questions

Related questions