Will any model get above human level on the Simple Bench benchmark before September 1st, 2025.

Basic

Ṁ22k

Sep 2

chance

ALL

This question is managed and resolved by Manifold.

#️ Technology

#AI

#Technical AI Timelines

#LLMs

Get

1,000

and

3.00

17 Comments

46 Holders

159 Trades

Sort by:

Shocking - 56.7% for GPT 5(High)

Postive(+)

GPT5 - Hallucination rate is down ~80% across the board

GPT-5 dominates the Text Arena, ranking #1 in every major category: Hard Prompts,Coding ,Math,Long Queries

The model will do well.

My prediction is 73% +or- 2% ( Much better than i thought). My end of year optimistic estimate on simple bench was 74% just last month.

I think worse case the model does 68-69% and my best case is around 78%

we will likely have an answer to this market within a week.

Rumor : gpt 5 got 90%. Someone tested it through copilot.

I’m going to come out and say the model won’t break 80%.

We will likely know the answer in1-2 weeks

@Mad2live That test was done on the 10 public questions, not the private dataset.

Do we have any info on GPT 5's knowledge cutoff date? Could it possibly have the public questions in its training data?

@TiagoChamba That is what I’m thinking, plus I think i remember another modeling getting 80% on that public version and failing to break 63%

The best model will probably be around 72% give or take 2%.

I like the odds for EOY tho

bought Ṁ175 NO

i'd bet even odds by EOY, but it's highly unlikely by september 1st.

The human baseline is now 83.7%. Unfortunate that the old baseline is the name but I will resolve to true if any model exceeds the human baseline published on https://simple-bench.com.

bought Ṁ50 YES

@HenryGeorge You can edit the name. Hover over it and a pen button will appear.

@NeuralBets done thx

We have a new reported human baseline. (83.7%) Is this a question about 92% or about the human level?

@MikhailDoroshenko human baseline

bought Ṁ250 NO

Seems unlikely without a major paradigm shift. 27% is sota and it doesn't seem to be increasing much with successive model generations

Is it true that this benchmark can be anything, and can be changed at any point? There are no hashes, no large sample of problems, no error bars, no evaluation code, no specifics on what a model can or cannot use... How do we know what a true performance is, except what the author says?

@dp gotta trust the guy

bought Ṁ532 YES

Description of the benchmark here: https://simple-bench.com/about.html

I have made some irrational bets to subsidize the market - as I cannot be bothered to figure out the correct way to do this.

I think you can normally just add liquidity?

Related questions

Related questions