What year will the first AI exceed 80% on MLE-bench?

Plus

Ṁ6711

2031

2024

34%

2025

33%

2026

13%

2027

2028

2029

2030

After 2030

https://arxiv.org/abs/2410.07095

We introduce MLE-bench, a benchmark for measuring how well AI agents perform at machine learning engineering. To this end, we curate 75 ML engineering-related competitions from Kaggle, creating a diverse set of challenging tasks that test real-world ML engineering skills such as training models, preparing datasets, and running experiments. We establish human baselines for each competition using Kaggle's publicly available leaderboards. We use open-source agent scaffolds to evaluate several frontier language models on our benchmark, finding that the best-performing setup--OpenAI's o1-preview with AIDE scaffolding--achieves at least the level of a Kaggle bronze medal in 16.9% of competitions. In addition to our main results, we investigate various forms of resource scaling for AI agents and the impact of contamination from pre-training. We open-source our benchmark code (this http URL) to facilitate future research in understanding the ML engineering capabilities of AI agents.

Primary metric, so >=80% bronze+ without explicitly training on the test set

This question is managed and resolved by Manifold.

#️ Technology

#AI

#OpenAI

#Technical AI Timelines

#AI Safety

Get

1,000

and

3.00

6 Comments

37 Holders

160 Trades

Sort by:

shouldn't this market add up to 100% across the relevant options?

(I think normally when people make markets like this, they do "By when" so that the relevant property is just monotonicity. See e.g. my market here: https://manifold.markets/RyanGreenblatt/by-when-will-85-be-reached-on-the-p?play=true)

bought Ṁ25 NO

@RyanGreenblatt Yeah, I just bought NO on everything which seems like free mana

Primary metric, so >=80% bronze+ without explicitly training on the test set for memorisation.

opened a Ṁ250 YES at 20% order

@NMcA Should add this to question text.

80% bronze+?

Related questions

Related questions