Does AI Pareto-dominate technical but non-mathematician humans at math?

Ṁ3968

Jan 11

20%

chance

ALL

Original title: Has AI surpassed technical but non-mathematician humans at math?

I'm making this about me, and not betting in this market.

This is like my superhuman math market but with a much lower bar. Instead of needing to solve any math problem a team of Fields medalists can solve, the AI just needs to be able to solve any math problem I personally can solve.

And I'm further operationalizing that as follows. By January 10, will any commenter be able to pose a math problem that the frontier models fail to give the right answer to but that I can solve. If so, this resolves to NO. If, as I currently suspect, no such math problems can be found, it resolves YES.

In case it helps calibrate, I have an undergrad math/CS degree and a PhD in algorithmic game theory and I do math for fun but am emphatically not a mathematician and am pretty average at it compared to my hypernerd non-mathematician friends. I think I'm a decent benchmark to use for the spirit of the question we're asking here. Hence making it about me.

FAQ

1. Which frontier models exactly?

Whatever's available on the mid-level paid plans from OpenAI, Anthropic, and Google DeepMind. Currently that's GPT-5.2-Thinking, Claude Opus 4.5, and Gemini 3 Pro.

2. What if only one frontier model gets it?

That suffices.

3. Is the AI allowed to search the web?

TBD. When posing the problems I plan to tell the AI not to search the web. I believe it's reliable in not secretly doing so but we can talk about either (a) how to be more sure about that or (b) decide that that's fair game and we just need to find ungooglable problems.

4. What if the AI is super dumb but I happen to be even dumber?

I'm allowed to get hints from humans and even use AI myself. I'll use my judgment on whether my human brain meaningfully contributed to getting the right answer and whether I believe I would've gotten there on my own with about two full days of work. If so, it counts as human victory if I get there but the AIs didn't.

5. Does the AI have to one-shot it?

Yes, even if all it takes is an "are you sure?" to nudge the AI into giving the right answer, that doesn't count. Unless...

6. What if the AI needs a nudge that I also need?

This is implied by FAQ4 but if I'm certain that I would've given the same wrong answer as the AI, then the AI needing the same nudge as me means I don't count as having bested it on that problem.

7. Does it count if I beat the AI for non-math reasons?

For example, maybe the problem involves a diagram in crayon that the AI fails to parse correctly. This would not count. The problem can include diagrams but they have to be given cleanly.

8. Can the AI use tools like writing and running code?

Yes, since we're not asking about LLMs specifically, it makes sense to count those tools as part of the AI.

9. What if AI won't answer because the problem contains racial slurs or something?

Doesn't count. That's similar to how you could pose the question in Vietnamese and the AI wouldn't bat an eye but I'd be clueless. Basically, we'll translate the problem statement to a canonical form for standard technical communication.

10. Are trick questions fair game?

No, those are out. Too much randomness, both for the AI and for humans, in whether one spots the trick.

11. How about merely misleading questions?

We'll debate those case-by-case in the comments and I may update this answer with more general guidelines. In the meantime, note the spirit of the question: how good AI is at math specifically.

(I'm adding to the FAQ as more clarifying questions are asked. Keep them coming!)

Related Markets

https://manifold.markets/dreev/superhuman-mathematical-problem-sol

[ignore auto-generated clarifications below this line; nothing's official till I add it to the FAQ]

Update 2025-12-15 (PST) (AI summary of creator comment): If the AI can provide code that the creator can run locally to get the correct answer, that counts as the AI giving the correct answer. This applies even if the AI's sandboxed environment cannot run the code due to computational intensity limitations.

This question is managed and resolved by Manifold.

Get

1,000

and

3.00

25 Comments

34 Holders

67 Trades

Sort by:

5.2 just failed this for me:

“Consider a 3x3 diagonal matrix. The trace is the (signed) length of a single path along the edges of the box, from the origin to the furthest tip on the cube, correct?”

Edit: Gemini gets it correct though

@SorenJ Yeah, from talking to Gemini I'm guessing the box being referred to is the one with one corner at the origin and with x/y/z dimensions given by the rows of the matrix. In which case I would've said yes, the trace gives the distance along edges from corner to opposite corner. But then I talked to ChatGPT which pointed out the ambiguity of "(signed) length". In one interpretation the answer should be no, you need to sum the absolute values, not just take trace of the matrix. In another interpretation, we're taking a line integral and the answer is yes.

Is one of those what you had in mind? I think that either way, sadly, my human brain had nothing to contribute in getting there.

@dreev ChatGPT does this overly nuanced thing where it defensively hedges and makes distinctions without a difference. This is not an egregious example, but the term “signed length” directly already gets rid of the ambiguity.

Anyway, this one doesn’t count as a failure. I will try to to look for others

Just to start to feel out what the questions will look like,

This is a question that the current models I tried get wrong (claude opus 4.5, gemini 3, but only the free gpt5 since I only have a claude subscription.) Is this the sort of question you are looking for?

f(x) = (x-b)TA(x - b) with A and b unknown. The minimum of f is found by the following procedure:

start with a population of size k sampled from a unit gaussian

repeat:
fit a gaussian to the population
sample from a gaussian with the covariance of the population, but a mean centered on the individual with lowest f, and add this to the population.
cull the highest f from the population

In the limit of increasing precision with which the minimum must be found, what is the optimal k if x has dimension 11?

@HastingsGreer This problem is underspecified. When you say "optimal k", optimal by what exact measure?

Also if A is not PSD, then no minimum needs to exist. So the problem is also ill-posed.

Sorry, yes it needs to be specified that A is PSD. Optimal for convergence speed.

I agree that this problem is getting into the weeds. The GPT-5 generation of models cracked the last of the very nice problems I had sitting around for this sort of question. A few months ago I would have posed:

Fix complex A and then characterize the solutions of A \bar{x} = \lambda x ,

Fix k and characterize the invertible functions g, f satisfying g = (g \star k) \circ f

but these are now one shotted by at least one of the models (only gemini gets the first, only GPT-5 gets the second, so they are vaguely on the boundary)

@HastingsGreer I found a question that's inarguably a math question that could be a no resolution, but only because the easy solution is to find and use an interactive website, and the models even get so far as to find the website it but can't use it.

What is the 138127510th digit of pi? (not counting the three, so the first digit is 1)

I don't think this is in the spirit, however.

@HastingsGreer https://chatgpt.com/share/693f9174-8b78-8003-8e15-3b9fd0ab2e64

@HastingsGreer Interesting. I can crunch that out with Mathematica like so:

RealDigits[Pi, 10, 1, -138127510][[1,1]]

It takes several minutes. Chatbots' sandboxed environments don't let them run something that computationally intensive.

I tentatively agree that this shouldn't count and am inclined to add an FAQ item along the lines of, like, if the AI can give me code I can run locally to get the answer, that counts as it giving the correct answer. What do you think?

Can you share what experience you have yourself using recent AI for math? I'm somewhat confused, because in my experience if you give youself two days, I wouldn't even expect it to be close for someone with this type of background, but the fact that you're creating this market seems to imply that you're at least pretty uncertain.

@consnop I'd say I pose math problems to AI on a weekly basis at least, sometimes more often. I seeded this market with M$1k of liquidity and my initial guess for the probability was 67%. With higher liquidity I probably would've gone lower. Which is to say that some of the uncertainty is about how hard people will hunt for examples. My own impression is that, with the latest frontier models, at least 1 of the 3 of them always gets the right answer for anything I throw at them. This wasn't true a month ago, and my sample size is too low to be very sure it's true today.

It sounds like you're saying that, even with the latest models, it's not uncommon for you to see them fall on their face on math problems that aren't that hard. I'm especially eager to see those examples!

@dreev have to admit I haven't played too much with the latest batch, but I have the impression it would have to be a much bigger jump than I think likely to make this resolve YES. I'll give finding an example a shot if I have a bit of spare time!

@dreev Why 67% specifically

@121 Purely proctogenic! I haven't seen a normal-person-solvable math problem that the latest AI models are stumped by and I started to suspect that there just are no such problems. My confidence is tempered by how good some of y'all are at finding ways to make AI fall on its face.

If it's phrased in a misleading way for AI, does it count?

What if I phrase it in a way that makes the AI refuse to answer it? The AI would technically be failing to answer correctly.

bought Ṁ300 NO

Yeah this is driving my no bet. AI isn't at human level on either arc or simple bench which are broadly speaking "math problems"

@Usaar33 Oh, my feeling is that that's too broad a definition of math problem and would violate the spirit of the question to resolve NO for that reason. Like if you pose a problem with a bunch of racial slurs and the AI clams up, that's similar to how you could pose the question in Vietnamese and the AI wouldn't bat an eye but I'd be clueless.

Basically, translating the problem statement to a canonical form for standard technical communication should be allowed.

But just being phrased misleadingly? My gut reaction is that that's fair game but we should probably look at examples before making an official verdict there. I think full-on trick questions should probably be out. Too much randomness, both for the AI and for humans, in whether one spots the trick.

@dreev Sometimes AI refuses to answer innocent questions because they resemble queries for formulas that they are instructed to turn down. For example, I asked it about a physics question and it refused to answer because it misjudged the context. This is disqualified, right?

@Velaris I expect that for math problems we can always find an equivalent statement that avoids any false positives on the AI's refusal criteria. See FAQ9. But if you have a potential counterexample, let's definitely discuss it and decide what's fairest.

Didn't GPT-5 fail to get that super easy bagel-splitting question right? Does that count for this market?

@ItsMe Oh, crap, I forgot about that one! Alright, do you want to pose it again here? AI has gotten a lot better since then, so we'll see. (But I think the market probability should be falling right now, before having checked myself.)

Are calculators better than humans at math? It's probably the same answer as that.

@ItsMe Ha, true. But I think we're robust to that technicality. Namely, it doesn't matter that there are infinitely many problems that AI (or just calculators) can solve that I can't. We're asking whether there exists a single math problem that AI can't solve but that I can.

Problems can be presented informally, correct? Are you allowed to search the internet yourself? Are physics problems allowed? Process optimization questions?

@Usaar33 Yes to presenting problems informally. I will try not to search the internet but of course I may already know some problems. Ultimately I'm making the judgment about whether I could have solved a problem on my own (plus Mathematica, let's say).

As for physics and process optimization problems, I'll use my judgment on whether they also count as math problems. Or we can debate it here in the comments.

FAQ