Update 2025-11-05 (PST) (AI summary of creator comment): Resolution will be based on:
Minimal agent configuration (as described on SWE-bench verified's website)
No parallel test time compute
Anthropic's official reporting of the score
Surprised this is 80%. Is everyone thinking that Anthropic will manipulate the evals due to facing pressure from Google?
https://manifold.markets/JaundicedBaboon/will-claude-opus-45-achieve-a-sota Made a
similar market about swe-rebench to test this
@JaundicedBaboon Yah, I don't get it. If anything the closer release rumors of Opus 4.5 should lower expectation of this score and instead market going opposite direction.
A 2.8% jump in 2 months is somewhat faster than progress rate over 2nd half of this year. (~1.2% a month). Not only that, but a YOLO type release would be expected to show less progress compared to a well timed one (Opus 4.1 pulled only 74.5% for under 1% a month of progress).
My expectation is ~79% for a release this week.
@Usaar33 Keep in mind Claude Opus 4 scored lower on SWE-bench than Sonnet 4. I wouldn't be surprised if Opus doesn't even get 78%.
@JaundicedBaboon So sonnet 4.5's score under this standard would have been 77.2%, just to be sure I understand the resolution criteria