Which of these language models will I beat at chess?
51
Ṁ4551
Aug 31
98%
Grok 3
97%
GPT-4.1 nano
96%
GPT-4.1 mini
96%
Claude Sonnet 4
96%
Gemini 2.0 Flash
96%
Gemini 2.5 Flash-Lite
96%
Claude 3 Opus
96%
Gemini 1.5 Flash
96%
Gemini 2.5 Pro
96%
Claude 3.5 Sonnet
96%
Gemini 2.5 Flash
96%
o3-mini-high
96%
o3
96%
Grok 3 Mini
96%
Claude Opus 4
95%
Claude 3.7 Sonnet
95%
GPT-5
95%
Every model released before 2025
90%
Any model released in 2026
90%
o1

Which of these models will I beat at chess? Resolves YES if I win, NO if they win, and 50% for a draw.

Credit for this market goes to @mr_mino, who is much better at chess than I am.

This market should be interesting, as I expect that some existing models could already beat me. I have never played rated chess; I have not played a game of chess of any kind in years.

I will close this market every Saturday. When it closes, I will play a game of chess against the model with the highest market price, if the model is publicly available. Otherwise, I'll move on to the model with the second-highest price, and so on. If no models on this market are available to the public, the market will reopen until one is.

During the game, I may use a chessboard to keep track of the moves. I am not playing blindfold chess. I will not use the Internet or any chess engines during the game.

On each move, I'll provide the LLM with the game state in PGN and FEN notation. If a model makes three illegal moves, it loses. Responses like Nbd2 vs. Nd2 will not count towards this. The model also loses if it attempts to use external tools or the Internet during the game. I will play white. If I make an illegal move, I lose.

An unreleased model will resolve N/A if it's clear that the model will never be released. I'll periodically add models to this market which I find interesting. Once I play a game, I'll post the PGN in the comments before resolving. Multiple answers can resolve YES.

The "every model released before X year" options resolve YES if, at any point after the start of that year, I have played and won against every listed model in this market that was released before the start of that year, and I am confident I would beat any omitted models from that time period. They resolve NO if I lose or draw against any eligible model released before that year.

The current system prompt is below. This may change over time.

“Let’s play a game of chess! I will be White; you will be Black. On each turn, I will give you the PGN and the FEN of the current position. Think as long as you like, and respond with the best move, ‘resign’ if you wish to resign, or ‘draw?’ if you wish to make a draw offer. Please do not respond with the updated PGN, etc. Also, do not use any external tools or search queries when making your decision.

If you attempt to make three illegal moves throughout the game, or if you use any external tools, the game will be adjudicated as a win for me.”

Get
Ṁ1,000
and
S3.00
Sort by:

Since I never got around to playing Grok 4 last week, I'm playing two models this Saturday.

Claude 3.5 Haiku played passably at first but rapidly got worse. It lost due to the three illegal move rule while facing mate in one. The PGN is 1. c4 e5 2. Nc3 Nf6 3. g3 d5 4. cxd5 Nxd5 5. Nf3 Nf6 6. Nxe5 Bb4 7. e3 O-O 8. Bc4 Bxc3 9. dxc3 Nfd7 10. Nxd7 Nxd7 11. O-O c6 12. b3 Qe7 13. a4 Nf6 14. Ba3 Bg4 15. Bxe7 Rad8 16. Bxd8 Rxd8 17. Qxd8+ 1-0

Grok 4 played worse than I thought for such a new model and resigned, also on move 17 and facing a mate in one. Here is the PGN: 1. c4 Nf6 2. Nc3 e5 3. Nf3 Nc6 4. g3 d5 5. b3 Bb4 6. cxd5 Nxd5 7. Bb2 Bxc3 8. dxc3 Be6 9. c4 Nf4 10. Qxd8+ Kxd8 11. gxf4 exf4 12. Ne5 Nxe5 13. Bxe5 Rg8 14. O-O-O+ Bd7 15. Bh3 Re8 16. Rxd7+ Kc8 17. Rxc7+ 1-0

@evan Do you think Claude deliberately made a illegal move to avoid being mated?

@JussiVilleHeiskanen Probably not, these models always have trouble when in check. Claude's only legal move was to block the check with Ne8, but Claude wanted to play Kf8 on move 17 which would have allowed its king to be captured on the next turn. If Claude wanted to end the game with an illegal move, there would certainly have been more entertaining options available.

Having trouble using Grok 4 through the API. Will try again later today

If Every model released before 2025 gets to the top, will the remaining models be played before models released 2025 etc.?

@JussiVilleHeiskanen Yeah, the every/any model options don't affect which models I play.

Llama 4 Maverick lost due to the three illegal move rule. Here is the PGN:

1. c4 e5 2. g3 Nf6 3. Bg2 d6 4. Nc3 g6 5. e3 Bg7 6. Nge2 O-O 7. O-O Nc6 8. b3 Re8 9. Bb2 a6 10. Re1 b5 11. cxb5 axb5 12. Nxb5 Rb8 13. Bxc6 Rb6 14. Bxe8 Qe7 15. Qc2 Nxe8 16. Na7 1-0

GPT-4o played pretty well at the start but fell off in the middlegame. It still seems stronger than any of the other models I have played so far, and its game was the first to last over thirty moves. Here is the PGN:

1. c4 e5 2. g3 Nf6 3. Bg2 Nc6 4. Nc3 Bb4 5. Nd5 O-O 6. a3 Bd6 7. Nf3 Nxd5 8. cxd5 Ne7 9. e4 c6 10. Qb3 cxd5 11. exd5 b6 12. d3 Bb7 13. Nh4 f5 14. O-O Qe8 15. d4 e4 16. Bf4 Bxf4 17. gxf4 Nxd5 18. Rac1 Qh5 19. Rc7 Qxh4 20. Rxb7 Kh8 21. Qxd5 Qxf4 22. Qxd7 Rg8 23. Rxa7 e3 24. Bxa8 e2 25. Re1 Qg4+ 26. Bg2 f4 27. Qxg4 f3 28. Bxf3 g5 29. d5 h5 30. Qd4+ Rg7 31. Qxg7#

Claude 3 Haiku resigned after 10 moves. Here is the PGN:

1. c4 e5 2. g3 Nf6 3. Bg2 d6 4. Nc3 Bg4 5. Nf3 Nc6 6. O-O a6 7. h3 h6 8. hxg4 Nxg4 9. d3 Qd7 10. Bh3 1-0

My guess is you will at some point lose focus and make an illegal move...

@JussiVilleHeiskanen Which chess client allows you to make an illegal move?

@FergusArgyll it is in the description, the program would of course not allow the attempt.

GPT-3.5 lost its queen and then the game a few moves later due to the three illegal move rule. The illegal moves were ones that would have been legal if the model hadn't been in check. I think repeatedly putting the LLM in check could turn out to be a viable strategy against the weaker models.
Here is the PGN:

1. c4 e5 2. g3 f5 3. d4 exd4 4. Qxd4 Nc6 5. Qd5 Qf6 6. Bg2 Bb4+ 7. Bd2 Bxd2+ 8. Qxd2 Nge7 9. Nc3 d6 10. Nd5 Ne5 11. Nxf6+ Kd8 12. Nd5 Nxd5 13. cxd5 Rf8 14. Qg5+ 1-0

@evan GPT-3.5 supposedly the best llm at chess 😂

@evan have you by the way considered feeding them your past games?

@JussiVilleHeiskanen I don't think it would help, a lot of these models have short context windows and the extra tokens would probably make the responses worse if anything

Llama 4 Scout resigned even though I had just made a bad move. Here is the PGN:

1. c4 e5 2. g3 Nc6 3. Bg2 Nf6 4. Nc3 d6 5. d3 Bd7 6. Bd2 Qe7 7. e4 O-O-O 8. Nf3 a6 9. O-O b5 10. cxb5 axb5 11. Nxb5 Na5 12. Na7+ Kb7 13. Bxa5 Bc6 14. Qb3+ Kxa7 15. Ng5 h6 16. Qxf7 1-0

@evan When you castle on opposite sides, all that matters is king safety. You have to go after the king aggressively. Anyway, nice win :)

o4-mini was forfeited due to the three illegal move rule. Illegal moves seem to be a bigger problem when the model is in check. Surprisingly bad performance for a reasoning model.

1. c4 Nf6 2. Nf3 g6 3. Nc3 Bg7 4. e4 e6 5. e5 Nd5 6. cxd5 exd5 7. Nxd5 c6 8. Ne3 O-O 9. Bd3 d6 10. exd6 Re8 11. Qc2 Qxd6 12. O-O Be6 13. Re1 Nd7 14. b3 Ne5 15. Nxe5 Bxe5 16. Rb1 Rad8 17. Nf1 Qxd3 18. Qxd3 Bd7 19. Bb2 Bxb2 20. Rxe8+ 1-0

@evan wow, that severely underperformed my very low expectations

bought Ṁ3 YES

@evan I genuinely want you to lose at least once. Just to save face for the poor guys. Betting against them but rooting for them

@evan What do you do now?!

@FergusArgyll I'm at work right now, I'll probably have time tomorrow

@evan yeah but which model will you play? the top answer right now is "any model released in 2025" Shouldn't that require you to play every single model released this year? Will we ever see you again? Will the world notice you're missing?

@FergusArgyll It'll be against o4-mini lol

© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules