The human baseline is now 83.7%. Unfortunate that the old baseline is the name but I will resolve to true if any model exceeds the human baseline published on https://simple-bench.com.
We have a new reported human baseline. (83.7%) Is this a question about 92% or about the human level?
Seems unlikely without a major paradigm shift. 27% is sota and it doesn't seem to be increasing much with successive model generations
Is it true that this benchmark can be anything, and can be changed at any point? There are no hashes, no large sample of problems, no error bars, no evaluation code, no specifics on what a model can or cannot use... How do we know what a true performance is, except what the author says?
Description of the benchmark here: https://simple-bench.com/about.html
I have made some irrational bets to subsidize the market - as I cannot be bothered to figure out the correct way to do this.