
Background
Humanity's Last Exam (HLE) is a benchmark designed to test AI models at the frontiers of human expertise. The exam consists of expert-level questions across various fields, deliberately crafted to be extremely challenging. Current AI models have performed poorly on this benchmark, with leading models answering fewer than 10% of expert questions correctly.
Resolution Criteria
This market will resolve YES if any AI model achieves a verified score of 80% or higher on Humanity's Last Exam before April 1, 2025. The score must be:
Independently verified by Scale AI or another reputable organization
Achieved on the full exam, not a subset
Publicly announced and documented
Achieved through a single model's capabilities (not through combining multiple models or human assistance)
The market will resolve NO if no AI model achieves a verified score of 80% or higher by April 1, 2025.
Considerations
The current performance gap between AI models (<10%) and the target (80%) is substantial
Experts predict models might exceed 50% accuracy by the end of 2025, making an 80% score by April 2025 particularly ambitious
The exam is specifically designed to test the limits of AI capabilities, making rapid improvements more challenging than on typical benchmarks
Scale AI's methodology and scoring criteria may evolve, but resolution will be based on their official scoring system at the time of evaluation