
Explained FrontierMath benchmark for measuring AI mathematical reasoning capabilities and discussed AI's advancement in solving PhD-level math problems.
How media typically covers Greg Burnham
Greg Burnham as author
Benchmark scores across AI models are dominated by a single 'General Capability' dimension, with a secondary 'Claudiness' dimension that captures Claude's unique performance profile across agentic tasks, vision, and math.
“Author of "Benchmark Scores = General Capability + Claudiness" in Epoch AI”
Referenced in coverage
State-of-the-art AI models now solve over 40% of FrontierMath benchmark problems, up from 2% at launch, while Google DeepMind's Aletheia achieved publishable PhD-level mathematics autonomously, forcing the need for new, harder benchmarks.
“Explained FrontierMath benchmark for measuring AI mathematical reasoning capabilities and discussed AI's advancement in solving PhD-level math problems.”