The METR horizon length plot, which has only 14 samples in the 1-4 hour range where 2025 frontier AI progress occurred, is being overused to make outsized inferences about AGI timelines and research priorities despite insufficient data.

“Author of "How to Game the METR Plot"”

AI BenchmarksAI ResearchAI SafetyFoundation Models

The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs

researchpositiveSep 21, 2025

Short-task benchmarks create an illusion of diminishing returns in LLM scaling; marginal gains in single-step accuracy compound into exponential improvements in long-horizon task execution, with larger models demonstrating significantly better execution capability across extended task sequences.

“Author of the research paper on measuring long horizon execution in LLMs”

LLMsAI ResearchFoundation ModelsReinforcement Learning

Researchers from Cambridge and team show small models succeed briefly but fail quickly on extended multi-step tasks

researchneutralSep 15, 2025

Small language models appear to succeed on short tasks but fail rapidly on extended multi-step tasks due to execution errors and self-conditioning degradation, while scaling and sequential test-time compute significantly improve long-horizon task completion.

“Co-author of research on long-horizon task execution in language models”

LLMsAI ResearchReasoning and PlanningModel Scaling