Thursday, April 9, 2026

Anthropic launches Managed Agents

Anthropic just dropped Managed Agents with a decoupled architecture for long-running tasks (bold move), while GPT-5.4 is crushing the new APEX-Agents-AA benchmark at 33% on professional tasks—though that number feels... low? Meanwhile, GLM-5.1 is claiming 94.6% of Claude Opus's coding performance at a fraction of the cost, and Meta's pushing 13x faster training for ML engineering agents with synthetic sandboxes. Would you trust an agent that only succeeds a third of the time?

Top Stories

APEX-Agents-AA

Artificial Analysis

GPT-5.4 tops the APEX-Agents-AA benchmark for AI agents at 33.3%, slightly ahead of Claude Opus 4.6 and Gemini 3.1 Pro, though all leading models score below 35% on these long-horizon professional tasks. The results highlight that agentic AI capabilities remain challenging even for frontier models.

agentsbenchmarkopenaianthropic

Anthropic's Managed Agents

Anthropic

Anthropic's Managed Agents is a hosted service that decouples AI agent components (brain, hands, session) into stable interfaces, enabling reliable long-horizon task execution while dramatically improving performance and allowing flexible deployment to customer infrastructure.

anthropicagentsclaudeinfrastructure

Meta AI Scales RL for ML Engineering Agents

arXiv

Meta AI's SandMLE framework enables scalable reinforcement learning for ML engineering agents by using synthetic micro-scale datasets, reducing training time by 13x while achieving 20-67% performance improvements over supervised fine-tuning approaches. This breakthrough makes on-policy RL practical for training agents that can handle complex machine learning workflows beyond basic software engineering tasks.

reinforcement-learningagentsmetallm

Harness hill-climbing

LangChain

LangChain presents a practical framework for self-improving AI agents that uses evals as training data to systematically hill-climb agent performance through iterative harness improvements. The company is releasing tooling to enable teams to build autonomous agent improvement systems with proper evaluation guardrails.

agentslangchainevalsself-improvement

GLM-5.1 Scores 94.6% of Claude Opus on Coding at a Fraction the Cost

Hugging Face

Z.ai's GLM-5.1 matches Claude Opus coding performance at lower cost, excelling at long-horizon agentic tasks by sustaining optimization through hundreds of iterations rather than plateauing early like previous models.

llmcodingagentsbenchmarks

Keep Reading

•

Anthropic enables Claude to use email, documents, and files from Microsoft 365 in conversations

•

RL of Interleaved ReasoningMeta FAIR

•

IRPAPERS benchmarkWeaviate

•

Safetensors joins PyTorch Foundation

•

Project Glasswing: Securing critical software for the AI eraAnthropic

Enjoyed this issue?

Get daily AI intel delivered to your inbox. No fluff, just the stories that matter.