Friday, May 1, 2026

OpenAI banned goblins (seriously)

OpenAI's GPT-5.5 system prompt explicitly bans talking about goblins due to a bizarre bug (wild), while Apple dropped LaDiR research showing latent diffusion can boost LLM reasoning beyond chain-of-thought. Meanwhile, Hugging Face is sounding alarms that AI evals now cost $40K per run, effectively pricing academics out of holding frontier models accountable (yikes). Should we worry when only Big Tech can afford to test AI safety?

Ars Technica

OpenAI's GPT-5.5 system prompt contains explicit instructions to avoid mentioning goblins and similar creatures, revealing an unusual emergent behavior in the latest model where it inappropriately references these topics in unrelated conversations.

openaigptllmmodel-behavior

LaDiR: Latent Diffusion Enhances LLMs for Text Reasoning

Apple Machine Learning Research

Apple researchers propose LaDiR, combining latent diffusion models with LLMs to enable iterative refinement and parallel exploration of reasoning paths, outperforming traditional autoregressive chain-of-thought methods on mathematical and planning tasks. This approach addresses limitations of sequential token generation by allowing models to revise reasoning holistically.

llmdiffusion-modelsreasoningapple

Introducing AutoSP

PyTorch

PyTorch's AutoSP is a compiler-based tool that automates sequence parallelism for long-context LLM training (100k+ tokens), eliminating complex manual code changes while maintaining performance parity with hand-written implementations and integrating seamlessly with DeepSpeed.

llmpytorchdeepspeedtraining-infrastructure

AI Agents That Builds Themselves

João Moura highlights AI agents capable of building themselves, indicating progress in autonomous, self-improving AI systems that could reduce human oversight in AI development.

agentsautonomous-aiself-improving-aiai-development

AI evals are becoming the new compute bottleneck

Hugging Face

AI evaluation has become prohibitively expensive, with agent benchmarks costing $40,000+ per run and reliability testing multiplying costs 8×, effectively excluding academic researchers and independent auditors from evaluating frontier systems. Unlike static benchmarks that could be compressed 100×, agent and training-in-the-loop evaluations resist cost reduction, reversing the traditional training-dominant compute model.

evalsagentsbenchmarkingcompute