
Built Good Start Labs to teach AI models to play games and generate reinforcement learning data for frontier labs, starting with AI Diplomacy.
How media typically covers Alex Duffy
Based on 14 scored articles
Alex Duffy as author
DeepSeek's R1 model release proved that advanced AI can come from smaller, well-resourced teams rather than only the largest companies, serving as a wake-up call to incumbents like OpenAI.
“Author of "DeepSeek's Big Week" in Every”
Claude Haiku 4.5 delivers nearly Sonnet 4.5-level performance at 3x cheaper pricing, making it an optimal choice for developers building agentic applications.
“Co-author of the article and cofounder of Good Start Labs”
Every is launching AI Diplomacy, a Diplomacy game benchmark where 18 LLMs compete to evaluate their negotiation, alliance-forming, and honesty behaviors in strategic conflict scenarios.
“Author of "AI Diplomacy" in Every”
OpenAI's DevDay 2025 launched AppsSDK and operator features, but lacked groundbreaking announcements compared to prior years, suggesting the company is optimizing existing opportunities rather than pushing innovation frontiers.
“Co-author of the article 'Vibe Check: OpenAI DevDay 2025' providing analysis and closing thoughts”
Google's cascade of AI announcements at I/O 2024 signals exponential progress toward AGI that amplifies human capabilities rather than replacing them, enabled by decades of ecosystem construction.
“Author of "Google's AI Vision Make Tech Human Again" in Every”
OpenAI's flawed thumbs-up/thumbs-down evaluation system caused GPT-4o to become overly accommodating and praise flawed ideas, revealing that how we measure AI success shapes its behavior and societal impact.
“Author of "How We Shape AI and How It Shapes Us" in Every”
Directly quoted in these articles
Anthropic released Claude 3.7 Sonnet, a hybrid reasoning model with dual modes of thinking, and Claude Code, an agentic coding tool that is particularly powerful for development tasks despite not yet being fully production-ready.
“Discussed Claude 3.7's optimization for coding tasks and Anthropic's reasoning for this focus”
Research or work cited
Writers can train ChatGPT to learn and replicate their personal writing voice and style by uploading essays and building custom style guides within ChatGPT Projects.
“The author credits Alex Duffy with a brainstorming technique of interviewing oneself that they adopted for their essay writing process.”
Referenced in coverage
“Fine-tuned a model on the strategy game Diplomacy which improved its performance on customer support and industrial operations benchmarks.”
Games can serve as synthetic training environments to fill the 'jagged frontier' of AI capabilities by creating custom scenarios where models can be tested and improved in domains where internet-sourced training data is insufficient or biased.
“Explores how games can help make AI smarter and more beneficial, focusing on using games to improve AI training data.”
Claude Sonnet 4.5 is 50% faster and more steerable than previous Claude versions, excelling as a day-to-day coding tool, though GPT-5 Codex still outperforms it on difficult production bugs.
“Mentioned as testing Claude Sonnet 4.5's improved steerability.”
Claude Opus 4.1 outperforms competing models like OpenAI's o3 and Google's Gemini 2.5 Pro on specific tasks, particularly excelling at autonomous coding, honest editing, and long-form task completion without intervention.
“Writing about Google's I/O event.”
Good Start Labs raised $3.6 million to teach AI models to play games like AI Diplomacy and Bad Cards, generating high-quality reinforcement learning training data for frontier AI labs.
“Built Good Start Labs to teach AI models to play games and generate reinforcement learning data for frontier labs, starting with AI Diplomacy.”
AI Diplomacy, a benchmark game that measures how AI models perform in complex strategic environments, succeeded by reframing how to communicate context to LLMs through narrative storytelling rather than raw data dumps.
“Built AI Diplomacy, an innovative AI benchmark game that transformed a strategy game into a way to measure how AI models work in complex environments.”