DSGym is a standardized framework for evaluating and training data science agents that addresses fragmented benchmarks, and a 4B model trained on DSGym outperforms GPT-4o on standardized analysis benchmarks.

“Author of "DSGym: A Holistic Framework for Evaluating and Training Data Science Agents" on arXiv”

AI AgentsAI ResearchFoundation ModelsBenchmarking

Also Mentioned In

Referenced in coverage

Reinforcement Learning at Test Time

GitHubannouncementpositiveJan 26, 2026

TTT-Discover enables large language models to perform reinforcement learning at test time, adapting to problem-specific tasks and achieving state-of-the-art results in mathematics, GPU kernels, algorithms, and biology.

“Co-authored TTT-Discover reinforcement learning system for test-time LLM training.”

Reinforcement LearningLLMsAI ResearchOpen Source AI