
Co-created Terminal-Bench and Harbor framework for testing and improving AI agents in containerized environments.
How media typically covers Alex Shaw
Referenced in coverage
Terminal-Bench 2.0 launches as the new standard benchmark for evaluating autonomous AI agents with 89 rigorously validated tasks, alongside Harbor, a framework for testing agents in containerized environments at scale.
“Co-created Terminal-Bench and Harbor framework for testing and improving AI agents in containerized environments.”