Open source tools for evaluating, testing, and understanding frontier AI models.
Petri joins Meridian Labs, with a major architecture overhaul focused on hackability. Dish and Bloom extensions provide new capabilities.
A new package adding managed cloud sandbox providers to Inspect, so you can run evals at scale without provisioning your own infrastructure.
New tools for long-horizon agents including deep agents, timelines, compaction, checkpointing, and bridges for Claude Code and Codex CLI.
A workflow layer for Inspect that makes it easier to run evals at scale with declarative configs, matrix sweeps, and automatic log reuse.
A tool for in-depth analysis of AI agent transcripts, with LLM-based and pattern-based scanners for detecting issues beyond simple metrics.
A new package that makes 80+ Harbor benchmarks including SWE-Bench Pro, Terminal-Bench, and more available to run in Inspect.
A new data visualization framework for Inspect evals, featuring pre-built plots for commonly used views of evaluation data.