Updates on our open source tools for AI research and evaluation.
A workflow layer for Inspect that makes it easier to run evals at scale with declarative configs, matrix sweeps, and automatic log reuse.
A tool for in-depth analysis of AI agent transcripts, with LLM-based and pattern-based scanners for detecting issues beyond simple success/failure metrics.
A new package that makes 80+ Harbor benchmarks including SWE-Bench, Terminal-Bench, LawBench, and more available to run directly in Inspect.
A new data visualization framework for Inspect evals, featuring pre-built plots for commonly used views of evaluation data.