Harbor is a framework for evaluating AI agents in sandboxed, containerized environments. Its registry hosts a growing collection of popular benchmarks including SWE-Bench, Terminal-Bench, LawBench, MedAgentBench, Finance Agent, and ReplicationBench, making it a go-to resource for teams that need rigorous, reproducible agent evaluations.
We’re excited to share Inspect Harbor, a new package that brings 80+ Harbor task implementations directly into Inspect. This means you can run Harbor’s extensive library of containerized agent evaluations using Inspect’s workflow, tooling, and agent integrations without needing to set up Harbor separately.
Check out the documentation and the full listing of Harbor evals available in Inspect.