Harbor Tasks for Inspect

A new package that makes 80+ Harbor benchmarks including SWE-Bench, Terminal-Bench, LawBench, and more available to run directly in Inspect.
Author

Alexandra Abbas

Published

February 12, 2026

Harbor is a framework for evaluating AI agents in sandboxed, containerized environments. Its registry hosts a growing collection of popular benchmarks including SWE-Bench, Terminal-Bench, LawBench, MedAgentBench, Finance Agent, and ReplicationBench, making it a go-to resource for teams that need rigorous, reproducible agent evaluations.

We’re excited to share Inspect Harbor, a new package that brings 80+ Harbor task implementations directly into Inspect. This means you can run Harbor’s extensive library of containerized agent evaluations using Inspect’s workflow, tooling, and agent integrations without needing to set up Harbor separately.

Check out the documentation and the full listing of Harbor evals available in Inspect.