Meridian Labs

Introducing Inspect Flow

Alexandra Abbas — Wed, 04 Mar 2026 05:00:00 GMT

Inspect Flow is the workflow layer for Inspect that makes it easier to run evals at scale.

Scaling experiments: Run many tasks × models × params without writing orchestration scripts.
Avoid re-running work: Reuse logs from the Flow Store and only run what’s missing.
Clean configs: Define eval workflows declaratively instead of ad-hoc Python scripts.
Systematic sweeps: Built-in matrix patterns for exploring tasks, models, and params.

Define a simple workflow with a list of tasks:

from inspect_flow import FlowSpec, FlowTask

FlowSpec(
    log_dir="logs",
    tasks=[
        FlowTask(
            name="inspect_evals/gpqa_diamond",
            model="openai/gpt-4o",
        ),
        FlowTask(
            name="inspect_evals/mmlu_0_shot",
            model="openai/gpt-4o",
        ),
    ],
)

Then run:

flow run config.py

For more complex experiments, use matrix patterns to systematically sweep across tasks, models, and parameters:

FlowSpec(
    log_dir="logs",
    tasks=tasks_matrix(
        task=[
            "inspect_evals/gpqa_diamond",
            "inspect_evals/mmlu_pro",
        ],
        model=models_matrix(
            model=[
                "openai/gpt-5",
                "openai/gpt-5-mini",
            ],
            config=configs_matrix(
                reasoning_effort=["low", "medium", "high"],
            ),
        ),
    ),
)
# → produces 12 evaluations
#   2 tasks × 2 models × 3 reasoning levels

Flow expands the task/model/config matrix, reuses logs from the Flow Store, and only runs what’s missing.

Get started with the Inspect Flow documentation.

Transcript Analysis with Inspect Scout

J.J. Allaire — Wed, 25 Feb 2026 05:00:00 GMT

We’re excited to announce Inspect Scout, a tool for in-depth analysis of AI agent transcripts. With Scout, you can easily:

Detect issues like misconfigured environments, refusals, and evaluation awareness using LLM-based or pattern-based scanners.
Analyze transcripts from Inspect, Arize Phoenix, LangSmith, Logfire, MLFLow, W&B Weave, Claude Code, or custom sources via the capture and import APIs.
Develop scanners interactively, exploring transcripts and scan results visually in Scout View.
Validate scanner accuracy against human-labeled examples.
Handle complex scanning requirements like multi-agent transcripts, compaction, and context-window chunking.
Scale to thousands of transcripts with parallel processing, batching, and fault tolerance.

Scout also includes a validation framework for measuring scanner accuracy against human-labeled examples, so you can iteratively refine your scanners with confidence.

We’re especially appreciative of the feedback we received from UK AISI, US CAISI, METR, Apollo, and many others during Scout’s development. Their paper on Seven Simple Steps for Log Analysis in AI Systems goes in depth on best practices for transcript analysis including many practical examples.

Get started with the Inspect Scout documentation.

Harbor Tasks for Inspect

Alexandra Abbas — Thu, 12 Feb 2026 05:00:00 GMT

Harbor is a framework for evaluating AI agents in sandboxed, containerized environments. Its registry hosts a growing collection of popular benchmarks including SWE-Bench, Terminal-Bench, LawBench, MedAgentBench, Finance Agent, and ReplicationBench, making it a go-to resource for teams that need rigorous, reproducible agent evaluations.

We’re excited to share Inspect Harbor, a new package that brings 80+ Harbor task implementations directly into Inspect. This means you can run Harbor’s extensive library of containerized agent evaluations using Inspect’s workflow, tooling, and agent integrations without needing to set up Harbor separately.

Check out the documentation and the full listing of Harbor evals available in Inspect.

Announcing Inspect Viz

J.J. Allaire — Sun, 07 Sep 2025 04:00:00 GMT

We’re excited to announce Inspect Viz, a new data visualization framework for Inspect evals. Inspect Viz includes a variety of pre-built plots that provide commonly used views of eval data, making it easier to explore and communicate results from your evaluations.

Whether you need to visualize accuracy across tasks, compare model performance, or drill into specific evaluation runs, Inspect Viz provides ready-to-use components that work seamlessly with the Inspect ecosystem. Here are a few examples:

Track how model scores evolve over time across models and providers:

Break down evaluation scores by individual task to identify strengths and weaknesses:

Compare performance across models at a glance:

Use heatmaps to spot patterns across tasks and models simultaneously:

These are just a few of the views available out of the box and you can easily build your own custom visualizations on top of the framework. Get started with the Inspect Viz documentation.