<?xml version="1.0" encoding="UTF-8"?>
<rss  xmlns:atom="http://www.w3.org/2005/Atom" 
      xmlns:media="http://search.yahoo.com/mrss/" 
      xmlns:content="http://purl.org/rss/1.0/modules/content/" 
      xmlns:dc="http://purl.org/dc/elements/1.1/" 
      version="2.0">
<channel>
<title>Meridian Labs</title>
<link>https://meridianlabs.ai/blog/</link>
<atom:link href="https://meridianlabs.ai/blog/index.xml" rel="self" type="application/rss+xml"/>
<description>Open source tools for AI research and evaluation</description>
<generator>quarto-1.9.36</generator>
<lastBuildDate>Wed, 04 Mar 2026 05:00:00 GMT</lastBuildDate>
<item>
  <title>Introducing Inspect Flow</title>
  <dc:creator>Alexandra Abbas</dc:creator>
  <link>https://meridianlabs.ai/blog/posts/inspect-flow/</link>
  <description><![CDATA[ 





<p><a href="https://meridianlabs-ai.github.io/inspect_flow/">Inspect Flow</a> is the workflow layer for Inspect that makes it easier to run evals at scale.</p>
<ul>
<li><strong>Scaling experiments:</strong> Run many tasks × models × params without writing orchestration scripts.</li>
<li><strong>Avoid re-running work:</strong> Reuse logs from the Flow Store and only run what’s missing.</li>
<li><strong>Clean configs:</strong> Define eval workflows declaratively instead of ad-hoc Python scripts.</li>
<li><strong>Systematic sweeps:</strong> Built-in matrix patterns for exploring tasks, models, and params.</li>
</ul>
<p>Define a simple workflow with a list of tasks:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> inspect_flow <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> FlowSpec, FlowTask</span>
<span id="cb1-2"></span>
<span id="cb1-3">FlowSpec(</span>
<span id="cb1-4">    log_dir<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"logs"</span>,</span>
<span id="cb1-5">    tasks<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[</span>
<span id="cb1-6">        FlowTask(</span>
<span id="cb1-7">            name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"inspect_evals/gpqa_diamond"</span>,</span>
<span id="cb1-8">            model<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"openai/gpt-4o"</span>,</span>
<span id="cb1-9">        ),</span>
<span id="cb1-10">        FlowTask(</span>
<span id="cb1-11">            name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"inspect_evals/mmlu_0_shot"</span>,</span>
<span id="cb1-12">            model<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"openai/gpt-4o"</span>,</span>
<span id="cb1-13">        ),</span>
<span id="cb1-14">    ],</span>
<span id="cb1-15">)</span></code></pre></div></div>
<p>Then run:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb2-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">flow</span> run config.py</span></code></pre></div></div>
<p><img src="https://meridianlabs-ai.github.io/inspect_flow/images/config_progress_terminal.png" class="border img-fluid"></p>
<p>For more complex experiments, use matrix patterns to systematically sweep across tasks, models, and parameters:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1">FlowSpec(</span>
<span id="cb3-2">    log_dir<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"logs"</span>,</span>
<span id="cb3-3">    tasks<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>tasks_matrix(</span>
<span id="cb3-4">        task<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[</span>
<span id="cb3-5">            <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"inspect_evals/gpqa_diamond"</span>,</span>
<span id="cb3-6">            <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"inspect_evals/mmlu_pro"</span>,</span>
<span id="cb3-7">        ],</span>
<span id="cb3-8">        model<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>models_matrix(</span>
<span id="cb3-9">            model<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[</span>
<span id="cb3-10">                <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"openai/gpt-5"</span>,</span>
<span id="cb3-11">                <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"openai/gpt-5-mini"</span>,</span>
<span id="cb3-12">            ],</span>
<span id="cb3-13">            config<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>configs_matrix(</span>
<span id="cb3-14">                reasoning_effort<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"low"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"medium"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"high"</span>],</span>
<span id="cb3-15">            ),</span>
<span id="cb3-16">        ),</span>
<span id="cb3-17">    ),</span>
<span id="cb3-18">)</span>
<span id="cb3-19"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># → produces 12 evaluations</span></span>
<span id="cb3-20"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#   2 tasks × 2 models × 3 reasoning levels</span></span></code></pre></div></div>
<p>Flow expands the task/model/config matrix, reuses logs from the Flow Store, and only runs what’s missing.</p>
<p>Get started with the <a href="https://meridianlabs-ai.github.io/inspect_flow/">Inspect Flow documentation</a>.</p>



 ]]></description>
  <guid>https://meridianlabs.ai/blog/posts/inspect-flow/</guid>
  <pubDate>Wed, 04 Mar 2026 05:00:00 GMT</pubDate>
  <media:content url="https://meridianlabs-ai.github.io/inspect_flow/images/config_progress_terminal.png" medium="image" type="image/png"/>
</item>
<item>
  <title>Transcript Analysis with Inspect Scout</title>
  <dc:creator>J.J. Allaire</dc:creator>
  <link>https://meridianlabs.ai/blog/posts/inspect-scout/</link>
  <description><![CDATA[ 





<p>We’re excited to announce <a href="https://meridianlabs-ai.github.io/inspect_scout/">Inspect Scout</a>, a tool for in-depth analysis of AI agent transcripts. With Scout, you can easily:</p>
<ul>
<li>Detect issues like misconfigured environments, refusals, and evaluation awareness using LLM-based or pattern-based scanners.</li>
<li>Analyze transcripts from Inspect, Arize Phoenix, LangSmith, Logfire, MLFLow, W&amp;B Weave, Claude Code, or custom sources via the capture and import APIs.</li>
<li>Develop scanners interactively, exploring transcripts and scan results visually in Scout View.</li>
<li>Validate scanner accuracy against human-labeled examples.</li>
<li>Handle complex scanning requirements like multi-agent transcripts, compaction, and context-window chunking.</li>
<li>Scale to thousands of transcripts with parallel processing, batching, and fault tolerance.</li>
</ul>
<p><a href="https://meridianlabs-ai.github.io/inspect_scout/images/view-result.png" class="lightbox" data-gallery="quarto-lightbox-gallery-1"><img src="https://meridianlabs-ai.github.io/inspect_scout/images/view-result.png" class="border img-fluid"></a></p>
<p>Scout also includes a validation framework for measuring scanner accuracy against human-labeled examples, so you can iteratively refine your scanners with confidence.</p>
<p><a href="https://meridianlabs-ai.github.io/inspect_scout/images/validation-panel-transcripts.png" class="lightbox" data-gallery="quarto-lightbox-gallery-2"><img src="https://meridianlabs-ai.github.io/inspect_scout/images/validation-panel-transcripts.png" class="border img-fluid"></a></p>
<p>We’re especially appreciative of the feedback we received from UK AISI, US CAISI, METR, Apollo, and many others during Scout’s development. Their paper on <a href="https://cdn.prod.website-files.com/663bd486c5e4c81588db7a1d/699f3a9b918419fe89c8c740_Seven_simple_steps_forlog_analysis_in_AI_systems_corrected.pdf">Seven Simple Steps for Log Analysis in AI Systems</a> goes in depth on best practices for transcript analysis including many practical examples.</p>
<p>Get started with the <a href="https://meridianlabs-ai.github.io/inspect_scout/">Inspect Scout documentation</a>.</p>



 ]]></description>
  <guid>https://meridianlabs.ai/blog/posts/inspect-scout/</guid>
  <pubDate>Wed, 25 Feb 2026 05:00:00 GMT</pubDate>
  <media:content url="https://meridianlabs-ai.github.io/inspect_scout/images/validation-panel-transcripts.png" medium="image" type="image/png"/>
</item>
<item>
  <title>Harbor Tasks for Inspect</title>
  <dc:creator>Alexandra Abbas</dc:creator>
  <link>https://meridianlabs.ai/blog/posts/inspect-harbor/</link>
  <description><![CDATA[ 





<p><a href="https://harborframework.com">Harbor</a> is a framework for evaluating AI agents in sandboxed, containerized environments. Its <a href="https://registry.harborframework.com">registry</a> hosts a growing collection of popular benchmarks including SWE-Bench, Terminal-Bench, LawBench, MedAgentBench, Finance Agent, and ReplicationBench, making it a go-to resource for teams that need rigorous, reproducible agent evaluations.</p>
<p>We’re excited to share <a href="https://meridianlabs-ai.github.io/inspect_harbor/">Inspect Harbor</a>, a new package that brings 80+ <a href="https://inspect.aisi.org.uk/evals/#/?source=harbor">Harbor task implementations</a> directly into Inspect. This means you can run Harbor’s extensive library of containerized agent evaluations using Inspect’s workflow, tooling, and agent integrations without needing to set up Harbor separately.</p>
<p>Check out the <a href="https://meridianlabs-ai.github.io/inspect_harbor/">documentation</a> and the full <a href="https://inspect.aisi.org.uk/evals/#/?source=harbor">listing of Harbor evals</a> available in Inspect.</p>



 ]]></description>
  <guid>https://meridianlabs.ai/blog/posts/inspect-harbor/</guid>
  <pubDate>Thu, 12 Feb 2026 05:00:00 GMT</pubDate>
  <media:content url="https://meridianlabs.ai/blog/posts/inspect-harbor/cover.png" medium="image" type="image/png" height="47" width="144"/>
</item>
<item>
  <title>Announcing Inspect Viz</title>
  <dc:creator>J.J. Allaire</dc:creator>
  <link>https://meridianlabs.ai/blog/posts/inspect-viz/</link>
  <description><![CDATA[ 





<p>We’re excited to announce <a href="https://meridianlabs-ai.github.io/inspect_viz/">Inspect Viz</a>, a new data visualization framework for Inspect evals. Inspect Viz includes a variety of pre-built plots that provide commonly used views of eval data, making it easier to explore and communicate results from your evaluations.</p>
<p>Whether you need to visualize accuracy across tasks, compare model performance, or drill into specific evaluation runs, Inspect Viz provides ready-to-use components that work seamlessly with the Inspect ecosystem. Here are a few examples:</p>
<p>Track how model scores evolve over time across models and providers:</p>
<p><img src="https://meridianlabs-ai.github.io/inspect_viz/images/scores_timeline_gpqa.png" class="border img-fluid"></p>
<p>Break down evaluation scores by individual task to identify strengths and weaknesses:</p>
<p><img src="https://meridianlabs-ai.github.io/inspect_viz/view-scores-by-task_files/placeholder/7194617527e820cb.png" class="border img-fluid"></p>
<p>Compare performance across models at a glance:</p>
<p><img src="https://meridianlabs-ai.github.io/inspect_viz/view-scores-by-model_files/placeholder/641e3e25e0133dc2.png" class="border img-fluid"></p>
<p>Use heatmaps to spot patterns across tasks and models simultaneously:</p>
<p><img src="https://meridianlabs-ai.github.io/inspect_viz/view-scores-heatmap_files/placeholder/97559e02f53807b9.png" class="border img-fluid"></p>
<p>These are just a few of the views available out of the box and you can easily build your own custom visualizations on top of the framework. Get started with the <a href="https://meridianlabs-ai.github.io/inspect_viz/">Inspect Viz documentation</a>.</p>



 ]]></description>
  <guid>https://meridianlabs.ai/blog/posts/inspect-viz/</guid>
  <pubDate>Sun, 07 Sep 2025 04:00:00 GMT</pubDate>
  <media:content url="https://meridianlabs-ai.github.io/inspect_viz/view-scores-by-task_files/placeholder/7194617527e820cb.png" medium="image" type="image/png"/>
</item>
</channel>
</rss>
