Skip to content

Guide: Drift Management & Edge-Case Triage

This guide details how to use the Semantic Bridge features to ingest real-world behavior and analyze agent failures.

1. Drift Importer (import-drift)

The Drift Importer allows you to convert production traces (agent/user interaction logs) into reusable evaluation scenarios. This is critical for building regression suites from real-world "drift."

Usage

multiagent-eval import-drift --input path/to/trace.json --industry telecom

Trace Format

The importer expects a list of interaction objects:

[
  {"role": "user", "content": "I need help with my bill."},
  {"role": "assistant", "content": "I can help with that. What is your account number?"}
]

Result

A new v2 scenario file is created in industries/[industry]/scenarios/drift-[hash].json, containing the original history as ground_truth_history.


2. Edge-Case Triage Library

The Triage Engine automatically analyzes failed evaluation tasks and applies tags based on known failure patterns.

Built-in Triage Tags

Tag Description
CONNECTION_ERROR Agent communication failed (e.g., timeout, 500 reset).
POLICY_VIOLATION The agent attempted an action forbidden by the ToolSandbox policies.
TOOL_ERROR A mock tool returned an error status during execution.
STALL The agent hit the maximum number of turns without reaching a final answer.

How it Works

The Triage Engine inspects the conversation_history and EvaluationContext after a run to match heuristics.

Task: refund_processing [FAILURE [CONNECTION_ERROR]] FAILED Metric: generic_accuracy | Score: 0.00 | Threshold: 0.80

## 3. Automated Diagnostics (`explain`)

While triage applies categorical tags, the `explain` command performs a deep forensic analysis of the execution trace to identify the root cause of a failure.

### Usage
```bash
multiagent-eval explain --path runs/run.jsonl

Forensic Features

  • Tiered Confidence Scoring: Distinguishes between explicit policy violations (100%), induced system/tool errors (85%), and heuristic fallbacks (50%).
  • Actionable Remediation: Provides targeted advice based on the identified pattern (e.g., prompt refinement, sandbox optimization).
  • Pinpoint Diagnostics: Identifies the exact turn (index) where the failure logic diverged.

[!TIP] Visual Triage: Use multiagent-eval console to view these failure tags interactively. The dashboard highlights POLICY_VIOLATION and STALL events with visual cues in the trajectory timeline.