๐ง User Manual โ MultiAgentEval¶
This guide is for users who want to run and understand evaluations without diving into internal implementation details.
๐ Table of Contents¶
- Core Concepts
- Running Evaluations (CLI)
- Scenario Structure
- Metrics Explained
- Drift & Triage (Advanced)
๐ Core Concepts¶
๐๏ธ Scenario¶
A scenario is the unit of evaluation. It's a JSON file that defines:
scenario_idโ unique identifiertitleโ human-friendly nameindustryโ category for groupingdatasetโ (optional) path to a synthetic CSV/JSONL dataset to ground the scenario. Path Decoupling (v1.1+): Relative paths (e.g.,./data.csv) are resolved relative to the scenario file itself.tasksโ list of tasks to runtoolsโ mock tool behaviors (optional)policiesโ rules and governance checks (optional)initial_stateโ starting state (optional)
โ Task¶
A task is a step in a scenario:
task_iddescriptionโ the prompt sent to the agentsuccess_criteriaโ metrics + thresholds used to determine successrequired_toolsโ which tools the agent should call (optional)expected_state_changesโ expected changes to sandbox state (optional)
๐งฐ Tool Sandbox¶
The harness uses a sandbox to simulate tool calls and external dependencies. This means: - No real API calls are made unless explicitly configured - Tool behavior is controlled by the scenario definition - Policies can be enforced during tool execution
๐งฉ AES (Agent Eval Specification)¶
AES is a standardized YAML/JSON schema for defining benchmarks, tasks, expected outcomes, metrics, and policy constraints. It enables consistent sharing of evaluation scenarios across tools and repositories.
๐ Metrics¶
Metrics score an agentโs performance. Built-in metrics include:
- policy_compliance โ avoids policy violations
- path_parsimony โ prefers fewer turns (efficiency)
- state_verification โ validates expected state changes. Supports dot-notation (e.g., user.profile.balance) for nested object verification.
- calculation_accuracy โ High-Fidelity: extracts and validates numerical results within a configurable tolerance.
- planning_quality โ evaluates strategic sequencing and decision-making logic.
- root_cause_analysis_correctness โ assesses the accuracy of agent diagnostics.
- consistency_score โ checks stability across multiple runs
- luna_judge_score โ semantic and behavioral evaluation via LLM-Judge (calibratable to human ground truth, with required provider guards).
โถ๏ธ Running Evaluations (CLI)¶
๐งช evaluate โ batch evaluation¶
multiagent-eval evaluate --path industries/telecom --format jsonl --output reports/latest_results.json --attempts 3
Key options:
- --path (required): Scenario file or directory
- --format: jsonl (default) or csv
- --output: Output report path
- --limit: Limit number of scenarios executed
- --attempts: Run pass@k (multiple attempts per scenario)
What happens during evaluation:
1. Loads each scenario
2. Runs each task for up to EVAL_MAX_TURNS turns
3. Calls the agent (default: HTTP adapter)
4. Executes sandbox tool calls
5. Computes metrics
6. Writes reports and trace files
โก CLI Quick Reference¶
| Command | Common options | What it does |
|---|---|---|
multiagent-eval evaluate |
--path, --format, --output, --limit, --attempts / -k |
Run a set of scenarios (batch mode) |
multiagent-eval run |
-k |
Run a single scenario JSON file |
multiagent-eval replay |
--path |
Replay a recorded run trace |
multiagent-eval aes validate |
--path |
Validate AES benchmark YAML (v1.1) |
multiagent-eval aes scaffold |
--output |
Generate a starter AES v1.1 template |
multiagent-eval console |
--port |
Launch the Visual Debugger local API and interactive GUI (Integrated Visual Suite) |
multiagent-eval quickstart |
(none) | 60-second CLI demo (spawns agent + runs evaluation). |
multiagent-eval doctor |
(none) | Check environment, dependencies, and configuration |
multiagent-eval init |
--dir, --industry |
Scaffold a new benchmark environment and linked datasets |
multiagent-eval report |
--path |
Generate a Premium HTML report with reconstructed trajectories |
multiagent-eval scenario generate |
(none) | Interactively bootstrap new scenarios |
multiagent-eval record |
--agent |
Capture real interactions into an executable trace |
multiagent-eval playground |
--agent |
Interactive REPL for rapid experimentation |
multiagent-eval spec-to-eval |
--input, --output |
Convert Markdown spec to scenario JSON |
multiagent-eval auto-translate |
--input, --model, --industry |
Translate raw documents (PDF, DOCX) into scenario JSON via a local LLM |
multiagent-eval import-drift |
--input, --industry, --output-dir |
Convert production trace to scenario |
multiagent-eval mutate |
--input, --type, --output |
Generate adversarial scenario variants |
multiagent-eval list |
--search |
Search and explore the scenario catalog |
multiagent-eval lint |
--path |
Score and validate scenario quality/compliance |
multiagent-eval plugin |
<plugin_name> <cmd> |
Secure namespace for executing 3rd-party plugin commands |
๐งฉ run โ single scenario¶
Use this for rapid iteration and debugging.
๐ฝ๏ธ Run Trace Management¶
The harness records every event (agent messages, tool calls, metrics) into trace files for debugging and auditing.
Default Behavior:
- All runs are appended to runs/run.jsonl.
- Each run is also saved to its own file: runs/run-<run_id>.jsonl.
Configuration (Environment Variables):
| Variable | Default | Description |
|---|---|---|
| RUN_LOG_DIR | runs | Directory where trace files are stored. |
| RUN_LOG_PER_RUN | true | Save each run to a separate file. |
| RUN_LOG_MASTER | true | Append all runs to a master run.jsonl. |
| RUN_LOG_ROTATE_COUNT | 0 | Number of per-run files to keep. 0 means keep all. |
๐ Adoption & Productivity Utilities¶
These utilities are designed to get you from "installation" to "first eval" in seconds.
๐ quickstart โ The 60-Second Demo¶
Runs a complete evaluation loop using the built-in sample agent in your terminal.
* Spawns the sample agent server process. * Runs a telecom troubleshooting scenario. * Generates a Premium HTML report (Mermaid trajectories enabled) inreports/.
* Note: This command is designed for CLI-only instant feedback; use multiagent-eval console for the visual experience.
๐ฅ๏ธ console โ React Visual Debugger GUI¶
Launch a high-fidelity visual dashboard to run scenarios, inspect trace lines chronologically, and review system documentation locally.
Key Features:¶
- Scenario Explorer: Browse the catalog with search filters. View real-time Lint Scores and quality status badges.
- Background Execution: Trigger evaluations directly from the UI; monitor progress in real-time.
- Visual DNA Debugger: Live trajectory playback, state inspection, and trace export via the
DebuggerStateStorehook. - API Reference: Integrated technical documentation drawer for one-click access to guides.
๐ doctor โ Environment Validator¶
Troubleshoot your installation and connectivity.
๐จ report โ Premium Visual Reporting¶
Generate a premium HTML report with interactive trajectory maps reconstructed from historical trace events.
โจ scenario generate โ Interactive Scaffolding¶
Bootstrap new test cases without writing JSON by hand.
โบ record โ Trace Capture¶
Capture real interactions with your agent to create new eval scenarios.
๐ฎ playground โ Interactive Experimentation¶
Talk to your agent directly in the CLI and see how it performs.
๐ list โ Scenario Catalog Search¶
Discover scenarios across the built-in and downloaded libraries.
๐งน lint โ Quality Scoring¶
Check your scenarios for AES compliance and technical quality.
- 90-100: High quality, CI-ready. - 70-89: Warning (Missing metadata or low complexity). - <70: Fail (Structural errors or zero tasks).๐ Advanced CLI Utilities & UX¶
๐ฆ install โ Scenario Packs¶
Rapidly deploy industry-specific scenario bundles (e.g., telecom-pack, rag-agent-pack).
๐ฌ analyze โ Repo Scanning¶
Scan agent repositories to identify tool patterns and auto-generate matching AES scenarios.
๐ค explain โ Trace Analysis¶
Automated diagnostic analysis of run.jsonl traces to identify root causes of agent failures.
๐ ๏ธ Visual Scenario Editor¶
Built into the Visual Debugger (multiagent-eval console), this tool provides a visual interface for constructing complex AES logic and saving it directly to the local industry catalog.
๐งฉ Scenario Structure (Example)¶
{
"scenario_id": "example_01",
"title": "Basic instruction",
"industry": "generic",
"tasks": [
{
"task_id": "t1",
"description": "Write a friendly greeting.",
"success_criteria": [
{"metric": "policy_compliance", "threshold": 1.0},
{"metric": "path_parsimony", "threshold": 0.5}
]
}
]
}
๐งฑ Adding Industries & Scenarios¶
The simplest way to add a completely new industry is to generate a bootstrapped setup using init, which automatically creates a starter scenario and linked synthetic datasets.
If you prefer to add them manually:
The harness loads scenarios from industries/<industry>/scenarios/.
- Create a directory for your industry (if it doesn't exist):
- Add a JSON scenario file (any name ending in
.json):
{
"scenario_id": "my_scenario_01",
"title": "Example scenario",
"industry": "<your_industry>",
"tasks": [ ... ]
}
- Run it via CLI:
- (Optional) Run an industry batch:
๐ Tip¶
Keep scenario_id unique within the industry and prefer descriptive file names like scenario_<short-name>.json.
๐งฉ Agent Topology (Multi-Agent Scenarios)¶
When a scenario involves more than one agent, you can define an agent_topology object to control which agents can read or write which parts of the shared state.
This prevents agents from unintentionally interfering with each other and enables fine-grained multi-agent evaluation.
Example:
"agent_topology": {
"agent_a": {"reads": ["user.*"], "writes": ["user.profile"]},
"agent_b": {"reads": ["user.*", "order.*"], "writes": ["order.status"]}
}
โ Best practices (Agent Topology)¶
- Use topology only when you need multi-agent separation. For single-agent scenarios, omitting
agent_topologykeeps things simpler. - Start with broad read permissions and tighten over time. This helps avoid accidental โpermission deniedโ failures while you iterate.
- Match topology to tool behavior. If a tool writes a state path, ensure the calling agent has
writesrights for that path. - Avoid overlap for sensitive state. If two agents shouldnโt see each otherโs data, keep their
readssets disjoint.
๐ Tool Definitions¶
Tools are defined within the scenario and can: - Apply state changes - Return structured outputs - Enforce policies
Example:
"tools": {
"send_email": {
"state_changes": [
{"path": "emails.sent", "value": true}
],
"output": {"status": "success", "message": "Email sent"}
}
}
๐จ Policies¶
Policies enforce guardrails during tool execution.
Example:
If the agent calls withdraw_money with an amount above the limit, the harness returns a policy_violation response.
๐ Metrics Explained¶
Metrics are evaluated per task and decide if the task succeeded.
โ
policy_compliance¶
Ensures the agent didnโt trigger policy violations via tool calls.
๐งญ path_parsimony¶
Rewards fewer turns (more concise/efficient behavior).
๐งฑ state_verification¶
Validates that the sandbox state matches expected changes.
๐งฐ tool_call_correctness¶
Ensures the agent called required tools.
๐ calculation_accuracy¶
High-Fidelity: Uses regex to extract numerical values from the agent's summary and compares them against the expected values with a 0.01 tolerance.
๐ง planning_quality & root_cause_analysis¶
Advanced cognitive metrics using domain-specific LLM rubrics to evaluate the quality of an agent's planning and diagnostic accuracy.
๐ consistency_score¶
Used when --attempts > 1 to measure stability across runs.
๐ง Drift & Triage (Advanced)¶
๐ช๏ธ 5.1 Importing Drift¶
Convert production traces into evaluation scenarios:
multiagent-eval import-drift --input production_trace.jsonl --industry telecom --output-dir industries/telecom/scenarios
๐ฌ 5.2 State-Level Trajectory Triage (How Root Cause Isolation Works)¶
AgentEval isolates the root cause of a failure by combining three layers of analysis โ not just scanning logs.
Layer 1: State Parity Check (VFS Delta) Every World Shim (Database, Jira, Git, API, etc.) is "VFS-aware". When an agent calls a tool, AgentEval compares the resulting system state against the "Ground Truth" defined in the scenario. If the agent queries the wrong table or fails to commit a file, the State Divergence is flagged immediately as the "Patient Zero" step. This catches "silent failures" where the agent thinks it succeeded but the environment changed incorrectly.
Layer 2: Heuristic Triage Engine (triage.py)
A specialized engine scans the entire trace for failure patterns:
- Stall Detection โ identifies if an agent is looping (e.g., calling list_dir 3 times with no change in behavior).
- Tool-Level Exceptions โ captures internal simulator errors (e.g., a "404 Not Found" from the API Shim) that the agent might have ignored.
- Policy Violations โ if a Security Shim blocks an action (like a regex-based data leak), the triage engine flags the exact guardrail triggered.
Layer 3: Visual Timeline Mapping The Visual Debugger's "Isolate Root Cause" automatically scrolls the timeline to the first Non-Success Signal โ the exact failing node, highlighted in red.
| Layer | What it detects | Why it matters |
|---|---|---|
| State | Data/File divergence | Catches silent failures where the agent thinks it succeeded. |
| Logic | Loops & Stalls | Identifies when an agent's reasoning has hit a dead-end. |
| Security | Policy Violations | Pinpoints exactly which guardrail was triggered and why. |
By combining these layers, AgentEval can distinguish between an agent that hallucinated a tool's existence vs. an agent that used the right tool but with the wrong parameters.
Why use Shims instead of Real APIs? - Safety: No risk of accidentally deleting a production database or emailing a real customer. - Determinism: You can force the shim to fail (e.g., simulate a 500 error) to test error handling. - Speed: Simulated responses are near-instant vs. real network latency.
๐ See the full technical deep-dive: 06_TRIAGE_ENGINE_AND_VFS.md
๐ Next Steps¶
For architecture and extension patterns, read the Developer Guide:
- docs/guides/help/03_DEVELOPER_GUIDE.md
- docs/guides/help/06_TRIAGE_ENGINE_AND_VFS.md