Skip to content

๐Ÿง  User Manual โ€” MultiAgentEval

This guide is for users who want to run and understand evaluations without diving into internal implementation details.


๐Ÿ“Œ Table of Contents

  1. Core Concepts
  2. Running Evaluations (CLI)
  3. Scenario Structure
  4. Metrics Explained
  5. Drift & Triage (Advanced)

๐Ÿ“š Core Concepts

๐Ÿ—‚๏ธ Scenario

A scenario is the unit of evaluation. It's a JSON file that defines:

  • scenario_id โ€” unique identifier
  • title โ€” human-friendly name
  • industry โ€” category for grouping
  • dataset โ€” (optional) path to a synthetic CSV/JSONL dataset to ground the scenario. Path Decoupling (v1.1+): Relative paths (e.g., ./data.csv) are resolved relative to the scenario file itself.
  • tasks โ€” list of tasks to run
  • tools โ€” mock tool behaviors (optional)
  • policies โ€” rules and governance checks (optional)
  • initial_state โ€” starting state (optional)

โœ… Task

A task is a step in a scenario:

  • task_id
  • description โ€” the prompt sent to the agent
  • success_criteria โ€” metrics + thresholds used to determine success
  • required_tools โ€” which tools the agent should call (optional)
  • expected_state_changes โ€” expected changes to sandbox state (optional)

๐Ÿงฐ Tool Sandbox

The harness uses a sandbox to simulate tool calls and external dependencies. This means: - No real API calls are made unless explicitly configured - Tool behavior is controlled by the scenario definition - Policies can be enforced during tool execution

๐Ÿงฉ AES (Agent Eval Specification)

AES is a standardized YAML/JSON schema for defining benchmarks, tasks, expected outcomes, metrics, and policy constraints. It enables consistent sharing of evaluation scenarios across tools and repositories.

๐Ÿ“ Metrics

Metrics score an agentโ€™s performance. Built-in metrics include: - policy_compliance โ€” avoids policy violations - path_parsimony โ€” prefers fewer turns (efficiency) - state_verification โ€” validates expected state changes. Supports dot-notation (e.g., user.profile.balance) for nested object verification. - calculation_accuracy โ€” High-Fidelity: extracts and validates numerical results within a configurable tolerance. - planning_quality โ€” evaluates strategic sequencing and decision-making logic. - root_cause_analysis_correctness โ€” assesses the accuracy of agent diagnostics. - consistency_score โ€” checks stability across multiple runs - luna_judge_score โ€” semantic and behavioral evaluation via LLM-Judge (calibratable to human ground truth, with required provider guards).


โ–ถ๏ธ Running Evaluations (CLI)

๐Ÿงช evaluate โ€” batch evaluation

multiagent-eval evaluate --path industries/telecom --format jsonl --output reports/latest_results.json --attempts 3

Key options: - --path (required): Scenario file or directory - --format: jsonl (default) or csv - --output: Output report path - --limit: Limit number of scenarios executed - --attempts: Run pass@k (multiple attempts per scenario)

What happens during evaluation: 1. Loads each scenario 2. Runs each task for up to EVAL_MAX_TURNS turns 3. Calls the agent (default: HTTP adapter) 4. Executes sandbox tool calls 5. Computes metrics 6. Writes reports and trace files

โšก CLI Quick Reference

Command Common options What it does
multiagent-eval evaluate --path, --format, --output, --limit, --attempts / -k Run a set of scenarios (batch mode)
multiagent-eval run -k Run a single scenario JSON file
multiagent-eval replay --path Replay a recorded run trace
multiagent-eval aes validate --path Validate AES benchmark YAML (v1.1)
multiagent-eval aes scaffold --output Generate a starter AES v1.1 template
multiagent-eval console --port Launch the Visual Debugger local API and interactive GUI (Integrated Visual Suite)
multiagent-eval quickstart (none) 60-second CLI demo (spawns agent + runs evaluation).
multiagent-eval doctor (none) Check environment, dependencies, and configuration
multiagent-eval init --dir, --industry Scaffold a new benchmark environment and linked datasets
multiagent-eval report --path Generate a Premium HTML report with reconstructed trajectories
multiagent-eval scenario generate (none) Interactively bootstrap new scenarios
multiagent-eval record --agent Capture real interactions into an executable trace
multiagent-eval playground --agent Interactive REPL for rapid experimentation
multiagent-eval spec-to-eval --input, --output Convert Markdown spec to scenario JSON
multiagent-eval auto-translate --input, --model, --industry Translate raw documents (PDF, DOCX) into scenario JSON via a local LLM
multiagent-eval import-drift --input, --industry, --output-dir Convert production trace to scenario
multiagent-eval mutate --input, --type, --output Generate adversarial scenario variants
multiagent-eval list --search Search and explore the scenario catalog
multiagent-eval lint --path Score and validate scenario quality/compliance
multiagent-eval plugin <plugin_name> <cmd> Secure namespace for executing 3rd-party plugin commands

๐Ÿงฉ run โ€” single scenario

multiagent-eval run --scenario scenarios/your_scenario.json -k 2

Use this for rapid iteration and debugging.

๐Ÿ“ฝ๏ธ Run Trace Management

The harness records every event (agent messages, tool calls, metrics) into trace files for debugging and auditing.

Default Behavior: - All runs are appended to runs/run.jsonl. - Each run is also saved to its own file: runs/run-<run_id>.jsonl.

Configuration (Environment Variables): | Variable | Default | Description | |---|---|---| | RUN_LOG_DIR | runs | Directory where trace files are stored. | | RUN_LOG_PER_RUN | true | Save each run to a separate file. | | RUN_LOG_MASTER | true | Append all runs to a master run.jsonl. | | RUN_LOG_ROTATE_COUNT | 0 | Number of per-run files to keep. 0 means keep all. |


๐Ÿš€ Adoption & Productivity Utilities

These utilities are designed to get you from "installation" to "first eval" in seconds.

๐Ÿƒ quickstart โ€” The 60-Second Demo

Runs a complete evaluation loop using the built-in sample agent in your terminal.

multiagent-eval quickstart
* Spawns the sample agent server process. * Runs a telecom troubleshooting scenario. * Generates a Premium HTML report (Mermaid trajectories enabled) in reports/. * Note: This command is designed for CLI-only instant feedback; use multiagent-eval console for the visual experience.

๐Ÿ–ฅ๏ธ console โ€” React Visual Debugger GUI

Launch a high-fidelity visual dashboard to run scenarios, inspect trace lines chronologically, and review system documentation locally.

Key Features:

  • Scenario Explorer: Browse the catalog with search filters. View real-time Lint Scores and quality status badges.
  • Background Execution: Trigger evaluations directly from the UI; monitor progress in real-time.
  • Visual DNA Debugger: Live trajectory playback, state inspection, and trace export via the DebuggerStateStore hook.
  • API Reference: Integrated technical documentation drawer for one-click access to guides.
multiagent-eval console --port 5000

๐Ÿ” doctor โ€” Environment Validator

Troubleshoot your installation and connectivity.

multiagent-eval doctor

๐ŸŽจ report โ€” Premium Visual Reporting

Generate a premium HTML report with interactive trajectory maps reconstructed from historical trace events.

multiagent-eval report --path runs/run_<id>.jsonl

โœจ scenario generate โ€” Interactive Scaffolding

Bootstrap new test cases without writing JSON by hand.

multiagent-eval scenario generate

โบ record โ€” Trace Capture

Capture real interactions with your agent to create new eval scenarios.

multiagent-eval record --agent http://localhost:5001/execute_task

๐ŸŽฎ playground โ€” Interactive Experimentation

Talk to your agent directly in the CLI and see how it performs.

multiagent-eval playground --agent http://localhost:5001/execute_task

Discover scenarios across the built-in and downloaded libraries.

multiagent-eval list --search "telecom"

๐Ÿงน lint โ€” Quality Scoring

Check your scenarios for AES compliance and technical quality.

multiagent-eval lint --path industries/telecom/scenarios/troubleshooting_v1.json
- 90-100: High quality, CI-ready. - 70-89: Warning (Missing metadata or low complexity). - <70: Fail (Structural errors or zero tasks).


๐Ÿš€ Advanced CLI Utilities & UX

๐Ÿ“ฆ install โ€” Scenario Packs

Rapidly deploy industry-specific scenario bundles (e.g., telecom-pack, rag-agent-pack).

multiagent-eval install telecom-pack

๐Ÿ”ฌ analyze โ€” Repo Scanning

Scan agent repositories to identify tool patterns and auto-generate matching AES scenarios.

multiagent-eval analyze https://github.com/example/agent

๐Ÿค– explain โ€” Trace Analysis

Automated diagnostic analysis of run.jsonl traces to identify root causes of agent failures.

multiagent-eval explain --path runs/run.jsonl

๐Ÿ› ๏ธ Visual Scenario Editor

Built into the Visual Debugger (multiagent-eval console), this tool provides a visual interface for constructing complex AES logic and saving it directly to the local industry catalog.



๐Ÿงฉ Scenario Structure (Example)

{
  "scenario_id": "example_01",
  "title": "Basic instruction",
  "industry": "generic",
  "tasks": [
    {
      "task_id": "t1",
      "description": "Write a friendly greeting.",
      "success_criteria": [
        {"metric": "policy_compliance", "threshold": 1.0},
        {"metric": "path_parsimony", "threshold": 0.5}
      ]
    }
  ]
}

๐Ÿงฑ Adding Industries & Scenarios

The simplest way to add a completely new industry is to generate a bootstrapped setup using init, which automatically creates a starter scenario and linked synthetic datasets.

multiagent-eval init --dir industries/my_industry --industry my_industry

If you prefer to add them manually: The harness loads scenarios from industries/<industry>/scenarios/.

  1. Create a directory for your industry (if it doesn't exist):
mkdir -p industries/<your_industry>/scenarios
  1. Add a JSON scenario file (any name ending in .json):
{
  "scenario_id": "my_scenario_01",
  "title": "Example scenario",
  "industry": "<your_industry>",
  "tasks": [ ... ]
}
  1. Run it via CLI:
multiagent-eval run --scenario industries/<your_industry>/scenarios/<file>.json
  1. (Optional) Run an industry batch:
multiagent-eval evaluate --path industries/<your_industry>

๐Ÿ”Ž Tip

Keep scenario_id unique within the industry and prefer descriptive file names like scenario_<short-name>.json.

๐Ÿงฉ Agent Topology (Multi-Agent Scenarios)

When a scenario involves more than one agent, you can define an agent_topology object to control which agents can read or write which parts of the shared state. This prevents agents from unintentionally interfering with each other and enables fine-grained multi-agent evaluation.

Example:

"agent_topology": {
  "agent_a": {"reads": ["user.*"], "writes": ["user.profile"]},
  "agent_b": {"reads": ["user.*", "order.*"], "writes": ["order.status"]}
}

โœ… Best practices (Agent Topology)

  • Use topology only when you need multi-agent separation. For single-agent scenarios, omitting agent_topology keeps things simpler.
  • Start with broad read permissions and tighten over time. This helps avoid accidental โ€œpermission deniedโ€ failures while you iterate.
  • Match topology to tool behavior. If a tool writes a state path, ensure the calling agent has writes rights for that path.
  • Avoid overlap for sensitive state. If two agents shouldnโ€™t see each otherโ€™s data, keep their reads sets disjoint.

๐Ÿ›  Tool Definitions

Tools are defined within the scenario and can: - Apply state changes - Return structured outputs - Enforce policies

Example:

"tools": {
  "send_email": {
    "state_changes": [
      {"path": "emails.sent", "value": true}
    ],
    "output": {"status": "success", "message": "Email sent"}
  }
}

๐Ÿšจ Policies

Policies enforce guardrails during tool execution.

Example:

"policies": {
  "withdraw_money": {"max_limit": 500}
}

If the agent calls withdraw_money with an amount above the limit, the harness returns a policy_violation response.


๐Ÿ“Š Metrics Explained

Metrics are evaluated per task and decide if the task succeeded.

โœ… policy_compliance

Ensures the agent didnโ€™t trigger policy violations via tool calls.

๐Ÿงญ path_parsimony

Rewards fewer turns (more concise/efficient behavior).

๐Ÿงฑ state_verification

Validates that the sandbox state matches expected changes.

๐Ÿงฐ tool_call_correctness

Ensures the agent called required tools.

๐Ÿ“ calculation_accuracy

High-Fidelity: Uses regex to extract numerical values from the agent's summary and compares them against the expected values with a 0.01 tolerance.

๐Ÿง  planning_quality & root_cause_analysis

Advanced cognitive metrics using domain-specific LLM rubrics to evaluate the quality of an agent's planning and diagnostic accuracy.

๐Ÿ” consistency_score

Used when --attempts > 1 to measure stability across runs.


๐Ÿง  Drift & Triage (Advanced)

๐ŸŒช๏ธ 5.1 Importing Drift

Convert production traces into evaluation scenarios:

multiagent-eval import-drift --input production_trace.jsonl --industry telecom --output-dir industries/telecom/scenarios

๐Ÿ”ฌ 5.2 State-Level Trajectory Triage (How Root Cause Isolation Works)

AgentEval isolates the root cause of a failure by combining three layers of analysis โ€” not just scanning logs.

Layer 1: State Parity Check (VFS Delta) Every World Shim (Database, Jira, Git, API, etc.) is "VFS-aware". When an agent calls a tool, AgentEval compares the resulting system state against the "Ground Truth" defined in the scenario. If the agent queries the wrong table or fails to commit a file, the State Divergence is flagged immediately as the "Patient Zero" step. This catches "silent failures" where the agent thinks it succeeded but the environment changed incorrectly.

Layer 2: Heuristic Triage Engine (triage.py) A specialized engine scans the entire trace for failure patterns: - Stall Detection โ€” identifies if an agent is looping (e.g., calling list_dir 3 times with no change in behavior). - Tool-Level Exceptions โ€” captures internal simulator errors (e.g., a "404 Not Found" from the API Shim) that the agent might have ignored. - Policy Violations โ€” if a Security Shim blocks an action (like a regex-based data leak), the triage engine flags the exact guardrail triggered.

Layer 3: Visual Timeline Mapping The Visual Debugger's "Isolate Root Cause" automatically scrolls the timeline to the first Non-Success Signal โ€” the exact failing node, highlighted in red.

Layer What it detects Why it matters
State Data/File divergence Catches silent failures where the agent thinks it succeeded.
Logic Loops & Stalls Identifies when an agent's reasoning has hit a dead-end.
Security Policy Violations Pinpoints exactly which guardrail was triggered and why.

By combining these layers, AgentEval can distinguish between an agent that hallucinated a tool's existence vs. an agent that used the right tool but with the wrong parameters.

Why use Shims instead of Real APIs? - Safety: No risk of accidentally deleting a production database or emailing a real customer. - Determinism: You can force the shim to fail (e.g., simulate a 500 error) to test error handling. - Speed: Simulated responses are near-instant vs. real network latency.

๐Ÿ“– See the full technical deep-dive: 06_TRIAGE_ENGINE_AND_VFS.md


๐Ÿ“Ž Next Steps

For architecture and extension patterns, read the Developer Guide: - docs/guides/help/03_DEVELOPER_GUIDE.md - docs/guides/help/06_TRIAGE_ENGINE_AND_VFS.md