Skip to content

CLI Reference

The multiagent-eval (or python -m eval_runner) CLI provides a comprehensive suite of tools for agent evaluation, management, and debugging.

Core Commands

evaluate

Run evaluations on one or more scenarios.

multiagent-eval evaluate --path <path> [--attempts K] [--limit N] [--verbose]
- --path: Path to a single Scenario JSON file, a directory containing scenarios, or a Benchmark URI (e.g., gaia://2023). Supports Path Decoupling: If a scenario is located outside the standard /industries directory, the harness automatically resolves relative dataset paths and tags the scenario as local/unclassified. - --attempts: Number of attempts (K) per scenario for pass@k calculation. - --limit: Max number of scenarios to run. - --agent-name: Human-readable name for the agent (for reports and leaderboards). Priority: CLI Flag > Zero-Touch Discovery > Endpoint URL. - --verbose: Enable detailed execution logs. - --pilot: Quick-run pilot mode (forces --limit 5 --attempts 1). - --seed: Set random seed for reproducibility. - --retry-failed: Retry only previously failed scenarios from the latest trace. - --push-hf: HuggingFace repo ID to push results to after evaluation. - --output: Path to save the final results (default: reports/latest_results.json).

Environment Variables: | Variable | Default | Description | |---|---|---| | AGENT_API_URL | http://localhost:5001/execute_task | Agent endpoint for http protocol | | EVAL_MAX_TURNS | 5 | Max conversation turns per task | | JUDGE_PROVIDER | ollama | LLM Judge provider (openai, anthropic, gemini, ollama, grok) | | JUDGE_MODEL | - | Specific model for the judge (e.g., gpt-4o, claude-3-5-sonnet) | | LUNA_JUDGE_TEMPERATURE| 0.0 | Temperature for judge generation | | OLLAMA_HOST | http://localhost:11434 | Ollama service endpoint | | OPENAI_API_KEY | - | API key for OpenAI provider | | OPENAI_BASE_URL | https://api.openai.com/v1 | Base URL for OpenAI-compatible APIs | | ANTHROPIC_API_KEY| - | API key for Anthropic/Claude provider | | GOOGLE_API_KEY | - | API key for Google/Gemini provider | | XAI_API_KEY | - | API key for xAI/Grok provider | | AUTOGEN_API_URL | http://localhost:5002/execute_task | Endpoint for autogen protocol | | DEFAULT_ADAPTER_TIMEOUT| 30 | Network timeout for agent adapters |

  • --protocol: Agent protocol (e.g., http, local, socket, autogen, langgraph, etc.). Note: All ecosystem adapters are discovery-driven; the CLI dynamically populates these choices from available plugins.
  • --agent: Unified agent target. Can be a URL (for http, autogen, langgraph), a shell command (for local), or an address (for socket).
  • --agent-cmd: (Legacy) Shell command for local protocol.
  • --agent-socket: (Legacy) Socket address for socket protocol.
  • --format: Dataset format (jsonl or csv).

Research Summary Output: When --attempts > 1, the harness generates: - reports/research_summary.json: Raw aggregate data. - reports/research_summary.md: A formatted Markdown table of Pass@k, Success Consistency, and Semantic Stability.

run

Execute a single scenario file or a Benchmark URI.

multiagent-eval run --scenario <path_or_uri> [--attempts K] [--agent <url>]
- --scenario: Path to a single scenario file or a Benchmark URI. - --attempts: Number of attempts (pass@k) for this single scenario. - --agent: Override the default agent URL for this specific run.

Example (Benchmark): multiagent-eval run --scenario gaia://2023 (Executes all scenarios in the GAIA 2023 benchmark).

list

Search the scenario catalog with keyword and faceted filtering.

multiagent-eval list [--search <query>]
- --search: Search scenarios by title, industry, or tags. - --refresh: Rebuild the scenario index from source.

lint

Verify scenario quality and AES specification compliance.

multiagent-eval lint --path <path_to_scenario_or_dir>
- Runs automated checks for metadata quality, valid structure, and duplicate detection. - Provides a quality score (0-100) and detailed warning/error report.

init

Scaffold a new benchmark directory with starter scenarios for a specific industry. Automatically links the scenario to realistic synthetic CSV datasets.

multiagent-eval init --dir <directory_name> --industry <industry_name>

install

Install curated scenario packs (e.g., telecom-pack, rag-agent-pack).

multiagent-eval install <pack-name>
Example: multiagent-eval install telecom-pack downloads and registers a bundle of 100+ telecom-specific agent scenarios.

analyze

Scan an agent's GitHub repository to identify tool patterns and auto-generate matching AES scenarios.

multiagent-eval analyze <github_url>
Example: multiagent-eval analyze https://github.com/my-org/my-agent scaffolds scenarios in scenarios/auto/ based on detected tool definitions.

verify

Verify the integrity of a run trace using SHA-256 checksums and manifest validation.

multiagent-eval verify --path <path_to_run.jsonl> [--manifest <path_to_manifest.json>]

contribute

Launch the interactive wizard to create and submit new scenarios to the public catalog.

multiagent-eval contribute

leaderboard

Generate a performance comparison table from multiple execution traces.

multiagent-eval leaderboard [--dir <trace_directory>] [--output <filename.md>]
- --dir: Directory containing .jsonl traces (default: runs). - --output: Output Markdown filename (default: LEADERBOARD.md).

Specification & Validation

aes validate

Validate Agent Eval Specification (.aes.yaml) files against the official schema.

multiagent-eval aes validate --path <path>
- Performs deep structure checking using jsonschema. - Ensures all mandatory benchmark fields are present.

aes scaffold

Automatically generate a template Agent Eval Specification (.aes.yaml) file.

multiagent-eval aes scaffold --output <path>
- Creates a baseline YAML structure compliant with the latest benchmark standards.

spec-to-eval

Convert a Markdown PRD/Spec file into a structured Scenario JSON.

multiagent-eval spec-to-eval --input <prd.md> [--output <scenario.json>] [--fill-defaults]
- --input: Path to the Markdown specification file. - --output: Optional. Custom output path for the generated JSON. - --fill-defaults: Optional. Automatically populates mandatory AES fields. - Intelligent Classification: The command includes a Semantic Similarity Classifier (using sentence-transformers) that automatically identifies industry, use_case, and core_function from the spec's conceptual context (e.g., distinguishing between finance and legal based on the nature of the request).

auto-translate

Translate raw, unstructured documents (TXT, MD, PDF, DOCX) into structured Scenario JSON files using a local LLM.

multiagent-eval auto-translate --input <document.pdf> [--model <model_name>] [--industry <industry>]
- --input: Path to the source document (PDF, TXT, MD, DOCX). - --model: Local Ollama model to use (default: llama3). - --industry: Force a specific industry category; if omitted, the tool attempts semantic classification.

Requirement: Ollama must be running locally.

ci generate

Scaffold a .github/workflows/agent_eval.yml file to run evaluations automatically on Pull Requests.

multiagent-eval ci generate

Drift & Research

import-drift

Convert production traces (interaction logs) into reusable evaluation scenarios.

multiagent-eval import-drift --input <trace.json> --industry <industry>

mutate

Generate adversarial variants of a scenario (e.g., adding typos, prompt injection).

multiagent-eval mutate --input <scenario.json> --type <mutation_type>

export

Convert internal execution traces (run.jsonl) into externally shareable dataset formats (like HuggingFace Datasets).

multiagent-eval export --input <run.jsonl> --format hf --output <dataset.json>

Query the global Failure Corpus to retrieve known failing edge cases for specific topics (e.g., PII, timeouts).

multiagent-eval failures search <query>
Example: multiagent-eval failures search "pii leaks" discovers and imports 5 realistic failing scenarios from the corpus.

inspect

Show detailed metadata and task descriptions for a specific scenario file.

multiagent-eval inspect --scenario-path <path_to.json>

taxonomy

Show the official failure taxonomy used for triage and reporting.

multiagent-eval taxonomy

Perform a deep keyword search across the title, ID, and description of all scenarios.

multiagent-eval catalog-search --query <search_term>

list-metrics

List all registered evaluation metrics (including those provided by plugins).

multiagent-eval list-metrics

cleanup-runs

Perform housekeeping by removing old execution traces.

multiagent-eval cleanup-runs [--days N] [--force]
- --days: Remove files older than N days (default: 7). - --force: Skip confirmation prompt.

Debugging & Exploration

replay

Re-execute a run.jsonl flight recorder log to debug "wrong turns".

multiagent-eval replay --path <path/to/run.jsonl>

explain

Automatically analyze a run.jsonl trace to diagnose root causes with high-fidelity tiered scoring and actionable technical fixes.

multiagent-eval explain --path <path/to/run.jsonl>
Forensic Features: - Tiered Confidence Scoring: Distinguishes between explicit policy violations (100%), induced system/tool errors (85%), and heuristic fallbacks (50%). - Actionable Recommendations: Provides targeted remediation advice (e.g., prompt refinement, tool sandbox optimization) based on the identified failure pattern. - Pinpoint Diagnostics: Identifies the exact turn (index) in the trajectory where the failure logic diverged.

calibrate

Measure alignment between the LLM judge and human ground truth in a flight recorder log.

multiagent-eval calibrate --path <path/to/run.jsonl>
Metrics: Calculates Pearson Correlation and Mean Absolute Error (MAE) based on paired luna_judge_score and human_score events.

playground

Launch an interactive REPL to talk to an agent directly in the terminal.

multiagent-eval playground [--agent <url>]

record

Record a live interaction with an agent and save it as a structured trace.

multiagent-eval record [--agent <url>]

Utilities

console

Launch the Visual Debugger backend API and Unified React SPA. The debugger provides a high-density dashboard for end-to-end evaluation management: - Scenario Explorer: Browse the catalog with faceted filters, global search, and real-time Lint Scores. - Visual AES Builder: Drag-and-drop integrated logic builder that saves production-ready JSON directly to the industry catalog. - Background Evaluation: Trigger runs directly from the UI; the console handles background execution and event streaming. - Visual Debugger: Real-time trajectory playback with interactive state inspection powered by the DebuggerStateStore.

multiagent-eval console [--host 127.0.0.1] [--port 5000]

doctor

Check the local environment for missing dependencies or configuration issues.

multiagent-eval doctor

quickstart

Run a 60-second guided demo that spawns a mock agent and executes an evaluation.

multiagent-eval quickstart

report

Generate a standalone Premium HTML report from an execution trace.

multiagent-eval report --path <path/to/run.jsonl>
Feature Highlights: - Trace Reconstruction: Automatically reconstructs hierarchical task results, metrics, and triage tags from historical JSONL events. - Visual Trajectories: Generates interactive Mermaid maps for every task in the trace. - Reproduction Scripts: After every evaluation run, the harness generates an inert reproduction script in reports/repro/repro_<id>.txt containing exact CLI instructions to re-run the scenario.

scenario generate

Interactively workspace to generate new test scenarios via a terminal wizard.

multiagent-eval scenario generate

Plugin Commands

plugin <name> <command>

Execute plugin-specific subcommands. Plugins register their own commands under a secure namespace to prevent command hijacking.

multiagent-eval plugin <plugin_name> <command> [options]

Security Note: All plugin commands are namespaced under multiagent-eval plugin <name> to prevent command hijacking. The legacy extend_cli hook has been removed.