CLI Reference¶
The multiagent-eval (or python -m eval_runner) CLI provides a comprehensive suite of tools for agent evaluation, management, and debugging.
Core Commands¶
evaluate¶
Run evaluations on one or more scenarios.
---path: Path to a single Scenario JSON file, a directory containing scenarios, or a Benchmark URI (e.g., gaia://2023). Supports Path Decoupling: If a scenario is located outside the standard /industries directory, the harness automatically resolves relative dataset paths and tags the scenario as local/unclassified.
- --attempts: Number of attempts (K) per scenario for pass@k calculation.
- --limit: Max number of scenarios to run.
- --agent-name: Human-readable name for the agent (for reports and leaderboards). Priority: CLI Flag > Zero-Touch Discovery > Endpoint URL.
- --verbose: Enable detailed execution logs.
- --pilot: Quick-run pilot mode (forces --limit 5 --attempts 1).
- --seed: Set random seed for reproducibility.
- --retry-failed: Retry only previously failed scenarios from the latest trace.
- --push-hf: HuggingFace repo ID to push results to after evaluation.
- --output: Path to save the final results (default: reports/latest_results.json).
Environment Variables:
| Variable | Default | Description |
|---|---|---|
| AGENT_API_URL | http://localhost:5001/execute_task | Agent endpoint for http protocol |
| EVAL_MAX_TURNS | 5 | Max conversation turns per task |
| JUDGE_PROVIDER | ollama | LLM Judge provider (openai, anthropic, gemini, ollama, grok) |
| JUDGE_MODEL | - | Specific model for the judge (e.g., gpt-4o, claude-3-5-sonnet) |
| LUNA_JUDGE_TEMPERATURE| 0.0 | Temperature for judge generation |
| OLLAMA_HOST | http://localhost:11434 | Ollama service endpoint |
| OPENAI_API_KEY | - | API key for OpenAI provider |
| OPENAI_BASE_URL | https://api.openai.com/v1 | Base URL for OpenAI-compatible APIs |
| ANTHROPIC_API_KEY| - | API key for Anthropic/Claude provider |
| GOOGLE_API_KEY | - | API key for Google/Gemini provider |
| XAI_API_KEY | - | API key for xAI/Grok provider |
| AUTOGEN_API_URL | http://localhost:5002/execute_task | Endpoint for autogen protocol |
| DEFAULT_ADAPTER_TIMEOUT| 30 | Network timeout for agent adapters |
--protocol: Agent protocol (e.g.,http,local,socket,autogen,langgraph, etc.). Note: All ecosystem adapters are discovery-driven; the CLI dynamically populates these choices from available plugins.--agent: Unified agent target. Can be a URL (forhttp,autogen,langgraph), a shell command (forlocal), or an address (forsocket).--agent-cmd: (Legacy) Shell command forlocalprotocol.--agent-socket: (Legacy) Socket address forsocketprotocol.--format: Dataset format (jsonlorcsv).
Research Summary Output:
When --attempts > 1, the harness generates:
- reports/research_summary.json: Raw aggregate data.
- reports/research_summary.md: A formatted Markdown table of Pass@k, Success Consistency, and Semantic Stability.
run¶
Execute a single scenario file or a Benchmark URI.
---scenario: Path to a single scenario file or a Benchmark URI.
- --attempts: Number of attempts (pass@k) for this single scenario.
- --agent: Override the default agent URL for this specific run.
Example (Benchmark): multiagent-eval run --scenario gaia://2023 (Executes all scenarios in the GAIA 2023 benchmark).
list¶
Search the scenario catalog with keyword and faceted filtering.
---search: Search scenarios by title, industry, or tags.
- --refresh: Rebuild the scenario index from source.
lint¶
Verify scenario quality and AES specification compliance.
- Runs automated checks for metadata quality, valid structure, and duplicate detection. - Provides a quality score (0-100) and detailed warning/error report.init¶
Scaffold a new benchmark directory with starter scenarios for a specific industry. Automatically links the scenario to realistic synthetic CSV datasets.
install¶
Install curated scenario packs (e.g., telecom-pack, rag-agent-pack).
multiagent-eval install telecom-pack downloads and registers a bundle of 100+ telecom-specific agent scenarios.
analyze¶
Scan an agent's GitHub repository to identify tool patterns and auto-generate matching AES scenarios.
Example:multiagent-eval analyze https://github.com/my-org/my-agent scaffolds scenarios in scenarios/auto/ based on detected tool definitions.
verify¶
Verify the integrity of a run trace using SHA-256 checksums and manifest validation.
contribute¶
Launch the interactive wizard to create and submit new scenarios to the public catalog.
leaderboard¶
Generate a performance comparison table from multiple execution traces.
---dir: Directory containing .jsonl traces (default: runs).
- --output: Output Markdown filename (default: LEADERBOARD.md).
Specification & Validation¶
aes validate¶
Validate Agent Eval Specification (.aes.yaml) files against the official schema.
- Performs deep structure checking usingjsonschema.
- Ensures all mandatory benchmark fields are present.
aes scaffold¶
Automatically generate a template Agent Eval Specification (.aes.yaml) file.
- Creates a baseline YAML structure compliant with the latest benchmark standards.spec-to-eval¶
Convert a Markdown PRD/Spec file into a structured Scenario JSON.
---input: Path to the Markdown specification file.
- --output: Optional. Custom output path for the generated JSON.
- --fill-defaults: Optional. Automatically populates mandatory AES fields.
- Intelligent Classification: The command includes a Semantic Similarity Classifier (using sentence-transformers) that automatically identifies industry, use_case, and core_function from the spec's conceptual context (e.g., distinguishing between finance and legal based on the nature of the request).
auto-translate¶
Translate raw, unstructured documents (TXT, MD, PDF, DOCX) into structured Scenario JSON files using a local LLM.
multiagent-eval auto-translate --input <document.pdf> [--model <model_name>] [--industry <industry>]
--input: Path to the source document (PDF, TXT, MD, DOCX).
- --model: Local Ollama model to use (default: llama3).
- --industry: Force a specific industry category; if omitted, the tool attempts semantic classification.
Requirement: Ollama must be running locally.
ci generate¶
Scaffold a .github/workflows/agent_eval.yml file to run evaluations automatically on Pull Requests.
Drift & Research¶
import-drift¶
Convert production traces (interaction logs) into reusable evaluation scenarios.
mutate¶
Generate adversarial variants of a scenario (e.g., adding typos, prompt injection).
export¶
Convert internal execution traces (run.jsonl) into externally shareable dataset formats (like HuggingFace Datasets).
failures search¶
Query the global Failure Corpus to retrieve known failing edge cases for specific topics (e.g., PII, timeouts).
Example:multiagent-eval failures search "pii leaks" discovers and imports 5 realistic failing scenarios from the corpus.
inspect¶
Show detailed metadata and task descriptions for a specific scenario file.
taxonomy¶
Show the official failure taxonomy used for triage and reporting.
catalog-search¶
Perform a deep keyword search across the title, ID, and description of all scenarios.
list-metrics¶
List all registered evaluation metrics (including those provided by plugins).
cleanup-runs¶
Perform housekeeping by removing old execution traces.
---days: Remove files older than N days (default: 7).
- --force: Skip confirmation prompt.
Debugging & Exploration¶
replay¶
Re-execute a run.jsonl flight recorder log to debug "wrong turns".
explain¶
Automatically analyze a run.jsonl trace to diagnose root causes with high-fidelity tiered scoring and actionable technical fixes.
calibrate¶
Measure alignment between the LLM judge and human ground truth in a flight recorder log.
Metrics: Calculates Pearson Correlation and Mean Absolute Error (MAE) based on pairedluna_judge_score and human_score events.
playground¶
Launch an interactive REPL to talk to an agent directly in the terminal.
record¶
Record a live interaction with an agent and save it as a structured trace.
Utilities¶
console¶
Launch the Visual Debugger backend API and Unified React SPA. The debugger provides a high-density dashboard for end-to-end evaluation management:
- Scenario Explorer: Browse the catalog with faceted filters, global search, and real-time Lint Scores.
- Visual AES Builder: Drag-and-drop integrated logic builder that saves production-ready JSON directly to the industry catalog.
- Background Evaluation: Trigger runs directly from the UI; the console handles background execution and event streaming.
- Visual Debugger: Real-time trajectory playback with interactive state inspection powered by the DebuggerStateStore.
doctor¶
Check the local environment for missing dependencies or configuration issues.
quickstart¶
Run a 60-second guided demo that spawns a mock agent and executes an evaluation.
report¶
Generate a standalone Premium HTML report from an execution trace.
Feature Highlights: - Trace Reconstruction: Automatically reconstructs hierarchical task results, metrics, and triage tags from historical JSONL events. - Visual Trajectories: Generates interactive Mermaid maps for every task in the trace. - Reproduction Scripts: After every evaluation run, the harness generates an inert reproduction script inreports/repro/repro_<id>.txt containing exact CLI instructions to re-run the scenario.
scenario generate¶
Interactively workspace to generate new test scenarios via a terminal wizard.
Plugin Commands¶
plugin <name> <command>¶
Execute plugin-specific subcommands. Plugins register their own commands under a secure namespace to prevent command hijacking.
Security Note: All plugin commands are namespaced under
multiagent-eval plugin <name>to prevent command hijacking. The legacyextend_clihook has been removed.