MultiAgentEval Guide¶
This document is the single authoritative reference for using, extending, and developing the MultiAgentEval framework. It is organized into three levels: Quick Start, User Manual, and Developer Guide.
1) Quick Start (Get Running in 60 Seconds)¶
1.1 Install and Run¶
-
Clone and install MultiAgentEval:
-
Run the Quickstart Demo:
What it does: Spawns a mock sample agent, runs a troubleshooting evaluation, and generates a Premium HTML report in reports/.
[!TIP] Integrated Visual Suite: Launch the premium web dashboard via
multiagent-eval console. It now features a Live Debugger that streams real-time state from your evaluations using the Zero-TouchRemoteBridgePlugin.
1.2 Useful CLI Commands¶
multiagent-eval console: Launch the React Visual Debugger GUI for visual management.multiagent-eval doctor: Check your environment health.multiagent-eval list --search <query>: Search the scenario catalog (supports faceted filtering).multiagent-eval lint --path <path>: Verify scenario quality and AES specification compliance.multiagent-eval spec-to-eval --fill-defaults: Convert rough specs into valid, lint-passable scenarios.multiagent-eval auto-translate --input <doc>: Convert PDFs/Docs into JSON scenarios using Ollama.multiagent-eval run --scenario <benchmark-uri>: Zero-config execution for community benchmarks (GAIA, AssistantBench).multiagent-eval report --path <path>: Generate a standalone Premium HTML report (reconstructed from any.jsonltrace).multiagent-eval replay --path <path>: View the step-by-step history of a run in terminal.- Advanced Utilities:
install,analyze,ci generate,failures search, andexplain.
[!NOTE] Path Decoupling: Evaluations can now be run from any directory. The harness automatically resolves relative dataset paths and tags ad-hoc scenarios as
local.
2) User Manual (Scenario & Execution Guide)¶
2.1 Core Concepts¶
- AES (Agent Eval Specification): A standardized JSON/YAML format for defining portable benchmarks.
- Scenario: A JSON file containing tasks, tools, and success criteria.
- Task: An individual instruction sent to the agent within a scenario.
- Tool Sandbox: A state-aware execution environment for agent tool calls.
- Synthetic Datasets: Auto-generated industry-specific CSV tabular data used to ground scenarios in reality.
- Plugin System: Lifecycle hooks to extend the harness without modifying the core.
2.2 Advanced CLI¶
evaluate --path <dir> --attempts 3: Run batch evaluations with pass@k scoring.import-drift --input <pord.jsonl>: Convert production traces into regression tests.mutate --input <file> --type injection: Generate adversarial test variants for red-teaming.- Advanced Utilities:
install,analyze,ci generate,failures search, andexplain(Reference). - Visual Scenario Editor: A visual React UI for constructing AES logic and saving it directly to the catalog, accessible via the
consolecommand.
3) Developer Guide (Zero-Touch Core Architecture)¶
3.1 Architecture Overview¶
The harness uses a Zero-Touch Core design, where all major capabilities are hot-swappable via the Modular Plugin Bus: 1. Runner: Orchestrates high-level loop (e.g., pass@k). 2. Session: Manages multi-turn conversation and context. 3. EventEmitter: Broadcasts state changes through a central bus. 4. AgentAdapterRegistry: Dynamically discovers and registers agent protocols at runtime. 5. Plugins: Standard lifecycle listeners and interceptors.
3.2 Extensibility¶
- Interception: Use
on_tool_requestto block or proxy tool calls. - Observability: Subscribe to
CoreEventsvia theEventEmitterfor non-blocking logging. - Advanced Evaluation: Standard support for dot-notation in state verification, High-Fidelity Calculation Accuracy, and Judge Guarding (strict failure for required metrics).
- Secure Namespaces: Legacy
extend_cliis removed. Useon_register_commandsto register commands under themultiagent-eval plugin <name>sub-command. - Ecosystem Adapters: Official, zero-touch support for LangChain, Ollama, OpenAI, Gemini, Claude, AutoGen, and xAI Grok.
- Immutability: TurnContext and EvaluationContext are frozen to prevent accidental side-effects.
🧪 Testing & Verification¶
Run the full suite:
For industry-specific scenarios:
For deep-dives into specific topics, see the specialized guides in docs/guides/help/:
- State-Level Triage & VFS Guide — how root cause isolation works
- Adding World Shims — extending the simulator suite
- Plugin Development — zero-touch extensibility