MultiAgentEval Guide¶

This document is the single authoritative reference for using, extending, and developing the MultiAgentEval framework. It is organized into three levels: Quick Start, User Manual, and Developer Guide.

1) Quick Start (Get Running in 60 Seconds)¶

1.1 Install and Run¶

Clone and install MultiAgentEval:

git clone https://github.com/najeed/ai-agent-eval-harness.git
cd ai-agent-eval-harness
pip install -e .

Run the Quickstart Demo:
```
multiagent-eval quickstart
```

What it does: Spawns a mock sample agent, runs a troubleshooting evaluation, and generates a Premium HTML report in reports/.

[!TIP] Integrated Visual Suite: Launch the premium web dashboard via multiagent-eval console. It now features a Live Debugger that streams real-time state from your evaluations using the Zero-Touch RemoteBridgePlugin.

1.2 Useful CLI Commands¶

multiagent-eval console: Launch the React Visual Debugger GUI for visual management.
multiagent-eval doctor: Check your environment health.
multiagent-eval list --search <query>: Search the scenario catalog (supports faceted filtering).
multiagent-eval lint --path <path>: Verify scenario quality and AES specification compliance.
multiagent-eval spec-to-eval --fill-defaults: Convert rough specs into valid, lint-passable scenarios.
multiagent-eval auto-translate --input <doc>: Convert PDFs/Docs into JSON scenarios using Ollama.
multiagent-eval run --scenario <benchmark-uri>: Zero-config execution for community benchmarks (GAIA, AssistantBench).
multiagent-eval report --path <path>: Generate a standalone Premium HTML report (reconstructed from any .jsonl trace).
multiagent-eval replay --path <path>: View the step-by-step history of a run in terminal.
Advanced Utilities: install, analyze, ci generate, failures search, and explain.

[!NOTE] Path Decoupling: Evaluations can now be run from any directory. The harness automatically resolves relative dataset paths and tags ad-hoc scenarios as local.

2) User Manual (Scenario & Execution Guide)¶

2.1 Core Concepts¶

AES (Agent Eval Specification): A standardized JSON/YAML format for defining portable benchmarks.
Scenario: A JSON file containing tasks, tools, and success criteria.
Task: An individual instruction sent to the agent within a scenario.
Tool Sandbox: A state-aware execution environment for agent tool calls.
Synthetic Datasets: Auto-generated industry-specific CSV tabular data used to ground scenarios in reality.
Plugin System: Lifecycle hooks to extend the harness without modifying the core.

2.2 Advanced CLI¶

evaluate --path <dir> --attempts 3: Run batch evaluations with pass@k scoring.
import-drift --input <pord.jsonl>: Convert production traces into regression tests.
mutate --input <file> --type injection: Generate adversarial test variants for red-teaming.
Advanced Utilities: install, analyze, ci generate, failures search, and explain (Reference).
Visual Scenario Editor: A visual React UI for constructing AES logic and saving it directly to the catalog, accessible via the console command.

3) Developer Guide (Zero-Touch Core Architecture)¶

3.1 Architecture Overview¶

The harness uses a Zero-Touch Core design, where all major capabilities are hot-swappable via the Modular Plugin Bus: 1. Runner: Orchestrates high-level loop (e.g., pass@k). 2. Session: Manages multi-turn conversation and context. 3. EventEmitter: Broadcasts state changes through a central bus. 4. AgentAdapterRegistry: Dynamically discovers and registers agent protocols at runtime. 5. Plugins: Standard lifecycle listeners and interceptors.

3.2 Extensibility¶

Interception: Use on_tool_request to block or proxy tool calls.
Observability: Subscribe to CoreEvents via the EventEmitter for non-blocking logging.
Advanced Evaluation: Standard support for dot-notation in state verification, High-Fidelity Calculation Accuracy, and Judge Guarding (strict failure for required metrics).
Secure Namespaces: Legacy extend_cli is removed. Use on_register_commands to register commands under the multiagent-eval plugin <name> sub-command.
Ecosystem Adapters: Official, zero-touch support for LangChain, Ollama, OpenAI, Gemini, Claude, AutoGen, and xAI Grok.
Immutability: TurnContext and EvaluationContext are frozen to prevent accidental side-effects.

🧪 Testing & Verification¶

Run the full suite:

pytest tests/

For industry-specific scenarios:

multiagent-eval run --scenario industries/<ind>/scenarios/<file>.json

For deep-dives into specific topics, see the specialized guides in docs/guides/help/: - State-Level Triage & VFS Guide — how root cause isolation works - Adding World Shims — extending the simulator suite - Plugin Development — zero-touch extensibility