Skip to content

🏁 Quick Start — MultiAgentEval

Get up and running in under 60 seconds. This guide is for people who want to run an evaluation with minimal setup.


🚀 The 60-Second Demo

The fastest way to see the harness in action is the quickstart command. It automatically handles agent setup, scenario execution, and report generation.

# 1. Install the harness
pip install -e .

# 2. Run the Quickstart
multiagent-eval quickstart

What happens: 1. Spawns a sample agent server in the background. 2. Executes a telecom troubleshooting evaluation. 3. Generates a Premium HTML report in reports/ (with Mermaid trajectories). 4. Shuts down the agent automatically.


🏗️ Building Your Own Suite

When you're ready to start building benchmarks for your own use-case:

1. Scaffold the Project

Generate a starter workspace linked to automatically generated realistic datasets.

multiagent-eval init --dir my_benchmarks --industry finance

2. Auto-Translate Existing Specs

If you already have PDF or Markdown guidelines, convert them into JSON scenarios automatically (requires local Ollama):

multiagent-eval auto-translate --input specs/loan_approval.pdf --industry finance


🛠 Manual Setup (The "Standard" Way)

If you're ready to integrate your own agent, follow these steps.

1. Start your Agent

Ensure your agent is running and accessible via an HTTP endpoint.

# Example (starting the sample agent)
python sample_agent/agent_app.py

2. Configure Environment

Set the AGENT_API_URL to point to your agent.

# Windows
set AGENT_API_URL=http://localhost:5001/execute_task

3. Run an Evaluation

Access the global library of 5,000+ industry-grade scenarios.

multiagent-eval evaluate --path industries/telecom

4. Validate Benchmarks

Ensure your custom benchmarks are AES v0.2 compliant.

multiagent-eval aes validate --path my_benchmarks


📊 Viewing Results

After the run, you'll see a summary in the console and detailed logs in: - reports/latest_results.json - runs/run.jsonl (Flight Recorder) - reports/report_<id>.html (Premium HTML Report with trace reconstruction) - Interactive Dashboard: Run multiagent-eval console for visual background evaluation and live DNA debugging.


🔍 Replay a Run Trace

Inspect exactly what happened during an evaluation (agent prompts, tool calls, metrics):

multiagent-eval replay --path runs/run.jsonl

⚙️ Useful Environment Variables

Variable Default Purpose
AGENT_API_URL http://localhost:5001/execute_task Agent endpoint URL
EVAL_MAX_TURNS 5 Max conversation turns per task
RUN_LOG_DIR runs Directory for execution traces

🧭 Next Steps

  • Want more detail? Read the User Manual: docs/guides/help/02_USER_MANUAL.md
  • Planning to extend or contribute? See the Developer Guide: docs/guides/help/03_DEVELOPER_GUIDE.md