Onboarding Tutorial β MultiAgentEval¶
Welcome to the harness! This tutorial walks a first-time user (e.g., a Product Manager or Engineering first-timer) through a complete evaluation workflow with real commands and expected outputs.
π― Goal: Get from zero to a full evaluation run, inspect the results, and understand where to look next.
π€ Persona: New Evaluator (Product Manager / QA)¶
You are responsible for validating whether an agent behaves correctly for a business scenario. You donβt need to know the internal code; you should be able to:
- Run an evaluation end-to-end
- Understand the key outputs (reports + traces)
- Make a small update to a scenario and re-run
1) Setup (First Time)¶
β Step 1: Clone & Install¶
Run:
git clone https://github.com/najeed/ai-agent-eval-harness.git
cd ai-agent-eval-harness
python -m venv .venv
# Activate the venv (macOS/Linux)
source .venv/bin/activate
# Activate the venv (Windows PowerShell)
# .\.venv\Scripts\Activate.ps1
python -m pip install -e .
python -m pip install -r requirements.txt
β Step 2: Verify the CLI is available¶
Run:
β Expected output snippet:
β Step 3: Explore the Scenario Catalog¶
Before running an evaluation, discover what's available:
β Expected output: A table of matching scenarios with IDs and titles.
2) Run a Quick Evaluation (First Smoke Test)¶
β Step 1: Pick a built-in industry scenario¶
The harness includes industry scenarios under industries/. You can use the list command above or browse the file system.
β Step 2: Run the evaluation¶
3) Add your own industry / scenario¶
Instead of manually typing out JSON, the easiest way to start a new industry benchmark is the init command.
- Scaffold the environment:
- Run the newly generated scenario:
4) Validate Scenario Quality¶
Before sharing or running complex benchmarks, ensure your scenarios meet the AES standard:
5) Inspect the Output¶
β Replay the run trace¶
β View in the Visual Dashboard¶
Inspect results natively using the Visual Debugger. The suite provides a unified hub for the entire lifecycle: - Scenario Editor: Design and save scenarios directly to the industry libraries. - Background Runner: Trigger evaluations and monitor them via the UI. - Visual DNA Debugger: Live trajectory playback, state inspection, and trace export. - Search: Use the sidebar to search for scenarios by title or tags. - Quality Badges: Look for the "Lint Score" on each scenario. Scenarios with a score of 90+ are considered high-fidelity benchmarks. - Documentation Drawer: Click the "API Reference" icon to see these guides directly within the app.
6) Next Steps¶
- β
Read the User Manual (
docs/guides/help/02_USER_MANUAL.md). - π§ Read the Developer Guide (
docs/guides/help/03_DEVELOPER_GUIDE.md) for adapters and plugins. - π Path Decoupling (v1.1+): You don't have to keep scenarios in
industries/. You can now runmultiagent-eval evaluate --path <any_folder>and it will work out of the box!
Happy evaluating!