Skip to content

Evaluation Guide

This guide explains the philosophy behind our evaluation scenarios and how to interpret them.

The Scenario Corpus

The harness ships with a production-grade corpus of 5,000+ scenarios across 45+ industries. This includes specialized high-stakes categories: - Cross-Industry: Inter-sector policy negotiation. - Ethical Guardrails: Hardened safety and PII tests. - Interactive Complexity: Multi-turn flows with HITL. - Simulations: High-fidelity lab environments.

Scenario Structure

Each evaluation is defined by a .json file in an industry's scenarios directory. The file has the following top-level keys:

  • scenario_id: A unique identifier for the scenario (e.g., telecom-cs-001).
  • title: A human-readable title.
  • description: A brief explanation of the overall goal of the scenario.
  • use_case: The specific business function being tested (e.g., Customer Service).
  • industry: The industry this scenario belongs to.
  • core_function: The category within the use case this scenario belongs to.
  • dataset: (Optional) Path and format of a synthetic or real dataset required for this scenario. Supports Path Decoupling (v1.1+): Relative paths (e.g., ./data.csv) are resolved relative to the scenario file itself, allowing for portable scenario bundles.
  • initial_state: (Optional) The starting state for the sandbox. Supports nested dictionaries with dot-notation verification.
  • policies: (Optional) Governance rules to enforce during execution.
  • tasks: An array of task objects that represent the steps an agent must take.

Task Structure

Each object in the tasks array represents a single step and contains:

  • task_id: A unique ID for the task within the scenario (e.g., task-1).
  • description: A clear description of what the agent needs to accomplish.
  • expected_outcome: A description of what a successful completion of the task looks like.
  • required_tools: A list of tool/API names that the agent is expected to use for this task.
  • expected_state_changes: (Optional) A list of state paths and values that should be true after the task.
  • success_criteria: An array defining how to measure success.

LLM-as-Judge & Rubrics

For semantic or safety-critical evaluations, you can use the luna_judge_score metric. This metric can be customized using a judge_config object within the criterion:

  • judge_rubric: Select a specialized rubric (e.g., clinical_safety, fiduciary_accuracy, policy_adherence).
  • judge_provider: Override the global judge model (e.g., openai, gemini).
  • required: Setting this to true (v1.1+) ensures that if the judge fails to initialize (e.g., missing API key), the evaluation terminates immediately with a clear error rather than falling back to weak heuristics.

A task is only considered successful if all of its success criteria are met.

State Verification

The state_verification metric now supports dot-notation for inspecting nested objects in the sandbox state. - Example: A path of user.profile.balance will correctly navigate through nested dictionaries in the actual_state to verify the final value.

Advanced Orchestration

  • HITL (Human-In-The-Loop): Pause evaluations for manual intervention via the human adapter.

Visual Evaluation & Debugging (Visual Debugger)

Beyond the CLI, the harness provides a Unified React SPA Visual Debugger for visual management: - Scenario Explorer: Browse the catalog with faceted filters and global search. - Visual AES Builder: Drag-and-drop integrated logic builder for complex flows. - Reports & Traces: Historical execution timeline with detailed analysis, discovered Agent Identity (names/models), and instant "View Report" navigation. - Visual Debugger: Real-time trajectory playback with interactive state inspection, human-readable agent labels, and trace export.

Launch with:

multiagent-eval console

Available Metrics

Metric Function Description
tool_call_correctness calculate_tool_call_correctness Exact set-match of expected vs. actual tools
state_verification calculate_state_correctness Verify persistent system state changes
policy_compliance calculate_policy_compliance Detect governance policy violations
delegation_latency calculate_delegation_latency Measures the 'Thinking Cost' of handoffs
delegation_loop_risk calculate_delegation_loop_risk Detects 'Infinite Re-planning' cycles
consensus_scoring calculate_consensus_scoring Semantic similarity judge for agent agreement
communication_clarity calculate_communication_clarity Checks summary length and semantic quality
factual_accuracy calculate_factual_accuracy LLM-based verification of agent's final answer
performance_efficiency calculate_efficiency Weighted score of turns taken vs. goal reached
security_guardrail calculate_guardrail_violation Detection of prompt injection or sensitive data leaks

Community Benchmarks

Instead of relying solely on local .json or .aes.yaml files, the multiagent-eval can now pull and format datasets from major community benchmarks on-the-fly using URIs.

# Load the 2023 GAIA validation set
multiagent-eval evaluate --path gaia://2023

# Load AssistantBench tasks
multiagent-eval evaluate --path assistantbench://v1
The universal loader will dynamically download these datasets, wrap them in compatible Scenario objects, and apply the correct specific evaluation metrics.

Multi-Agent Scenario Schema

Scenarios now support complex topologies. Example snippet:

{
  "version": "2.0.0",
  "agent_topology": {
    "agent_1": { "writes": ["ns1"], "reads": ["ns2"] }
  }
}

Schema Validation

Scenarios are validated against schemas/scenario.schema.json at load time. Any scenario that fails validation will raise a ValueError with a clear error message.