Skip to content

Architecture

The "Zero-Touch Core" is a design philosophy ensuring the central evaluation engine remains framework-agnostic. All industry-specific logic, communication protocols (adapters), and World Shims (Environment Simulators) are implemented as modular plugins.

Unified Command-Line Interface

The CLI (cli.py) now features Dynamic Discovery. Before parsing arguments, the harness triggers the on_discover_adapters hook across all registered plugins. This allows ecosystem-specific protocols like autogen:// or langgraph:// to be recognized as first-class citizens in the --protocol argument without hardcoding.

Furthermore, the unified --agent flag provides a single point of entry for specifying target endpoints, which are intelligently routed to the appropriate adapter via the metadata context. The harness also implements Zero-Touch Identity Discovery, automatically extracting human-readable agent names from response metadata (via session.py) or manual --agent-name overrides to enrich leaderboards and visual trajectories without framework-level changes.

High-Level Data Flow

┌─────────────────────────────────────────────────────────────────────────────┐
│                             CLI (eval_runner/cli.py)                        │
│  • evaluate / console / import-drift / aes validate / replay                │
└──────────────┬───────────────────┬────────────────────────┬─────────────────┘
               │                   │                        │
               ▼                   ▼                        ▼
┌───────────────────────────┐  ┌──────────────────────┐  ┌───────────────────────┐
│     Loader (loader.py)    │  │  AES Spec (/spec)    │  │  Drift (drift_imp...) │
│ • Universal Registry      │  │ • Schema Validation  │  │ • Production Traces   │
│ • JSON v2 / CSV / JSONL   │  │ • Portable Benchmarks│  │ • Scenario Conversion │
└──────────────┬────────────┘  └──────────┬───────────┘  └──────────┬────────────┘
               │                          │                         │
               └──────────────────────────┼─────────────────────────┘
┌──────────────────────────────────────────────────────────────────────────────┐
│                            Engine (eval_runner/engine.py)                    │
│                                                                              │
│  ┌────────────────────────────────────────────────────────────────────────┐  │
│  │     Multi-turn Conversation Loop (with hooks)                          │  │
│  │  [before_evaluation] -> [on_turn_end] -> [after_eval]                  │  │
│  └───────────────────────────────────┬────────────────────────────────────┘  │
│                                      │                                       │
│  ┌────────────────────────────┐       ▼        ┌──────────────────────────┐  │
│  │ Metrics (/metrics)         │◀─────────────▶│ Tool Sandbox (sandbox.py)│  │
│  │ • Modular Category Modules │                │ • Governance Policies    │  │
│  │ • High-Fidelity Judging    │                │ • SharedStateRegistry    │  │
│  └────────────────────────────┘                └──────────────────────────┘  │
└──────────────────────┬───────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│                          Persistence & Reporting                            │
│                                                                             │
│  • run.jsonl (Flight Recorder): Deterministic, streamable execution logs    │
│  • trajectories/: Mermaid visual flows (Reconstructed from traces)          │
│  • triage.py: Heuristic failure tagging (CONNECTION_ERROR, etc.)            │
│  • coverage/: HTML grounding heatmaps                                       │
│                                       │                                     │
│  • catalog/: Optimized scenario indexing and faceted search                 │
│  • linter/: AES compliance and quality scoring logic                        │
│  • dashboard/: SPA Frontend (Integrated Visual Suite)                       │
└─────────────────────────────────────────────────────────────────────────────┘

Module Inventory

Module File Purpose
Integrated Visual Suite eval_runner/console/ & ui/visual-debugger/ Flask proxy API (Background Eval) and Unified React SPA
Rubrics eval_runner/rubrics.py Registry for industry-standard evaluation prompts
### EventEmitter Bus: Passive Observation
The core engine is built around a central EventEmitter (see eval_runner/events.py). Every state transition in the harness - from the start of a run to a tool call or an agent response - is emitted as an event. This allows plugins to observe the system's behavior without modifying the core logic.

Key Event Types:

  • RUN_START / RUN_END: lifecycle of the entire evaluation.
  • TASK_START / TASK_END: cycle for a specific scenario task.
  • PROMPT: when the harness sends a request to the agent.
  • AGENT_RESPONSE: when the agent returns an action.
  • TOOL_CALL / TOOL_RESULT: execution of a sandbox tool.
  • HITL_PAUSE / HITL_RESUME: Human-In-The-Loop events.

Plugin Lifecycle

Plugins (inheriting from BaseEvalPlugin) hook into specific stages of the evaluation loop. The PluginManager triggers these hooks synchronously, ensuring a deterministic execution order.

Hook Trigger Point Use Case
on_run_start Before the evaluation starts Setup monitoring or telemetry
on_tool_request When an agent requests a tool Interception, blocking, or masking
on_discover_adapters During agent initialization Register custom agent protocols
on_eval_complete After metrics are calculated Custom reporting or triage
Engine eval_runner/engine.py Minimal entry point for initializing the evaluation context
Runner eval_runner/runner.py Pluggable orchestration strategies (e.g., DefaultRunner for pass@k)
Session eval_runner/session.py Handles immutable turn-contexts and conversation state management
Event Hub eval_runner/events.py Centralized EventEmitter for decoupled, non-blocking observation
Plugin Manager eval_runner/plugins.py Robust lifecycle hooks and interception for Enterprise extensions
Tool Sandbox eval_runner/tool_sandbox.py Stateful mock executor with policy guardrails and observer signals
Reporting eval_runner/reporting_plugin.py Decoupled report generation (HTML/Console) and triage automation
Flight Recorder eval_runner/flight_recorder.py Passive event logger subscribing to the core event bus
Metrics eval_runner/metrics/ Modular, high-fidelity evaluators: Accuracy, Planning, and Defense
Simulators eval_runner/simulators.py World Shim suite (20+ simulators) for high-fidelity testing
Triage eval_runner/triage.py High-fidelity trajectory forensics and confidence-based root cause isolation
Visual Suite ui/visual-debugger/ React Flow powered dashboard for real-time trajectory analysis
Analyzer eval_runner/analyzer.py Proactive GitHub repo scanning and AES scenario scaffolding
Explainer eval_runner/explainer.py Heuristic-based trace diagnostics and root cause analysis

Foundational Core: AES & Flight Recorder

Phase 1 establishes the "Standardized Evaluation" layer: - AES (Agent Eval Specification): A framework-agnostic YAML format defining agent tasks, expected states, and safety policies. It enables benchmark sharing across repositories. - run.jsonl (Flight Recorder): Every evaluation emits an append-only, deterministic log. This serves as the "source of truth" for replaying and debugging agent behavior without re-running the actual models. - Agent Crash Replayer: The replay CLI command reconstructs the agent's timeline from a run.jsonl file, enabling step-by-step inspection. - Scenario Editor: A visual drag-and-drop tool integrated into the Visual Suite for authoring and modifying AES logic without writing JSON. - Trace Explainer: High-fidelity root cause diagnostics with forensic reasoning (e.g., policy violations vs. induced errors) and confidence scoring (100% for violations, 85% for tool/logic errors, 50% for heuristic fallbacks).

Semantic Bridge & Drift Management

Phase 2 focuses on operationalizing evaluation data: - Drift Management: The import-drift command creates a "Semantic Bridge" between production behavior and evaluation rigor, allowing developers to quickly capture and fix real-world edge cases. - Edge-Case Triage: A library of heuristics that automatically tags failed runs (e.g., POLICY_VIOLATION, CONNECTION_ERROR), drastically reducing manual debugging time. - Grounding Coverage: Tracks the utilization of domain-specific tools and knowledge bases during execution, visualizeable via an HTML heatmap.

Advanced Orchestration: HITL & Branching

Phase 3 introduces advanced orchestration capabilities for research and complex production replay: - Native HITL (Human-In-The-Loop): The human adapter allows scenarios to pause and wait for human intervention. This is integrated directly into the SessionManager loop, emitting HITL_PAUSE and HITL_RESUME events. - Non-Linear Trajectories: SessionManager.fork() enables creators to explore multiple agent paths from a single checkpoint. This is essential for studying agent decision-making under ambiguity. - Universal Agent Adapters: The AgentAdapterRegistry allows switching between http, local (subprocess), and socket protocols. High-level metadata is propagated from the CLI to ensure the correct communication shim is used without scenario-level changes. - Advanced Adapter Discovery: The registry supports plugin-driven discovery. External plugins can register custom protocols (e.g., mock_proto, proprietary_rpc) using the on_discover_adapters hook. - Scenario Catalog & Intelligence: A centralized indexer (catalog.py) enables high-performance discovery across thousands of scenarios. It supports keyword search and powers the Visual Suite "Scenario Explorer". - AES Quality Linter: The linter.py module implements automated quality scoring, ensuring scenarios have required metadata, balanced task counts, and no duplicates. - Visual Debugger Hook: A dedicated DebuggerStateStore in the console backend captures live world state and tool signals via the EventEmitter for real-time UI synchronization.

Simulation Lab & Research metrics

  • High-Fidelity Metrics: Decoupled framework with specialized modules for Calculation, Strategic Planning, and Causal Inference. Features robust numerical extraction and domain-specific LLM rubrics.
  • Research Metrics: Native support for pass@k (robustness across attempts) and Success Consistency. The harness now generates a research_summary.md and ASCII table for multi-attempt evaluations, capturing semantic stability and outcome variance.
  • Adversarial Red-Teaming: The mutator engine injects typos, prompt-injection, and ambiguity into scenarios to test agent edge-resistance.

Ecosystem, Benchmarks & Distribution

Phase 4 elevates the Harness from an isolated tool to an integrated participant in the open AI evaluation ecosystem: - Community Benchmark Integration: The harness natively supports downloading and structuring data from major AI benchmarks. Passing URIs like gaia://... to the loader transparently fetches and wraps the datasets into executable Scenario objects with compatible metrics. - HuggingFace Distribution: The HFExporter enables a one-click CLI flow (multiagent-eval export --format hf) to transform deterministic internal run.jsonl flight logs into normalized datasets ready for HuggingFace publication and leaderboards. - Framework Adapters via Plugins: Supporting frameworks like LangGraph, CrewAI, and Microsoft AutoGen (via autogen://) without "polluting" the core engine. These are implemented as modular BaseEvalPlugin classes that hook into the on_discover_adapters lifecycle to register their custom execution protocols. - Ecosystem Hub: A unified registry for LLM providers (OpenAI, Gemini, Claude, Ollama, xAI Grok) and orchestration frameworks. The Ecosystem Hub ensures the core evaluator remains "Zero-Touch"—swapping a provider requires zero core code changes. The LLM judge is now configurable via the JUDGE_PROVIDER environment variable, with support for per-scenario judge_config overrides. - Industry-Standard Rubrics: The rubrics.py module provides a hot-swappable registry for clinical, fiduciary, and legal scoring logic, enabling precise evaluation without modifying engine code. - Judge Calibration: The calibrate command analyzes run.jsonl flight logs to measure alignment between automated judges and human ground truth, calculating Pearson Correlation and Error metrics.

Key Environment Variables

Variable Default Description
AGENT_API_URL http://localhost:5001/execute_task Agent endpoint
EVAL_MAX_TURNS 5 Max conversation turns per task
MAX_ENGINE_ATTEMPTS 50 Evaluation security cap
JUDGE_PROVIDER ollama Multi-model judge provider

Test Suite

15+ test files covering core engine, metrics, drift ingestion, and triage. Run with:

python -m pytest

Security Guardrails (Enterprise Audit)

The following mitigations are enforced at the core level:

# Threat Mitigation Location
1 DoS / CPU Exhaustion MAX_ENGINE_ATTEMPTS = 50 hard cap config.py
2 PII / Token Leakage sanitize_payload() redacts JWT, AWS, GitHub, Bearer tokens and neutralizes format-string injection events.py
3 CLI Command Hijacking extend_cli removed; plugins use namespaced on_register_commands under multiagent-eval plugin <name> cli.py, plugins.py
4 Plugin Halt (Hang) All hooks wrapped in PLUGIN_TIMEOUT = 5.0s via _invoke_with_timeout() config.py
5 Sandbox Escape Chroot on emitted state keys and values; shell meta-characters (;, \|, &&, `) stripped config.py
6 Fork Bomb MAX_FORK_DEPTH = 3, MAX_FORK_BREADTH = 5 enforced in SessionManager config.py
7 RCE via Repro Scripts Scripts output as inert .txt; os.system/subprocess strings stripped reporting_plugin.py
8 Prototype Pollution EvaluationContext/TurnContext are frozen dataclasses; nested dicts wrapped in MappingProxyType; history stored as tuple context.py
9 Plugin GUI Hijacking JWT-based Secure Handoff (60s tokens) for all Enterprise routes auth.py, App.jsx

Secure GUI Handoff Architecture

The Integrated Visual Suite uses a Token-Exchange Protocol to ensure Enterprise features are never exposed to unauthorized web requests:

  1. Discovery: The Flask backend exposes core and plugin routes via /api/nav with metadata (type: internal | external | component).
  2. Handoff: For internal routes, the React SPA uses standard routing. For component types, it renders a Sandboxed Iframe.
  3. Security: Plugin iframes are constrained via sandbox="allow-scripts allow-forms allow-popups" to prevent top-level session hijacking.
  4. Communication: Components communicate with the core UI via window.postMessage using origin-validated listeners for sanctioned actions like NOTIFY.

Visual Suite: React Flow Implementation

The Visual Suite has been fully migrated to React Flow, enabling: - High-Density Trajectories: Fluid zoom and pan for 100+ node traces. - Glassmorphic UI: Premium aesthetics with real-time tool overlays. - Auto-Centering: Instant focus on tool-calls or identified root cause failure points. - Interactive State Inspection: Deep dive into the VFS sandbox at any turn.

High-Fidelity Triage Implementation

The TriageEngine uses a multi-layered forensic approach to identify the root cause of agent failures:

  1. Level 1: Explicit Violations (100% Confidence): Direct policy breaches or evaluation plugin markers.
  2. Level 2: Induced Errors (85% Confidence): Tracing back from system/tool exceptions to the preceding agent decision.
  3. Level 3: Heuristic Divergence (50% Confidence): Identifying the last substantive decision before a run's failure signature.

Every identification includes a reason and confidence score, providing transparency into the automatic diagnostics process.