Skip to content

📦 API Reference: Core Evaluator

The eval_runner core provides the main orchestration logic for agentic evaluations. It is built on a Zero-Touch architecture that uses dynamic registries and plugin hooks for extensibility.


🚀 Core Engine API

engine.run_evaluation

Executes a single evaluation scenario. This is the primary entry point for the harness.

Signature:

async def run_evaluation(
    scenario: dict, 
    attempts: int = 1, 
    metadata: Optional[dict] = None
) -> Union[dict, list]

Parameters: - scenario: The Scenario dictionary (loaded via loader.load_scenario). - attempts: Number of attempts (K) for pass@k scoring. - metadata: Optional dictionary of run-level metadata (e.g., job ID, git commit).

Returns: - If attempts == 1: Returns a single EvaluationResult dictionary. - If attempts > 1: Returns a list of EvaluationResult dictionaries.


AgentAdapterRegistry

Manages agent communication protocols. Plugins can register custom adapters via the on_discover_adapters hook.

Registration:

from eval_runner.engine import AgentAdapterRegistry

async def my_custom_adapter(payload: dict, endpoint: str):
    # Your communication logic here...
    return {"action": "final_answer", "summary": "Success"}

AgentAdapterRegistry.register("my-protocol", my_custom_adapter)

Standard Protocols: - http: Communicates with a JSON REST API. - local: Communicates with a subprocess via Stdin/Stdout. - socket: Communicates via persistent TCP/Unix sockets. - human: Interactive HITL (Human-In-The-Loop) adapter.


📂 Scenario Loading API

loader.load_scenario

Loads scenarios from local files or remote Benchmark URIs.

Signature:

def load_scenario(path: Union[str, Path]) -> Union[dict, list]

Key Features: - Benchmark URIs: Supports gaia://[split], assistantbench://[split], etc. - Path Decoupling: If dataset.path in the scenario is relative (e.g., ./data.csv), it is automatically resolved relative to the scenario file location. - Schema Validation: All loaded JSON scenarios are validated against the official scenario.schema.json.


loader.load_dataset

Batch loads scenarios from a file or directory.

Signature:

def load_dataset(file_path: Union[str, Path], format_type: Optional[str] = None) -> List[dict]
- Directories: Passing a directory path triggers a recursive search for all .json files. - Formats: Auto-detects .json, .jsonl, and .csv.


⚖️ Metrics API

MetricRegistry

A central registry for all evaluation metrics. Metrics are stateless functions that take expected vs actual data and return a float (0.0 to 1.0).

Example:

from eval_runner.metrics import MetricRegistry
metric_func = MetricRegistry.get("state_verification")
score = metric_func(expected_changes, actual_state)

Core Metric Catalog

Metric Key Function Signature Description
tool_call_correctness (expected: list, actual: list) Exact set-match of tool names.
state_verification (expected: list, actual: dict) Parity check using dot-notation (e.g., user.id).
luna_judge_score (criterion: dict, context: dict) Async semantic similarity using an LLM.
policy_compliance (history: list) Detects "status": "policy_violation" in trajectory.
pii_safety (criterion: dict, summary: str) Detects leaked emails or phone numbers.
delegation_loop_risk (agent_sequence: list) Detects infinite cycles in multi-agent handoffs.
refusal_calibration (criterion: dict, summary: str) Measures if the agent refused correctly.

🔌 Plugin & Event System

The core broadcasts events via plugins.manager.trigger. You can register a BaseEvalPlugin subclass to intercept: - on_discover_adapters: Register new agent protocols. - on_register_commands: Add custom CLI subcommands. - on_agent_turn_start: Stream real-time state to the Visual Debugger. - on_tool_result: Capture data for grounding heatmaps.