Agent API Specification¶

This document describes the expected API contract between the MultiAgentEval and the AI agent under test.

Endpoint¶

POST /execute_task

Request Body¶

Field	Type	Required	Description
`task_description`	string	✅	The task for the agent to perform
`turn`	integer	✅	Current turn number in the conversation
`conversation_history`	array	❌	Array of previous turns (`{role, content}`)

Agent Identity (Name Discovery)¶

The harness supports Zero-Touch Identity Discovery. Agents can optionally identify themselves by including a name or agent_name field in their response. This is highly recommended for leaderboards and visual reporting.

The harness discovers the name using the following priority: 1. Top-level fields: name or agent_name. 2. Nested Metadata: metadata.name or metadata.agent_name. 3. Model Identity: metadata.model (commonly used by LLM-direct adapters).

If no name is discovered, the harness falls back to the manual --agent-name CLI flag or the endpoint URL.

Response Body¶

The agent must return a JSON object with an action field indicating what it did:

`"action": "call_tool"` — Single tool call¶

{
  "action": "call_tool",
  "tool_name": "get_customer_details",
  "tool_params": {"customer_id": "cust_123"},
  "tool_output": { ... },
  "summary": "Identified customer Jane Doe on 100 Mbps plan."
}

`"action": "call_multiple_tools"` — Multiple tool calls¶

{
  "action": "call_multiple_tools",
  "tool_names": ["run_line_test", "run_remote_speed_test"],
  "tool_outputs": [{ ... }, { ... }],
  "summary": "Remote diagnostics complete."
}

`"action": "final_answer"` — Task complete¶

{
  "action": "final_answer",
  "summary": "The issue is resolved. Customer Wi-Fi was on a congested channel."
}

`"action": "provide_instructions"` — Instructions to user¶

{
  "action": "provide_instructions",
  "instructions": "Please connect via Ethernet and run a speed test.",
  "summary": "Guided customer to perform a local speed test."
}

`"action": "hitl_pause"` — Request Human Intervention¶

Tells the harness to pause execution and wait for human input. The harness will emit a HITL_PAUSE event.

{
  "action": "hitl_pause",
  "reason": "Requesting manual credit override beyond $500 limit."
}

`"action": "branch"` — Fork Trajectory¶

Signals the SessionManager to create a checkpoint and explore a new path.

{
  "action": "branch",
  "metadata": {"reason": "Testing alternative resolution path A"}
}

Multi-turn Flow¶

The harness supports multi-turn conversations. When the agent returns a call_tool or call_multiple_tools action, the harness executes the tool(s) via the Tool Sandbox and sends the result back as the next task_description:

Turn 1: Harness → Agent: "Identify the customer..."
         Agent → Harness: {"action": "call_tool", "tool_name": "get_customer_details", ...}
Turn 2: Harness → Agent: "Tool 'get_customer_details' returned: {...}. Continue."
         Agent → Harness: {"action": "final_answer", "summary": "Customer identified."}

The loop ends when the agent sends final_answer, provide_instructions, error, or the max turn limit (default: 5) is reached.

Policy Violations (Governance feedback)¶

If the agent attempts to call a tool in a way that violates a governance policy (e.g., exceeding a refund limit), the harness will return a "status": "policy_violation" in the environment message, allowing the agent to self-correct:

Turn 1: Harness → Agent: "Process a $100 refund..."
         Agent → Harness: {"action": "call_tool", "tool_name": "apply_refund", "tool_params": {"amount": 100}}
Turn 2: Harness → Agent: "GOVERNANCE ERROR: Amount 100 exceeds maximum allowed limit of 50. Please adjust."
         Agent → Harness: {"action": "call_tool", "tool_name": "apply_refund", "tool_params": {"amount": 50}}

---

## 💻 Local Subprocess Protocol

When using `--protocol local`, the harness communicates via **Standard I/O**.

1.  **Request**: Harness writes a single-line JSON payload to the agent's `stdin`.
2.  **Response**: Agent must write a single-line JSON response to `stdout`.
3.  **Logs**: Anything the agent writes to `stderr` is captured and emitted as an engine log.

### Example Agent (Python)
```python
import sys, json
for line in sys.stdin:
    payload = json.loads(line)
    # Process...
    print(json.dumps({"action": "final_answer", "summary": "Done"}))
    sys.stdout.flush()

🔌 Socket Protocol¶

When using --protocol socket, the harness opens a persistent connection: - TCP: host:port - Unix: /path/to/socket

Payloads are exchanged as JSON strings followed by a newline \n. The harness ensures persistent connection stability during multi-attempt evaluations.

🏛 Benchmark URIs (Community Integration)¶

The harness natively supports evaluating against major research benchmarks using custom URI schemes:

gaia://[split]: Loads scenarios from the GAIA dataset (e.g., gaia://2023_all).
assistantbench://[split]: Loads scenarios from AssistantBench (e.g., assistantbench://test).

These URIs are handled by the loader.py which transparently wraps the external data into the standardized AES format with multi-turn metric support.

🔗 Ecosystem Hub Payloads¶

When using Ecosystem Adapters (openai://, gemini://, claude://), the harness transparently maps the AES scenario into specific provider payloads. The return object follows the same action structure as the standard POST request.

Example: Gemini Adapter Payload¶

{
  "task": "Process user request...",
  "messages": [{"role": "user", "content": "..."}],
  "model": "gemini-1.5-pro",
  "temperature": 0.7
}

Example: Grok Adapter Payload¶

{
  "task": "Process user request...",
  "model": "grok-beta",
  "temperature": 0.0
}

Example: AutoGen Adapter Payload¶

{
  "task_description": "...",
  "url": "http://localhost:5002/execute_task",
  "conversation_history": [...]
}

Scenario-Level Judge Configuration¶

The luna_judge_score metric can be customized per-scenario or per-criterion using the judge_config object. This allows for granular control over the evaluation model and scoring rubrics.

`judge_config` Schema¶

Field	Type	Default	Description
`judge_provider`	string	`.env` default	Model provider (e.g., `openai`, `gemini`, `ollama`)
`judge_model`	string	`.env` default	Specific model ID (e.g., `gpt-4.1`, `claude-3-sonnet`)
`judge_temperature`	float	`0.0`	Randomness of the judge (higher = less predictable)
`judge_rubric`	string	`generic`	The named rubric to use (see below)

Built-in Rubrics¶

The following industry-standard rubrics are available out-of-the-box: - clinical_safety: Healthcare-specific safety and HIPAA compliance check. - fiduciary_accuracy: Financial advice and numerical correctness audit. - policy_adherence: Legal disclosure and boundary enforcement check. - factual_grounding: Evidence-based grounding and hallucination detection. - generic: Standard semantic similarity score.

Usage Example (Scenario JSON)¶

{
  "scenario_id": "clinical_trial_summary",
  "criteria": [
    {
      "metric": "luna_judge_score",
      "threshold": 0.9,
      "judge_config": {
        "judge_provider": "openai",
        "judge_model": "gpt-4o",
        "judge_rubric": "clinical_safety"
      }
    }
  ]
}

Visual Debugger Integration¶

The Visual Debugger (multiagent-eval console) utilizes this REST API as its backbone. Enterprise plugins can extend this contract via the on_register_console_routes hook to inject custom monitoring or debugging endpoints into the React dashboard.