Ecosystem Verification Guide (Comprehensive)¶
This guide provides step-by-step instructions to manually verify all third-party integrations, including adapters and plugins, to ensure "Partner Production-Ready" status.
🚦 1. Prerequisites & Global Setup¶
[!IMPORTANT] Python Version: Use Python 3.14 (Stable). While 3.14 is the current stable baseline (released Oct 2025), some downstream dependencies (like
jiterforopenai) may still require a local Rust compiler to build on certain Windows configurations if pre-built wheels are not detected.
Troubleshooting: Build Failures (Rust & C++)¶
If you see error: subprocess-exited-with-error mentioning Cargo/Rust or cl.exe:
1. Rust Build Failure (e.g., jiter)¶
- Install Rust: Download and run
rustup-init.exefrom rustup.rs. - Select Default: Use the default installation (
x86_64-pc-windows-msvc). - Restart Terminal: Ensure
cargois in your PATH.
2. C++ Extension Build Failure (e.g., regex)¶
This occurs when a package lacks a pre-built wheel for Python 3.14 and requires local C compilation.
1. Install MSVC Build Tools: Download Visual Studio Build Tools.
2. Select Workload: In the installer, check "Desktop development with C++".
3. Verify Kits: Ensure "Windows 10/11 SDK" and "MSVC v14x - VS 2022 C++ x64/x86 build tools" are selected in the side panel.
4. Restart Terminal and retry: pip install -e .
Before verifying individual integrations, ensure the harness is installed in editable mode:
Configure Environment Variables¶
Create or update your .env file with the following keys:
# Proprietary Models
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=ant-api-...
GEMINI_API_KEY=...
XAI_API_KEY=...
# Ecosystem Defaults
OLLAMA_HOST=http://localhost:11434
AUTOGEN_API_URL=http://localhost:5002/query
🧠 2. Local LLM Verification (Ollama)¶
Goal: Verify that the harness can communicate with local models and use them for evaluation/judging.
- Install Ollama: Download from ollama.com.
- Pull Model:
- Verify Adapter:
- Verify Luna-Judge (Ollama fallback):
- Ensure
JUDGE_PROVIDER=ollamain.env. - Run:
multiagent-eval evaluate --path scenarios/luna_demo.json
🛠 3. Framework Integrations (Adapters)¶
A. Microsoft AutoGen¶
Goal: Verify the autogen:// protocol and multi-agent interaction.
- Install Dependencies:
- Run Sample AutoGen Agent: (Requires a separate server script or the
quickstartmock). - Run Evaluation:
B. LangChain / LangGraph¶
Goal: Verify RemoteRunnable and state-aware graph adapters.
- Install Dependencies:
- Verify LangChain Adapter:
- Verify LangGraph (Structural):
- Since the LangGraph adapter is currently an Architectural Mock, verification confirms the engine's ability to route requests through the plugin hook without hardcoding.
- Run:
multiagent-eval evaluate --path industries/telecom --protocol langgraph
C. CrewAI¶
Goal: Verify task-based multi-agent orchestration.
- Install Dependencies:
- Verify Adapter (Structural Mock):
- Verification confirms the lifecycle hook
on_discover_adapterscorrectly registers thecrewai://protocol. - Run:
multiagent-eval evaluate --path industries/telecom --protocol crewai
💎 4. Proprietary Models (Production Verification)¶
Verify that the openai, claude, gemini, and grok adapters are production-ready using live API keys.
| Provider | Protocol | Verification Command |
|---|---|---|
| OpenAI | openai:// |
multiagent-eval run --protocol openai --agent openai://gpt-4-turbo |
| Anthropic | claude:// |
multiagent-eval run --protocol claude --agent claude://claude-3-5-sonnet-20240620 |
gemini:// |
multiagent-eval run --protocol gemini --agent gemini://gemini-1.5-pro |
|
| xAI | grok:// |
multiagent-eval run --protocol grok --agent grok://grok-beta |
🔌 5. Plugin Verification¶
A. RemoteBridgePlugin (Live Debugger)¶
- Launch the console:
multiagent-eval console. - Run an evaluation.
- Observe real-time state updates in the Visual Debugger tab. This verifies the
on_agent_turn_startandon_turn_endhooks.
C. CoveragePlugin (Grounding Heatmaps)¶
- Run:
multiagent-eval evaluate --path industries/telecom. - Check
reports/coverage/: Verify thattelecom_coverage.htmlis generated. This ensures theon_tool_resulthook is correctly capturing grounding events.
⚖️ 6. Judge Layer & Calibration Verification¶
Goal: Confirm specialized rubric routing and judge-to-human alignment checking.
- Verify Rubric Selection:
- Run a scenario with a specialized rubric:
-
Check the
run.jsonltrace to ensure the correct rubric prompt was injected (requires inspecting the event payload). -
Verify Calibration Command:
- Ensure you have a
run.jsonlwith bothluna_judge_scoreandhuman_scorefield. - Run:
- Verify the ASCII "JUDGE CALIBRATION REPORT" is displayed with MAE and Pearson Correlation.
✅ 7. Final "Production-Ready" Checklist¶
- [ ]
multiagent-eval doctorreturns all GREEN status. - [ ] No
ImportErrorwhen running any ecosystem adapter. - [ ] API keys are correctly masked in log outputs.
- [ ] Timeouts are respected (verified by setting
DEFAULT_ADAPTER_TIMEOUT=1). - [ ] Results generated in
reports/include detailed framework metadata.