Skip to content

AES Dataset Alignment Strategy

The dataproc-engine is designed to provide the underlying verifiable ground truth that fuels the agent evaluation process. This document outlines how extracted datasets are aligned with existing AES scenarios.

1. Mapping Extraction to Scenarios

Scenario ID Task Requirement dataproc Dataset Support
scenario-39dd9c3d Verify audit trail / retrieval. entity_name and status mapped to SEC CIK and filing status.
finance-cf-11255 Perform 'what-if' modeling. Historical revenue/sales trends (10-K time-series).
Future Scenarios General Knowledge Retrieval. Full corporate facts (JSON Knowledge Base).

2. Output Formatting (AES Standard)

To be useful for agents, the engine supports two primary output formats:

A. Tabular CSV (Legacy Support)

Matches existing industries/{industry}/datasets/ schemas across the 16-sector fleet. Useful for agents using simple table-lookup tools. * Location: industries/{industry}/datasets/{industry}_records.csv * Fields: record_id, entity_name, status, value, updated_at.

B. High-Signal JSONL (Knowledge Base)

Rich, hierarchical data for RAG-enabled or sophisticated reasoning agents. Contains technical units, units of measure, and deep source provenance for all 16 domains. * Location: industries/{industry}/datasets/{industry}_kb.jsonl

3. Direct Integration Workflow

By using the --target industries flag, the engine will: 1. Extract live ground truth. 2. Format it to match the AES target schema. 3. Overwrite the local industries/{industry}/datasets/ files. 4. Allow for immediate re-running of agent evaluations against the latest data.