dataproc-engine User Manual¶

Welcome to the dataproc-engine, a modular and extensible framework for generating high-quality industrial datasets for AI agent evaluation. This manual provides an end-to-end guide for setting up, configuring, and### 📚 Supporting Documentation - Implementation Strategy - Architecture Diagram - Data Veracity Report - Algorithm Overview - Target Schema

Prerequisites¶

Python 3.9+
Pip or Poetry

Setup Steps¶

Clone the Repository:

git clone https://github.com/najeed/ai-agent-eval-harness.git
cd ai-agent-eval-harness

Install Dependencies:
```
pip install -e .
```
Note: This installs the core dataproc-engine and the dataproc-cli utility.

🛠️ 2. CLI Usage Reference¶

The primary interface is the dataproc-cli.

Basic Command Structure¶

python dataproc_engine/cli/main.py extract --industry [INDUSTRY] [FLAGS]

Core Flags¶

Flag	Description	Default / Example
`--industry`	Target industrial sector (16+ sectors available).	`finance`
`--limit`	Maximum number of records to process.	`10`
`--format`	Output format (JSON Lines or CSV).	`jsonl`
`--schema-type`	Specific industry-standard schema variant.	`clinical`, `sec_edgar`
`--source`	Data acquisition strategy (`api` or `file`).	`api`
`--input-uri`	Local path or URL for `file` source ingestion.	`--input-uri data.csv`
`--target-dir`	Custom output directory for results.	`output`
`--output-name`	Override the default generated filename.	`my_dataset.jsonl`
`--llm-strategy`	Transformation approach (`auto`, `cloud`, `ollama`, `mock`).	`auto`
`--allow-simulation`	Fallback to high-fidelity synthetic data on failure.	`True`
`--overwrite`	Bypasses the conflict prompt for existing output.	N/A
`--max-backups`	Maximum number of rolling backups to keep.	`5`

🌐 3. Industrial Data Sectors¶

The engine supports 16 standard industrial sectors with specialized gold-standard schemas. Use the strings below for the --industry flag.

Sector Group	Industry ID (`--industry`)	Primary Schema Types (`--schema-type`)	Gold Source
Finance	`finance`	`sec_edgar`, `credit_risk`, `world_bank`	SEC, FRED
Energy	`energy`	`eia`, `energy_balances`, `standard`	EIA, Energy Provider Agency
Healthcare	`healthcare`	`cms`, `clinical`, `standard`	CMS, WHO
Telecom	`telecom`	`fcc`, `ookla`, `standard`	FCC, Ookla
Public Sector	`demographics`	`standard`, `global_population`	World Bank
Public Sector	`labor`	`standard`, `ilo_employment`	ILO, BLS
Public Sector	`housing`	`standard`, `hud_market`	HUD
Public Sector	`environment`	`standard`, `climate_noaa`	NOAA
Academic	`education`	`standard`, `nces_stats`	NCES, UNESCO
Industrial	`manufacturing`	`standard`, `industrial_stats`	Industrial Statistics Agency
Commercial	`agriculture`	`standard`, `usda_crops`	USDA
Commercial	`ecommerce`	`olist`, `uci_retail`, `standard`	Olist, UCI
Logistics	`transportation`	`standard`, `bts_ontime`	BTB, Eurostat
Foundational	`media_and_entertainment`	`standard`, `imdb_cultural`	IMDb
Foundational	`decision_support`	`standard`, `signal_linking`	Correlator
Foundational	`unstructured`	`standard`, `llm_scraper`	Scrapers

Full Industry List & Schema Types¶

🛡️ 4. Gold Standards & BYOD (Bring Your Own Data)¶

Compliance-First Architecture¶

For restricted datasets (CC BY-NC-SA 4.0, Restricted Clinical Repository, Energy Provider Agency), the engine defaults to High-Fidelity Simulations to ensure 100% Apache 2.0 compliance for redistribution.

Unlocking Live Processing¶

To use actual commercial/restricted benchmarks, follow these steps: 1. Apply for Access: Use the links in the Data Veracity Report to download the data. 2. Execution:

python dataproc_engine/cli/main.py extract --industry healthcare --schema-type clinical --input-uri ./my_clinical_data/labevents.csv

📂 Default Output Layout¶

If --target-dir is not specified, the engine automatically resolves the path to ensure organizational parity: * Path: industries/{industry}/datasets/ * Knowledge Base (JSONL): {industry}_kb.jsonl * Records (CSV): {industry}_records.csv

🚀 Examples¶

Extract Finance Data (Default):

python dataproc_engine/cli/main.py extract --industry finance
# Saved to: industries/finance/datasets/finance_kb.jsonl

🔑 5. Configuration & Secrets¶

Configuration is managed via config.json or Environment Variables.

API Keys & Secrets¶

Service	Environment Variable	Config Key	Industry
EIA	`EIA_API_KEY`	`eia_api_key`	Energy
FRED	`FRED_API_KEY`	`fred_api_key`	Finance
NCES	`NCES_API_KEY`	`nces_api_key`	Education
ILO	`ILO_API_KEY`	`ilo_api_key`	Labor
Industrial Statistics Agency	`Industrial Statistics Agency_API_KEY`	`Industrial Statistics Agency_api_key`	Manufacturing
Copernicus	`CDS_API_KEY`	`cds_api_key`	Environment
Ookla	`OOKLA_API_KEY`	`ookla_api_key`	Telecom
Spotify	`SPOTIFY_CLIENT_ID`	`spotify_client_id`	Media
Backup Retention Policy	`DATAPROC_MAX_BACKUPS`	N/A	CLI Global

📊 6. Output & Veracity¶

Every record produced by the engine adheres to the StandardSchema, including: - Deterministic ID: Hash-based records for cross-dataset linking. - SHA-256 Integrity: Every record includes a checksum of its normalized data. - Provenance: Lineage back to the source URL or local file path.

🔄 8. Rotational Dataset Archiving¶

To prevent accidental data loss (especially when local changes are overwritten by CI/CD triggers), the engine implements a Rolling Backup Policy:

Automatic Archiving: When a file conflict occurs (even with --overwrite), the engine renames the existing file to <filename>.<timestamp>.bak.
Rotation: By default, the engine keeps the last 5 backups and automatically prunes older ones.

Customization: Use the --max-backups flag or set the DATAPROC_MAX_BACKUPS environment variable:

# Keep the last 10 versions via CLI
python dataproc_engine/cli/main.py extract --industry finance --max-backups 10

# Keep the last 10 versions via Env Var (Windows)
$env:DATAPROC_MAX_BACKUPS="10"

⚖️ 9. Licensing & Attribution¶

Code: Apache 2.0.
Shipped Data: Synthetic boilerplate only (Safe for redistribution).
External Data: Must adhere to respective licenses (CC BY, Restricted Clinical Repository, Energy Provider Agency).

For detailed licensing guidance, see the Data Veracity Report.