Skip to content

Dataproc Engine

The dataproc-engine is an enterprise-grade framework designed to generate high-fidelity industrial datasets for AI agent evaluation. By consolidating 16 distinct sectors into a unified hierarchical fleet, the engine provides a robust foundation for benchmarking agents across finance, healthcare, energy, the public sector, and more.

🛡️ Core Pillars

  1. Zero-Bundling Architecture: Guarantees that no PII or restricted commercial records are leaked into the evaluation stream, leveraging high-fidelity simulations where necessary.
  2. Industrial Parity: Achieves high-fidelity statistical match against standard benchmarks (SEC EDGAR, World Bank, etc.).
  3. Deterministic Integrity: Every extracted record is tagged with SHA-256 checksums and immutable IDs for lineage tracking.
  4. Multi-Tier Fallback: Intelligent extraction pipeline that scales from Cloud APIs to local LLMs and regex heuristics.

🏗️ Documentation Layout


🚀 Get Started

To initialize the engine and run your first extraction:

python dataproc_engine/cli/main.py extract --industry finance --limit 10
# Results saved to: industries/finance/datasets/finance_kb.jsonl

For detailed setup instructions, visit the User Manual.