data-to-agentdata: Automated Document Extraction & Formatting¶
1. Implementation Overview¶
The data-to-agentdata utility is fully implemented as a core component of the dataproc-engine. It facilitates the transition from raw, unstructured documents (PDF, Text, Web) to high-signal metrics for the Agent Evaluation System (AES).
Technical Capabilities:¶
- Parsing: Native PDF extraction via
pypdfand raw text/web retrieval viaaiohttp. - Tiered LLM Extraction:
- Cloud: High-precision extraction (Gemini, Claude, GPT, Grok).
- Local: Privacy-first extraction (Ollama: DeepSeek, Qwen).
- Heuristics: Zero-cost regex fallback for structured logs and KV pairs.
- Unified Orchestration: Uses the same
DatasetEngineandDataCorrelatorlayers as the deterministic API providers.
2. Architecture: UnstructuredProvider¶
The UnstructuredProvider implements the BaseProvider interface with a unified asynchronous transformation pipeline:
1. Extracts: Recovers text from documents or URIs.
2. Transforms: Utilizes the LLMManager tiers to map text to the AES-compliant schema.
3. Correlates: Automatically links extracted entities with existing industrial datasets (e.g., matching a document's company name with an SEC CIK).
3. Workflow Example¶
- Input: Annual Report PDF (internal) or raw text dump.
- Command:
python dataproc_engine/cli/main.py extract --source file --input-uri ./reports/apple_q1.pdf --industry finance - Output: A correlated Finance record enriched with infrastructure signals discovered in
/industries/telecom/datasets/.
4. Operational Status¶
- [x] v1: Deterministic API sources (SEC/FRED/EIA).
- [x] v2: Local file ingestion and regex heuristics.
- [x] v3: Intelligent LLM-powered extraction with fuzzy cross-vertical correlation.