Skip to content

dataproc-engine Architecture Diagram

The following diagram illustrates the data flow and component interactions within the hardened 16-sector industrial fleet.

graph TD
    subgraph CLI[User Interface]
        Main[cli/main.py] -->|--industry| Engine[core/engine.py]
        Main -->|--target-dir| Correlator[core/correlator.py]
    end

    subgraph Core[Execution Layer]
        Engine -->|Register| Providers[16x Industrial Providers]
        Engine -->|Secrets| Config[core/config.py]
        Engine -->|Async Execute| Pipeline[ETL Pipeline]
    end

    subgraph ETL[Unified Pipeline]
        direction TB
        Resiliency[Circuit Breaker / Retry] --> Extract[Async Extraction]
        Extract --> Security[PII Scrubber / Zero-Bundling]
        Security --> Transform[Async Logic / Tiered Fallback]
        Transform --> Validate[Schema Integrity / SHA-256]
        Validate --> Correlate[Cross-Sector Correlation]
    end

    subgraph Intelligence[Inference & Fallback]
        Transform -->|LLM Gating| LLM[core/llm_manager.py]
        LLM -->|T1| Cloud[Gemini/GPT/Claude]
        LLM -->|T2| Local[Ollama/Llama-CPP]
        LLM -->|T3| Heuristic[Regex Heuristics]

        Correlate -->|Fuzzy Match| CorrelationLayer[DataCorrelator]
        CorrelationLayer -->|RapidFuzz| Identity[Identity Resolution]
    end

    subgraph Storage[16-Sector Fleet]
        Correlate -->|Save| Output[( industries/**/datasets/*.{jsonl,csv} )]
        Output -->|Metadata| Audit[Data Veracity Logs]
    end

🎯 Industrial Parity Implementation

The engine achieves "Parity" by comparing extracted records against "Gold Standard" benchmarks (e.g., SEC EDGAR, World Bank).

Parity Scenarios: 1. Anchor Parity: Direct field-for-field mapping for Finance, Healthcare, Energy (Anchor Sectors). 2. Fidelity Simulation: High-fidelity synthetic records used when commercial/restricted licenses prohibit redistribution (Clinical Database V2, Industrial Statistics Agency). 3. Cross-Sector Linking: Verifying that IDs (e.g., Ticker symbols) resolve consistently across different industrial providers.