📄 Unstructured Data Guide¶
High-fidelity extraction from non-tabular sources (PDF, Doc, Web) using LLM-gated scrapers and OCR pipelines. This sector enables the engine to process raw, natural language source material.
Status: HARDENED¶
- Architecture: LLM-Gated Scraper Fleet.
- Verification: 100% Parity.
Data Sources¶
- Web-based Acquisition: Universal scraper fleet (API-compatible).
- PDF/OCR: Textual extraction from institutional reports and documentation.
🛠️ Schema (StandardSchema)¶
source_uri: URL or File path.content_type:text,image, orhybrid.metric:relevance_score,data_density, orsentiment.value: Numerical reading.integrity_hash: SHA-256 hash of the raw source.