Provider Interface Specification¶
Base Provider Class¶
All industrial data providers must inherit from the BaseProvider abstract class to ensure compatibility with the DatasetEngine.
from abc import ABC, abstractmethod
class BaseProvider(ABC):
def __init__(self, config: dict):
self.config = config
self.rate_limit = config.get("rate_limit", 1) # Default 1 req/s
@abstractmethod
async def extract(self) -> list[RawArtifact]:
"""Fetch raw documents from external source."""
pass
@abstractmethod
async def transform(self, raw_data: list[RawArtifact]) -> list[StandardSchema]:
"""Convert raw artifacts into the normalized industry schema (Async)."""
pass
@abstractmethod
def validate(self, normalized_data: list[StandardSchema]) -> bool:
"""Perform industry-specific integrity and business logic checks."""
pass
Industrial Implementation Registry¶
The framework currently supports 16 hardened industrial sectors, each implementing the lifecycle above:
1. Anchor Foundations¶
- Finance, Healthcare, Energy, Transportation: Deep API integrations (SEC, CMS, EIA, BTS).
- Matching: Uses
DataCorrelatorfor fuzzy identity resolution (e.g., linking CIKs to industry signatures). - Infrastructure: Fully asynchronous transformation supporting tiered LLM failover.
2. UnstructuredProvider¶
- Universal Input: Handles Local Paths, URLs, and PDF files.
- Intelligence: Integrates
LLMManagerfor heuristic or probabilistic metric extraction.
Extensibility for New Industries¶
Adding a new industry requires zero changes to the DatasetEngine. The developer only needs to:
1. Implement a new class inheriting from BaseProvider.
2. Specify the industry-specific transformation logic (async).
3. Register the new class in the CLI orchestrator.