Configuration & Secret Management Strategy¶
Overview¶
To ensure security and flexibility, dataproc-engine follows a hierarchical configuration strategy that prioritizes environment variables for sensitive secrets (API keys) while using standard configuration files for non-sensitive parameters.
1. Secret Handling (API Keys)¶
Secrets must never be hardcoded or stored in version-controlled config files.
Recommended Pattern: Environment Variables¶
The engine will look for industry-specific environment variables:
* SEC_USER_AGENT: Required for SEC EDGAR access.
* FRED_API_KEY: Required for FRED macro data.
Local Development: .env Support¶
For local runs, the engine will automatically load keys from a .env file in the project root:
# .env file
SEC_USER_AGENT="YourName (your.email@example.com)"
FRED_API_KEY="your_alphanumeric_key_here"
2. Hierarchical Configuration¶
The DatasetEngine loads a base configuration (e.g., config.yaml) and merges it with industry-specific overrides.
Precedence Order (Lowest to Highest):¶
- Defaults: Hardcoded in the
BaseProviderorDatasetEngine. - Config File:
config.yaml(e.g.,rate_limit: 10). - CLI Arguments: (e.g.,
--limit 100). - Environment Variables: (e.g.,
DATAPROC_LIMIT=50).
3. Implementation in Code¶
The DatasetEngine will utilize a ConfigLoader utility to inject secrets into providers during instantiation.
# Generic injection pattern
class FinanceProvider(BaseProvider):
def __init__(self, config: dict):
super().__init__(config)
# Secrets are injected from ENV or .env by the Orchestrator
self.api_key = config.get("fred_api_key")
self.user_agent = config.get("sec_user_agent")
4. Security Auditing¶
- Secret Masking: Logs must never print the values of keys.
- Validation: The
validate()step in the provider should check for the presence of required keys before starting extraction.