Data Veracity & Provenance Report (Hardened)¶
Licensing & Synthesis Compliance Matrix ⚖️¶
To ensure Apache 2.0 compatibility across the 16-sector fleet, the engine adheres to the following distribution matrix:
| Source Category | Examples | Safe to Embed? | Generator Mode | Licensing Conditions |
|---|---|---|---|---|
| Public Domain | SEC, Census, NOAA, HUD, CMS | ✅ Yes | Moment Parity | Derived/Synthetic twins allowed. |
| CC BY / CC0 | World Bank, WHO, UNESCO | ✅ Yes | Moment Parity | Attribution required (CC BY). |
| NC-SA / NC | Olist, Ookla, Zillow, IMDb | ❌ No | Logic Only | NC/SA prevents redistribution. |
| Restricted (DUA) | Clinical Repository, Industrial Stats | ❌ No | Logic Only | Contractual gates (Credentialed). |
Parity Generation Principles (Conservative Stance)¶
- Conservative Default: If a license is NC (Non-Commercial) or Restricted (DUA), the engine distributes Logic Only. No synthetic approximations are embedded in the core repo for these sources.
- Provenance Audit: All synthetic data includes a defensible audit trail link back to the permissive original.
- Parameter Integrity: Parameters for restricted generators are sourced exclusively from public literature (e.g., Merck Manuals, Energy Statistics Yearbooks), not from gated raw data.
- License Contamination (NC-SA): Because CC BY-NC-SA 4.0 carries a "ShareAlike" clause, synthetic outputs generated from these sources may inherit Non-Commercial restrictions. To protect the Apache 2.0 core, these generators are isolated from the embedded package.
⚠️ Licensing Nuances¶
- FRED (Federal Reserve): Licensing is upstream-dependent. Series sourced from the Federal Reserve are Public Domain; others (BIS, ECB) carry their own terms.
- Copernicus: Licensed under the Copernicus Data Information and Access Services license (CC BY 4.0 equivalent). Mandatory attribution: "Generated using Copernicus Climate Change Service information 2026."
- HUD User: Fair Market Rents (FMR) data is U.S. Government Public Domain. Attribution is best practice for data veracity audit trails.
- Zillow Research: Personal/Academic use only. Prohibits bulk redistribution or commercial derivative services per Developer Terms.
- IMDb: Non-Commercial by ToS. No "ShareAlike" clause; derivative restriction analysis differs from CC BY-NC-SA sources.
🏗️ Data Onboarding (BYOD)¶
To unlock the full "Gold Standard" evaluation, download the following datasets and provide the path via the --input-uri flag. Please review the respective licensing restrictions for use and redistribution.
| Industry | Target Benchmark | Download Link | License |
|---|---|---|---|
| Finance | UCI Credit Card | UCI ML Repository | CC BY 4.0 |
| Healthcare | Clinical Data Example | Restricted Access | Restricted (Credentialed) |
| Manufacturing | Industrial Benchmarking | Production Statistics | Restricted |
| Energy | Global Energy Standard | Global Statistics | Restricted |
| Retail | E-Commerce Parity | Transaction Logs | CC BY-NC-SA 4.0 |
| Agriculture | Global Agri-Stats | FAOStat | CC BY-NC-SA 3.0 |
| Media | Creative Metadata | IMDb Datasets | Non-Commercial |
🛡️ Sector Diagnostics¶
1. Finance¶
- SEC EDGAR: Public Domain fundamentals. Synthetic templates embedded.
- FRED: Open (Attribution required) macro data. Live API integration.
2. Housing¶
- HUD User: Public Domain¹ rental trends. Synthetic parity embedded.
- Zillow Research: Non-Commercial macro housing metrics. Live download required.
3. Environment¶
- NOAA: Public Domain station data. Climatology synthetic records embedded.
- Copernicus: CC BY 4.0 monitoring data. Embedded with mandatory attribution.
4. Manufacturing¶
- Industrial Stats: Restricted benchmarking. Logic-only parity synthesis.
- Census ASM: Public Domain manufacturing aggregates. Synthetic templates available.
5. Healthcare¶
- CMS: Hospital General Info
- Usage: Quality outcomes and provider metadata. LIVE (Public Domain).
- WHO GHO: Global Health Observatory
- Usage: Global health trends. LIVE (CC BY 4.0).
- Clinical Database:
- Usage: Intensive care records. BYOD / SIMULATED (Restricted).
🏛️ Comprehensive Registry (Citations)¶
| Industry | Primary Source / Citation URL | Format | License |
|---|---|---|---|
| Finance | SEC EDGAR (Fundamentals) | XBRL/CSV | Public Domain |
| Finance | FRED (Federal Reserve Economic Data) | API/CSV | Public Domain |
| Environment | NOAA Climate Data Online | API/CSV | Public Domain |
| Environment | Copernicus Climate Change Service | GRIB/NetCDF | CC BY 4.0 |
| Healthcare | CMS Hospital General Information | CSV/API | Public Domain |
| Healthcare | WHO Global Health Observatory | API/CSV | CC BY 4.0 |
| Labor | U.S. Bureau of Labor Statistics (BLS) | API/CSV | Public Domain |
| Labor | ILOSTAT (International Labour Organization) | REST API | CC BY 4.0 |
| Agriculture | FAOStat (Food and Agriculture Organization) | CSV/API | CC BY-NC-SA 3.0 |
| Agriculture | USDA NASS (Quick Stats) | CSV/API | Public Domain |
| Commerce | Marketplace Parity Repository (Olist) | CSV | CC BY-NC-SA 4.0 |
| Housing | HUD User (Fair Market Rents) | CSV | Public Domain |
| Housing | Zillow Research (Economic Data) | CSV | Non-Commercial |
| Media | IMDb Dataset Interface | TSV | Non-Commercial |
| Telecom | Ookla Open Data (Speedtest Intelligence) | Parquet | CC BY-NC-SA 4.0 |
| Telecom | FCC Fixed Broadband Deployment | CSV/API | Public Domain |
| Manufact. | U.S. Census Bureau (ASM) | CSV/API | Public Domain |
Last Updated: 2026-03-24 (Hardened) ⚖️🛡️🏁