Skip to content

Data Veracity & Provenance Report (Hardened)

Licensing & Synthesis Compliance Matrix ⚖️

To ensure Apache 2.0 compatibility across the 16-sector fleet, the engine adheres to the following distribution matrix:

Source Category Examples Safe to Embed? Generator Mode Licensing Conditions
Public Domain SEC, Census, NOAA, HUD, CMS ✅ Yes Moment Parity Derived/Synthetic twins allowed.
CC BY / CC0 World Bank, WHO, UNESCO ✅ Yes Moment Parity Attribution required (CC BY).
NC-SA / NC Olist, Ookla, Zillow, IMDb ❌ No Logic Only NC/SA prevents redistribution.
Restricted (DUA) Clinical Repository, Industrial Stats ❌ No Logic Only Contractual gates (Credentialed).

Parity Generation Principles (Conservative Stance)

  1. Conservative Default: If a license is NC (Non-Commercial) or Restricted (DUA), the engine distributes Logic Only. No synthetic approximations are embedded in the core repo for these sources.
  2. Provenance Audit: All synthetic data includes a defensible audit trail link back to the permissive original.
  3. Parameter Integrity: Parameters for restricted generators are sourced exclusively from public literature (e.g., Merck Manuals, Energy Statistics Yearbooks), not from gated raw data.
  4. License Contamination (NC-SA): Because CC BY-NC-SA 4.0 carries a "ShareAlike" clause, synthetic outputs generated from these sources may inherit Non-Commercial restrictions. To protect the Apache 2.0 core, these generators are isolated from the embedded package.

⚠️ Licensing Nuances

  • FRED (Federal Reserve): Licensing is upstream-dependent. Series sourced from the Federal Reserve are Public Domain; others (BIS, ECB) carry their own terms.
  • Copernicus: Licensed under the Copernicus Data Information and Access Services license (CC BY 4.0 equivalent). Mandatory attribution: "Generated using Copernicus Climate Change Service information 2026."
  • HUD User: Fair Market Rents (FMR) data is U.S. Government Public Domain. Attribution is best practice for data veracity audit trails.
  • Zillow Research: Personal/Academic use only. Prohibits bulk redistribution or commercial derivative services per Developer Terms.
  • IMDb: Non-Commercial by ToS. No "ShareAlike" clause; derivative restriction analysis differs from CC BY-NC-SA sources.

🏗️ Data Onboarding (BYOD)

To unlock the full "Gold Standard" evaluation, download the following datasets and provide the path via the --input-uri flag. Please review the respective licensing restrictions for use and redistribution.

Industry Target Benchmark Download Link License
Finance UCI Credit Card UCI ML Repository CC BY 4.0
Healthcare Clinical Data Example Restricted Access Restricted (Credentialed)
Manufacturing Industrial Benchmarking Production Statistics Restricted
Energy Global Energy Standard Global Statistics Restricted
Retail E-Commerce Parity Transaction Logs CC BY-NC-SA 4.0
Agriculture Global Agri-Stats FAOStat CC BY-NC-SA 3.0
Media Creative Metadata IMDb Datasets Non-Commercial

🛡️ Sector Diagnostics

1. Finance

  • SEC EDGAR: Public Domain fundamentals. Synthetic templates embedded.
  • FRED: Open (Attribution required) macro data. Live API integration.

2. Housing

  • HUD User: Public Domain¹ rental trends. Synthetic parity embedded.
  • Zillow Research: Non-Commercial macro housing metrics. Live download required.

3. Environment

  • NOAA: Public Domain station data. Climatology synthetic records embedded.
  • Copernicus: CC BY 4.0 monitoring data. Embedded with mandatory attribution.

4. Manufacturing

  • Industrial Stats: Restricted benchmarking. Logic-only parity synthesis.
  • Census ASM: Public Domain manufacturing aggregates. Synthetic templates available.

5. Healthcare

  • CMS: Hospital General Info
    • Usage: Quality outcomes and provider metadata. LIVE (Public Domain).
  • WHO GHO: Global Health Observatory
    • Usage: Global health trends. LIVE (CC BY 4.0).
  • Clinical Database:
    • Usage: Intensive care records. BYOD / SIMULATED (Restricted).

🏛️ Comprehensive Registry (Citations)

Industry Primary Source / Citation URL Format License
Finance SEC EDGAR (Fundamentals) XBRL/CSV Public Domain
Finance FRED (Federal Reserve Economic Data) API/CSV Public Domain
Environment NOAA Climate Data Online API/CSV Public Domain
Environment Copernicus Climate Change Service GRIB/NetCDF CC BY 4.0
Healthcare CMS Hospital General Information CSV/API Public Domain
Healthcare WHO Global Health Observatory API/CSV CC BY 4.0
Labor U.S. Bureau of Labor Statistics (BLS) API/CSV Public Domain
Labor ILOSTAT (International Labour Organization) REST API CC BY 4.0
Agriculture FAOStat (Food and Agriculture Organization) CSV/API CC BY-NC-SA 3.0
Agriculture USDA NASS (Quick Stats) CSV/API Public Domain
Commerce Marketplace Parity Repository (Olist) CSV CC BY-NC-SA 4.0
Housing HUD User (Fair Market Rents) CSV Public Domain
Housing Zillow Research (Economic Data) CSV Non-Commercial
Media IMDb Dataset Interface TSV Non-Commercial
Telecom Ookla Open Data (Speedtest Intelligence) Parquet CC BY-NC-SA 4.0
Telecom FCC Fixed Broadband Deployment CSV/API Public Domain
Manufact. U.S. Census Bureau (ASM) CSV/API Public Domain

Last Updated: 2026-03-24 (Hardened) ⚖️🛡️🏁