๐ฆ Distribution Manifest (Hardened)¶
This manifest defines the distribution policy for the 16-sector dataset fleet. To ensure Apache 2.0 compatibility, restricted and NC-SA data is never bundled; instead, the engine provides Logic-Only synthesis paths.
๐๏ธ Bundled Datasets (Embedded)¶
Safe for immediate redistribution. No raw PII or restricted records. Verified Statistical Parity.
| Sector | Dataset Label | License | File Path |
|---|---|---|---|
| Finance | SEC Fundamentals | Public Domain | industries/finance/datasets/synthetic_parity.jsonl |
| Demogs | Population Trends | CC BY 4.0 | industries/demographics/datasets/synthetic_parity.jsonl |
| Housing | HUD Rental Trends | Public Domainยน | industries/housing/datasets/housing_kb.jsonl |
| Enviro | NOAA Climatology | Public Domain | industries/environment/datasets/noaa_climatology.jsonl |
| Enviro | Copernicus Climate | CC BY 4.0ยฒ | industries/environment/datasets/copernicus_climate.jsonl |
ยน HUD Fair Market Rents are U.S. Gov Public Domain. Attribution headers included as best practice. ยฒ Mandatory attribution embedded: "Generated using Copernicus Climate Change Service information 2026."
๐ Restricted Datasets (Generator-Only)¶
Redistribution prohibited. Run the local generator to produce your own "Statistical Twin". Statistical parameters are validated for first and second moment parity against benchmark sources.
| Sector | Benchmark Description | Status | Restriction Type | Generator |
|---|---|---|---|---|
| Healthcare | Standard Clinical Repository | ๐ Restricted | DUA / Credentialed | clinical_generator.py |
| Energy | Global Energy Standard | ๐ Restricted | DUA / Restricted | energy_generator.py |
| Manufacturing | Industrial Benchmarking | ๐ Restricted | DUA / Restricted | industrial_generator.py |
| Retail | E-Commerce Parity | ๐ Restricted | NC-SA 4.0ยณ | olist_generator.py |
| Agriculture | Global Agri-Stats | ๐ Restricted | NC-SA 3.0ยณ | faostat_generator.py |
| Media | Creative Metadata (IMDb) | ๐ Restricted | Non-Commercialโด | imdb_generator.py |
| Telecom | Network Performance | ๐ Restricted | NC-SA 4.0ยณ | ookla_generator.py |
ยณ NC-SA Distinction: These datasets are restricted by Copyright/ShareAlike. Synthetic outputs may inherit Non-Commercial restrictions. โด IMDb Policy: Non-Commercial by ToS. No "ShareAlike" clause; derivative restriction analysis differs from CC BY-NC-SA sources.
๐ Live Datasets (API / URL)¶
| Sector | Data Provider | Method | License |
|---|---|---|---|
| Finance | FRED | REST API | Upstream-dependent |
| Agri-Tech | USDA NASS | Quick Stats API | Public Domain |
| Housing | Zillow Research | Download | Non-Commercialโต |
| Telecom | FCC | Geospatial | Public Domain |
| Labor | ILOSTAT | REST API | CC BY / Restricted |
| Labor | Bureau of Labor Statistics (BLS) | Series API | Public Domain |
โต Zillow Research data is for personal/academic use only. Commercial redistribution of bulk data is prohibited.
๐๏ธ Comprehensive Registry (Citations)¶
| Industry | Primary Source / Citation URL | Format | License |
|---|---|---|---|
| Finance | SEC EDGAR (Fundamentals) | XBRL/CSV | Public Domain |
| Finance | FRED (Federal Reserve Economic Data) | API/CSV | Public Domain |
| Environment | NOAA Climate Data Online | API/CSV | Public Domain |
| Environment | Copernicus Climate Change Service | GRIB/NetCDF | CC BY 4.0ยฒ |
| Healthcare | CMS Hospital General Information | CSV/API | Public Domain |
| Healthcare | WHO Global Health Observatory | API/CSV | CC BY 4.0 |
| Labor | U.S. Bureau of Labor Statistics (BLS) | API/CSV | Public Domain |
| Labor | ILOSTAT (International Labour Organization) | REST API | CC BY 4.0 |
| Agriculture | FAOStat (Food and Agriculture Organization) | CSV/API | CC BY-NC-SA 3.0 |
| Agriculture | USDA NASS (Quick Stats) | CSV/API | Public Domain |
| Commerce | Marketplace Parity Repository (Olist) | CSV | CC BY-NC-SA 4.0 |
| Housing | HUD User (Fair Market Rents) | CSV | Public Domainยน |
| Housing | Zillow Research (Economic Data) | CSV | Non-Commercial |
| Media | IMDb Dataset Interface | TSV | Non-Commercial |
| Telecom | Ookla Open Data (Speedtest Intelligence) | Parquet | CC BY-NC-SA 4.0 |
| Telecom | FCC Fixed Broadband Deployment | CSV/API | Public Domain |
| Manufact. | U.S. Census Bureau (ASM) | CSV/API | Public Domain |
Full registry and veracity diagnostics available in the Data Veracity & Provenance Report.
๐งช Local Synthesis Guide¶
To generate a compliant parity dataset, use the dedicated scripts in the generators/ directory. These scripts enforce the required compliance wrappers and inject legally defensible provenance metadata.
- Clinical:
python industries/healthcare/generators/clinical_generator.py - Energy:
python industries/energy/generators/energy_generator.py - Industrial:
python industries/manufacturing/generators/industrial_generator.py - Olist:
python industries/retail/generators/olist_generator.py - FAOStat:
python industries/agriculture/generators/faostat_generator.py - IMDb:
python industries/media_entertainment/generators/imdb_generator.py - Ookla:
python industries/telecom/generators/ookla_generator.py
โ๏ธ Terms of Use - Data & APIs¶
By using this harness, you agree to adhere to the terms of the respective data providers. - RESTRICTED Sources: You must have a valid DUA or credentialing agreement with the relevant data provider to use raw restricted data locally. - NC-SA Sources: You agree not to use synthetic outputs for commercial competition or redistribution where prohibited (e.g., Olist, Zillow). - Attribution: You will maintain all embedded source headers in output artifacts.
Last Updated: 2026-03-24 (Hardened) โ๏ธ๐ก๏ธ๐