brady_bastian on Hacker News

Show HN: Synthea Fhir Data in BigQuery

We generated ~1,100 synthetic patients with Synthea, processed the FHIR R4 output through our normalization engine (Forge), and published it as a free public dataset on BigQuery Analytics Hub.

8 resource types: Patient, Encounter, Observation, Condition, Procedure, Immunization, MedicationRequest, DiagnosticReport.

The raw Synthea output has 459 nested fields per resource, urn:uuid: references, and no column descriptions. We flatten it to clean views with ~15 columns each, pre-extracted IDs, and descriptions sourced from the FHIR R4 OpenAPI spec. Example:

-- Raw FHIR: SELECT id, code.text FROM diagnostic_report WHERE subject.reference = CONCAT("urn:uuid:", patient_id) -- Forge view: SELECT report_name, patient_id FROM v_diagnostic_report Data scanned per query drops ~90x (450 MB → 5 MB).

Free to subscribe: https://console.cloud.google.com/bigquery/analytics-hub/exch...

Updated weekly. Useful if you're building anything against FHIR data and want a realistic test dataset without standing up your own Synthea pipeline.

Happy to answer questions about the normalization approach or FHIR data modeling tradeoffs.

2brady_bastian3mo ago0

Show HN: Forge, the NoSQL to SQL Compiler

https://forge.foxtrotcommunications.net/

I've been a data engineer for years and one thing drove me crazy: every time we integrated a new API, someone had to manually write SQL to flatten the JSON into tables. LATERAL FLATTEN for Snowflake, UNNEST for BigQuery, EXPLODE for Databricks — same logic, different syntax, written from scratch every time.

Forge takes an OpenAPI spec (or any JSON schema) and automatically:

1. Discovers all fields across all nesting levels 2. Generates dbt models that flatten nested JSON into a star schema 3. Compiles for BigQuery, Snowflake, Databricks, AND Redshift from the same metadata 4. Runs incrementally — new fields get added via schema evolution, no rebuilds

The key insight is that JSON-to-table is a compilation problem, not a query problem. If you know the schema, you can generate all the SQL mechanically. Forge is essentially a compiler: schema in, warehouse- specific SQL out.

How it works under the hood:

- An introspection phase scans actual data rows and collects the union of ALL keys (not just one sample record), so sparse/optional fields are always discovered - Each array-of-objects becomes its own child table with a hierarchical index (idx) linking back to the parent — no manual join keys needed - Warehouse adapters translate universal metadata into dialect-specific SQL: BigQuery: UNNEST(JSON_EXTRACT_ARRAY(...)) Snowflake: LATERAL FLATTEN(input => PARSE_JSON(...)) Databricks: LATERAL VIEW EXPLODE(from_json(...)) Redshift: JSON_PARSE + manual extraction - dbt handles incremental loads with on_schema_change='append_new_columns'

The full pipeline: Bellows (synthetic data generation from OpenAPI specs) → BigQuery staging → Forge (model generation + dbt run) → queryable tables + dbt docs. There's also Merlin (AI-powered field enrichment via Gemini) that auto-generates realistic data generators for each field.

I built this because I watched teams spend weeks writing one-off FLATTEN queries that broke the moment an API added a field. Every Snowflake blog post shows you how to parse 3 fields from a known schema — none of them handle schema evolution, arbitrary nesting depth, or cross-warehouse portability.

Try it: https://forge.foxtrotcommunications.net

Happy to answer questions about the architecture, the cross-warehouse compilation approach, or the AI enrichment layer.

4brady_bastian3mo ago1

Forge – Automate 3NF Schema Generation from Nested JSON in BigQuery/Snowflake

I've built a product that completely parses highly nested JSON data in cloud data warehouses. Forge works by methodically dissecting each subcollection and each field of Json data, one by one, and creates 3NF tables for each json sub-object. This completely flattens Json data of any complexity and depth and fully accounts for any schema changes in the entire dataset.

While hand crafted scripts work once and ok for a quick look, a systematic deconstruction and rebuild of the entire Json object is required to truly understand the structure. Some companies have Json data coming from MongoDb or Firestore which has undergone hundreds of even thousands of changes from changing data types to abstract manipulations such as changing Json object to array. A simple parsing script won't cut it. You will either sacrifice some data in order to get something out of it or spend weeks writing dozens of scripts and manipulations to correctly process it. Repeat this for each API and each schema that your company utilizes.

Forge doesn't stop at just unnesting. With the included AI schema classifier, Excalibur, we automatically identify which API your data is coming from based upon tens of thousands of examples. From Stripe to hubspot to segment, we detect it, classify it, and automatically apply field mappings. Additionally, Forge uses advanced AI and ML techniques to document and identify PII fields in your data. No more painstaking scrubbing and parsing of your data, just quick and ready analytics.

How does Forge handle schema changes? Automatic detection and adaptation. When new fields appear, Forge regenerates models while maintaining backward compatibility. Zero downtime.

Does my data leave my warehouse? SaaS: Forge connects via service account to process data in-place. Only schema fingerprints (not actual data) sent for AI classification. Enterprise: Everything runs in YOUR VPC. Zero data egress.

What warehouses do you support? BigQuery, Snowflake, Databricks, and Redshift. One parse generates native models for all four simultaneously.

How accurate is PII detection? Pridwen uses a 3-layer hybrid system (rules + ML + crowd) with 95%+ accuracy. Context-aware and supports 20+ languages.

Do you replace Fivetran/Airbyte? No, we're complementary. Use Fivetran/Airbyte to load raw JSON → Use Forge to transform it into analytics tables.

How much engineering time does this save? Conservative estimate: 2-4 weeks initial build + 10 hours/month maintenance = $50,000-100,000/year for mid-size teams.

1brady_bastian4mo ago1

Show HN: Synthea Fhir Data in BigQuery

We generated ~1,100 synthetic patients with Synthea, processed the FHIR R4 output through our normalization engine (Forge), and published it as a free public dataset on BigQuery Analytics Hub.

8 resource types: Patient, Encounter, Observation, Condition, Procedure, Immunization, MedicationRequest, DiagnosticReport.

Free to subscribe: https://console.cloud.google.com/bigquery/analytics-hub/exch...

Updated weekly. Useful if you're building anything against FHIR data and want a realistic test dataset without standing up your own Synthea pipeline.

Happy to answer questions about the normalization approach or FHIR data modeling tradeoffs.

Show HN: Forge, the NoSQL to SQL Compiler

https://forge.foxtrotcommunications.net/

Forge takes an OpenAPI spec (or any JSON schema) and automatically:

How it works under the hood:

Try it: https://forge.foxtrotcommunications.net

Happy to answer questions about the architecture, the cross-warehouse compilation approach, or the AI enrichment layer.

Forge – Automate 3NF Schema Generation from Nested JSON in BigQuery/Snowflake

How does Forge handle schema changes? Automatic detection and adaptation. When new fields appear, Forge regenerates models while maintaining backward compatibility. Zero downtime.

What warehouses do you support? BigQuery, Snowflake, Databricks, and Redshift. One parse generates native models for all four simultaneously.

How accurate is PII detection? Pridwen uses a 3-layer hybrid system (rules + ML + crowd) with 95%+ accuracy. Context-aware and supports 20+ languages.

Do you replace Fivetran/Airbyte? No, we're complementary. Use Fivetran/Airbyte to load raw JSON → Use Forge to transform it into analytics tables.

How much engineering time does this save? Conservative estimate: 2-4 weeks initial build + 10 hours/month maintenance = $50,000-100,000/year for mid-size teams.

brady_bastian

Recent submissions

Show HN: Open-Source FHIR –> OMOP Pipeline (opens in new tab)

Show HN: Forge-Core released on GitHub, Parse JSON in your data warehouse (opens in new tab)

Show HN: Avalon - Synthetic FHIR R4 patient data as OMOP CDM 5.4 views (opens in new tab)

Show HN: Synthea Fhir Data in BigQuery

Show HN: Forge, the NoSQL to SQL Compiler

Forge – Automate 3NF Schema Generation from Nested JSON in BigQuery/Snowflake

Forge – Transform nested JSON into governed dbt models for BQ/Snowflake (opens in new tab)

Recent submissions

Show HN: Open-Source FHIR –> OMOP Pipeline (opens in new tab)

Show HN: Forge-Core released on GitHub, Parse JSON in your data warehouse (opens in new tab)

Show HN: Avalon - Synthetic FHIR R4 patient data as OMOP CDM 5.4 views (opens in new tab)

Show HN: Synthea Fhir Data in BigQuery

Show HN: Forge, the NoSQL to SQL Compiler

Forge – Automate 3NF Schema Generation from Nested JSON in BigQuery/Snowflake

Forge – Transform nested JSON into governed dbt models for BQ/Snowflake (opens in new tab)