Forge, our latest product, is an automated JSON parsing engine for Cloud Data Warehouses that sits in between you data ingestion and analytics layers. Use Forge to efficiently parse JSON data into a collection of relational tables easily and efficiently. Forge handles schema evolution flawlessly, ensuring that your team never has a broken pipeline again.
8 resource types: Patient, Encounter, Observation, Condition, Procedure, Immunization, MedicationRequest, DiagnosticReport.
The raw Synthea output has 459 nested fields per resource, urn:uuid: references, and no column descriptions. We flatten it to clean views with ~15 columns each, pre-extracted IDs, and descriptions sourced from the FHIR R4 OpenAPI spec. Example:
-- Raw FHIR: SELECT id, code.text FROM diagnostic_report WHERE subject.reference = CONCAT("urn:uuid:", patient_id) -- Forge view: SELECT report_name, patient_id FROM v_diagnostic_report Data scanned per query drops ~90x (450 MB → 5 MB).
Free to subscribe: https://console.cloud.google.com/bigquery/analytics-hub/exch...
Updated weekly. Useful if you're building anything against FHIR data and want a realistic test dataset without standing up your own Synthea pipeline.
Happy to answer questions about the normalization approach or FHIR data modeling tradeoffs.
I've been a data engineer for years and one thing drove me crazy: every time we integrated a new API, someone had to manually write SQL to flatten the JSON into tables. LATERAL FLATTEN for Snowflake, UNNEST for BigQuery, EXPLODE for Databricks — same logic, different syntax, written from scratch every time.
Forge takes an OpenAPI spec (or any JSON schema) and automatically:
1. Discovers all fields across all nesting levels 2. Generates dbt models that flatten nested JSON into a star schema 3. Compiles for BigQuery, Snowflake, Databricks, AND Redshift from the same metadata 4. Runs incrementally — new fields get added via schema evolution, no rebuilds
The key insight is that JSON-to-table is a compilation problem, not a query problem. If you know the schema, you can generate all the SQL mechanically. Forge is essentially a compiler: schema in, warehouse- specific SQL out.
How it works under the hood:
- An introspection phase scans actual data rows and collects the union of ALL keys (not just one sample record), so sparse/optional fields are always discovered - Each array-of-objects becomes its own child table with a hierarchical index (idx) linking back to the parent — no manual join keys needed - Warehouse adapters translate universal metadata into dialect-specific SQL: BigQuery: UNNEST(JSON_EXTRACT_ARRAY(...)) Snowflake: LATERAL FLATTEN(input => PARSE_JSON(...)) Databricks: LATERAL VIEW EXPLODE(from_json(...)) Redshift: JSON_PARSE + manual extraction - dbt handles incremental loads with on_schema_change='append_new_columns'
The full pipeline: Bellows (synthetic data generation from OpenAPI specs) → BigQuery staging → Forge (model generation + dbt run) → queryable tables + dbt docs. There's also Merlin (AI-powered field enrichment via Gemini) that auto-generates realistic data generators for each field.
I built this because I watched teams spend weeks writing one-off FLATTEN queries that broke the moment an API added a field. Every Snowflake blog post shows you how to parse 3 fields from a known schema — none of them handle schema evolution, arbitrary nesting depth, or cross-warehouse portability.
Try it: https://forge.foxtrotcommunications.net
Happy to answer questions about the architecture, the cross-warehouse compilation approach, or the AI enrichment layer.
While hand crafted scripts work once and ok for a quick look, a systematic deconstruction and rebuild of the entire Json object is required to truly understand the structure. Some companies have Json data coming from MongoDb or Firestore which has undergone hundreds of even thousands of changes from changing data types to abstract manipulations such as changing Json object to array. A simple parsing script won't cut it. You will either sacrifice some data in order to get something out of it or spend weeks writing dozens of scripts and manipulations to correctly process it. Repeat this for each API and each schema that your company utilizes.
Forge doesn't stop at just unnesting. With the included AI schema classifier, Excalibur, we automatically identify which API your data is coming from based upon tens of thousands of examples. From Stripe to hubspot to segment, we detect it, classify it, and automatically apply field mappings. Additionally, Forge uses advanced AI and ML techniques to document and identify PII fields in your data. No more painstaking scrubbing and parsing of your data, just quick and ready analytics.
How does Forge handle schema changes? Automatic detection and adaptation. When new fields appear, Forge regenerates models while maintaining backward compatibility. Zero downtime.
Does my data leave my warehouse? SaaS: Forge connects via service account to process data in-place. Only schema fingerprints (not actual data) sent for AI classification. Enterprise: Everything runs in YOUR VPC. Zero data egress.
What warehouses do you support? BigQuery, Snowflake, Databricks, and Redshift. One parse generates native models for all four simultaneously.
How accurate is PII detection? Pridwen uses a 3-layer hybrid system (rules + ML + crowd) with 95%+ accuracy. Context-aware and supports 20+ languages.
Do you replace Fivetran/Airbyte? No, we're complementary. Use Fivetran/Airbyte to load raw JSON → Use Forge to transform it into analytics tables.
How much engineering time does this save? Conservative estimate: 2-4 weeks initial build + 10 hours/month maintenance = $50,000-100,000/year for mid-size teams.