Show HN: Nasty, a cross-warehouse, type-checked, unit-testable analytics library (opens in new tab)

(getnasty.dev)

47 pointstdfirth2y ago6 comments

Hey HN - our team wants to open source a project called NASTY (NASTY Abstract Syntax Tree thingY) that we built for ourselves. NASTY was built to maintain testable/composable data pipelines. Our team was ripping our hair out trying to maintain dbt/SQL scripts across different data warehouses (Redshift, BigQuery, Postgres, Snowflake) on top of ever shifting data foundations maintained by our customer's internal data teams. NASTY is the result of our learnings from field experience.

We wanted to write abstractions so that we could reuse code. We wanted to bundle those abstractions into libraries. We wanted to statically analyze our models so that we caught more errors before production. We wanted a fast unit test suite that you could iterate on locally without connecting to a data warehouse.

In short, we wanted to use all the same practices we used for building our other software. Tools like dbt made a great start at importing these kinds of practices into analytics, but there are many great aspects of the software engineering workflow that still aren't easy to replicate in data.

We've found it to be a super productive way to work, and we thought others might want to use it too. It's very early days so we've put up a page for to explain it, and we've made some executable examples for people to have a go. Let us know what you think!

FAQ:

Is NASTY an ORM? A SQL builder?

Neither! NASTY is kinda it's own thing. NASTY is more like a minimal relational algebra programming language shipped as a Typescript library. It borrows a bunch of learnings from other programming languages and applies them to OLAP programming / data engineering. It's an important distinction, because it totally changes the way you can think about building transformations.

Why TypeScript? Why not Python?

The 100% honest answer is we had a ton of typescript experience and we had a node/react app with pulumi's TS apis. We like typescript and by being a library instead of a full programming language, we can leverage things like vitest, eslint, node, TS's language server, TSdoc, etc.

Show HN: Nasty, a cross-warehouse, type-checked, unit-testable analytics library

(getnasty.dev)

47 pointstdfirth2y ago6 comments

FAQ:

Is NASTY an ORM? A SQL builder?

Why TypeScript? Why not Python?

6 comments

6 comments · 4 top-level

gigatexal2y ago· 1 in thread

What a bold name choice. Not sure how well it will go over when I tell the boss I’m doing something Nasty to the codebase. Hahaha

cotera-grant2y ago

I was involved with the naming. We were debating between a couple of choices that had `ast` in them

We made a word list with

   cat /usr/share/dict/words | grep 'ast'

looking back through our slacks, the other top contenders were

1. Astronaut 2. Wrastle 3. Iconoclast 4. asteroid 5. disaster 6. plastic

We settled on NASTY because it was kinda funny. We were debating if it was too gross, but we think it might have positive signalling to other engineers because the name "NASTY" probably didn't come from a Fortune 50 enterprise tech stack.

After a week or so we stopped noticing what a weird name it was, and it only comes up when we tell someone new about it

zurawiki2y ago· 1 in thread

I'm really happy to see we're exploring ways to make SQL more maintainable. TypeScript did a lot for the JS ecosystem, giving us nice autocomplete and making many kinds of bugs impossible to type-check, adding this kind of type-system to SQL makes a bunch of sense.

I selfishly hope this can smooth over all the missing SQL functionality in redshift

Very cool work team!

cotera-grant2y ago

> I selfishly hope this can smooth over all the missing SQL functionality in redshift

Hi! I'm Grant, I work at Cotera and wrote most of the warehouse compatibility stuff for NASTY

Redshift is the bane of my existence. It was definitely the hardest warehouse to write a NASTY compatible SQL gen for.

A couple of annoyances that immediately come to mind.

1. Redshift Query Planner does wild stuff

At Cotera we'll typically develop analytics libraries on one warehouse working closely with a customer and use the same library for other customers afterwards. A library will go to prod on one warehouse and then get start running on others as new customers with different warehouses want the functionality.

Moving a library between Snowflake, BigQuery an Postgres is almost never a problem performance wise. In, Redshift the semantics will be correct but performance can unexpectedly fall off a cliff for innocuous stuff. We typically write a bunch of unit tests so it's pretty easy to refactor, but I've been shocked at the things that Redshift can't optimize that everyone else had no problem with

2. Redshift does silly stuff with types of literals.

  with cte as (select *, 'foo' as "bar" from "cotera_data".foo) select coalesce("bar", 'baz') from cte;

Fails with the error `[XX000] ERROR: failed to find conversion function from "unknown" to text` Because 'foo' is passed as `any` type...

this fixes is but the error is bizarre and shows up way far away from the problem

  with cte as (select *, 'foo'::text as "bar" from "cotera_data".foo) select coalesce("bar", 'baz') from cte;

(NASTY now fixes this for you when it detects it will happen)

3. The `super` type breaks referential transparency

Here's just one head scratching example, but there are many super type limitations

  -- Allowed
  with bar as (select (json_parse('{"a": 1}')) as foo) select foo.a from bar;

  -- Not allowed
  select (json_parse('{"a": 1}')).a as foo

  -- [0A000] ERROR: applying array subscript on complex expression of SUPER type is currently not supported

4. Leader Only vs Compute Node Functions. Basic things like `generate_series` blow up in surprising ways

From the NASTY source code for Redshift

  // Valid redshift
  // ```
  // select generate_series(0, 10);
  // ```
  //
  // Not valid redshift
  // ```
  // -- Inserts run on compute nodes
  // insert into foo (a) (
  //     -- Leader only function
  //     select generate_series(0, 10) as a
  // )
  // ```
  //
  // This is because `generate_series` is a leader only function, so it can’t be run on worker nodes
  // https://docs.aws.amazon.com/redshift/latest/dg/c_SQL_functions_leader_node_only.html
  // https://docs.aws.amazon.com/redshift/latest/dg/c_sql-functions-leader-node.html
  // https://stackoverflow.com/questions/62716606/redshift-loading-data-issue-specified-types-or-functions-one-per-info-message
  // https://stackoverflow.com/questions/17282276/using-sql-function-generate-series-in-redshift#comment96402527_22782384
  //
  // Recurive CTEs are NOT supported in subqueries
  // ```
  // -- Not valid
  // select \* from (
  //    with recursive t(n) as (
  //        select 1::integer union all select n + 1 from t where n < 100
  //    ) select n from t
  // );
  // ```
  // To get around this, we can use the approach outlined by how dbt does ansi sql generate_series

  // https://github.com/dbt-labs/dbt-utils/blob/main/macros/sql/generate_series.sql
  const numbers = (upperBound: number) => {
    if (upperBound > 2 ** 11) {
      throw new Error(
        `We only support generating series in Reshift where the upperBound is less than ${
          2 ** 11
        }`
      );
    }

    return `
  (
    with p as (
      select 0::integer as generated_number union all select 1::integer
    ),
      unioned as (
      select
        (   p0.generated_number * power(2, 0) 
  // ... Omitted for brevity
         +  p11.generated_number * power(2, 11) 
        ) as generated_number
      from
        p as p0
        cross join p as p1
  // ... Omitted for brevity

        cross join p as p11
    )
    select generated_number::integer from unioned where generated_number <= ${upperBound} order by generated_number
  )
  `;
  };

cotera-grant2y ago

Hi, I'm Grant, I'm one of the primary NASTY maintainers along with Tom and TJ.

One thing that's on the getnasty.dev but worth calling out explicitly with the link are the ruby koans style TDD learn by examples we have for NASTY. They should just require NodeJS in whatever form is easiest.

https://github.com/coterahq/learn-nasty-by-example

Another small but interesting thing that I think people who work in SQL would find cool are `invariants`, which do runtime data integrity checks

https://getnasty.dev/docs/invariants

statueofibberty2y ago

Super excited for this! We made this because of how hard it was to track how things were breaking downstream and hopefully it makes some data person's life easier/more manageable

j / k navigate · click thread line to collapse

6 comments

6 comments · 4 top-level

gigatexal2y ago· 1 in thread

What a bold name choice. Not sure how well it will go over when I tell the boss I’m doing something Nasty to the codebase. Hahaha

cotera-grant2y ago

I was involved with the naming. We were debating between a couple of choices that had `ast` in them

We made a word list with

   cat /usr/share/dict/words | grep 'ast'

looking back through our slacks, the other top contenders were

1. Astronaut 2. Wrastle 3. Iconoclast 4. asteroid 5. disaster 6. plastic

After a week or so we stopped noticing what a weird name it was, and it only comes up when we tell someone new about it

zurawiki2y ago· 1 in thread

I selfishly hope this can smooth over all the missing SQL functionality in redshift

Very cool work team!

cotera-grant2y ago

> I selfishly hope this can smooth over all the missing SQL functionality in redshift

Hi! I'm Grant, I work at Cotera and wrote most of the warehouse compatibility stuff for NASTY

Redshift is the bane of my existence. It was definitely the hardest warehouse to write a NASTY compatible SQL gen for.

A couple of annoyances that immediately come to mind.

1. Redshift Query Planner does wild stuff

2. Redshift does silly stuff with types of literals.

  with cte as (select *, 'foo' as "bar" from "cotera_data".foo) select coalesce("bar", 'baz') from cte;

Fails with the error `[XX000] ERROR: failed to find conversion function from "unknown" to text` Because 'foo' is passed as `any` type...

this fixes is but the error is bizarre and shows up way far away from the problem

  with cte as (select *, 'foo'::text as "bar" from "cotera_data".foo) select coalesce("bar", 'baz') from cte;

(NASTY now fixes this for you when it detects it will happen)

3. The `super` type breaks referential transparency

Here's just one head scratching example, but there are many super type limitations

  -- Allowed
  with bar as (select (json_parse('{"a": 1}')) as foo) select foo.a from bar;

  -- Not allowed
  select (json_parse('{"a": 1}')).a as foo

  -- [0A000] ERROR: applying array subscript on complex expression of SUPER type is currently not supported

4. Leader Only vs Compute Node Functions. Basic things like `generate_series` blow up in surprising ways

From the NASTY source code for Redshift

  // Valid redshift
  // ```
  // select generate_series(0, 10);
  // ```
  //
  // Not valid redshift
  // ```
  // -- Inserts run on compute nodes
  // insert into foo (a) (
  //     -- Leader only function
  //     select generate_series(0, 10) as a
  // )
  // ```
  //
  // This is because `generate_series` is a leader only function, so it can’t be run on worker nodes
  // https://docs.aws.amazon.com/redshift/latest/dg/c_SQL_functions_leader_node_only.html
  // https://docs.aws.amazon.com/redshift/latest/dg/c_sql-functions-leader-node.html
  // https://stackoverflow.com/questions/62716606/redshift-loading-data-issue-specified-types-or-functions-one-per-info-message
  // https://stackoverflow.com/questions/17282276/using-sql-function-generate-series-in-redshift#comment96402527_22782384
  //
  // Recurive CTEs are NOT supported in subqueries
  // ```
  // -- Not valid
  // select \* from (
  //    with recursive t(n) as (
  //        select 1::integer union all select n + 1 from t where n < 100
  //    ) select n from t
  // );
  // ```
  // To get around this, we can use the approach outlined by how dbt does ansi sql generate_series

  // https://github.com/dbt-labs/dbt-utils/blob/main/macros/sql/generate_series.sql
  const numbers = (upperBound: number) => {
    if (upperBound > 2 ** 11) {
      throw new Error(
        `We only support generating series in Reshift where the upperBound is less than ${
          2 ** 11
        }`
      );
    }

    return `
  (
    with p as (
      select 0::integer as generated_number union all select 1::integer
    ),
      unioned as (
      select
        (   p0.generated_number * power(2, 0) 
  // ... Omitted for brevity
         +  p11.generated_number * power(2, 11) 
        ) as generated_number
      from
        p as p0
        cross join p as p1
  // ... Omitted for brevity

        cross join p as p11
    )
    select generated_number::integer from unioned where generated_number <= ${upperBound} order by generated_number
  )
  `;
  };

cotera-grant2y ago

Hi, I'm Grant, I'm one of the primary NASTY maintainers along with Tom and TJ.

https://github.com/coterahq/learn-nasty-by-example

Another small but interesting thing that I think people who work in SQL would find cool are `invariants`, which do runtime data integrity checks

https://getnasty.dev/docs/invariants

statueofibberty2y ago

Super excited for this! We made this because of how hard it was to track how things were breaking downstream and hopefully it makes some data person's life easier/more manageable

j / k navigate · click thread line to collapse