Ask HN: How do you test SQL?

663 pointspcarolan3y ago315 comments

I've been looking for resources for our data team to apply best practices for testing SQL pipelines (we use DBT) but have not found anything. How do you test SQL pipelines? What patterns, tools and best practices would you recommend? Any good reference material you know of?

Ask HN: How do you test SQL?

663 pointspcarolan3y ago315 comments

315 comments

226 comments · 96 top-level

ramenmeal3y ago· 37 in thread

We spin up a docker container running the DB technology we use, run our DB migration scripts on it, and then run integration tests against it. You get coverage of your migration scripts this way too.

MBCook3y ago

This is what we did at my last job. You can catch DB specific issues that a false implementation wouldn’t show and make sure all your code paths work as expected.

Every time new issues cropped up we would put new data in the test data designed to reproduce it. Every edge case we would run into.

It provided so much confidence because it would catch and trigger so many edge cases that testing with mocks or by hand would miss.

Edit: also, it’s great for test/dev environments. You don’t have to worry about losing important data or filling new environments with data. Just start with the full test data and you’re good to go. It’s got stuff for all the corner cases already. Something got screwed up? Blow it away and reload, it’s not precious.

jdc05893y ago

same here. every backend service with a sql datastore runs an real ephemeral db instance during CI, runs the normal migrations against it, and uses it for tests of the direct data access/persistence code. everything else in the service with a dependency on the data-access stuff gets a mocked version, but testing the actual data access code against a real datasource is non-negotiable IMO.

acheton3y ago

We did something similar, being a small company we used batch files (checked into source control) to run database migrations on the different platforms we supported. Most database platforms have command line tools, although you need to be careful as there can be subtle differences in behaviour between the command line tools & those running with a UI.

fatneckbeard3y ago

thats a huge amount of work and money.

at my company they just told us to stop reporting edge cases. much easier, much cheaper.

2 more replies

feike3y ago

We do this too for PostgreSQL: to ensure the tests are really fast:

    - we create a template database using the migrations
    - for *every* integration test we do `CREATE DATABASE test123 TEMPLATE test_template;`
    - we tune the PostgreSQL instance inside Docker to speed up things, for exampling disabling synchronous_commit

On a successful test, we drop the test123 database. On a failed test, we keep the database around, so we can inspect it a bit.

The really great thing about this approach (IMHO), is that you can validate certain constraint violations.

For example, exclusion constraints are great for modelling certain use cases where overlapping ranges should be avoided. In our (go) code, the test cases can use the sqlstate code, or the constraint name to figure out if we hit the error we expect to hit.

This approach is pretty much as fast as our unit tests (your mileage may vary), but it prevents way more bugs from being merged into our codebase.

neonate3y ago

Out of curiosity, how fast is really fast?

1 more reply

pgoggijr3y ago

I've used this same approach as well. Testcontainers is a nice way to help with this!

https://www.testcontainers.org/

JackFr3y ago

I rolled my own with docker for a few years and recently made the switch to testcontainers. So far so good - but if you’re in an environment or language where test containers are difficult, rolling your own really it ant to hard. It also keeps you honest with maintaining good migration scripts.

jve3y ago

What is testcontainer? Reading the first page I don't understand what benefits it buys me:

> Testcontainers for .NET is a library to support tests with throwaway instances of Docker containers for all compatible .NET Standard versions. The library is built on top of the .NET Docker remote API and provides a lightweight implementation to support your test environment in all circumstances.

Edit: Ok, example helps. https://dotnet.testcontainers.org/examples/aspnet/

I can declare docker infrastructure from my code, right?

1 more reply

claytonjy3y ago

Yup, same. Last time i set this up i used Sqitch¹ for migrations, which encourages you to write tests for each migration; caught a lot of bugs early that way, all in a local-first dev environment. Worked especially well for Postgres since plpgsql makes it easy to write tests more imperatively.

¹: https://sqitch.org/

grilledcheez3y ago

At my job, we're breaking down a monolith into services with a hand-me-down database schema. DB changes are manual, every dev runs against a shared test DB, and everybody dreads doing schema changes. I've been looking for a way to transition into version controlled migrations and it looks like sqitch might be a solid option, as the language-specific frameworks are too opinionated. Thanks for recommending!

1 more reply

Dowwie3y ago

I resented writing the verify scripts for my migrations, after writing unit and integration tests, but yes it is valuable

kakwa_3y ago

We do that too, in fact it's not only the DB that run in docker, but the whole build + CI process.

Our overall strategy is to create a master "test" DB with a test dataset, and for each test, copy this master DB to a test specific DB (CREATE DATABASE <testdb> TEMPLATE <master>) so that tests can run in parallel without interfering with each other and without the significant overhead of a "from scratch" DB initialization.

For schema migrations, we build the "branch/PR" version and the "main" version, then we check that 'init DB with "main" + migration with the "branch/PR" version' results in the same schema as 'init the DB directly with the "branch/PR" version' using apgdiff.

This strategy could probably be extended to migrating from every older version by building each tag, but we don't have that need.

We could also probably improve checks on the data itself however as for now, we only check the schemas.

Few things to note:

* it's still possible to run the tests outside of docker and use a local DB instead with some light setup (it's faster than running everything in docker when developing)

* docker argument --tmpfs <dir/of/db> is quite good, assuming you have enough ram for your dataset

* few configuration tweaks on the DB, like max connection might be necessary.

Overall, we are quite happy with this setup as it permits to implement end to end integration tests quite easily without spending too much time mocking dependencies.

As a general pattern, I find instantiating dependencies internal to your service (like a DB or Queue) to be the way to go, with mocking only for external dependencies (like external APIs) or exceptionally to reach a specific code branch (specially error handling sections).

ddyevf6353723y ago

Running sqlite in memory as a test db speeds up your test runner as crazy. You can do this if you use an sql query builder library, because it can translate your queries to the specific database.

whalesalad3y ago

This can be a good fast/local test or maybe a sanity test ... but there are definitely differences between databases that need to be accounted for. You wanna take that green test pass with a bit of skepticism. So you always want to test on the same DB engine that is running your prod workloads. If your surface area is small, you can get by with the approach you mentioned, but it would need to be marked tech debt that should be paid down as soon as possible, or as soon as that bug that manifests itself in PSQL but not sqlite appears :)

2 more replies

danielheath3y ago

You can get the 'in-memory' speed advantage by putting the datastore on a ramdisk (or even just disabling fsync, which is pretty easy to do in postgresql).

nicoburns3y ago

There’s also the eatmydata program which does similar by using LD_PRELOAD to intercept io calls.

rc_mob3y ago

problem is that its not always compatible with features you use on your production database

waste_monk3y ago

Can split test regime so that as much as possible is covered with SQLite, and then have a second test phase with a heavyweight db only if the first phase passes. So code errors, malformed SQL, etc. cause it to fail fast and early, and you only test with the real DB once you know everything else is working.

Or along similar lines you could divide it such that developers can test things locally on their machines with SQLite, but once it gets pushed into CI (and passes code review etc.) it's tested against the heavy db.

1 more reply

gardenhedge3y ago

So you test against a different database technology than the one you software uses? I understand why that works but it seems odd

1 more reply

sicp-enjoyer3y ago

Why should this be faster than a local postgres instance with no traffic?

1 more reply

crabbone3y ago

This may work for something simple, but a typical database cluster setup will be impossible / impractical to try to emulate in containers because you'd need to configure a lot of things not normally available inside containers (s.a. how storage is attached, how memory is allocated).

Since OP mentioned DBT (kind of weird, hopefully, it's at least DBT2, since DBT is very old), they mean to test the productivity of the system rather than correctness of some queries (typical tests that deal with workloads similar to DBT2 are, eg. pgbench). Running pgbench over a database setup in a container will tell you nothing / might just confuse you, if you don't understand the difference between how database is meant to be setup and what happens in container.

atmosx3y ago

Yeah. That’s it. I have seen method with sample data (low fidelity) and/or performed against production copies of RDS DB (high fidelity). It’s still hard to catch some migrations or other operations under workload, but you can emulate that as well to a certain point.

hot_gril3y ago

Same (though I don't use Docker). Did TDD for a while like this, and it wasn't perfect, but it worked better than anything else. I didn't even know how to run the frontend; that was a separate team.

inetknght3y ago

bonus points if you add test data for integration tests with sad paths too

csteubs3y ago

Absolutely this. I stood up a negative test suite for continuous DB queries late in 2020 and it's caught many potential show stopper integration issues since; about 45% more YoY than pre-suite.

Took about a week of duplicating happy path tests, investigating gaps, and switching inputs/assertion types to get everything passing, but less than a week later we had our first verifiable test failure.

allanbreyes3y ago

This is a great way to test for backwards-incompatible changes if your fleet is running canaries or instances on different versions backed by a singleton database. You apply the migrations, checkout the app from the old version, and then re-run your test suite. Any failures are a reasonably high signal that some backwards-incompatible migration was introduced.

MuffinFlavored3y ago

Do you do this in place of unit tests (where you have to mock/stub the DB interactions) or do you do both?

nishs3y ago

Sorry for the digression first. (If anyone has different definitions for the ideas here, I would love to learn.)

I think the answers would depend on the types of tests that the term "this" encompasses. From how I understand it, calling something a unit test or an integration test depends on the context of what is is being tested. For example, if a developer is writing a unit test for a HTTP handler, and the handler implementation includes calls to an external database, then one would have to use a mock for the database, in order for the test to be deemed a true unit test [1]. If the developer is writing an integration test for the same HTTP handler, then the database would have to be a real database implementation [2].

On other hand, if the developer were testing SQL queries for syntactical or logical correctness against a database, these tests would want to use a real database implementation. Note that though the test uses a real database, it is still a true unit test [3]. Additionally, note that using a mocked database here would not serve the purpose of the tests, which is to catch syntactical and logical errors against a database engine. This can, of course, only be achieved by using a real database—or, if you insisted on using a "mock", then, by implementing an entire SQL engine, with the exact same quirks as the real database's engine, inside the mock!

On the original question:

> Do you do this in place of unit tests (where you have to mock/stub the DB interactions) or do you do both?

I guess the answer would be: It would depend on the objectives of and types of tests. Do both of them, because some tests, such as unit tests on the HTTP handler, would use use mocks, while other tests, such as the SQL query correctness tests, would use the real database.

[1] A true unit test is one that has no external interactions, neither directly nor transitively, besides interactions with the system under test (SUT). The SUT here is the HTTP handler.

[2] An integration test should include external system interactions, if any. That's what integration means.

[3] The SUT is each SQL query and the its interaction with the database. There are no interactions in the test with systems outside the SUT, so it is still a true unit test.

1 more reply

sssspppp3y ago

would love to do this, but how does one spin up a redshift cluster inside of a docker container?

ntr--3y ago

try not to cry, cry a lot, and then resolve not to vendor lock yourself to a black box data store next time.

jokes aside, redshift is based on pg^1, you can try an older version to get some semblance of it running locally.

1. https://docs.aws.amazon.com/redshift/latest/dg/c_redshift-an...

nicoburns3y ago

And this is why you don’t rely on a closed source stack if you have any alternative.

danmcs3y ago

Redshift speaks the postgres protocol so you might be able to use postgres. There are a few purpose-built docker images (googleable) that may replicate Redshift slightly better than just `docker run -p 127.0.0.1:5439:5432 postgres:latest`, but if you're at the point of having a test suite for your data warehouse code, you're likely using Redshift-specific features in your code and postgres won't suffice.

I have seen teams give each developer a personal schema on a dev cluster, to ensure their Redshift SQL actually works. The downside is that now your tests are non-local, so it's a real tradeoff. In CI you probably connect to a real test cluster.

contravariant3y ago

Well the first step is to get Amazon to part with their lucrative closed source software.

atwebb3y ago

Would redshift serverless help at all? I realize that it may not have parity with the existing functionality, just a thought.

totalhack3y ago

Same. That's one of the benefits of being able to replicate your environment with something like docker compose and then use a tool like alembic to manage migrations both locally and in production.

unboxingelf3y ago

This is the way

RobinL3y ago· 29 in thread

Try and write any complex SQL as a series of semantically meaningful CTEs. Test each part of the CTE pipeline with an in.parquet and an expected_out.parquet (or in.csv and out.csv if you have simple datatypes, so it works better with git). And similarly test larger parts of the pipeline with 'in' and 'expected_out' files.

If you use DuckDB to run the tests, you can reference those files as if they were tables (select * from 'in.parquet'), and the tests will run extremely fast

One challenge if you're using Spark is that test can be frustratingly slow to run. One possible solution (that I use myself) is to run most tests using DuckDB, and only e.g. the overall test using Spark SQL.

I've used the above strategy with PyTest, but I'm not sure conceptually it's particularly sensitive to the programming language/testrunner you use.

Also I have no idea whether this is good practice - it's just something that seemed to work well for me.

The approach with csvs can be nice because your customers can review these files for correctness (they may be the owners of the metric), without them needing to be coders. They just need to confirm in.csv should result in expected_out.csv.

If it makes it more readable you can also inline the 'in' and 'expected_out' data e.g. as a list of dicts and pass into DuckDB as a pandas dataframe

One gotya is SQL does not guarantee order so you need to somehow sort or otherwise ensure your tests are robust to this

paulmd3y ago

If your database supports it, unit tests are an absolutely ideal use-case for temporary tables or global temporary tables. A global temporary table can be defined with the same schema as the correct table and will "exist" for the purpose of view/CTE definitions, but any data inserted into the table will only ever be visible from that specific thread context. The rules depend but basically either it exists until the thread is closed, or until the session calls commit, but the semantics will (theoretically) be optimized for one thread and rapid inserts/clears.

If you build your unit tests so that a thread will not commit until it's finished, or clear it before/after, it's pretty ideal for that. You can feed in data for that test and it won't be visible anywhere else.

Potentially if you scripted your table creates you could add an additional rule that GLOBAL TEMPORARY TABLE gets added to any other tabledefs but only for unit tests. Or just update both in parallel.

The other useful use-case I've found for this is that you can do an unlimited-size "WHERE myId IN (:1, :2 ...)" by doing a GTT expressed as "INNER JOIN myId = myGtt.myid", and this has the advantage of not thrashing the query planner for every literal sqltext resulting from different lengths of myList (this will fill up your query cache!). Since it's one literal string it always hits the same query plan cache.

I am told this has problems in oracle's OCI when using PDBs, apparently GTTs will shit up the redo-log tablespace (WAL equivalent). But so do many many things, PDBs are apparently a much different and much less capable and incredibly immature implementation that seems essentially abandoned/not a focus from what I've been told. I was very surprised having originally written that code in regular Oracle and not having had problems but PDBs just aren't the same (right down to nulls being present in indexes!).

n0n0n4t0r3y ago

Off topic incoming (sorry )

I used this trick (join temporaryFoo instead of where foo in ...) in production fifteen years ago, using MySQL. The gain was really astonishing. Several instructions can be optimized using joins on specialty craft tables (I know of LIMIT for instance).

This is one of the worst drawbacks of orm everywhere: nobody even seems to think about those optimisations anymore.

2 more replies

MrPowers3y ago

Spark makes it easy to wrap SQL in functions that are easy to test. I am the author of the popular Scala Spark (spark-fast-tests) and PySpark (chispa) testing libraries. Some additional tips to speed up Spark tests (can speed up tests between 70-90%):

* reuse the same Spark session throughout the test suite

* Set shuffle partitions to 2 (instead of default which is 200)

* Use dependency injection to avoid disk I/O in the test suite

* Use fast DataFrame equality when possible. assertSmallDataFrameEquality is 4x faster than assertLargeDataFrameEquality. Some benchmarks here: https://github.com/MrPowers/spark-fast-tests#why-is-this-lib...

* Use column equality to test column functions. Don't compare DataFrames unless you're testing custom DataFrame transformations. See the spark-style-guide for definitions for these terms: https://github.com/MrPowers/spark-style-guide/blob/main/PYSP...

Spark is an underrated tool for testing SQL. Spark makes it really easy to abstract SQL into unit testable chunks. Configuring your tests properly takes some knowledge, but you can make the tests run relatively quickly.

RobinL3y ago

That's super useful, thanks. Could you expand on the 'Use dependency injection to avoid disk I/O in the test suite' point please - I'm not sure I understand what it means but it sounds interesting!

MrPowers3y ago

Sure, this blog post explains the dependency injection design pattern with Spark: https://mrpowers.medium.com/dependency-injection-with-spark-...

You can structure your code to read from paths when run in the production environment, but inject DataFrames you build in memory for your test suite. Spark is designed to read multiple files in parallel, so it's not optimized to read a single tiny file. That's why it's best to avoid I/O in Spark test suites whenever possible.

colanderman3y ago

One caution with PostgreSQL, CTEs under some circumstances (and in all circumstances, prior to PostgreSQL 12) act as optimization barriers. Specify `NOT MATERIALIZED` before the CTE definition to ensure that they are optimized same as a sub-SELECT would be.

timwis3y ago

Just to flag this behaviour changes in v12: https://www.depesz.com/2019/02/19/waiting-for-postgresql-12-...

1 more reply

nicoburns3y ago

This is a good point, although it’s worth noting that there are definitely cases where you want the sun table to be materialised (esp. when that table is small and referenced many times)

dragonwriter3y ago

Yeah, its definitely one of those things where it is worth reading the docs [0], and not just adopting an “always do it this way” rule of thumb.

[0] https://www.postgresql.org/docs/current/queries-with.html#id...

bolanyo3y ago

Is 'sun table' a particular concept here? A typo?

1 more reply

gpvos3y ago

CTE = common table expression, i.e. a WITH clause (possibly recursive) before your SELECT (or other) statement. https://learnsql.com/blog/what-is-common-table-expression/

stormdennis3y ago

Yes, I didn't know what it was so I had looked it up before I saw yours. The definition I got is below:

A common table expression, or CTE, is a temporary named result set created from a simple SQL statement that can be used in subsequent SELECT, DELETE, INSERT, or UPDATE statements.

2 more replies

curiousllama3y ago

> Try and write any complex SQL as a series of semantically meaningful CTEs

I find this helps a heck of a lot with maintainability + debugging as well

atwebb3y ago

It can wind up with more of a procedural thought though than set-based. Not always, just something to pay attention to and pushing filters early / explicitly. It is asking more of the optimizer a lot of the times and that's were a bunch of the cautions come in to play.

1 more reply

shazzdeeds3y ago

A few quick tips to help slow Spark tests.

1. Make sure you're using a shared session between the test suite. So that spin-up only has to occur once per suite and not per test. This has the drawback of not allowing dataframe name reuse across tests, but who cares.

2. If you have any kind of magic-number N in the SparkSQL or dataframe calls (e.g coalesce(N), repartition(N)) change N to be parameterized, and set it to 1 for the test.

3. Make sure the master doesn't have any more than 'local[2]' set. Or less depending on your workstation.

marijnz0r3y ago

I really like this answer because it tests CTEs in a modular way. One question I have is: how do you test a the CTE with an in.csv and out.csv without altering the original CTE? Currently I have a chain of multiple CTEs in a sequence, so how would I be able to take the middle CTE and "mock" the previous one without altering the CTE actually used in production? I prefer not to maintain a CTE for the live query and the same CTE adapted for tests.

jimmygrapes3y ago

My ignorance of the topic (and experience with a mostly unrelated one) is showing, but all I could think of when you said CTE was "chronic traumatic encephalopathy". This made a lot more sense when you generalized answering the question as if it's a given that Python is necessary (I know that's not your intent, but that's how it comes off).

Not much more to say, just observing, sorry if this is irrelevant commentary.

jph003y ago

CTE==Common Table Expression. It's not specific to any particular language. It's basically like a view, except you can define them the same place as you use them, and they can be recursive.

1 more reply

db48x3y ago

Sometimes using databases feels just like repeatedly slamming your head into your desk, so that fits.

collyw3y ago

I used to think that, but once you start thing about them in the right way, the relational model is pretty nice.

1 more reply

osigurdson3y ago

>> I could think of when you said CTE was "chronic traumatic encephalopathy"

The title is "Ask HN: How do you test SQL"

intrasight3y ago

I'm a fan of CTEs too. Here's a pattern that I use in Sql Server for testing/debugging when using CTE pipelines. Same approach would work with "FOR JSON"

   declare @debug bit = 1;

   ;with cte1 as (
     select
       @debug AS Debug1,
       ...
   ),
   cte2 as (
     select 
       @debug AS Debug2,
       ... from cte1
   ),
   cte3 as (
     select 
        @debug AS Debug3,
        ... from cte2
   )
   select
     -- dump intermediate if debug
     (
       select * from cte1 where Debug1=1 for xml raw ('row'), root ('cte1'), type
     )
     ,(
       select * from cte2 where Debug2=1 for xml raw ('row'), root ('cte2'), type
     )
     ,(
       select * from cte3 where Debug3=1 for xml raw ('row'), root ('cte2'), type
     )
     -- final results
     ,(
         select ...
         from cte3 for xml raw ('row'), type
   )
   for xml raw ('results'), type;

devin3y ago

> Try and write any complex SQL as a series of semantically meaningful CTEs.

Could you or anyone else on the post provide an example?

paulmd3y ago

let's say you are doing a paystub rollup - a department has multiple employees, an employee has multiple paystubs.

if you are storing fully denormalized concrete data about the value of salary/medical/retirement both pre- and post-tax that was actually paid to each pay period (because this can vary!), then you can define a view that does salary-per-employee (taxable, untaxable, etc), and then a view that rolls up employees-per-department. And you can write unit tests for all of those.

that's a super contrived example but basically once group aggregate or window functions and other complex sub-queries start coming into the picture it becomes highly desirable to write those as their own views. And you can write some simple unit tests for those views. there are tons of shitty weird sql quirks that come from nullity/etc and you can have very weird specific sum(mycol where condition) and other non-trivial sub-subquery logic, and it's simple to just write an expression that you think is true and validate that it works like you think, that all output groups (including empty/null groups etc) that you expect to be present or not present actually are/aren't, etc.

I'm not personally advocating for writing those as CTEs specifically as a design goal in preference to views, personally I'd rather write views where possible. But recursive CTEs are the canonical approach for certain kinds of queries (particularly node/tree structures) and at minimum a CTE certainly is a "less powerful context" than a generalized WHERE EXISTS (select 1 from ... WHERE myVal = outerVal) or value-select subquery. it's desirable to have that isolation from the outer SQL query cursor imo (and depending on what you're asking, it may optimize to something different in terms of plan).

Writing everything as a single query, where the sub-sub-query needs to be 100% sure not to depend on the outer-outer-cursor, is painful. What even is "DEEP_RANK()" in the context of this particular row/window? If you've got some bizarre (RANK(myId order by timestamp) or whatever, does it really work right? Etc. It's just a lot easier to conceptually write each "function" as a level with its own unit tests. Same as any other unit-testable function block, it's ideal if it's Obviously Correct and then you define compositions of Obvious Correctness with their own provable correctness.

And if it's not Obviously Correct then you need the proof even more. Encapsulate whatever dumb shit you have to do to make it run correctly and quick into a subquery and just do a "inner join where outerQuery.myId = myView.myId". Hide the badness.

ChadMoran3y ago

> Try and write any complex SQL as a series of semantically meaningful CTEs

I used this exact same method. Not only does it help me but those who come after trying to understand what's going on.

tensor3y ago

Decomposing a large query into smaller subsets is the right approach, but I would strongly suggest doing it with VIEWs rather than CTEs. CTEs are a useful tool, but come with their own performance profiles and often using them too much will lead to slower queries.

mritchie7123y ago

Doesn't that mean you have to copy all your data into duckdb? I'd imagine with even a modest data warehouse, loading the entire thing to duckdb would be unfeasible.

RobinL3y ago

The tests are being run against small test datasets of 10s of rows rather than the real data that test whether the behaviour of transforms/joins etc. is as expected. You're right, this approach wouldn't be sensible if the tests have to use large production tables

mritchie7123y ago

How do you get the small datasets? Selecting a random sample from the source?

1 more reply

uticus3y ago· 13 in thread

Related - how is any declarative language tested?

Quick web search confirms suspicions, it is not easy

https://www.metalevel.at/prolog/testing

henryfjordan3y ago

Your link has the answer, declarative testing. You need to think about why you wrote that code and declare some outputs for given inputs that give you confidence in your code.

The other tests from your link are just silly. You did not write code to terminate or to provide an answer for all inputs, you wrote code to provide the right answer.

To test some HTML for instance, think about what information you want the page to convey, load up a browser, and check that the info is displayed. Easy as that.

uticus3y ago

> You did not write code to terminate or provide an answer for all inputs...

Definitely not an expert. But to my eyes checking for non-termination seems very close to things that happen in SQL all the time.

For example, a parent query joins two subqueries that both map data, and then maps on that data. To test, I could just test one specific scenario with specific values in all columns. That would be more of a concrete test case.

But I might want something more robust checking the outer bounds of acceptable data (like a min and a max). That seems much closer to the non-termination test.

Of course, I'm open to being corrected on this.

spamtarget3y ago

SQL is not a declarative language. It is a functional language, and structured language on the top as extensions. HTML is a declarative language.

taeric3y ago

I mean, I get what you are aiming for. HTML, though, is a markup language. You can call it declarative, but you don't get anything other than the HTML that you create. That is, it is not generating anything. You type what you get. Put differently, it is not a program.

SQL is far and away understood as a declarative language for what data you want out of a relational database. I challenge you to find any literature that does not describe it as a declarative language.

Now, can we munge definitions and pull in an odd true scotsman argument about it not being a "true declarative" language? I mean, yeah. But, this is like arguing that LISP is not a functional language by some specific modern view of that term. Certainly true, but far from useful. And almost certainly not what anyone you would talk to expects from those terms.

spamtarget3y ago

"don't get anything other than the HTML that you create" that's the point of declarative languages

"That is, it is not generating anything." if you speak about code generation like code behind in VS, it has nothing do with the paradigm. If you speak about underlying technology, it does not matter. In case of declarative languages, the framework determines what the language is capable of. And that's the point of them, they are generalizing the solution for common problems. The browser do a lot of things in the background when it draws a rectangle the same way how SQL server is doing a lot things when you execute a query, but these things have nothing to do with programming paradigm, which can be understood without computers.

Imperative: cooking recipe

Functional: assembly line in a factory

Declarative: your assistant

(compare the real world examples to the things i mentioned)

> SQL is far and away understood as a declarative language for what data you want out of a relational database. I challenge you to find any literature that does not describe it as a declarative language.

Ok, and I challenge you to compare HTML and SQL. Are they the same by nature. Are the steps, the approach of problem solving is the same? Because if SQL is declarative, than it is really similar to HTML

> Now, can we munge definitions and pull in an odd true scotsman argument about it not being a "true declarative" language? I mean, yeah. But, this is like arguing that LISP is not a functional language by some specific modern view of that term. Certainly true, but far from useful. And almost certainly not what anyone you would talk to expects from those terms.

We are speaking about semantics and taxonomy here. These kind of a questions are always debatable, so i can't really say anything about this, because you are speaking of ways of discussion. What i'm saying that my taxonomy makes the most sense to me, and if someone truly can challenge it, it's gonna make my mind change, but "literature mentions it somehow else" is not an argument makes me change my mind.

2 more replies

luhn3y ago

> Although SQL is essentially a declarative language (4GL), it also includes procedural elements.

https://en.wikipedia.org/wiki/SQL

1 more reply

mLuby3y ago

You tell SQL what you want, not how to get it. That's declarative.

  SQL : CSV :: GraphQL : JSON :: React : HTML

spamtarget3y ago

Really?

SELECT CASE WHEN employee.type = 'contract' THEN salary CASE WHEN employee.type = 'full-time' THEN salary + benefit_costs END CASE FROM employee

Tell me how is this declarative?

-SQL is functional -CSV is not a language, it's a data format -GraphQL, im not familiar -JSON is literally executable javascript code, arguable -React is a javascript framework, binding is a declarative nature (if it has it, i dont know it too well), but it is not a language, it's a framework -HTML is absolutely, 100% declarative, yes

3 more replies

recursive3y ago

The most important requirement for a functional language is that functions are first-class values. i.e. Not SQL.

On the other hand, the details of query execution are left to the planner and optimizer.

What's the case that it's functional, but not declarative?

spamtarget3y ago

You are speak about things which called delegates, typed functions, defined by the parameters it gets, and the data it returns. Using delegates does not make a language functional. If you really want to speak about the underlying technology, than every high level language is declarative. You can draw a triangle by drawing three sides, or you can just put this in a function like drawtriangle(v1,v2,v3), and tadam, your language is declarative. The underlying tech is not important, the approach of the developer is important. See my longer answer above

1 more reply

preseinger3y ago

SQL is absolutely a declarative language. It is practically the textbook definition of a declarative language.

ttfkam3y ago

Very often literally as well.

uticus3y ago

Okay, I'll grant it's not pure declarative.

But the part that is declarative is sure difficult to test.

2 more replies

crabbone3y ago· 7 in thread

Can you please expand on what you mean by DBT?

DBT, specifically, DBT-2 is a suit of tests designed to benchmark a database system. These tests aren't interested in, eg. correctness of an application that is using the database. They are meant to be testing the system as a whole by modeling some sort of an "average business" and defining some sort of an "average business operation" and estimating how many of such operations can a particular deployment of a system perform.

Such tests are rarely of any interest to application developer, and are more geared towards DBAs who execute such tests to estimate the efficiency of a system they deploy or to estimate the amount of hardware necessary to support a business.

MySQL DBT2 suit: https://dev.mysql.com/downloads/benchmarks.html

PostgreSQL DBT2 suit: https://wiki.postgresql.org/wiki/DBT-2

Those tools are typically modeled on TPC-B... And, it would require a separate discussion to describe why these tests are obsolete and why there isn't really any replacement.

----

However, from the rest of your question it seems that you may use DBT acronym in some other way... So, what exactly are you testing? Are you interested in performance? A benchmark? Schema correctness? Are you perhaps trying to simply test the application that is using a SQL database and you want to avoid dealing with the database setup as much as possible?

crabbone3y ago

Oh, someone else below suggested that DBT in OP may be this: https://docs.getdbt.com/ .

Well... I don't know anything about this tool, but from cursory reading, this is the correctness kind of testing that is meant to examine whether the way you defined the schema actually leads to the results you expect in the data. Well... it's not an interesting kind of testing, at least not for me. So, I don't know much about it. Also, while, obviously, I'm not familiar with this newer tool, I wouldn't trust much someone who reused the name of an existing tool to do something completely unrelated in the same domain. It just feels like these people don't have a very good familiarity with the subject because they caused this confusion.

rizz03y ago

Depends on what you mean by "the subject". dbt doesn't claim to be a testing tool. Personally I use dbt all the time, while I didn't know about the existence of DBT-2. Different worlds apparently.

crabbone3y ago

Have you ever been burned by trying to install Docker and getting the whole new desktop manager installed because Docker is a widget in that desktop manager (I believe it's in Gnome)?

Actually happened to me twice (different distros at different times did include Docker, the container management tool as "docker" while more commonly naming it something else, like "docker-engine").

As for dbt. I mean... imagine you are selecting the name for your company, that's in SQL business, and you don't even bother to type into your favorite Web search engine "dbt sql"... I mean, it's definitely a business about DataBases, so, certainly a 3-letter acronym with "D" and "B" in it in that order would be truly unique, right?

I'd bet is that what happened is that these people, just like you, thought "I don't know what that is, so it must be unimportant", and they never stopped to consider that they might not know something important about the field they are in.

vitorbaptistaa3y ago

Not the OP, but I think he/she meant https://www.getdbt.com

crabbone3y ago

Yeah, thanks for pointing this out. The kind of testing this tool does isn't up my line of work as a system / storage automation person, so, I wouldn't immediately know about it. Reusing the name of an older tool in the area with not so many tools for a completely different purpose doesn't instill a lot of trust though.

Vaslo3y ago

I think you are talking about two different DBTs. OP is referring to this one I believe:

https://www.getdbt.com/

crabbone3y ago

Yeah, I went to read more comments and figured as much. I don't know much about this tool, but since it is certainly newer than original DBT, and chosen to use the same name, while operating in the same domain... I wouldn't trust these people. I have a feeling they simply didn't know the older tool existed, and I generally don't want to trust people who don't know stuff about what seems like their primary business.

1 more reply

uticus3y ago· 6 in thread

My favorite interview question. No, I mean when I'm being interviewed. The sheepish grins let me know I'm not alone.

Best ideas IMO (no particular order):

- make SQL dumber, move logic that needs testing out of SQL

- use an ORM that allows composing, disconnect composition & test (ie EF for .NET groups, test the LINQ for correct filtering etc, instead of testing for expected data from a db) (I see this has already been recommended elsewhere)

* edited formatting

pharmakom3y ago

Interesting. We have totally different philosophies here.

1. Moving logic out of SQL can break transactionality and hurt performance (more data leaves the DB)

2. ORMs hide the SQL from you, making all sorts of other things harder

My favoured approach is to test my application against a real local database, built from a fresh snapshot each time.

I do not change my application code to make testing easier. The application code is optimised for maintainability - so simplicity and ease of reading.

zip12343y ago

EF core makes it simple to test against a real sqlite db. It is super fast to run the tests and gives a reasonably realistic outcome.

uticus3y ago

It does and I understand. But in my experience most of what should be tested is not "does the database contain x?" but "does this where clause filter y?"

Those sorts of things are programmed in EF by LINQ queries. All I'm saying is, test the LINQ queries the same way you would test non-EF LINQ queries. LINQ doesn't require a database.

postalrat3y ago

Test it by not testing it! Genius!

javaunsafe20193y ago

You mean using interfaces and integration tests?

uticus3y ago

> using interfaces

Kinda, but personally I describe as using LINQ queries. The dbcontext just isn't hooked up. It's a method that takes in an IQueryable<T> (there's the interface I suppose) and outputs a filtered IQueryable<T>. The unit test (see my next response) provides a test collection and expects a certain result.

> and integration tests

No, unit tests

1 more reply

purerandomness3y ago· 5 in thread

There's pgTAP for Postgres [1], the same approach probably is valid for other databases.

Here's [2] a slide deck by David Wheeler giving an introduction into how it works.

[1] https://pgtap.org/

[2] https://www.slideshare.net/justatheory/unit-test-your-databa...

vogtb3y ago

+1 for pgtap. Surprised it's not been mentioned more in this thread.

I'm using it w/ supabase, and it works really well.

They have a small doc on it that's a better primer than the pgtap docs: <https://supabase.com/docs/guides/database/extensions/pgtap>.

Pretty easy to get started, I'm doing something like this in a Makefile.

  supabase_tests := $(call rfind, supabase/tests/*_tests.sql)
  DB_URL := "postgresql://postgres:postgres@localhost:54322/postgres"
  test-supabase: $(supabase_tests)
    @echo "Testing supabase..." && \
        echo "${supabase_tests}" && \
        psql -f supabase/tests/setup.sql ${DB_URL} && \
        $(foreach t,$(supabase_tests),\
            echo "Test: $(t)..." && psql -f $(t) ${DB_URL} $(__EXEC)) && \
        psql -f supabase/tests/teardown.sql ${DB_URL}

kiwicopple3y ago

the test runner is now built into the supabase CLI: https://supabase.com/docs/reference/cli/supabase-test

Guide is here: https://supabase.com/docs/guides/database/testing

gregplaysguitar3y ago

We used to use pgtap extensively, but ended up removing it in favour of testing queries as part our regular nodejs application tests, via a db client - the ergomonics of the test tooling is far better, and I don't think we really lost anything.

nesarkvechnep3y ago

Unfortunately, I’m not surprised people test queries in their applications’ unit tests. What they’re actually testing is the ORM/query builder. Instead, with pgTAP, you can test specifically your queries.

adulion3y ago

I was going to mention pgtap as i had used it in a previous role and it works but its cumbersome. I was hoping for a better solution by reading the comments

ahakanbaba3y ago· 3 in thread

The hard part about testing SQL is decoupling from infrastructure and big data sources. We use DuckDB, and pandas dataframes mock data sources to unit test SQL. Python testing frameworks (or simple assert statements) can be used to compare inputs and outputs.

When the tests pass, we can change from DuckDB to Spark. This helps decouple testing Spark pipelines from the SparkSession and infrastructure, which saves a lot of compute resources during the iteration process.

This setup requires an abstraction layer to make the SQL execution agnostic to platforms and to make the data sources mockable. We use the open source Fugue layer to define the business logic once, and have it be compatible with DuckDB and Spark.

It is also worth noting that FugueSQL will support warehouses like BigQuery and Snowflake in the near future as part of their roadmap. So in the future, you can unit test SQL logic, and then bring it to BigQuery/Snowflake when ready.

For more information, there is this talk on PyData NYC (SQL testing part): https://www.youtube.com/watch?v=yQHksEh1GCs&t=1766s

Fugue project repo: https://github.com/fugue-project/fugue/

urthor3y ago

fwiw, you can share a spark session between unit tests. Even persist a spark session throughout the day so your tests run against a hot session.

Straight TDD with spark is perfectly fine if you know what you're doing. I'm not saying it's easy or there's an easy guide somewhere, but it's possible.

If you're using Pyspark via the API, it's likely an incredibly important part of your process.

ahakanbaba3y ago

Fair enough, agreed. It is tricky to “mock” as you said.

Our CICD platform and their owners get unhappy if we spawn an ad hoc spark session for testing purposes.

There is also a general expectation that unit tests are self contained and portable. So you could execute them in mac, linux, and arm ISA without much effort.

Another point was that we need to make this mocking or test setup easy because data scientist and ML Modellers are the most important persona who needs to write these tests ideally.

So mocking the data source with an abstraction layer and passing pandas dataframes, worked reasonably well for our use case.

newrotik3y ago

Would you be able to elaborate on your approach to TDD with spark sessions? You can persist them, which is only useful if you are doing multiple tests in a run.

But I find myself running one given test, making some code changes, and then wanting to run it again, over and over. Instantiating a local spark session takes several seconds every iteration. Enough for me to often want to "alt tab" into something else instead of waiting. It's very disruptive.

I did not know about Fugue but will definitely give it a try. Looks almost too good to be true.

davedx3y ago· 3 in thread

SQL being relatively pure, functional, algebraic and whatnot, doesn't require the same rigor of automated test coverage that more "systems" programming languages do. (By "systems" I include all languages that people use to integrate the various parts of a software system - i.e. regular programming languages like Java, C#, TypeScript, C++ and so on. Not just low level languages.)

Stored procedures are a different beast though. Having significantly struggled to debug stored procedures running in MSSQL on a Macbook (on Windows SQL Management Studio lets you set breakpoints, on Mac you're SOL), if I was building an application based on them I'd definitely try to spin up some kind of testing framework around them. I guess what I'd probably do is have a temporary database and some regular testing framework that nukes the db, then calls the stored proc(s) with different inputs and checks what's in the tables after each run. Sounds slow and clunky?

Deukhoofd3y ago

> on Windows SQL Management Studio lets you set breakpoints, on Mac you're SOL

Good news! Since SSMS18 you're SOL on Windows too, as Microsoft completely removed that feature :)

Foobar85683y ago

I believe you are expected to use visual studio to manage all this crap... And debugging in it was awful years ago, SSMS was slightly less worse IMHO.

davedx3y ago

Ugh :)

Sounds like an opportunity though…

pmarreck3y ago· 3 in thread

Excellent question. Not sure why it's getting no traction.

For my own use-cases, I usually test this at the application level and not the DB level. This is admittedly not unit-testing my SQL (or stored procs or triggers) but integration-testing it.

edgyquant3y ago

With ORMs you can get pretty close to this being unit testing for the DB though.

teej3y ago

I haven't seen an ORM that handles analytical queries well. I'd rather write raw SQL than use SQLAlchemy for complex queries with multiple joins, aggregations, and window functions.

2 more replies

AnEro3y ago

With not complex data for sure, I think SQL testing is so much easier once you start having hierarchies and many to many relationships start piling up.

davps3y ago· 3 in thread

This approach didn't use an ORM and run the tests concurrently against the same database.

I follow those steps on my pipeline:

Every time I commit changes the CI/CD pipeline follow those steps, on this order:

- I use sqitch for the database migration (my DB is postgresql).

- Run the migration script `sqitch deploy`. It runs only the items that hasn't been migrated yet.

- Run the `revert all` feature of sqitch to check if the revert action works well too.

- I run `sqitch deploy` again to test if the migration works well from scratch.

- After the schema migration has been applied, I run integration tests with Typescript and a test runner, which includes a mix of application tests and database tests too.

- If everything goes well, then it runs the migration script to the staging environment, and eventually it runs on the production database after a series of other steps on the pipeline.

I test my database queries from Typescript in this way:

-in practice I'm not strict on separating the tests from the database queries and the application code, instead, I test the layers as they are being developed, starting from simple inserts on the database, where I test my application CRUD functions that is being developed, plus to the fixtures generators (the code that generate synthetic data for my tests) and the deletion and test cleanup capabilities.

-having those boilerplate code, then I start testing the complex queries, and if a query is large enough (and assuming there are no performance penalties using CTE for those cases), I write my largue queries on small chunks on a cte, like this (replace SELECT 1 by your queries):

    export const sql_start = `
    WITH dummy_start AS (
        SELECT 1
    )

    export const step_2 = `${sql_start},
    step_2 AS (
        SELECT 1
    )
    `;

    export const step_3 = `${step_2},
    step_3 AS (
        SELECT 1
    )
    `;

    export const final_sql_query_to_use_in_app = ` ${step_3},
    final_sql_query_to_use_in_app AS(
        SELECT 1
    )

    SELECT \* FROM final_sql_query_to_use_in_app

Then on my tests I can quickly pick any step of the CTE to test it

    import {step_2, step_3, final_sql_query_to_use_in_app} from './my-query';

    test('my test', async t => {

        //
        // here goes the code that load the fixtures (testing data) to the database
        //

        //this is one test, repeat for each step of your sql query
        const sql = `${step_3} 
            SELECT * FROM step_3 WHERE .....
        `;
        const {rows: myResult} = await db.query(sql, [myParam]);
        t.is(myResult.length, 3);

        //
        // here goes the code that cleanup the testing data created for this test
        // 
    });

and on my application, I just use the final query:

        import {final_sql_query_to_use_in_app} from './my-query';

        db.query(final_sql_query_to_use_in_app)

The tests start with an empty database (sqitch deploy just ran on it), then each test creates its own data fixtures (this is the more time consuming part of the test process) with UUIDs as synthetic data so I don't have conflicts between each test data, which makes it possible to run the tests concurrenlty, which is important to detect bugs on the queries too. Also, I include a cleanup process after each tests so after finishing the tests the database is empty of data again.

For sql queries that are critical pieces, I was be able to develop thounsands of automated tests with this approach and in addition to combinatorial approaches. In cases where a column of a view are basically a operation of states, if you write the logic in sql directly, you can test the combination of states from a spreadsheet (each colum is an state), and combining the states you can fill the expectations directly on the spreadsheet and give it to the test suites to run the scenarios and expectations by consuming the csv version of your spreadsheets.

If you are interested on more details just ping me, I'll be happy to share more about my approach.

WuxiFingerHold3y ago

This is a beautiful and powerful approach (and great explanation), IMO. Personally, I'd keep all CTEs (your steps) as units and combine them separately. But you've commented on that in your other comment.

zabzonk3y ago

how do you know that what your select statements return is correct? on a real database?

davps3y ago

Yes, it always runs on a real database, without mocks, as integration tests where you can test each CTE's auxiliary statement separately, which acts as a step of our sql pipeline. So the initial data is inserted on the tables and then I can exercise the assertions on the sql queries or views against the real database. In theory it could be easy to rearrange the concatenations of the CTE strings from above so it can be tested as a unit when it's put together with the previous auxiliary step as a temporary table but I never had the need for that since the integrations are simpler and works really well for me. The essential part of my approach is to treat the sql code as concatenated pieces of strings, and call slices of that with just the right concatenation to exercise the test to that slice on the real database, which is valid since the query will always be valid.

There is another pattern too, when I implemented a RLS based multi-tenancy with RBAC support, which needed an relatively large sql codebase and needed to be battle tested because it was critical, I've splited a big part of the sql code in a lot of sql functions instead of views to test the code units or integrations (using something similar to dependency injection but for the data, to switch the tenant RBAC's contexts), because for the sql functions I can pass different Postgresql's Configuration Parameters to test different tenants for example.

efxhoy3y ago· 2 in thread

I write mostly batch ETL stuff. All plain psql and bash. We don’t have a good testing setup to be honest. What we do use instead:

Plenty of constraints, uniques and foreign keys and not nulls. Enum types.

Visuals, dump to csv and plot some graphs. Much easier to find gaps and strange distributions visually.

Asserts in DO blocks, mostly counts being equal.

Build tables in a a _next suffix schema and swap when done.

Never mutating the source data.

Using psqls ON_ERROR_STOP setting.

Avoid all but the most trivial CTEs, preferring intermediate tables that can be inspected. Constraints and assertions on the intermediate tables.

“Wasting” machine resources and always rebuilding from scratch when feasible. CREATE TABLE foo AS SELECT is much simpler than figuring out which row to UPDATE. Also ensures reproducibility, if you’re always reproducing from scratch it’s always easy. State is hard.

Overall i’m quite happy with the workflow and very rarely do we make mistakes that unit tests would have caught. Our source data is complex and not always well understood (10+ years of changing business logic) so writing good tests would be very hard. Because we never touch the raw source data any errors we inevitably make are recoverable.

This talk by Dr Martin Loetzsch helped a lot: https://youtu.be/whwNi21jAm4

toto4443y ago

I admit it's not the best but I do almost like you. Graphs are definitely helpful.

I also have a few 'test' queries that insert there results into a 'test_results' table. Most of the queries check the cardinality of the table since to me wrong cardinality is where the biggest errors come from. I do something like :

insert into test_table

select case when count(*) = count(distinct users) then 'pass' else 'fail' end as result, 'test_cardinality_temporary_table_a' as test_name from temporary_table_a

vivegi3y ago

There's some great advice here for batch ETL work. The Create Table .. As Select (CTAS) pattern is your friend and a great aid in testability.

drx3y ago· 2 in thread

If you're using dbt, dbt tests are a good start: https://docs.getdbt.com/docs/build/tests

You can hook up dbt tests to your CI and Git(hub|lab) for data PRs.

Depending on your needs, you can also look into data observability tools such as Datafold (paid) or re_data (free)

mritchie7123y ago

I'm surprised dbt tests is this far down in the comments. This seems like the obvious place to start.

OP (or others) - If you've used dbt tests, I'm curious where it fell short? Tt doesn't cover everything, but it's pretty good in my experience.

adeelk933y ago

And if you want/need to take it further, I love this package: https://github.com/calogica/dbt-expectations

AnEro3y ago· 2 in thread

Gitlabs has their guide up, I love it and use it all the time. I've been doing data engineering in a small team for about 4 years, helping hospitals with data and hopefully making it easier to understand. Something that is overlooked or undervalued in my opinion, have stark distinctions for separating out technical and business logic tests. It makes it easier communicating what's happening in the event something is 'wrong' vs wrong, and it's easier to bring a non-technically inclined team member up to speed. Also, I think it's good to learn from the SaaS side of things and not bloat up or overengineer with infrastructure as data engineering is the latest development flavour. Keep it simple. Great expectations is a great tool however I think small teams should take really hard looks at their needs and see if a simple orchestration engine and SQL testing is enough. A centralized source for testing is great, however infrastructure isn't free even when it is you are paying for it with you and your teams time.

hampelm3y ago

Could you link to the specific guide you're referring to? I see a couple on quick search -- perhaps this one? https://docs.gitlab.com/ee/development/database_review.html

AnEro3y ago

Enjoy, it's honestly the best resource I've seen on data teams that is open. https://about.gitlab.com/handbook/business-technology/data-t...

1 more reply

munk-a3y ago· 2 in thread

Use a testing framework to mock some database into the DB, run your queries, verify the result. Make sure you have a variety of data you use for tests to fully exercise the surface of logic you expect to hit.

Basically, treat the query and database as a black-box for testing like you would another third party API call.

I would strongly suggest having a layer of code in your application that is exclusively your data access and keeping any logic you can out of it. Data level tests are pretty onerous to write in the best circumstances and the more complexity you allow to grow around the raw SQL the worse of a time you'll have - swapping out where clauses and the like dynamically is a cost you'll need to eat, and sometimes having a semi-generic chunk that you reuse with some different joins can be more efficient than writing ten completely different access functions with completely different internal logic so judgement is required.

At the end of the day a database is like any other third party software component - data goes in, data comes out... the nice thing is that SQL is well defined and you've got all the definitions so it's easier to find the conditional cases you need to really closely tests... but databases are complex beasties and it'll never be easy.

karmasimida3y ago

I think the tricky part is DB itself might get funny as time goes by, downstream logic changes might invalidate top level queries that built on now dated assumption

Overall data integrity is hard.

munk-a3y ago

It is - but that's the specific hardness that you want to capture and cover with your tests. Downstream logic changing and the application being unaware of it is one of the two major dangers these sorts of tests exist to protect against.

jiggywiggy3y ago· 2 in thread

To be honest.

I would love testing to work.

Have set up and maintained several unit test suites in Jest.

Wrote several large e2e test suites in Cypress.

I don't think anyone won time from simply having a manual checklist and testing manually.

Maybe me and my former teammates are doing it wrong. Talking 8+ teams, from corporate to startup.

But loved de proven wrong. E2e def. catched most issues.

monsieurbanana3y ago

I'm not sure I understand your comment, you seem to say that automating testing isn't better than manual testing, yet the e2e tests caught most issues?

jiggywiggy3y ago

Yeah could have been clearer, e2e from all testing suites i mean. My point is not so much that human testing it's better, which it is, but also it ends up being more time efficiënt the maintaining test suites. Which can het quite time consuming.

Also testing with a checklist by human is easy to outsource to cheap labor. Whereas testing engineers are pretty expensive

zabzonk3y ago· 2 in thread

assuming you are asking about sql select statements, the problem is knowing what the correct answer is so you can test against it. for most data, you don't, and probably cannot know this.

not a unique problem with sql, btw.

RobinL3y ago

You can know what the answer is against a small test dataset though. Obviously the challenge is ensuring it's representative, that you hit the edge cases of real data etc. But it's better than nothing

zabzonk3y ago

i have never worked with a small dataset. mocking one doesn't work, for the reasons you suggest, and others.

basically you cannot test queries against a big database. you just have to hope for the best.

1 more reply

jlund-molfese3y ago· 1 in thread

In SQL-heavy ETL pipelines, I normally don't test the SQL queries by themselves, but do black box & performance testing to verify that the output of a certain batch job matches what I expect (automated acceptance testing).

This is easier if you have the same input every time the tests run, like a frozen database image, because then you can basically have snapshot tests.

rozularen3y ago

we do something similar, we run some validation tests against the output regarding file size and line count but not the actual data

Ataraxy3y ago· 1 in thread

So I'm not an expert, but for simplistic use cases I merely make use of https://github.com/oguimbal/pg-mem

It's a lot faster and easier than dealing with containers and the like.

cgh3y ago

Same except we use a different in-memory database (H2). This makes it simple for CI systems like Jenkins to run the tests and you can have various degrees of integration. Really amazing that your comment was the first mention I saw of in-memory databases in this thread.

luckystarr3y ago· 1 in thread

I use a fake object in place of a database connection which gives fake responses when the correct SQL query is sent to it.

Example:

db = Fake().expect_query("SELECT * FROM users", result=[(1, 'Bob'), (2, 'Joe')])

Then you do:

db.query("SELECT * FROM users")

and get back the result.

In Python if you do this in a context manager, you can ensure that all expected queries actually were issued, because the Fake object can track which ones it already saw and throw an exception on exit.

The upside of this is, you don't need any database server running for your tests.

update: This pattern is usually called db-mock or something like this. There are some packages out there. I built it a few times for companies I worked for.

NoThisIsMe3y ago

This is testing the code that emits the SQL code, not the SQL code itself. Sometimes that's what you want, but it's not what OP asked for.

nextlevelwizard3y ago· 1 in thread

Test the queries your application is making. I wouldn't put much effort into this. You have to trust that other people test their stuff anyway so why make a difference with a database? I'd much rather test that your backups work.

And that can be done by dumping the database (possibly verifying the content of that dump), taking a backup, restoring the backup to a fresh container, then comparing dump of that freshly restored database to the one you took at the start.

zx143y ago

He is the one trying to test his stuff though, hence the question.

systems3y ago· 1 in thread

Well, I used to work at a place where they used https://tsqlt.org/ for testing it worked great

I dont know the technical detail of how to set it up, it was already setup when I worked there

But basically, we wrote SQL script that included statements to

1. create the db structure, tables or views

2. insert statement to enter test data (you can insert corner cases etc..)

3. ran the function or procedure

4. ran an assert to confirm if results are to our expectation

test script were ran by the CI/CD process

reactorpres3y ago

tSQLt is a great tool. If only it could get ported to some other dbs.

samsquire3y ago· 1 in thread

From an SQL database implementation perspective, in my toy Python barebones SQL database that barely supports inner joins (https://github.com/samsquire/hash-db) I tested by testing on postgresql and seeing if my query with two joins produces the same results.

I ought to produce unit tests that prove that tuples from each join operation produces the correct dataset. I've only ever tested with 3 join operations in one query.

For a user perspective, I guess you could write some tooling that loads example data into a database and does an incremental join with each part of the join statement added.

AnEro3y ago

Personally I like using dimensions from each set and seeing if the business logic side lines up. So measure sales month over month with 2 different sources and/or have the full join count important fields on matches what makes it in and out of the combination as a score board.

megalan2473y ago· 1 in thread

I experienced this issue in many companies and found there to be no real solution, especially when not testing with the real data. This was one of the reasons I created DrvDB [1] as it allows you to store a copy of the data and very quickly spin up containers to test large databases in CI, and verify the output, performance, etc. is what you expect.

You can achieve the same thing with "docker commit"-ing data into docker images of your dB engine of choice, and firing your queries on them, but that only really works with smaller datasets.

[1] https://devdb.cloud

femto1133y ago

Sorry for the offtopic remarks, but DevDB is relevant to my interests since I've been working on a similar/related and potentially synergistic tech as an on-again/off-again side project for a few years, and would love to chat with you about it. I've got the same username on GitHub and Twitter as here if you want to reach out through either of those channels.

idlewords3y ago· 1 in thread

Live in production!

mobjack3y ago

A production replica if you want to be safe

d01003y ago· 1 in thread

I've been using IntegreSQL and it works pretty well

majodev3y ago

Happy to hear that! When it comes to testing services that depend on PostgreSQL, this is still my preferred solution.

https://github.com/allaboutapps/integresql

disclaimer: author

chrisoldwood3y ago

Back in the mid-noughties I decided to see if I could write SQL in a test-first manner (i.e. TDD). This resulted in me writing a 100% T-SQL based unit testing framework for SQL Server [1] which we then used for the SQL back-end at an investment bank.

On the back on that professional use I wrote a blog post [2] explaining why you might choose to go down this route as it wasn't the way database was developed way back then (SQL wasn't developed in the same way as the other front-end and back-end code).

A few years later I gave a short 20-minute talk (videoed) to show what writing SQL using TDD looked like for me. It's hard to show all the kinds of tests we wrote in practice at the bank but the talk is intended to show how rapid the feedback loop can be using a standard DB query tool and two code windows - production code and tests.

Be kind, it was a long time ago and I'm sure the state of the art has improved a lot in the intervening years :o).

Chris Oldwood

---

[1] SQL Server Unit: https://github.com/chrisoldwood/SS-Unit

[2] You Write Your SQL Unit Tests in SQL?: https://chrisoldwood.blogspot.com/2011/04/you-write-your-sql...

[3] Test-Driven SQL: https://www.youtube.com/watch?v=5-MWYKLM3r0

dagss3y ago

Probably not that relevant for a data team, but this is what we do as a backend team:

We use Microsoft SQL's docker image and spin it up in the background on our laptop/CI server so port 1433 has a database.

Then we have our homegrown migration file runner that will compute a hash of the migrations, make a database template_a5757f7e, and run the hundreds of migrations on it, whenever we add a new SQL migration (todo: make one template build on the previous).

Then we use the BACKUP command to dump the db to disk (within the docker image)

Finally, each test function is able to make a new database and restore that backup from file in less than a second. Populate with some relevant test data, run code, inspect results, drop database.

So our test suite uses hundreds of fresh databases and it still runs in a reasonable time.

(And..our test suite is written in Go, with a lot of embedded SQL strings, even if a lot of our business logic is in SQL)

seanhunter3y ago

One easy method is just to test sql the way you test anything else:

1)Set up a test db instance with controlled data in it as the basis for your test cases. Ideally this data is taken from real data that has caused pipeline problems in the past but scrubbed for PII etc. You can also use or write generators to pad this out with realistic-looking fake data. If you do this the same dataset can be used for demos (once you add data for your demo paths).

2)Write test cases using whatever test framework you use in your main language. Say you code in python, you write pytest cases, java -> junit etc. You can help yourself by writing a little scaffolding that takes a sql query and a predicate, runs the query and asserts the predicate over the result. If you don't have a "main language", just write these test cases in a convenient language.

3)Consider resetting the state of the database (probably by reloading a controlled dump before each test batch) so any tests which involve inserts/deletes etc work. You may actually want to create an entirely new db and load it before each test run so that you can run multiple test batches concurrently against different dbs without contention messing up your results. Depending on your setup you may be able to achieve a similar effect using schemas or (sometimes but not always) transactions. You want each test run to be idempotent and isolated though.

Doing it this way has a number of benefits because it's easy to add your sql test cases into your CI/CD (they just run the same as everything else).

cwp3y ago

Two ideas here:

1) The same way you'd write any other tests. Use your favourite testing framework to write fixtures and tests for the SQL queries:

  - connect to the database
  - create tables
  - load test data
  - run the query
  - assert you get the results you expect

For insert or update queries, that assertion step might involve running another query.

2) DBT has support for testing! It's quite good. See https://docs.getdbt.com/docs/build/tests

hichkaker3y ago

I did data engineering for 6 years and am building a company to automate SQL validation for dbt users.

First, by “testing SQL pipelines”, I assume you mean testing changes to SQL code as part of the development workflow? (vs. monitoring pipelines in production for failures / anomalies).

If so:

1 – assertions. dbt comes with a solid built-in testing framework [1] for expressing assertions such as “this column should have values in the list [A,B,C]” as well checking referential integrity, uniqueness, nulls, etc. There are more advanced packages on top of dbt tests [2]. The problem with assertion testing in general though is that for a moderately complex data pipeline, it’s infeasible to achieve test coverage that would cover most possible failure scenarios.

2 – data diff: for every change to SQL, know exactly how the code change affects the output data by comparing the data in dev/staging (built off the dev branch code) with the data in production (built off the main branch). We built an open-source tool for that: https://github.com/datafold/data-diff, and we are adding an integration with dbt soon which will make diffing as part of dbt development workflow one command away [2]

We make money by selling a Cloud solution for teams that integrates data diff into Github/Gitlab CI and automatically diffs every pull request to tell you the how a change to SQL affects the target table you changed, downstream tables and dependent BI tools (video demo: [3])

I’ve also written about why reliable change management is so important for data engineering and what are key best practices to implement [4]

[1] https://docs.getdbt.com/docs/build/tests [2] https://github.com/calogica/dbt-expectations [3] https://github.com/datafold/data-diff/pull/364 [4] https://www.datafold.com/dbt [5] https://www.datafold.com/blog/the-day-you-stopped-breaking-y...

taeric3y ago

Best tool nowadays has to be to spin up a database and execute the queries against it. If you are on a database setup that spinning up an instance takes a long time, consider docker.

Be wary of too many techniques that are supposed to be making it easier to test, but also make it hard for you to leave a query pipeline. In particular, SQL should be very easy in the "with these as our base inputs, we expect these as our base outputs." Trying to test individual parts of the queries is almost certainly doomed to massive bloat of the system and will cause grief later.

kkleindev3y ago

The teams I've been working on have resorted to data tests instead of code tests. That means that the data produced by your code is tested against a certain set of expectations - in stark contrast to code being tested _before_ its execution.

We've written our own tool to compare different data sources against each other. This allows, for example, to test for invariants (or expected variations) between and after a transformation.

The tool is open source: https://github.com/QuantCo/datajudge

We've also written a blog post trying to illustrate a use case: https://tech.quantco.com/2022/06/20/datajudge.html

sakopov3y ago

.NET Shop using SQL Server here, but I think something similar to what we do can apply to any stack. We use TestContainers [1] to spin up a container with SQL Server engine running on it. Then use FluentMigrator [2] to provision tables and test data to run XUnit integration tests against. This has worked remarkably well.

[1] https://dotnet.testcontainers.org/

[2] https://fluentmigrator.github.io/

dave_infuseai3y ago

dbt does have testing built in, but of course there are only certain cases for which that kind of testing works. dbt can't know if your metrics 'look' right, only you will know.

As others have mentioned, you want to compare the results of your queries against a previously known 'good' state of the data. So, as you're making data model changes, you can regularly check your development environment against production to see how your changes affect the data.

Data profiling is the perfect tool for this, especially when your pipeline reaches a certain size, or you're dealing with very large datasets.

I work on the team creating PipeRider.io, which uses data profiling comparisons as a method of "code review for data".

It becomes particularly useful when you automate generating data profiles of development and production environments in CI, and attach the data profile comparison to the pull request comment. It makes seeing the impact of changes so much easier.

Here's an article that discusses the benefits of this: https://blog.infuseai.io/why-you-lack-confidence-merging-dbt...

Waterluvian3y ago

GitHub Actions trivialized this for us. Spawning a Postgres database for testing is easy and carefree. Spawns. Runs operations and evaluates the state of the database after each operation.

We have two flavours of test: one that drops the transaction each time, ensuring a clean, known state. And one that doesn’t, allowing your tests to avoid lots of overhead by “walking through a series of incremental states”.

Yes, some might call the latter heresy. But it works great.

dwohnitmok3y ago

Given that you're using dbt, it comes with a testing framework out of the box: https://docs.getdbt.com/docs/build/tests

nephton3y ago

Write a toolbox for your tests that is able to spin up and tear down an instance of the database that you use (eg. on a random port and random password). Make sure that the new instance is really independent, especially storing its data in a separate (temporary) ditectory.

During test: - At the start of the test (fixture), run a new DB instance - Apply DB schema. - possibly: Remove constraints that that would disturb your tests (eg. unimportant foreign keys) - possibly: Add default values for columns that are not important for your test (but do with caution) - run you test - Assert results (maybe also directly as access to databse or via a dump of tables). - Tear down database possibly removing all data (except error logs).

I used this pattern to test software that uses MySQL or MariaDB server. For Microsoft SQL server it may be enough to create a new database instead of running a new instance (possible but not as easy as for MySQL/MariaDB).

On CI server this can be used to run tests against all required DB server types and versions.

KronisLV3y ago

I think that when it comes to testing databases... most people just don't.

Look at this JetBrains survey: https://www.jetbrains.com/lp/devecosystem-2021/databases/

Around half of the people never debug stored procedures. Three quarters of people don't have tests in their databases. Only half of the people version their database scripts.

Personally: the answer is containers. Spin up a database in a container (manually or on CI server) and do whatever you need with it. Seed it with some test data, connect an app to it, check that app tests pass when writing to and reading from a live database (as opposed to mock data stores or something like H2), then discard the container.

Even if you don't have a traditional app, throwaway instances of the real type of DB that you'll be using are great, both for development and testing.

gsvclass3y ago

In GraphJin an automatic GraphQL to SQL compiler we use the gnomock library it startups a database instance (docker) then create the schema and tests data and finally our code connects to it and runs a series of tests. We run these across Mysql, Postgres and a few other DB's. Gnomock supports a wide range of them. Right now we don't take down the db for every test only between test runs but its fast enough that we could. This whole thing runs of a simple `go test -v .` command and we run it on every commit using a githook. https://github.com/dosco/graphjin/blob/master/tests/dbint_te...

summerlight3y ago

Google internally has developed a SQL language dialect and some investments have been made to have a first class language supports for SQL. It has several interesting functionalities that made SQL nice for ETL use cases.

  * It has a language level module support, similar to other languages. Thus SQL functions are reusable across multiple codebases without depending on code generation tricks. One of the major blocker for SQL adoption has been complex domain specific business logic and now the situation is better.
  * It has an official unit test support. Google use Blaze (which is known as Bazel externally), so adding a unit test for SQL code is as simple as adding a SQL module (and its test input) dependency to SQL test target, write a test query and its expected output in a format of approval testing. Setting up the DB environment is all handled by the testing framework.
  * It has an official SQL binary support. It's just a fancy name for handling lots of tedious stuffs for running a SQL query (e.g. putting everything needed into a single package, performing type checks, handling input parameters, managing native code dependencies for FFI etc etc).

None of those are technically too sophisticated at least in theory, actually these combined together become pretty handy. Now I can write a simple SQL module which mostly depends on other team's SQL module, do a simple unit test for it then run a SQL binary just as other languages. I haven't worried a single time on how to set up a DB instance. This loop is largely focused on OLAP so it's a bit different for OLTP, which has another type of established testing patterns.

gorgoiler3y ago

Realistically, most of my tests are integration / end to end tests. They typically get written only when it comes to patch time, where you first want proof that the old system works before you tear it apart and rebuild it. I think that’s probably the only SQL testing I’ve ever done and honestly, if they are fast enough, that kind of integration testing is all you will need too.

As the meme say: App worked before. App work afterwards. Can’t explain that.

creakingstairs3y ago

Phoenix/Elixir/Ecto seems to handle this really well. When you write a test in Phoenix, you can use `DataCase` which automatically creates sandboxed transactions [1] for each test. This makes writing integration tests a breeze.

[1] https://hexdocs.pm/ecto_sql/Ecto.Adapters.SQL.html#module-sq...

ufmace3y ago

I don't think there's any perfect universal answer for this. I only have a few things I've done that work.

Rails for Ruby comes with some pretty nice setups for testing the database code. There's a test DB by default with the same schema as Production, and the usual test frameworks (FactoryBot and RSpec) make it easy to set up some data in the actual DB for each spec, run model code that makes actual SQL queries, and assert against the results.

I would have hoped most other web hosting frameworks would make as much effort to making it straightforward to test your database code, but it doesn't really seem to be the case.

In Rust, there's a very handy crate called sqlx. What it does is, at compile time, it runs all of the SQL in your codebase against a copy of your database to both validate that it runs without errors and map the input and output types to typecheck the Rust code.

When it comes to stuff like validating that your queries are performant against production datasets or that there isn't any unexpected data in production that breaks your queries, well I pretty much got nothing. Maybe try a read replica to execute against?

epgui3y ago

We use this and take an example-based tests approach for any non-trivial DBT models: https://github.com/EqualExperts/dbt-unit-testing

More trivial example:

    {%
        call dbt_unit_testing.test(
            'REDACTED',
            'Should replace nullish values with NULL'
        )
    %}
        {% call dbt_unit_testing.mock_source('REDACTED', 'REDACTED', opts) %}

            "id" | "industry"
            1    | 'A'
            2    | 'B'
            3    | ''
            4    | 'Other'
            5    | 'C'
            6    | NULL

        {% endcall %}

        {% call dbt_unit_testing.expect(opts) %}

            "history_id" | "REDACTED"
            1            | 'A'
            2            | 'B'
            3            | NULL
            4            | NULL
            5            | 'C'
            6            | NULL

        {% endcall %}
    {% endcall %}

winrid3y ago

I try to do some kind of compile-time query checking. I really like sqlx with Rust, and other languages have some kind of equivalent (although maybe not as nice) like JOOQ. If you can store the queries in some kind of configuration, like SQL files, then this is easy no matter the language.

jesseryoung3y ago

I started in the software engineering space and move into data engineering, and I was floored with the complete lack of tooling. There is a HUGE gap between software engineering and data engineering when it comes to both tooling and practice. Even the simplest "unit" test of "Is the SQL statement valid" is not all that common in frameworks and tooling but in practice is like 90% of the production failures that I've seen.

Starting with a framework that is programming language first (IE Spark) can help you build your own tooling to help you actually build unit tests. It's frustrating though, that this isn't just common across other ETL tooling.

alphanumeric03y ago

I actively avoid it. The SQL languages have no default libraries for testing. Furthermore, SQL cannot be easily composed, or abstracted into modules, so there is no sense in testing it. It is not designed to be tested.

hansvm3y ago

You get a lot of bang for your buck with an expected input/output setup. In basically every database it's trivial to set up a few tiny tables, and it's cheap to run a query on small input. Pick a few edge cases and a few non-trivial representative examples, and any passing query written by a real person will likely express the logic you care about or will expose an additional input/output pair to add to the tests. Combine the high efficacy with the ease of writing and understanding such a test, and it becomes hard to argue against having at least a few.

AdrianB13y ago

It depends a lot on your use case. In my case we have SQL running against tables with trillions of rows, so we need to take a look at every single SQL query in the code that runs more than tens of milliseconds or often enough to get significant. There is no automation for a good DBA looking at an execution plan; I heard about a guy that works in some financial company where his job for the past 10 days was to tune the same ~ 10 queries to the death, but if your app is working with a database that can be hosted comfortably on a smartphone, none of this is needed.

rowls663y ago

This is a little off topic, but related. Does anyone know of any tools that would allow information about the size and shape of expected data to be provided along with a database schema so that developers could get instant feedback if queries were likely to perform poorly when run against production data sets. Or to perform poorly when a database grows to beyond a certain size. I have seen many instances of SQL going into production databases that works well for a while, but gets much slower as the database grows.

thingsilearned3y ago

This linter can really enforce some best practices https://github.com/sqlfluff/sqlfluff

A list of best practices: https://docs.getdbt.com/guides/legacy/best-practices

And shameless plug but there's a chapter on modeling in my book: https://theinformedcompany.com

cgopalan3y ago

You say you are using dbt, so doesnt "dbt test" provide you with the functionality to test? I assume by testing sql pipelines, you want to test if the data written to intermediate stores conform to what you expect. You should be able to do that with dbt test. If you are using an analytical database like Snowflake you could direct the results of the dbt run and dbt test to a test database and do your testing there.

geocrasher3y ago

SQL pipelines can be tested pretty easily. If they look congested, use a snake (Python, if you like) to try to knock the CRUD out of them.

davvid3y ago

Here's a nifty python module for testing against postgres that a friend wrote. It creates a temporary database on the fly for testing and tears it down at the end. You can then populate the database and run all kinds of integration tests against it.

https://github.com/ugtar/pg_temp

edublancas3y ago

If you use Jupyter, check out JupySQL. It allows you to break down long queries in multiple cells so you can test them individually: https://jupysql.ploomber.io/en/latest/compose.html (Disclosure: my team develops JupySQL)

andy_ppp3y ago

I really love (as with lots of things) how Ecto from the Elixir community handles this, you have an extra database and because Postgres has awesome transactions you can even run all theses tests of your whole data layer in parallel, including any SQL. Ecto is largely a domain specific language for writing modular SQL so that helps test things too.

sam0x173y ago

Side note: I believe it is good form to always have unit tests that test the "up" and "down" for every single migration in your app. It's not always possible but if you're strict about it you can avoid a lot of bad patterns and have a much healthier set of migrations

hans0l0743y ago

On a related note (though it does touch upon testing) mitchellh open sourced Squire[1] recently (today?). Note though, that it's PostgreSQL specific.

[1] https://github.com/mitchellh/squire

gxt3y ago

Abstraction layer between the query you write and the one that gets executed. This way you can mock the schema, run the query on the mock to assert some condition x.

A ref() concept like dbt's is sufficient. When testing, have ref output a different (test-x) name for all your references.

nitwit0053y ago

If your app is API driven, I would create test data with the APIs, and test the queries against what that generates. APIs tend to change less often than the internal data representation, and the app developers can usually figure out how to fix the tests when they do.

iblaine3y ago

If I were building a tool to test SQL, then I'd try to load the SQL into a dataframe, then test it by mocking the tables and the output. This is a tough problem to solve. If testing is important, possibly move away from SQL and towards ORMs.

JenrHywy3y ago

We test it implicitly with integration tests across the API for the most part. We occasionally have unit tests (in C#) that directly call stored procs.

The backbone for this is that we spin up a DB per unit test, so we don't have to worry about shared state.

dangwhy3y ago

I've had great success with this

https://news.ycombinator.com/item?id=34580675

I am always baffled by why this ins't more popular way of writing SQL.

javaunsafe20193y ago

I always did that with integration tests. Put some data in the db. Use the repository aka your sql and validate the results.

Most of the times there is a layer around your sql (a repository, a bash script or whatever) that you can use for integration testing.

devdada3y ago

Creating a separate isolated environment is totally the way to go, and migrate the data from your existing database to it. This way the data turns to "dummy" data but it provides you with something to conduct testing with.

jve3y ago

I regret making an integration in SQL Server Integration Services, with some nice addon tools that can make http requests and such. In the end it is untestable piece of hard to follow solution.

Anyone have any recommendations on testing SSIS ?

pharmakom3y ago

Testing databases with Docker, dummy data, etc. can be very slow so it’s a big win to use Bazel as the test executor. This enables caching between runs and between machines. Saves us about 20 mins of CI time every build.

Gigachad3y ago

Part of the solution could be using tooling which can compile time check SQL is valid like https://github.com/launchbadge/sqlx

nikita3y ago

Disclosure. I'm CEO of Neon (neon.tech).

One of the premises that we have is the ability to instantly create a test environment by creating a branch. I'd love to hear what you think about it.

mikpanko3y ago

Datafold (https://www.datafold.com/) is solving exactly that problem and has integration with dbt.

xmorse3y ago

I do E2E tests for the core features against a staging database and url.

This way i don't waste time with unit tests that quickly get old and no one wants to maintain and run

hparadiz3y ago

Surprised only a single comment in this thread had mentioned the "explain" keyword that exists in MySQL.

Excellent for checking delete queries before running them.

johnthescott3y ago

sqldb testing is historically problematic, for at least correctness AND performance.

in postgresql a cool tool for performance is"hypothetical indexing", which predicts how the optimizer will use indexes in any sql query. i could see an automated testing tool written around "hypothetical indexing".

also, i believe MSServer supports HI.

spikder3y ago

I have done this more than once and it is great. Point django at your db, dump the models, and then use django's test framework.

dirtybirdnj3y ago

in production, on a friday before I leave for vacation

for real though I love tools like SequelPro or TablePlus that let me work out a query before I bake logic or stuff into my apps. Also sometimes I use it to work out the data needed for reports. I am working with salesforce for the first time in my life and apparently there are tools that let me treat it like I'm used to SequelPro.

Andys3y ago

Something to consider (for Go users): Use the sqlc library, which makes whole classes of bugs/mistakes into compile-time errors.

DecoPerson3y ago

In production ;)

But my app is for six users at one site, it’s not mission critical, and the sqlite DB is backed up hourly.

Life’s too short for (unnecessary) testing.

world2vec3y ago

DBT supports testing right out of the gate, you can write little queries and macros to test and validate every column.

EGreg3y ago

Have small databases.

Learn to use IMPORT TABLESPACE in MySQL or just dump and import SQL.

Every time you run a test you set up the mock databases again.

dllthomas3y ago

I've been wanting to put together a property testing framework that lets you specify properties as queries.

kindofabigdeal3y ago

Either a test environment or a mocking library

z3t43y ago

Local Dev dB and/or staging

User233y ago

Like any other integration test.

postalrat3y ago

Git, merge requests, developers.

neural_thing3y ago

Great Expectations and Datafold

distantsounds3y ago

Flyway, https://flywaydb.org/

tommyage3y ago

Relational databases rely on math. You may test your implementation; Or the input to your statements.

karakanb3y ago

Disclaimer: I am building a data platform called Blast, and part of our focus is to make SQL pipelines easier to maintain at every aspect: easier to write, easier to test, easier to ensure it is correct before they are deployed. Link in my bio if you are interested, happy to have a conversation anyway. This is a problem we have been thinking and providing solutions for quite some time.

There are a few types of tests one would like from a SQL pipeline, each with a different value add:

- Quality assurance tests: these are things like DBT tests, they mainly test the accuracy of the result after the tables are produced. Examples of this would be tests like "this column should not contain any `null` values" or "it should have only X, Y and Z values". They are valuable checks, but the matter of the fact is that there are many cases where running this sorts of tests after the data is produced is a bit too late.

- Integration tests: specify an input table and your expected output, and run your queries against it, the end result must match with the expectations at all times. This is useful for running them regularly and serve as "integration tests" for your SQL assets. They allow validating the logic inside the query, provided that the input is covering the cases that needs to be covered, they can be executed in CI/CD pipelines. We are exploring a new way of doing this with Blast CLI, effectively running a BigQuery compatible database in-memory and running tests against every asset in the pipeline locally.

- Validation tests: these tests aim to ensure that the query is syntactically correct on the production DWH, usually using tricks like `EXPLAIN` or dry-run in BigQuery. These sorts of tests would ensure that the tables/fields referenced actually exist, the types are valid, the query has no syntax errors, etc.. These are very useful for running in CI after every change, effectively allowing catching many classes of bugs.

- Sanity checks: these are similar to the quality assurance tests described above, but with a bigger focus on making sense out of the data. They range from "this table has no more rows than this other table" to business-level checks such as "the conversion rate for this week cannot be more than 20% lower compared to last week". They are executed after the data is produced as well, and they would serve as an alerting layer.

There is no silver bullet when it comes to testing SQL, because in the end what is being tested is not just the SQL query but the data asset itself, which makes things more complicated. The fact that SQL has no standardized way of testing things and the language has a lot of dialects make this harder than it could have been. In my experience, I have found the combination of the strategies above to have a very good coverage when it comes to approximating how accurate the queries are and how trustworthy the end result is, provided that a healthy mix of them is being used throughout the whole development lifecycle.

sqldba3y ago

Thoughts and prayers.

Teamteam163y ago

SQL.lite

devdada3y ago

Test

opportune3y ago

I used to work on SQL pipelines, sadly I didn’t get to implement some of these suggestions, but we did implement others.

For testing:

Run your query/pipeline against synthetic/manual data that you can easily verify the correctness of. This is like a unit test.

Run your query/pipeline on sampled actual data (eg 0.1% of the furthest upstream data you care about). This is like an integration test or a canary. Instead of taking 0.1% of all records you might instead want to sample 0.1% of all USERID so that things like aggregate values can be sanity checked.

Compare the results of the new query to the results from the old query/pipeline. This is like a regression test. You may think this wouldn’t help for many changes because the output is expected to change, but you could run this only on e.g. a subset of columns.

Take the output of the new query (or sampled query, or the manual query) and feed it to whatever is downstream. This is like a conformance test.

For reliability:

If the cost is not prohibitive, consider persisting temporary query results (eg between stages of your pipeline) for 1-2 weeks. This way if you catch a bug from a recent change you only need to rerun the part of your pipeline after the breakage. May not make sense to do if your pipeline is not big

If the cost is not prohibitive you could also run both the new and old versions of the pipeline for ~a week so that you can quickly “rollback”. Ofc whether this is viable depends on what you’re doing.

The big failure modes with SQL pipelines IME are

1. unexpected edge cases and bad data causing queries to fail (eg you manually test the new query and it works fine, but in production it fails when handling Unicode)

2. not having a plan for what to do when a bug gets caught after the fact

3. barely ever noticing bugs or lost data because nobody is validating the output (for example, if you have a pipeline that aggregates a user’s records over a day, any USERID that’s in the input data for that day should also be in the output data for that day).

4. This can be very hard to solve depending on your circumstances, but upstream changes in data are the most annoying and intractable to solve. The best case here is you either spec out the input data closely OR have some kind of testing in place that the upstream folks run before shipping changes.

To address these, you need to take the approach of expecting things to fail, rather than hoping they don’t. This is common practice in many SWE shops these days but the culture in the data world hasn’t quite caught up. I think part of the problem is that automating this testing usually requires at least some scripting/programming which is outside the comfort zone for many people who “just write SQL.”

2v35gggg3y ago

by running it in prod duh

sbricks3y ago

you don't?...

chadlavi3y ago

that's the neat thing, you don't

wokwokwok3y ago

Two days ago, everyone: “You should use SQL because everyone knows it and modern SQL is pretty good.

Other languages are too complicated. :(“

Everyone today: “tries using sql

Oh wow, the tooling is quite basic, and you can’t express complex data structures and imperative code. :(“

What did you expect?

Look, I spent 4 years in this rabbit hole, and here’s my advice:

Don’t try to put the square peg in the round hole.

You want easy to write, simple code and pipelines? Just use sql.

Have a dev environment and run everything against that to verify it.

Do not bother with unit testing your CTEs, it’s hard to do, there are no good tools to do it.

If you want Strong Engineering TM, use python and spark and all the python libraries that exist to do all that stuff.

It won’t be as quick to write, or make changes to, but it will be easier to write more verifiably robust code.

If you treat either as something it is not (eg. Writing complex data structures and frameworks in sql) you’re using the wrong tool for the outcome you’re trying to achieve.

It’ll take longer and “feel bad”, not because the tool is bad, but because you’re using it in a bad way.

j / k navigate · click thread line to collapse

315 comments

226 comments · 96 top-level

ramenmeal3y ago· 37 in thread

We spin up a docker container running the DB technology we use, run our DB migration scripts on it, and then run integration tests against it. You get coverage of your migration scripts this way too.

MBCook3y ago

This is what we did at my last job. You can catch DB specific issues that a false implementation wouldn’t show and make sure all your code paths work as expected.

Every time new issues cropped up we would put new data in the test data designed to reproduce it. Every edge case we would run into.

It provided so much confidence because it would catch and trigger so many edge cases that testing with mocks or by hand would miss.

jdc05893y ago

acheton3y ago

fatneckbeard3y ago

thats a huge amount of work and money.

at my company they just told us to stop reporting edge cases. much easier, much cheaper.

2 more replies

feike3y ago

We do this too for PostgreSQL: to ensure the tests are really fast:

    - we create a template database using the migrations
    - for *every* integration test we do `CREATE DATABASE test123 TEMPLATE test_template;`
    - we tune the PostgreSQL instance inside Docker to speed up things, for exampling disabling synchronous_commit

On a successful test, we drop the test123 database. On a failed test, we keep the database around, so we can inspect it a bit.

The really great thing about this approach (IMHO), is that you can validate certain constraint violations.

This approach is pretty much as fast as our unit tests (your mileage may vary), but it prevents way more bugs from being merged into our codebase.

neonate3y ago

Out of curiosity, how fast is really fast?

1 more reply

pgoggijr3y ago

I've used this same approach as well. Testcontainers is a nice way to help with this!

https://www.testcontainers.org/

JackFr3y ago

jve3y ago

What is testcontainer? Reading the first page I don't understand what benefits it buys me:

Edit: Ok, example helps. https://dotnet.testcontainers.org/examples/aspnet/

I can declare docker infrastructure from my code, right?

1 more reply

claytonjy3y ago

¹: https://sqitch.org/

grilledcheez3y ago

1 more reply

Dowwie3y ago

I resented writing the verify scripts for my migrations, after writing unit and integration tests, but yes it is valuable

kakwa_3y ago

We do that too, in fact it's not only the DB that run in docker, but the whole build + CI process.

This strategy could probably be extended to migrating from every older version by building each tag, but we don't have that need.

We could also probably improve checks on the data itself however as for now, we only check the schemas.

Few things to note:

* it's still possible to run the tests outside of docker and use a local DB instead with some light setup (it's faster than running everything in docker when developing)

* docker argument --tmpfs <dir/of/db> is quite good, assuming you have enough ram for your dataset

* few configuration tweaks on the DB, like max connection might be necessary.

Overall, we are quite happy with this setup as it permits to implement end to end integration tests quite easily without spending too much time mocking dependencies.

ddyevf6353723y ago

Running sqlite in memory as a test db speeds up your test runner as crazy. You can do this if you use an sql query builder library, because it can translate your queries to the specific database.

whalesalad3y ago

2 more replies

danielheath3y ago

You can get the 'in-memory' speed advantage by putting the datastore on a ramdisk (or even just disabling fsync, which is pretty easy to do in postgresql).

nicoburns3y ago

There’s also the eatmydata program which does similar by using LD_PRELOAD to intercept io calls.

rc_mob3y ago

problem is that its not always compatible with features you use on your production database

waste_monk3y ago

1 more reply

gardenhedge3y ago

So you test against a different database technology than the one you software uses? I understand why that works but it seems odd

1 more reply

sicp-enjoyer3y ago

Why should this be faster than a local postgres instance with no traffic?

1 more reply

crabbone3y ago

atmosx3y ago

hot_gril3y ago

Same (though I don't use Docker). Did TDD for a while like this, and it wasn't perfect, but it worked better than anything else. I didn't even know how to run the frontend; that was a separate team.

inetknght3y ago

bonus points if you add test data for integration tests with sad paths too

csteubs3y ago

Absolutely this. I stood up a negative test suite for continuous DB queries late in 2020 and it's caught many potential show stopper integration issues since; about 45% more YoY than pre-suite.

allanbreyes3y ago

MuffinFlavored3y ago

Do you do this in place of unit tests (where you have to mock/stub the DB interactions) or do you do both?

nishs3y ago

Sorry for the digression first. (If anyone has different definitions for the ideas here, I would love to learn.)

On the original question:

> Do you do this in place of unit tests (where you have to mock/stub the DB interactions) or do you do both?

[1] A true unit test is one that has no external interactions, neither directly nor transitively, besides interactions with the system under test (SUT). The SUT here is the HTTP handler.

[2] An integration test should include external system interactions, if any. That's what integration means.

[3] The SUT is each SQL query and the its interaction with the database. There are no interactions in the test with systems outside the SUT, so it is still a true unit test.

1 more reply

sssspppp3y ago

would love to do this, but how does one spin up a redshift cluster inside of a docker container?

ntr--3y ago

try not to cry, cry a lot, and then resolve not to vendor lock yourself to a black box data store next time.

jokes aside, redshift is based on pg^1, you can try an older version to get some semblance of it running locally.

1. https://docs.aws.amazon.com/redshift/latest/dg/c_redshift-an...

nicoburns3y ago

And this is why you don’t rely on a closed source stack if you have any alternative.

danmcs3y ago

contravariant3y ago

Well the first step is to get Amazon to part with their lucrative closed source software.

atwebb3y ago

Would redshift serverless help at all? I realize that it may not have parity with the existing functionality, just a thought.

totalhack3y ago

Same. That's one of the benefits of being able to replicate your environment with something like docker compose and then use a tool like alembic to manage migrations both locally and in production.

unboxingelf3y ago

This is the way

RobinL3y ago· 29 in thread

If you use DuckDB to run the tests, you can reference those files as if they were tables (select * from 'in.parquet'), and the tests will run extremely fast

I've used the above strategy with PyTest, but I'm not sure conceptually it's particularly sensitive to the programming language/testrunner you use.

Also I have no idea whether this is good practice - it's just something that seemed to work well for me.

If it makes it more readable you can also inline the 'in' and 'expected_out' data e.g. as a list of dicts and pass into DuckDB as a pandas dataframe

One gotya is SQL does not guarantee order so you need to somehow sort or otherwise ensure your tests are robust to this

paulmd3y ago

Potentially if you scripted your table creates you could add an additional rule that GLOBAL TEMPORARY TABLE gets added to any other tabledefs but only for unit tests. Or just update both in parallel.

n0n0n4t0r3y ago

Off topic incoming (sorry )

This is one of the worst drawbacks of orm everywhere: nobody even seems to think about those optimisations anymore.

2 more replies

MrPowers3y ago

* reuse the same Spark session throughout the test suite

* Set shuffle partitions to 2 (instead of default which is 200)

* Use dependency injection to avoid disk I/O in the test suite

RobinL3y ago

That's super useful, thanks. Could you expand on the 'Use dependency injection to avoid disk I/O in the test suite' point please - I'm not sure I understand what it means but it sounds interesting!

MrPowers3y ago

Sure, this blog post explains the dependency injection design pattern with Spark: https://mrpowers.medium.com/dependency-injection-with-spark-...

colanderman3y ago

timwis3y ago

Just to flag this behaviour changes in v12: https://www.depesz.com/2019/02/19/waiting-for-postgresql-12-...

1 more reply

nicoburns3y ago

This is a good point, although it’s worth noting that there are definitely cases where you want the sun table to be materialised (esp. when that table is small and referenced many times)

dragonwriter3y ago

Yeah, its definitely one of those things where it is worth reading the docs [0], and not just adopting an “always do it this way” rule of thumb.

[0] https://www.postgresql.org/docs/current/queries-with.html#id...

bolanyo3y ago

Is 'sun table' a particular concept here? A typo?

1 more reply

gpvos3y ago

CTE = common table expression, i.e. a WITH clause (possibly recursive) before your SELECT (or other) statement. https://learnsql.com/blog/what-is-common-table-expression/

stormdennis3y ago

Yes, I didn't know what it was so I had looked it up before I saw yours. The definition I got is below:

A common table expression, or CTE, is a temporary named result set created from a simple SQL statement that can be used in subsequent SELECT, DELETE, INSERT, or UPDATE statements.

2 more replies

curiousllama3y ago

> Try and write any complex SQL as a series of semantically meaningful CTEs

I find this helps a heck of a lot with maintainability + debugging as well

atwebb3y ago

1 more reply

shazzdeeds3y ago

A few quick tips to help slow Spark tests.

2. If you have any kind of magic-number N in the SparkSQL or dataframe calls (e.g coalesce(N), repartition(N)) change N to be parameterized, and set it to 1 for the test.

3. Make sure the master doesn't have any more than 'local[2]' set. Or less depending on your workstation.

marijnz0r3y ago

jimmygrapes3y ago

Not much more to say, just observing, sorry if this is irrelevant commentary.

jph003y ago

CTE==Common Table Expression. It's not specific to any particular language. It's basically like a view, except you can define them the same place as you use them, and they can be recursive.

1 more reply

db48x3y ago

Sometimes using databases feels just like repeatedly slamming your head into your desk, so that fits.

collyw3y ago

I used to think that, but once you start thing about them in the right way, the relational model is pretty nice.

1 more reply

osigurdson3y ago

>> I could think of when you said CTE was "chronic traumatic encephalopathy"

The title is "Ask HN: How do you test SQL"

intrasight3y ago

I'm a fan of CTEs too. Here's a pattern that I use in Sql Server for testing/debugging when using CTE pipelines. Same approach would work with "FOR JSON"

   declare @debug bit = 1;

   ;with cte1 as (
     select
       @debug AS Debug1,
       ...
   ),
   cte2 as (
     select 
       @debug AS Debug2,
       ... from cte1
   ),
   cte3 as (
     select 
        @debug AS Debug3,
        ... from cte2
   )
   select
     -- dump intermediate if debug
     (
       select * from cte1 where Debug1=1 for xml raw ('row'), root ('cte1'), type
     )
     ,(
       select * from cte2 where Debug2=1 for xml raw ('row'), root ('cte2'), type
     )
     ,(
       select * from cte3 where Debug3=1 for xml raw ('row'), root ('cte2'), type
     )
     -- final results
     ,(
         select ...
         from cte3 for xml raw ('row'), type
   )
   for xml raw ('results'), type;

devin3y ago

> Try and write any complex SQL as a series of semantically meaningful CTEs.

Could you or anyone else on the post provide an example?

paulmd3y ago

let's say you are doing a paystub rollup - a department has multiple employees, an employee has multiple paystubs.

ChadMoran3y ago

> Try and write any complex SQL as a series of semantically meaningful CTEs

I used this exact same method. Not only does it help me but those who come after trying to understand what's going on.

tensor3y ago

mritchie7123y ago

Doesn't that mean you have to copy all your data into duckdb? I'd imagine with even a modest data warehouse, loading the entire thing to duckdb would be unfeasible.

RobinL3y ago

mritchie7123y ago

How do you get the small datasets? Selecting a random sample from the source?

1 more reply

uticus3y ago· 13 in thread

Related - how is any declarative language tested?

Quick web search confirms suspicions, it is not easy

https://www.metalevel.at/prolog/testing

henryfjordan3y ago

Your link has the answer, declarative testing. You need to think about why you wrote that code and declare some outputs for given inputs that give you confidence in your code.

The other tests from your link are just silly. You did not write code to terminate or to provide an answer for all inputs, you wrote code to provide the right answer.

To test some HTML for instance, think about what information you want the page to convey, load up a browser, and check that the info is displayed. Easy as that.

uticus3y ago

> You did not write code to terminate or provide an answer for all inputs...

Definitely not an expert. But to my eyes checking for non-termination seems very close to things that happen in SQL all the time.

But I might want something more robust checking the outer bounds of acceptable data (like a min and a max). That seems much closer to the non-termination test.

Of course, I'm open to being corrected on this.

spamtarget3y ago

SQL is not a declarative language. It is a functional language, and structured language on the top as extensions. HTML is a declarative language.

taeric3y ago

spamtarget3y ago

"don't get anything other than the HTML that you create" that's the point of declarative languages

Imperative: cooking recipe

Functional: assembly line in a factory

Declarative: your assistant

(compare the real world examples to the things i mentioned)

2 more replies

luhn3y ago

> Although SQL is essentially a declarative language (4GL), it also includes procedural elements.

https://en.wikipedia.org/wiki/SQL

1 more reply

mLuby3y ago

You tell SQL what you want, not how to get it. That's declarative.

  SQL : CSV :: GraphQL : JSON :: React : HTML

spamtarget3y ago

Really?

SELECT CASE WHEN employee.type = 'contract' THEN salary CASE WHEN employee.type = 'full-time' THEN salary + benefit_costs END CASE FROM employee

Tell me how is this declarative?

3 more replies

recursive3y ago

The most important requirement for a functional language is that functions are first-class values. i.e. Not SQL.

On the other hand, the details of query execution are left to the planner and optimizer.

What's the case that it's functional, but not declarative?

spamtarget3y ago

1 more reply

preseinger3y ago

SQL is absolutely a declarative language. It is practically the textbook definition of a declarative language.

ttfkam3y ago

Very often literally as well.

uticus3y ago

Okay, I'll grant it's not pure declarative.

But the part that is declarative is sure difficult to test.

2 more replies

crabbone3y ago· 7 in thread

Can you please expand on what you mean by DBT?

MySQL DBT2 suit: https://dev.mysql.com/downloads/benchmarks.html

PostgreSQL DBT2 suit: https://wiki.postgresql.org/wiki/DBT-2

Those tools are typically modeled on TPC-B... And, it would require a separate discussion to describe why these tests are obsolete and why there isn't really any replacement.

----

crabbone3y ago

Oh, someone else below suggested that DBT in OP may be this: https://docs.getdbt.com/ .

rizz03y ago

Depends on what you mean by "the subject". dbt doesn't claim to be a testing tool. Personally I use dbt all the time, while I didn't know about the existence of DBT-2. Different worlds apparently.

crabbone3y ago

Have you ever been burned by trying to install Docker and getting the whole new desktop manager installed because Docker is a widget in that desktop manager (I believe it's in Gnome)?

Actually happened to me twice (different distros at different times did include Docker, the container management tool as "docker" while more commonly naming it something else, like "docker-engine").

vitorbaptistaa3y ago

Not the OP, but I think he/she meant https://www.getdbt.com

crabbone3y ago

Vaslo3y ago

I think you are talking about two different DBTs. OP is referring to this one I believe:

https://www.getdbt.com/

crabbone3y ago

1 more reply

uticus3y ago· 6 in thread

My favorite interview question. No, I mean when I'm being interviewed. The sheepish grins let me know I'm not alone.

Best ideas IMO (no particular order):

- make SQL dumber, move logic that needs testing out of SQL

* edited formatting

pharmakom3y ago

Interesting. We have totally different philosophies here.

1. Moving logic out of SQL can break transactionality and hurt performance (more data leaves the DB)

2. ORMs hide the SQL from you, making all sorts of other things harder

My favoured approach is to test my application against a real local database, built from a fresh snapshot each time.

I do not change my application code to make testing easier. The application code is optimised for maintainability - so simplicity and ease of reading.

zip12343y ago

EF core makes it simple to test against a real sqlite db. It is super fast to run the tests and gives a reasonably realistic outcome.

uticus3y ago

It does and I understand. But in my experience most of what should be tested is not "does the database contain x?" but "does this where clause filter y?"

Those sorts of things are programmed in EF by LINQ queries. All I'm saying is, test the LINQ queries the same way you would test non-EF LINQ queries. LINQ doesn't require a database.

postalrat3y ago

Test it by not testing it! Genius!

javaunsafe20193y ago

You mean using interfaces and integration tests?

uticus3y ago

> using interfaces

> and integration tests

No, unit tests

1 more reply

purerandomness3y ago· 5 in thread

There's pgTAP for Postgres [1], the same approach probably is valid for other databases.

Here's [2] a slide deck by David Wheeler giving an introduction into how it works.

[1] https://pgtap.org/

[2] https://www.slideshare.net/justatheory/unit-test-your-databa...

vogtb3y ago

+1 for pgtap. Surprised it's not been mentioned more in this thread.

I'm using it w/ supabase, and it works really well.

They have a small doc on it that's a better primer than the pgtap docs: <https://supabase.com/docs/guides/database/extensions/pgtap>.

Pretty easy to get started, I'm doing something like this in a Makefile.

  supabase_tests := $(call rfind, supabase/tests/*_tests.sql)
  DB_URL := "postgresql://postgres:postgres@localhost:54322/postgres"
  test-supabase: $(supabase_tests)
    @echo "Testing supabase..." && \
        echo "${supabase_tests}" && \
        psql -f supabase/tests/setup.sql ${DB_URL} && \
        $(foreach t,$(supabase_tests),\
            echo "Test: $(t)..." && psql -f $(t) ${DB_URL} $(__EXEC)) && \
        psql -f supabase/tests/teardown.sql ${DB_URL}

kiwicopple3y ago

the test runner is now built into the supabase CLI: https://supabase.com/docs/reference/cli/supabase-test

Guide is here: https://supabase.com/docs/guides/database/testing

gregplaysguitar3y ago

nesarkvechnep3y ago

adulion3y ago

I was going to mention pgtap as i had used it in a previous role and it works but its cumbersome. I was hoping for a better solution by reading the comments

ahakanbaba3y ago· 3 in thread

For more information, there is this talk on PyData NYC (SQL testing part): https://www.youtube.com/watch?v=yQHksEh1GCs&t=1766s

Fugue project repo: https://github.com/fugue-project/fugue/

urthor3y ago

fwiw, you can share a spark session between unit tests. Even persist a spark session throughout the day so your tests run against a hot session.

Straight TDD with spark is perfectly fine if you know what you're doing. I'm not saying it's easy or there's an easy guide somewhere, but it's possible.

If you're using Pyspark via the API, it's likely an incredibly important part of your process.

ahakanbaba3y ago

Fair enough, agreed. It is tricky to “mock” as you said.

Our CICD platform and their owners get unhappy if we spawn an ad hoc spark session for testing purposes.

There is also a general expectation that unit tests are self contained and portable. So you could execute them in mac, linux, and arm ISA without much effort.

Another point was that we need to make this mocking or test setup easy because data scientist and ML Modellers are the most important persona who needs to write these tests ideally.

So mocking the data source with an abstraction layer and passing pandas dataframes, worked reasonably well for our use case.

newrotik3y ago

Would you be able to elaborate on your approach to TDD with spark sessions? You can persist them, which is only useful if you are doing multiple tests in a run.

I did not know about Fugue but will definitely give it a try. Looks almost too good to be true.

davedx3y ago· 3 in thread

Deukhoofd3y ago

> on Windows SQL Management Studio lets you set breakpoints, on Mac you're SOL

Good news! Since SSMS18 you're SOL on Windows too, as Microsoft completely removed that feature :)

Foobar85683y ago

I believe you are expected to use visual studio to manage all this crap... And debugging in it was awful years ago, SSMS was slightly less worse IMHO.

davedx3y ago

Ugh :)

Sounds like an opportunity though…

pmarreck3y ago· 3 in thread

Excellent question. Not sure why it's getting no traction.

For my own use-cases, I usually test this at the application level and not the DB level. This is admittedly not unit-testing my SQL (or stored procs or triggers) but integration-testing it.

edgyquant3y ago

With ORMs you can get pretty close to this being unit testing for the DB though.

teej3y ago

I haven't seen an ORM that handles analytical queries well. I'd rather write raw SQL than use SQLAlchemy for complex queries with multiple joins, aggregations, and window functions.

2 more replies

AnEro3y ago

With not complex data for sure, I think SQL testing is so much easier once you start having hierarchies and many to many relationships start piling up.

davps3y ago· 3 in thread

This approach didn't use an ORM and run the tests concurrently against the same database.

I follow those steps on my pipeline:

Every time I commit changes the CI/CD pipeline follow those steps, on this order:

- I use sqitch for the database migration (my DB is postgresql).

- Run the migration script `sqitch deploy`. It runs only the items that hasn't been migrated yet.

- Run the `revert all` feature of sqitch to check if the revert action works well too.

- I run `sqitch deploy` again to test if the migration works well from scratch.

- After the schema migration has been applied, I run integration tests with Typescript and a test runner, which includes a mix of application tests and database tests too.

- If everything goes well, then it runs the migration script to the staging environment, and eventually it runs on the production database after a series of other steps on the pipeline.

I test my database queries from Typescript in this way:

    export const sql_start = `
    WITH dummy_start AS (
        SELECT 1
    )

    export const step_2 = `${sql_start},
    step_2 AS (
        SELECT 1
    )
    `;

    export const step_3 = `${step_2},
    step_3 AS (
        SELECT 1
    )
    `;

    export const final_sql_query_to_use_in_app = ` ${step_3},
    final_sql_query_to_use_in_app AS(
        SELECT 1
    )

    SELECT \* FROM final_sql_query_to_use_in_app

Then on my tests I can quickly pick any step of the CTE to test it

    import {step_2, step_3, final_sql_query_to_use_in_app} from './my-query';

    test('my test', async t => {

        //
        // here goes the code that load the fixtures (testing data) to the database
        //

        //this is one test, repeat for each step of your sql query
        const sql = `${step_3} 
            SELECT * FROM step_3 WHERE .....
        `;
        const {rows: myResult} = await db.query(sql, [myParam]);
        t.is(myResult.length, 3);

        //
        // here goes the code that cleanup the testing data created for this test
        // 
    });

and on my application, I just use the final query:

        import {final_sql_query_to_use_in_app} from './my-query';

        db.query(final_sql_query_to_use_in_app)

If you are interested on more details just ping me, I'll be happy to share more about my approach.

WuxiFingerHold3y ago

zabzonk3y ago

how do you know that what your select statements return is correct? on a real database?

davps3y ago

efxhoy3y ago· 2 in thread

I write mostly batch ETL stuff. All plain psql and bash. We don’t have a good testing setup to be honest. What we do use instead:

Plenty of constraints, uniques and foreign keys and not nulls. Enum types.

Visuals, dump to csv and plot some graphs. Much easier to find gaps and strange distributions visually.

Asserts in DO blocks, mostly counts being equal.

Build tables in a a _next suffix schema and swap when done.

Never mutating the source data.

Using psqls ON_ERROR_STOP setting.

Avoid all but the most trivial CTEs, preferring intermediate tables that can be inspected. Constraints and assertions on the intermediate tables.

This talk by Dr Martin Loetzsch helped a lot: https://youtu.be/whwNi21jAm4

toto4443y ago

I admit it's not the best but I do almost like you. Graphs are definitely helpful.

insert into test_table

select case when count(*) = count(distinct users) then 'pass' else 'fail' end as result, 'test_cardinality_temporary_table_a' as test_name from temporary_table_a

vivegi3y ago

There's some great advice here for batch ETL work. The Create Table .. As Select (CTAS) pattern is your friend and a great aid in testability.

drx3y ago· 2 in thread

If you're using dbt, dbt tests are a good start: https://docs.getdbt.com/docs/build/tests

You can hook up dbt tests to your CI and Git(hub|lab) for data PRs.

Depending on your needs, you can also look into data observability tools such as Datafold (paid) or re_data (free)

mritchie7123y ago

I'm surprised dbt tests is this far down in the comments. This seems like the obvious place to start.

OP (or others) - If you've used dbt tests, I'm curious where it fell short? Tt doesn't cover everything, but it's pretty good in my experience.

adeelk933y ago

And if you want/need to take it further, I love this package: https://github.com/calogica/dbt-expectations

AnEro3y ago· 2 in thread

hampelm3y ago

Could you link to the specific guide you're referring to? I see a couple on quick search -- perhaps this one? https://docs.gitlab.com/ee/development/database_review.html

AnEro3y ago

Enjoy, it's honestly the best resource I've seen on data teams that is open. https://about.gitlab.com/handbook/business-technology/data-t...

1 more reply

munk-a3y ago· 2 in thread

Basically, treat the query and database as a black-box for testing like you would another third party API call.

karmasimida3y ago

I think the tricky part is DB itself might get funny as time goes by, downstream logic changes might invalidate top level queries that built on now dated assumption

Overall data integrity is hard.

munk-a3y ago

jiggywiggy3y ago· 2 in thread

To be honest.

I would love testing to work.

Have set up and maintained several unit test suites in Jest.

Wrote several large e2e test suites in Cypress.

I don't think anyone won time from simply having a manual checklist and testing manually.

Maybe me and my former teammates are doing it wrong. Talking 8+ teams, from corporate to startup.

But loved de proven wrong. E2e def. catched most issues.

monsieurbanana3y ago

I'm not sure I understand your comment, you seem to say that automating testing isn't better than manual testing, yet the e2e tests caught most issues?

jiggywiggy3y ago

Also testing with a checklist by human is easy to outsource to cheap labor. Whereas testing engineers are pretty expensive

zabzonk3y ago· 2 in thread

assuming you are asking about sql select statements, the problem is knowing what the correct answer is so you can test against it. for most data, you don't, and probably cannot know this.

not a unique problem with sql, btw.

RobinL3y ago

zabzonk3y ago

i have never worked with a small dataset. mocking one doesn't work, for the reasons you suggest, and others.

basically you cannot test queries against a big database. you just have to hope for the best.

1 more reply

jlund-molfese3y ago· 1 in thread

This is easier if you have the same input every time the tests run, like a frozen database image, because then you can basically have snapshot tests.

rozularen3y ago

we do something similar, we run some validation tests against the output regarding file size and line count but not the actual data

Ataraxy3y ago· 1 in thread

So I'm not an expert, but for simplistic use cases I merely make use of https://github.com/oguimbal/pg-mem

It's a lot faster and easier than dealing with containers and the like.

cgh3y ago

luckystarr3y ago· 1 in thread

I use a fake object in place of a database connection which gives fake responses when the correct SQL query is sent to it.

Example:

db = Fake().expect_query("SELECT * FROM users", result=[(1, 'Bob'), (2, 'Joe')])

Then you do:

db.query("SELECT * FROM users")

and get back the result.

The upside of this is, you don't need any database server running for your tests.

update: This pattern is usually called db-mock or something like this. There are some packages out there. I built it a few times for companies I worked for.

NoThisIsMe3y ago

This is testing the code that emits the SQL code, not the SQL code itself. Sometimes that's what you want, but it's not what OP asked for.

nextlevelwizard3y ago· 1 in thread

zx143y ago

He is the one trying to test his stuff though, hence the question.

systems3y ago· 1 in thread

Well, I used to work at a place where they used https://tsqlt.org/ for testing it worked great

I dont know the technical detail of how to set it up, it was already setup when I worked there

But basically, we wrote SQL script that included statements to

1. create the db structure, tables or views

2. insert statement to enter test data (you can insert corner cases etc..)

3. ran the function or procedure

4. ran an assert to confirm if results are to our expectation

test script were ran by the CI/CD process

reactorpres3y ago

tSQLt is a great tool. If only it could get ported to some other dbs.

samsquire3y ago· 1 in thread

I ought to produce unit tests that prove that tuples from each join operation produces the correct dataset. I've only ever tested with 3 join operations in one query.

For a user perspective, I guess you could write some tooling that loads example data into a database and does an incremental join with each part of the join statement added.

AnEro3y ago

megalan2473y ago· 1 in thread

You can achieve the same thing with "docker commit"-ing data into docker images of your dB engine of choice, and firing your queries on them, but that only really works with smaller datasets.

[1] https://devdb.cloud

femto1133y ago

idlewords3y ago· 1 in thread

Live in production!

mobjack3y ago

A production replica if you want to be safe

d01003y ago· 1 in thread

I've been using IntegreSQL and it works pretty well

majodev3y ago

Happy to hear that! When it comes to testing services that depend on PostgreSQL, this is still my preferred solution.

https://github.com/allaboutapps/integresql

disclaimer: author

chrisoldwood3y ago

Be kind, it was a long time ago and I'm sure the state of the art has improved a lot in the intervening years :o).

Chris Oldwood

---

[1] SQL Server Unit: https://github.com/chrisoldwood/SS-Unit

[2] You Write Your SQL Unit Tests in SQL?: https://chrisoldwood.blogspot.com/2011/04/you-write-your-sql...

[3] Test-Driven SQL: https://www.youtube.com/watch?v=5-MWYKLM3r0

dagss3y ago

Probably not that relevant for a data team, but this is what we do as a backend team:

We use Microsoft SQL's docker image and spin it up in the background on our laptop/CI server so port 1433 has a database.

Then we use the BACKUP command to dump the db to disk (within the docker image)

Finally, each test function is able to make a new database and restore that backup from file in less than a second. Populate with some relevant test data, run code, inspect results, drop database.

So our test suite uses hundreds of fresh databases and it still runs in a reasonable time.

(And..our test suite is written in Go, with a lot of embedded SQL strings, even if a lot of our business logic is in SQL)

seanhunter3y ago

One easy method is just to test sql the way you test anything else:

Doing it this way has a number of benefits because it's easy to add your sql test cases into your CI/CD (they just run the same as everything else).

cwp3y ago

Two ideas here:

1) The same way you'd write any other tests. Use your favourite testing framework to write fixtures and tests for the SQL queries:

  - connect to the database
  - create tables
  - load test data
  - run the query
  - assert you get the results you expect

For insert or update queries, that assertion step might involve running another query.

2) DBT has support for testing! It's quite good. See https://docs.getdbt.com/docs/build/tests

hichkaker3y ago

I did data engineering for 6 years and am building a company to automate SQL validation for dbt users.

First, by “testing SQL pipelines”, I assume you mean testing changes to SQL code as part of the development workflow? (vs. monitoring pipelines in production for failures / anomalies).

If so:

I’ve also written about why reliable change management is so important for data engineering and what are key best practices to implement [4]

taeric3y ago

Best tool nowadays has to be to spin up a database and execute the queries against it. If you are on a database setup that spinning up an instance takes a long time, consider docker.

kkleindev3y ago

We've written our own tool to compare different data sources against each other. This allows, for example, to test for invariants (or expected variations) between and after a transformation.

The tool is open source: https://github.com/QuantCo/datajudge

We've also written a blog post trying to illustrate a use case: https://tech.quantco.com/2022/06/20/datajudge.html

sakopov3y ago

[1] https://dotnet.testcontainers.org/

[2] https://fluentmigrator.github.io/

dave_infuseai3y ago

dbt does have testing built in, but of course there are only certain cases for which that kind of testing works. dbt can't know if your metrics 'look' right, only you will know.

Data profiling is the perfect tool for this, especially when your pipeline reaches a certain size, or you're dealing with very large datasets.

I work on the team creating PipeRider.io, which uses data profiling comparisons as a method of "code review for data".

Here's an article that discusses the benefits of this: https://blog.infuseai.io/why-you-lack-confidence-merging-dbt...

Waterluvian3y ago

GitHub Actions trivialized this for us. Spawning a Postgres database for testing is easy and carefree. Spawns. Runs operations and evaluates the state of the database after each operation.

Yes, some might call the latter heresy. But it works great.

dwohnitmok3y ago

Given that you're using dbt, it comes with a testing framework out of the box: https://docs.getdbt.com/docs/build/tests

nephton3y ago

On CI server this can be used to run tests against all required DB server types and versions.

KronisLV3y ago

I think that when it comes to testing databases... most people just don't.

Look at this JetBrains survey: https://www.jetbrains.com/lp/devecosystem-2021/databases/

Around half of the people never debug stored procedures. Three quarters of people don't have tests in their databases. Only half of the people version their database scripts.

Even if you don't have a traditional app, throwaway instances of the real type of DB that you'll be using are great, both for development and testing.

gsvclass3y ago

summerlight3y ago

  * It has a language level module support, similar to other languages. Thus SQL functions are reusable across multiple codebases without depending on code generation tricks. One of the major blocker for SQL adoption has been complex domain specific business logic and now the situation is better.
  * It has an official unit test support. Google use Blaze (which is known as Bazel externally), so adding a unit test for SQL code is as simple as adding a SQL module (and its test input) dependency to SQL test target, write a test query and its expected output in a format of approval testing. Setting up the DB environment is all handled by the testing framework.
  * It has an official SQL binary support. It's just a fancy name for handling lots of tedious stuffs for running a SQL query (e.g. putting everything needed into a single package, performing type checks, handling input parameters, managing native code dependencies for FFI etc etc).

gorgoiler3y ago

As the meme say: App worked before. App work afterwards. Can’t explain that.

creakingstairs3y ago

[1] https://hexdocs.pm/ecto_sql/Ecto.Adapters.SQL.html#module-sq...

ufmace3y ago

I don't think there's any perfect universal answer for this. I only have a few things I've done that work.

I would have hoped most other web hosting frameworks would make as much effort to making it straightforward to test your database code, but it doesn't really seem to be the case.

epgui3y ago

We use this and take an example-based tests approach for any non-trivial DBT models: https://github.com/EqualExperts/dbt-unit-testing

More trivial example:

    {%
        call dbt_unit_testing.test(
            'REDACTED',
            'Should replace nullish values with NULL'
        )
    %}
        {% call dbt_unit_testing.mock_source('REDACTED', 'REDACTED', opts) %}

            "id" | "industry"
            1    | 'A'
            2    | 'B'
            3    | ''
            4    | 'Other'
            5    | 'C'
            6    | NULL

        {% endcall %}

        {% call dbt_unit_testing.expect(opts) %}

            "history_id" | "REDACTED"
            1            | 'A'
            2            | 'B'
            3            | NULL
            4            | NULL
            5            | 'C'
            6            | NULL

        {% endcall %}
    {% endcall %}

winrid3y ago

jesseryoung3y ago

alphanumeric03y ago

hansvm3y ago

AdrianB13y ago

rowls663y ago

thingsilearned3y ago

This linter can really enforce some best practices https://github.com/sqlfluff/sqlfluff

A list of best practices: https://docs.getdbt.com/guides/legacy/best-practices

And shameless plug but there's a chapter on modeling in my book: https://theinformedcompany.com

cgopalan3y ago

geocrasher3y ago

SQL pipelines can be tested pretty easily. If they look congested, use a snake (Python, if you like) to try to knock the CRUD out of them.

davvid3y ago

https://github.com/ugtar/pg_temp

edublancas3y ago

andy_ppp3y ago

sam0x173y ago

hans0l0743y ago

On a related note (though it does touch upon testing) mitchellh open sourced Squire[1] recently (today?). Note though, that it's PostgreSQL specific.

[1] https://github.com/mitchellh/squire

gxt3y ago

Abstraction layer between the query you write and the one that gets executed. This way you can mock the schema, run the query on the mock to assert some condition x.

A ref() concept like dbt's is sufficient. When testing, have ref output a different (test-x) name for all your references.

nitwit0053y ago

iblaine3y ago

JenrHywy3y ago

We test it implicitly with integration tests across the API for the most part. We occasionally have unit tests (in C#) that directly call stored procs.

The backbone for this is that we spin up a DB per unit test, so we don't have to worry about shared state.

dangwhy3y ago

I've had great success with this

https://news.ycombinator.com/item?id=34580675

I am always baffled by why this ins't more popular way of writing SQL.

javaunsafe20193y ago

I always did that with integration tests. Put some data in the db. Use the repository aka your sql and validate the results.

Most of the times there is a layer around your sql (a repository, a bash script or whatever) that you can use for integration testing.

devdada3y ago

jve3y ago

I regret making an integration in SQL Server Integration Services, with some nice addon tools that can make http requests and such. In the end it is untestable piece of hard to follow solution.

Anyone have any recommendations on testing SSIS ?

pharmakom3y ago

Gigachad3y ago

Part of the solution could be using tooling which can compile time check SQL is valid like https://github.com/launchbadge/sqlx

nikita3y ago

Disclosure. I'm CEO of Neon (neon.tech).

One of the premises that we have is the ability to instantly create a test environment by creating a branch. I'd love to hear what you think about it.

mikpanko3y ago

Datafold (https://www.datafold.com/) is solving exactly that problem and has integration with dbt.

xmorse3y ago

I do E2E tests for the core features against a staging database and url.

This way i don't waste time with unit tests that quickly get old and no one wants to maintain and run

hparadiz3y ago

Surprised only a single comment in this thread had mentioned the "explain" keyword that exists in MySQL.

Excellent for checking delete queries before running them.

johnthescott3y ago

sqldb testing is historically problematic, for at least correctness AND performance.

also, i believe MSServer supports HI.

spikder3y ago

I have done this more than once and it is great. Point django at your db, dump the models, and then use django's test framework.

dirtybirdnj3y ago

in production, on a friday before I leave for vacation

Andys3y ago

Something to consider (for Go users): Use the sqlc library, which makes whole classes of bugs/mistakes into compile-time errors.

DecoPerson3y ago

In production ;)

But my app is for six users at one site, it’s not mission critical, and the sqlite DB is backed up hourly.

Life’s too short for (unnecessary) testing.

world2vec3y ago

DBT supports testing right out of the gate, you can write little queries and macros to test and validate every column.

EGreg3y ago

Have small databases.

Learn to use IMPORT TABLESPACE in MySQL or just dump and import SQL.

Every time you run a test you set up the mock databases again.

dllthomas3y ago

I've been wanting to put together a property testing framework that lets you specify properties as queries.

kindofabigdeal3y ago

Either a test environment or a mocking library

z3t43y ago

Local Dev dB and/or staging

User233y ago

Like any other integration test.

postalrat3y ago

Git, merge requests, developers.

neural_thing3y ago

Great Expectations and Datafold

distantsounds3y ago

Flyway, https://flywaydb.org/

tommyage3y ago

Relational databases rely on math. You may test your implementation; Or the input to your statements.

karakanb3y ago

There are a few types of tests one would like from a SQL pipeline, each with a different value add:

sqldba3y ago

Thoughts and prayers.

Teamteam163y ago

SQL.lite

devdada3y ago

Test

opportune3y ago

I used to work on SQL pipelines, sadly I didn’t get to implement some of these suggestions, but we did implement others.

For testing:

Run your query/pipeline against synthetic/manual data that you can easily verify the correctness of. This is like a unit test.

Take the output of the new query (or sampled query, or the manual query) and feed it to whatever is downstream. This is like a conformance test.

For reliability:

The big failure modes with SQL pipelines IME are

1. unexpected edge cases and bad data causing queries to fail (eg you manually test the new query and it works fine, but in production it fails when handling Unicode)

2. not having a plan for what to do when a bug gets caught after the fact

2v35gggg3y ago

by running it in prod duh

sbricks3y ago

you don't?...

chadlavi3y ago

that's the neat thing, you don't

wokwokwok3y ago

Two days ago, everyone: “You should use SQL because everyone knows it and modern SQL is pretty good.

Other languages are too complicated. :(“

Everyone today: “tries using sql

Oh wow, the tooling is quite basic, and you can’t express complex data structures and imperative code. :(“

What did you expect?

Look, I spent 4 years in this rabbit hole, and here’s my advice:

Don’t try to put the square peg in the round hole.

You want easy to write, simple code and pipelines? Just use sql.

Have a dev environment and run everything against that to verify it.

Do not bother with unit testing your CTEs, it’s hard to do, there are no good tools to do it.

If you want Strong Engineering TM, use python and spark and all the python libraries that exist to do all that stuff.

It won’t be as quick to write, or make changes to, but it will be easier to write more verifiably robust code.

If you treat either as something it is not (eg. Writing complex data structures and frameworks in sql) you’re using the wrong tool for the outcome you’re trying to achieve.

It’ll take longer and “feel bad”, not because the tool is bad, but because you’re using it in a bad way.

j / k navigate · click thread line to collapse