We use BQ and Metabase heavily at work. Our BQ analytics pipeline is several hundred TBs. In the beginning we had data (engineer|analyst|person) run amock and run up a BQ bill around 4,000 per month.
By far the biggest things was:-
- partition key was optional -> fix: required
- bypass the BQ caching layer -> fix: make queries use deterministic inputs [2]
It took a few weeks to go through each query using the metadata tables [1] but it worth it. In the end our BQ analysis pricing was down to something like 10 per day.
[1] https://cloud.google.com/bigquery/docs/information-schema-jo...
[2] https://cloud.google.com/bigquery/docs/cached-results#cache-...
This is a $20k/year savings. Perhaps, I'm not aware of the pricing in the Indian market (where this startup is), but that simply doesn't seem like a good use of time. There's an actual cost of doing these implementations. Both in hard financial dollars (salaries of the people doing the work) and the trade-offs of de prioritizing other other.
It's approximately 4 annual salaries (non dev)
For example, think of the value generated from a fraud detection system that alerts you on a nightly batch job, vs. a fraud detection system that runs once per minute. If you're spending $20k/y on the batch job, you're never even going to consider the 1/min option. If you can decost sufficiently to run it 1/min, whole new market opportunities open up.
If you don't have good staging data after running extraction (i.e., a 1:1 view of the source system data available in your database), there is nothing you can do to help with this downstream. You should stop right there and keep digging.
Extracting the data should be the most challenging aspect of an ETL pipeline. It can make a lot of sense to write custom software to handle this part. It is worth the investment because if you do the extraction really well, the transform & load stages can happen as a combined afterthought [0,1,2,3] in many situations.
This also tends to be one of the fastest ways to deal with gigantic amounts of data. If you are doing things like pulling 2 different tables and joining them in code as part of your T/L stages, you are really missing out on the power of views, CTEs, TVFs, merge statements, etc.
[0] https://learn.microsoft.com/en-us/sql/t-sql/statements/merge...
[1] https://www.postgresql.org/docs/current/sql-merge.html
[2] https://docs.oracle.com/database/121/SQLRF/statements_9017.h...
[3] https://www.ibm.com/docs/en/db2/12.1?topic=statements-merge
Why should this be difficult? It’s the easiest part. You run SELECT * and you’re done.
The difficult part is transforming all the disparate upstream systems and their evolving schemas into a useful analytical model for decision support.
If the source data is already in a SQL store, then the solution should be obvious. You don't need any other tools to produce the desired view of the business at that point. Transforming for an upstream schema is a select statement per target table. This doesn't need to be complicated.
Continuously refactoring when adding features, fixing bugs, or dealing with external changes keeps code fresh and free of cruft. Also removing dead code, deprecating and removing little-used features, refining behavior to remove unused edge cases ensures focused and useful systems.
But generally rolling your own has other benefits.
Not for startups it doesn't. The only rolling-your-own they should be doing is their main product.
Once you get bigger with many hundreds of employees, and existing software starts becoming a measurable blocker, then you can gradually build your own stuff once the tradeoffs actually make sense. But it generally takes a while for that to become the case. For startups, premature optimization isn't the root of all evil, but it's the root of a lot of it.
I'm not a data engineer but was tasked with building an ETL pipeline for a large company. It's all just Step Functions, looping through file streams in a Lambda, transforming, then putting the data into Snowflake for the analytics team to view. My pipeline processes billions of rows from many different sources. Each pipeline runs daily on a cron job (maybe that's the key differentiator, we don't need live streaming data, it's a lot of point of sale data).
Whenever I hear actual data engineers talk about pipelines there are always a million different tools and complicated sounding processes. What's am I missing?
Hi, data engineer here (of sorts). I'll take a crack.
> It's all just Step Functions, looping through file streams in a Lambda, transforming, then putting the data into Snowflake for the analytics team to view.
So, you're right. Data going from A -> B is the most important part of the job. Like, the pipeline you've got is like the 60-80% critical bit. Get data from systems A through Y and put it into system Z. Without it, there's no point talking about anything else.
But what happens when you need to rollback all of the data from system F for the last week? What if you need to rollback system C for the last 2 days as well? How do you keep track of whether data from systems G, H and I is still "good" quality? How do you handle frequently changing source systems, which could do any one of drop, rename, add columns on a daily basis? Do you ignore those columns forever? Or do you siphon that data off to a special place for manually review? Do you process all previously seen data every single night? If you only process the newest subset of data, what do you do if you've noticed a bug in the pipeline code? Process everything again and spend a whole bunch of money every time we fix buggy pipeline code? Or do a "migration" on already processed data? How do you track which subsets were processed with which version of your pipeline? How did that run get triggered -- was it a glue crawler scheduled cron job, or did someone trigger it manually? Does any of this actually matter? Does anyone even look at the dashboards any more?
Getting the pipeline running is the first bit. With non-volatile/slow-changing source data/systems, you're fine with just the pipeline most of the time. Depending on the volume of the data, you could probably get away with just rerunning the pipeline over all previously seen data when there's been a bugfix and just eat the cost. You don't need anything more complicated than that.
But if the source data/systems are volatile [0], then keeping it working, keeping it accurate and keeping it cheap comes next. And the complexity of that depends on how complex X -> Y is.
A simple "back of a napkin" way to determine complexity is to sit down and ask some questions about the 6 V's: Volume; Velocity; Variety; Veracity; Value; Variability [1]. It sounds like [2] you have high volume, some velocity (nightly), some variety, low veracity, high value and low variability data. There's a lot of data coming from multiple source systems, and it's important data. But it doesn't sound like the changes that much and is all probably all in the same or similar formats [3] and good quality. So... a simple pipeline triggered nightly works in that case.
When things get more complicated, that's when the system moving data from A -> B gets more complicated.
---
Having said all that, there has been a lot of resume driven development in the industry. $LAST_JOB i worked at the previous team thought it would be a great idea to roll their own job orchestrator product [4]. from scratch. in django. not a library that django called, not some django API endpoints that forwarded requests to some Airflow backend. nope. a job orchestrator written from scratch in django.
Because "bare metal" apparently.
They were also obsessed with this idea of reproducibility, but what they actually wanted was idempotency... but that's a story for another reply.
---
[0]: even if each upstream is slow to change, the cumulative effect could be there is at least one upstream change breaking something every day, so the cumulative effect is that the upstreams together are volatile even if individually they are not
[1]: https://www.geeksforgeeks.org/5-vs-of-big-data/
[2]: I have no idea if any of this is accurate, this is literally me taking a punt based on like a sentence in an HN comment.
[3]: by formats i don't mean different table structures, i mean table vs raw binary files vs JSON vs raw text with no delimiters
[4]: all those engineers had left by this point due to covid furlough and some other stuff
Even if you needed to invent a new feature, you could invent a months worth of features every year and still save money.
I do have a question on the BigQuery. i f you were experiencing unpredictable query costs or customization issues, that sounds like user error. There are ways to optimize or commit slots for reducing the cost. Did you try that ?
It’s always a trade-off, and we made the call that best fit our scale, workloads, and long-term plans
I am not sure if managing kafka connect cluster in too expensive in long term. This solution might work for you based on your needs. I would suggest to look for alternatives.
That said, streaming events into Kafka (or something like it), rather than pushing change events directly to destinations like Snowflake has big advantages when it comes to reprocessing data, setting up additional consumers, etc.
I’ve seen postgres cdc directly being written to Snowflake.
https://docs.redpanda.com/redpanda-connect/components/inputs...
Reading carefully: the result of this work yields an expected $21,000 USD in annual operating cost savings for infrastructure services.
Is this resume driven development?
What was the opportunity cost of this work? Is the resulting system more or less maintainable by future employees/teammates?