There are also some technical terms I don't know at all, and when I've searched for them, the top results are all more Azure stuff. Like wtf is a datalake?
Snowflake and data bricks are companies that operate in this space, providing ways to ingest, transform and analyze large volumes of data.
It separates compute and storage, so there's just a big ol' pile of data and tables, then it spins up large machines to crunch the data on demand.
Data storage is cheap and the machines are expensive per hour but running for shorter times, and with little to no ops work required it can be a cheap overall system.
Bunch of other features that are handy or vital depending on your use case (instant data sharing across accounts, for example).
I've used it to transform terabytes of JSON into nice relational tables for analysts to use with very little effort.
Hopefully that's a useful overview of what kind of thing it is and where it sits.
Databricks is a vendor of hosted Spark (and is operated by the creators of Spark). Spark is software for coordinating data processing jobs on multiple machines. The jobs are written using a SQL-like API that allows fairly arbitrary transformations. Databricks also offers storage using their custom virtual cloud filesystem that exposes stored datasets as DB tables.
Both vendors also offer interactive notebook functionality (although Databricks has spent more time on theirs). They're both getting into dashboarding (I think).
Ultimately, they're both selling cloud data services, and their product offerings are gradually converging.
So they can collect data from different places like sql, images, etc. I think a better question would be what type of data can't they ingest?
Once you have your data i guess you can run some analytics to find out what your data tells you
This really all ties back to the "old" Hadoop days, and is an evolution of compute over data not in a fixed and managed format/schema.
I've evaluated Databricks. It works with the above mentioned structured and semi-structured data. I also suspect it could process unstructured data. My understanding is that it runs Python (and some others), so you can do any "Python stuff, but in the cloud, and on 1000s of computers"
ELI5 is for reddit, generally here we expect you can google it to get the ELI5 explanation before giving us your hot take in a comment
Also proof that lakehouse and spot compute price performance economics are here to stay, that's good for customers.
Otherwise, as a vendor blog post with nothing but self-reported performance, this is worthless.
Disclaimer: I work at Databricks but I admire Snowflake's product for what it is - iron sharpens iron.
If I'm reading what Databricks published correctly, it seems that they've only used 1 driver node for this benchmark, in other words it's a dev setup. If they want to compare apples-to-apples then they should configure, and price, a multi-AZ HA set-up.
I'm not sure if this is still applicable to Photon, however - can anyone confirm?
Databricks started with the cloud datalake, sitting natively on parquet and using cloud native tools, fully open. Recently they added SQL to help democratize the data in the data lake versus moving it back and forth into a proprietary data warehouse.
The selling point in Databricks is why move the data around when you can just have it in one place IF performance is the same or better.
This is what led to the latest benchmark which in the writing appears to be unbiased.
In snowflakes response however, they condemn it but then submit their own fundings. Sound a lot lot trump telling everyone he had billions of people attend his inauguration, doesn’t it?
Anyhow, I trust independent studies more than I do coming from vendors. It cannot be argued or debated unless it was unfairly done. I think we are all smart enough to be careful with studies of any kind, but I can see why Databricks was excited about the findings.
With all these, the data sits on cloud storage and compute is done by cloud machines - the difference between Databricks and the others is that with Databricks, you can take a look at that bucket. But you're not going to be able to do much with that data without paying for Databricks compute, since the open source Delta library is not usable in real world.
Since commercial data warehouses are an enterprise product for enterprise companies (small companies can use stick with normal databases or SaaS and unicorns seem to roll their own with Presto/Trino, Iceberg, Spark and k8s, nowadays), the vendor and the product needs to be most of all reliable partner. And Databricks behavior does not inspire confidence of them being that.
If I'm outsourcing my analytical platform to a vendor, I want the to be almost boring. Not some growth hacking, guerilla marketing, sketchy benchmark posting techbros.
At the end of the day, anyone making years lasting million dollar decisions in this space should run their own evaluation. Our evaluation showed that there's a noticeable gap between what Databricks promises and what they deliver. I have not worked with Snowflake to compare.
The rest of this is some vague claims of Databricks being unreliable techbros blah blah which is just emotionally charged hot air rather than being based on anything.
RE who to pick. Run them side by side. Use snowflake for non technical staff/BI load in prepared cuts of data. it's batteries included and less knobs to twiddle for optimisation. Databricks/spark has a learning code and isn't suitable for non-technical staff. But it gives a lot more options for processing for all the stuff that doesn't fit neatly into data clustering.
* Nobody should benchmark anymore, just focus on customers instead
* But hey, we just did some benchmarks and we look better than what Databricks claims
* Btw, please sign up and do some benchmarks on Snowflake, we actually ship TPC-DS dataset with Snowflake
* Btw, we agree with Databricks, let's remove the DeWitt clause, vendors should be able to benchmark each other!
* Consistency is more important than anything else!!!
They are both billion dolar companies, we're hardly talking David and Goliath here.
EASY SQL, data sharing (they have a marketplace), simple scaling
* Elapsed time: 3108s (Databricks) vs 3760s (Snowflake)
* Price/Peformance: $242 (Databricks) vs $267 (Snowflake)
Needless to say, these numbers seriously need a verification by independent 3rd parties, but it seems that Databricks is still 18% faster and 10% cheaper than Snowflake?
The official review process is significantly more complicated than just offering a static dataset that's been highly optimized for answering the exact set of queries. It includes data loading, data maintenance (insert and delete data), sequential query test, and concurrent query test.
You can see the description of the official process in this 141 page document: http://tpc.org/tpc_documents_current_versions/pdf/tpc-ds_v3....
Consider the following analogy: Professional athletes compete in the Olympics, and there are official judges and a lot of stringent rules and checks to ensure fairness. That's the real arena. That's what we (Databricks) have done with the official TPC-DS world record. For example, in data warehouse systems, data loading, ordering and updates can affect performance substantially, so it’s most useful to compare both systems on the official benchmark.
But what’s really interesting to me is that even the Snowflake self-reported numbers ($267) are still more expensive than the Databricks’ numbers ($143 on spot, and $242 on demand). This is despite Databricks cost being calculated on our enterprise tier, while Snowflake used their cheapest tier without any enterprise features (e.g. disaster recovery).
Edit: added link to audit process doc
If you're in networking, it's throughput, latency or fairness. If you're in graphics its your shaders or polygons or hashes. If you're in CPUs its your clock speed. If its cameras, it's megapixels (but nobody talks about lens or real measures of clarity) If you're in silicon it's your die size (None of that has mattered for years, those numbers are like versions not the largest block on your die) If you're in finance, it's about your returns or your drawdowns or your sharpe ratios.
I'm a little bit surprised how seriously databricks is taking this, but maybe it's because one of the cofounders laid this claim. Ultimately what you find is one company is not very good at setting up the other company's system, and the result is the benchmarks are less than ideal.
So why not have a showdown? Both founders, streamed live, running their benchmarks on the data. NETFLIX SPECIAL!
Disclaimer: Databricks cofounder who authored the original blog post.
DB1.Databricks generated the TPC-DS datasets from TPC-DS kit before time started. Databricks starts time then generated all queries. Then Databricks loaded from CSV to Delta format (also some delta tables were partitioned delta tables by date) and also computed statistics. Then all of the queries are executed 1-99 for TPCDS 100TB
SF1. Databricks generated the TPC-DS datasets from TPC-DS kit before time started. Databricks starts time then generated all queries. Then load from S3 to Snowflake tables by - (i'm not sure about these next parts) - creating external stages and then "copy into" statements I guess? Or maybe just using copy into from an s3 bucket, that part doesnt matter much. But its not clear did they also allow target tables to be partitioned/clustering keys at all? Then all of the queries are executed 1-99 for TPCDS 100TB
Its just hard to say exactly what "They were not allowed to apply any optimizations that would require deep understanding of the dataset or queries (as done in the Snowflake pre-baked dataset, with additional clustering columns)" means exactly. Like what does that exactly mean. At a glance though, this looks very impressive for Databricks, but just want to be sure before I submit to an opinion.
The higher editions of Snowflake include features like materialised views, dynamic data masking, BYOK, PCI & HIPAA compliance etc., non of which are required for the benchmark.
For the more technically inclined, don’t let any corporate blog post / comms piece live in your head rent-free. If you’re a customer, make them show you value for their money. If you’re not, make them provide you tools / services for free. Just don’t help them fuel the pissing contest, you’ll end up a bag holder (swag holder?).
I'm interested in using Databricks, but I haven't done it yet. I've heard good things about their product.
"Posting benchmark results is bad because it quickly becomes a race to the wrong solution. Someone misrepresented our performance in a benchmark, here are the actual results."
Spark has had SQL engines (SparkSQL/Hive on Spark) for a long time. Photon is just a new, faster one. Photon tasks also run on Spark executors only, so it's not independent of Spark[1]. Also, while it's proprietary now, I wouldn't be surprised if Databricks open-sources it in the future, like they did with Delta Lake.
1. https://databricks.com/blog/2021/06/17/announcing-photon-pub...
Spark does take a lot of tuning, but then I'm guessing Databricks offer that service as part of your licensing fee? (I'd hope so if they're selling a product based on FOSS code, there has to be a value add to justify it)
They have some proprietary features like DBIO [1]. They also have some cloud-specific features like storage autoscaling [2] that would not be available in OSS Spark. Even Delta Lake [3] used to be proprietary, but I suspect the rise of open-source frameworks like Iceberg led them to open-source it.
Shameless plug - when working at a since-shutdown competitor to Databricks, I'd come up with storage autoscaling long before them [4], so it's not unlikely that they were "inspired" by us :-) .
1. https://docs.databricks.com/spark/latest/spark-sql/dbio-comm...
2. https://databricks.com/blog/2017/12/01/transparent-autoscali...
4. https://www.qubole.com/blog/auto-scaling-in-qubole-with-aws-...
The geometric mean? Really? Feels a lot easier to think in terms of arithmetic mean, and perhaps percentiles.
Consider 4 queries. Two run for 1sec, and the other two 1000sec. If we look at arithmetic mean, then we are really only taking into account the large queries. But improving geometric mean would require improving all queries.
Note that I'm on the opposite side (Databricks cofounder here), so when I say that Snowflake didn't make a mistake here, you should trust me :)
No. Improving the geometric mean only requires reducing the product of their execution times. So if you can make the two 1 ms queries execute in 0.5 ms at the expense of the two 1000 ms queries taking 1800 ms each then that’s an improvement in terms of geometric mean.
So… kind of QED. The geometric mean is not easy to reason about.
I really hope this is not the case again.
(yes, I understand my sarcasm is unneeded, I couldn't help myself)
[1]: https://www.ververica.com/blog/curious-case-broken-benchmark...