While they can crunch large datasets, they are laughably slow for the datasets most people have. So while I did propose we use these solutions for our big-ish data projects, management kept pushing for us to migrate our tiny datasets (tens of gigabytes or smaller) and the perf expectedly tanked compared to our other solutions (Postgres, Redshift, pandas etc.), never mind the immense costs to migrate everything and train everyone up.
Yes, these are very good products. But PLEASE, for the love of god, don't migrate to them unless you know you need them (and by 'need' I don't mean pimping your resume).
To overcome this, they make use of cache and if the small data is frequently accessed, the performance is generally pretty good and acceptable for most use cases.
Same playbook - show that you’re better in a key metric that’s easy to understand (performance) to get the attention, but then pitch the paradigm change.
In Snowflake’s case, that was separation of storage and compute.
In Databrick’s case, it’s the Lakehouse Architecture.
I think the reason why Snowflake is so nervous because they know they can’t win this game.
Databricks was founded before Spark 1.0 released by Spark's creators.
Hadoop was created at a time when network and disk were much slower, RAM was less abundant. Bringing compute to the data made sense, but it typically doesn't anymore.
Isn't Databricks' delta.io, which their Data Lakehouse product builds on top of, open source? Snowflake could take the best parts from and run with it?
I understand the appeal over having lake and warehouse as separate components, but with those native cloud warehouses, you can already do everything a lake does.
With the lakehouse, you can use python, R and Scala, (not just SQL) to interface with your data. You can use multiple compute engines (spark, Databricks, presto) so you are not locked into one compute engine.
I recall being a junior programmer, and wishing I could talk to my MySQL database in python code to do some processing that was difficult to express in SQL, that day is finally here.
Snowflake and Databricks are multicloud. The different is that Snowflake is more like a SaaS solution and only does SQL. Databricks is more than just SQL. It has all the data science, machine learning information, built into it. Snowflake has Snowpark but it’s every limited and so you are more likely to have to buy more products to build out your capabilities and integrate them with Snowflake. With Databricks it is more out of the box in terms of capabilities. Databricks also runs in your cloud account which has trade offs. It can be harder to get going and more complex but you end up with a lot more flexibility and you own your data and have complete control over it. While Snowflake gives you control of your data with their tools, everything has to go through Snowflake and incur their tax to get to it. You pay for simplicity, which many customers are ok with because they see value in it. On the contrary, a lot of customers see value in having more control and options. This market is big enough for everyone - it’s really just about market share.
The blog wars seem extremely ridiculous to me. I don't recall ever choosing one over another based on how fast it runs on some imaginary arbitrary dataset.
One of the biggest FUDs for a data lake architecture is performance - and this benchmark should put that concern to rest.
Databricks say their solution is better because it's open (though keep the optimizations you need to run this at scale to themselves, i.e. is ultimately proprietary). Snowflake says theirs is better because it's a fully managed service, meaning no infrastructure to procure or manage, is fully HA across multiple data centers by default etc.
Databricks push 'open' but really still want you to use their proprietary tech for first transforming into something usable (Parquet/Delta) and then querying with Photon/SQL, though you can also use other tech. With Snowflake you can just ingest and query, but it has to be through their engine.
Customers should do their own valudation and see which one fits their needs best.
Both Databricks and Snowflake have inflated marketing budgets, and marketing feels they have to "beat" the other one or they'll lose the market.
I really wish I could block all of Snowflake's domain from my inbox. Sadly, Google encourages spammers to just create a new email address. So I get a few emails each month from Snowflake who ask me to try their products. I've never done business with them and there's no unsubscribe link.
Fuck Snowflake for thinking it has any room to talk about integrity.
This is funny and interesting to watch but also a distraction I feel. Amazon says it best when they say, “Leaders start with the customer and work backwards. They work vigorously to earn and keep customer trust. Although leaders pay attention to competitors, they obsess over customers.”
Really can’t see what they can do now short of “bending” to Databricks and entering the competition. And naturally it’s no longer just enough that they show comparable performance. They have to hit their games stats somehow otherwise any news even of they beat Databricks will be reported as “see, we told you they where cheating”
In general though, I'm still not complaining. It's interesting to see a dispute like this unfold.
I think Databricks is overly enthusiastic about their results as they have been trying to be competitive with cloud DWs on these benchmarks for a number of years now. They have finally caught up (by building deltalake and their photon query engine which implement a number of standard DW features).
[1] http://www.vldb.org/pvldb/vol13/p1206-dreseler.pdf
[2] https://stratos.seas.harvard.edu/files/stratos/files/columnstoresfntdbs.pdf
[3] https://web.stanford.edu/class/cs245/readings/c- store.pdf
[4] http://sites.computer.org/debull/A12mar/vectorwise.pdfThe public pissing contest is entertaining while also being silly and slightly cringe, but I think it's a nice story for Databricks nonetheless. They now have a performant SQL-based analytics engine that can credibly compete with the best DWs in the market today, and it's just one part of their overall platform.
The sense I get is that Snowflake wants the conversation to be "no matter what you do, you need a data warehouse, and we're the best in the business at that." Databricks' Lakehouse approach is a fundamental challenge to that, and if they're getting this kind of performance from their analytics engine against the market-leading data warehouses today, that's a big momentum shift in their favour.
I believe the co-founders have addressed this in the blog.
> Our goal was to dispel the myth that Data Lakehouse cannot have best-in-class price and performance. Rather than making our own benchmarks, we sought the truth and participated in the official TPC benchmark.
I'm sure anybody seriously looking at evaluating data platforms would want to look at things holistically. There are different dimensions like open ecosystem, support for machine learning, performance etc. And different teams evaluating these platforms would stack rank them in different orders.
These blogs, I believe, show that Databricks is a viable choice for customers when performance is a top priority (along with other dimensions). That IMO is customer obsession.
* I haven't executed the test suite, but fraud seems likely.
Both participants in a fight can win by implicitly excluding their real competitors.
EDIT: I forgot lying about how open they are when all their interesting technologies (like the new sql engine and the good parts of delta) are proprietary.
I love working with Databricks and Snowflake. They both knock it out of the park for their respective use case. They’re amazing products.
It makes no sense to fall out about this though.
For a 100TB dataset with a funky calculation, Spark will trounce Snowflake. For a 1 row dataset, Snowflake will return before the spark job has been serialised.
Also what kind of queries are we talking about?
These are the slides from a talk one of the co-founders (@rxin) gave at Stanford. https://web.stanford.edu/class/cs245/slides/LakehouseGuestTa...
It goes into the details of how this performance is achieved(and not just at 100TB). Part of this could be attributed to innovations in the storage layer(delta lake), and part of it is just the new query engine design itself.
Some objective third party sets the goal and then each company submits automation (selenium?) that configures their own app to achieve the goal. Entrants are scored by:
- time
- storage
- compute
- config complexity
No need to waste time making your opponent look bad, just focus on making your self look good, and do it on a level playing field.
Atlassian? Adobe? ExxonMobil? PagerDuty? McAfee? HSBC? Starbucks? AstraZeneca? GlaxoSmithKline? Comcast? FINRA? Regeneron? Riot Games? Nielsen? HP? Conde Nast? Viacom? McGraw-Hill? Cisco? NBCUniversal?
Hopefully they can scale to the enterprise soon.
Aside from the Azure/GCP/AWS internal offeringa I know about Snowflake and Firebolt, Databricks is new to me.
I heard Google BigQuery is good. It is completely SaaS (like AWS Athena that works).
Unicorns often run their own stack and you could replicate that, if you have the apetite. Netflix and Apple run Trino + Spark on k8s + Iceberg. Uber used their own Hudi thing, not sure if they still do.
"Databricks is an enterprise software company founded by the creators of Apache Spark. [...] Databricks develops a web-based platform for working with Spark, that provides automated cluster management and IPython-style notebooks."
Something like Snowflake works much better when you're building a platform that you can give to two hundred data analysts or various skills spread over fifty teams, so they can build their own stuff. The nice UI, broad feature set (materialized views, time travel, automatic backups, superfast scaling up and down, ...) and general just-work-iness makes it nice for that, but you're going to pay for the privilege.
Databricks is somewhere in the middle - things are way less polished, features don't always work and you still have to figure out things like backups and partitions on S3 on your own, but some people like that. Expect to also pay a pretty penny for hundreds of Spark clusters nobody knows who uses.
* Databricks pivoted from analytics to ML and it's not just marketing. Clickhouse is all about OLAP use cases.
* Clickhouse competes with Druid/Pinot/Timescale, Spark competes with Flink.
Everyone technical knew they would game every environment to come out with superior results. I suppose it worked. As the top executives buy big system software and ignore the IT crowd who could easily point out the flaws in the methodology of the"studies".
Breakdown of one of those example ads:
https://db2news.wordpress.com/2011/06/08/a-closer-examinatio...
[0] A solution to DeWitt clauses. https://danluu.com/anon-benchmark/
This is kind of understandable. Benchmarking complex software is complicated. It’s easy to give totally wrong picture of things either accidentally or deliberately.
The "Unbreakable" Marketing Campaign:
https://www.oreilly.com/library/view/the-oracle-hackers/9780...
https://www.zdnet.com/article/invincible-oracle-not-so-secur...
Snowflake has shown NOTHING close to this.
[1] http://tpc.org/results/fdr/tpcds/databricks~tpcds~100000~dat...
Databricks is an F1 car - everything is built out. You get in and drive - FAST.
F1 cars really unreliable and need a lot of engineers to keep running, are very expensive, and completely impractical in normal use. They are fast but only on very specific roads, they couldn't survive on normal roads.
What do you know, you might be right! :D
found the databricks employee
I'm sick and tired of these companies Snake Oiling the Data industry by offering "the easiest" platform to satisfy your Data Lake + Warehouse solution only to fall hard whenever you hook it up with your production data (big dataset).
PS: Anyone selling Data Lakehouse (Data Lake + Warehouse as one platform) is on meth.
Data Lake + Merge support + DW performance is now possible.
That is the game changer.
As of today, these companies are not good enough to take on the Data Warehouse part.