Databricks response to Snowflake's accusation of lacking integrity (opens in new tab)

(databricks.com)

217 pointsrxin4y ago156 comments

156 comments

93 comments · 25 top-level

scapecast4y ago· 10 in thread

The irony here is that what Databricks is doing to Snowflake is exactly what Snowflake did to AWS and Redshift.

Same playbook - show that you’re better in a key metric that’s easy to understand (performance) to get the attention, but then pitch the paradigm change.

In Snowflake’s case, that was separation of storage and compute.

In Databrick’s case, it’s the Lakehouse Architecture.

I think the reason why Snowflake is so nervous because they know they can’t win this game.

falaki4y ago

To be fair Apache Spark, which started long before either company existed, was built on the assumption that compute and storage should be separate. Unlike Hadoop, Spark did not come with any storage system and could read from any source.

d-d-d4y ago

> To be fair Apache Spark, which started long before either company existed

Databricks was founded before Spark 1.0 released by Spark's creators.

Hadoop was created at a time when network and disk were much slower, RAM was less abundant. Bringing compute to the data made sense, but it typically doesn't anymore.

doppelganger14y ago

Hadoop was built on the notion that commodity hardware, when pooled together, can be extremely cheap and powerful. The problem is, to manage it, is a nightmare. Cloudera/HWX and others were unable to reduce the management burden and their inability to pivot to a cloud based architecture really sunk their ship.

doppelganger14y ago

SF spreads a lot of FUD saying that DB can’t perform, and it was true. DB then went out and hired a lot of engineering talent with a diverse background and has been investing a lot of money in being a best in class SQL offering, so what do you do? You do something to get people’s attention. They’re saying, “hey, we have great performance too, you should also look at us for your SQL workloads.”

ignoramous4y ago

> I think the reason why Snowflake is so nervous because they know they can’t win this game.

Isn't Databricks' delta.io, which their Data Lakehouse product builds on top of, open source? Snowflake could take the best parts from and run with it?

bpaneural4y ago

They could in principle. GCP, for instance, does do that. So does HP. And Databricks don't mind that as they have a strong open source legacy. But that takes away the proprietary lock-in strategy of Snowflake.

buttaphingas4y ago

Delta is open source, but Databricks keeps optimizations for themselves as proprietary. I'm not sure why it would be any better than Snowflake's solution, which is automatically deployed across multiple AZs as a fully HA system and gives full ACID transaction compliance across any number of tables (not just per-table).

1 more reply

glogla4y ago

In what way is lakehouse architecture beneficial over something like Snowflake or BigQuery?

I understand the appeal over having lake and warehouse as separate components, but with those native cloud warehouses, you can already do everything a lake does.

turk-4y ago

With a datawarehouse, you can only interface with your data in SQL. With big query and snowflake, your data is locked away in a proprietary format not accessible by other compute platforms. You need to export/copy your data to a different system to train an ML model in python or R.

With the lakehouse, you can use python, R and Scala, (not just SQL) to interface with your data. You can use multiple compute engines (spark, Databricks, presto) so you are not locked into one compute engine.

I recall being a junior programmer, and wishing I could talk to my MySQL database in python code to do some processing that was difficult to express in SQL, that day is finally here.

3 more replies

doppelganger14y ago

Big Query&Data Proc, Redshift&EMR, Synapse&HDR are tied to the cloud vendors. You can’t move easily from AWS stack to GCP without refactoring. Switching costs are higher.

Snowflake and Databricks are multicloud. The different is that Snowflake is more like a SaaS solution and only does SQL. Databricks is more than just SQL. It has all the data science, machine learning information, built into it. Snowflake has Snowpark but it’s every limited and so you are more likely to have to buy more products to build out your capabilities and integrate them with Snowflake. With Databricks it is more out of the box in terms of capabilities. Databricks also runs in your cloud account which has trade offs. It can be harder to get going and more complex but you end up with a lot more flexibility and you own your data and have complete control over it. While Snowflake gives you control of your data with their tools, everything has to go through Snowflake and incur their tax to get to it. You pay for simplicity, which many customers are ok with because they see value in it. On the contrary, a lot of customers see value in having more control and options. This market is big enough for everyone - it’s really just about market share.

1cvmask4y ago· 8 in thread

This reminds me of the old performance ads of Oracle where they would show you how everything ran better on Oracle. They used to put those ads at airports, business lounges and the back cover of newspapers and magazines read by non-technical executives like the FT and Economist.

Everyone technical knew they would game every environment to come out with superior results. I suppose it worked. As the top executives buy big system software and ignore the IT crowd who could easily point out the flaws in the methodology of the"studies".

Breakdown of one of those example ads:

https://db2news.wordpress.com/2011/06/08/a-closer-examinatio...

initplus4y ago

A key part of the Oracle strategy is making it a breach of license to publish any benchmarking data. No performance data about Oracle's database is allowed to be published without their approval, which means no negative results are published.

1cvmask4y ago

They also sue you for so many other reasons. It's like the management hierarchy joke that Oracle is a litigious law firm with a sales team.

https://palisadecompliance.com/oracle-org-chart/

1 more reply

laserlight4y ago

Here's some background for those who are interested [0].

[0] A solution to DeWitt clauses. https://danluu.com/anon-benchmark/

doppelganger14y ago

Oracle Exadata is very fast but expensive. I bet it would beat a similarly sized cluster from these 2 vendors. The problem is price to performance and elasticity. Because DB and SF are in the cloud, they have a lot more options that Oracle doesn’t have. This is why Kurian left Oracle to go to Google, because LE would not allow Oracle to make cloud native products that would run in other clouds. The SF cofounders are ex Oracle engineers and LE was not interested in creating a cloud native DB from scratch. If he did, we wouldn’t have a SF computing right now.

1 more reply

jpalomaki4y ago

I think this has been quite common clause in the license contracts. Databrics has a blog post about it: https://databricks.com/blog/2021/11/08/eliminating-the-dewit...

This is kind of understandable. Benchmarking complex software is complicated. It’s easy to give totally wrong picture of things either accidentally or deliberately.

belter4y ago

Who could forget the Unbreakable and Unhackable Campaign...

The "Unbreakable" Marketing Campaign:

https://www.oreilly.com/library/view/the-oracle-hackers/9780...

https://www.zdnet.com/article/invincible-oracle-not-so-secur...

doppelganger14y ago

The first thing unbreakable Linux did was break.

supercanuck4y ago

similiar as to how SAP is still showing growth even thought their core product (ERP Financials) hasn't changed much.

Normal_gaussian4y ago· 7 in thread

so, alternatives?

Aside from the Azure/GCP/AWS internal offeringa I know about Snowflake and Firebolt, Databricks is new to me.

glogla4y ago

Redshift is pretty terrible, stay away. AWS is even worse at delivering promises than Databricks and that's saying something.

I heard Google BigQuery is good. It is completely SaaS (like AWS Athena that works).

Unicorns often run their own stack and you could replicate that, if you have the apetite. Netflix and Apple run Trino + Spark on k8s + Iceberg. Uber used their own Hudi thing, not sure if they still do.

falaki4y ago

Apple is a big Deltalake (and Databricks) customer: https://www.youtube.com/watch?v=SFeBJxI4Q98

1 more reply

ethbr04y ago

https://en.m.wikipedia.org/wiki/Databricks

"Databricks is an enterprise software company founded by the creators of Apache Spark. [...] Databricks develops a web-based platform for working with Spark, that provides automated cluster management and IPython-style notebooks."

tyingq4y ago

Oracle and Teradata still have data warehouse pitches ;)

kofejnik4y ago

maybe clickhouse?

glogla4y ago

Clickhouse is good if you're building application. It has lot of great features and incredible performance, but there's an expectancy that people using it know what they're doing and can work around its limitations (like limited support for joins and sql in general).

Something like Snowflake works much better when you're building a platform that you can give to two hundred data analysts or various skills spread over fifty teams, so they can build their own stuff. The nice UI, broad feature set (materialized views, time travel, automatic backups, superfast scaling up and down, ...) and general just-work-iness makes it nice for that, but you're going to pay for the privilege.

Databricks is somewhere in the middle - things are way less polished, features don't always work and you still have to figure out things like backups and partitions on S3 on your own, but some people like that. Expect to also pay a pretty penny for hundreds of Spark clusters nobody knows who uses.

2 more replies

629514134y ago

* Apples and oranges: Clickhouse is a query engine while Databricks is a SaaS product/company. Apache Spark could be compared to Clickhouse, Databricks to clickhouse.com/company. The latter is barely a couple months old.

* Databricks pivoted from analytics to ML and it's not just marketing. Clickhouse is all about OLAP use cases.

* Clickhouse competes with Druid/Pinot/Timescale, Spark competes with Flink.

drej4y ago· 6 in thread

What I find hilarious is that companies argue who can query 100 TB faster and try to sell this to people. I've been on the receiving end of offers by both of the companies in question and used both platforms (and sadly migrated some data jobs to them).

While they can crunch large datasets, they are laughably slow for the datasets most people have. So while I did propose we use these solutions for our big-ish data projects, management kept pushing for us to migrate our tiny datasets (tens of gigabytes or smaller) and the perf expectedly tanked compared to our other solutions (Postgres, Redshift, pandas etc.), never mind the immense costs to migrate everything and train everyone up.

Yes, these are very good products. But PLEASE, for the love of god, don't migrate to them unless you know you need them (and by 'need' I don't mean pimping your resume).

sanketsarang4y ago

I did work on making a database myself, and I must say that querying 100TB fast, let alone storing 100TB of data, is a real problem. Some companies (very few) don't have much choice but to use a DB that works on 100TB. If you do have small data, then you have a lot of options. But if your data is large, then you have very few options. So it is correct to be competing on how fast a DB can query 100TB of data; while at the same time being slow if you have just 10GB of data. Some databases are designed only for large data, and should not be used if your data is small.

doppelganger14y ago

The larger your data, the more that indexing and maintaining them hurt you. This is why they do much better at larger datasets vs small data sets. It’s all about trade offs.

To overcome this, they make use of cache and if the small data is frequently accessed, the performance is generally pretty good and acceptable for most use cases.

1 more reply

tshanmu4y ago

Resume driven development FTW!

StephenJGL4y ago

Very true. You have to understand the actual capabilities and your actual requirements. We work with petabyte size datasets and BigQuery is hard to beat. Our other reporting systems are still all in MySQL though.

autokad4y ago

its my experience if its just 10s of GBs then use 'normal' solutions. if TB then spark is great for that. note I have only used DataBricks & Spark, no snowflake.

jeltz4y ago

PostgreSQL and MySQL can handle a few TB just fine. It is when you reach over 10TB that you need something else.

avip4y ago· 6 in thread

I've used both products in production. Both are good++.

The blog wars seem extremely ridiculous to me. I don't recall ever choosing one over another based on how fast it runs on some imaginary arbitrary dataset.

paxys4y ago

Manufactured rivalries can be a great thing for business. We have been debating Coke vs Pepsi, Nike vs Reebok, McDonald's vs Burger King for decades now while these companies laugh all the way to the bank.

javajosh4y ago

Like the post but I would add "Ford v Ferrari" there. A synthetic 100T test is much like an F1 course - not something you deal with during your commute, but it's nice to know what the limit is, and that there are people pushing that limit.

kartoonhero4y ago

Its not ridiculous at all. This is the coming of age for a brand new data architecture.

One of the biggest FUDs for a data lake architecture is performance - and this benchmark should put that concern to rest.

buttaphingas4y ago

I actually see them as variations on the same architecture. Databricks keeps their metadata in files, Snowflake keeps theirs in a database, but they both, ultimately, are querying data stored in a columnar format on blob store (and, to be fair, Snowflake have been doing that with ACID-compliant SQL for a lot longer than Databricks). So using SQL over blob at high performance has been around for a while.

Databricks say their solution is better because it's open (though keep the optimizations you need to run this at scale to themselves, i.e. is ultimately proprietary). Snowflake says theirs is better because it's a fully managed service, meaning no infrastructure to procure or manage, is fully HA across multiple data centers by default etc.

Databricks push 'open' but really still want you to use their proprietary tech for first transforming into something usable (Parquet/Delta) and then querying with Photon/SQL, though you can also use other tech. With Snowflake you can just ingest and query, but it has to be through their engine.

Customers should do their own valudation and see which one fits their needs best.

syntaxfree4y ago

I don’t know, “coming of age” seems to imply that there’s some pre-maturity period out of which something is emerging.

CactusOnFire4y ago

It was inevitable.

Both Databricks and Snowflake have inflated marketing budgets, and marketing feels they have to "beat" the other one or they'll lose the market.

redwood4y ago· 6 in thread

As much as I love seeing competition in the space and am enjoying my popcorn, I really don't understand what Databricks is doing here: this feels like a childish foodfight rather than an obsession with the customer...

saj1th4y ago

:) That is a good question. Why spend eng cycles to submit results to the TPC council - why not just focus on customers?

I believe the co-founders have addressed this in the blog.

> Our goal was to dispel the myth that Data Lakehouse cannot have best-in-class price and performance. Rather than making our own benchmarks, we sought the truth and participated in the official TPC benchmark.

I'm sure anybody seriously looking at evaluating data platforms would want to look at things holistically. There are different dimensions like open ecosystem, support for machine learning, performance etc. And different teams evaluating these platforms would stack rank them in different orders.

These blogs, I believe, show that Databricks is a viable choice for customers when performance is a top priority (along with other dimensions). That IMO is customer obsession.

kf6nux4y ago

I'd say helping customers spot fraud* is serving the customers' interests.

* I haven't executed the test suite, but fraud seems likely.

jjoonathan4y ago

All publicity is good publicity.

Both participants in a fight can win by implicitly excluding their real competitors.

glogla4y ago

Yes, the tone of those blogposts, the likelihood of fake benchmarks submitted on someone else's behalf and especially the deluge of new accounts supporting them makes me want to trust Databricks even less than the PoC my company ran with them last year and spending time with their terrible, terrible salespeople.

EDIT: I forgot lying about how open they are when all their interesting technologies (like the new sql engine and the good parts of delta) are proprietary.

mostdataisnice4y ago

What fake benchmarks are you talking about?

vgt4y ago

I think Snowflake cultivates a very careful public image, but in private their sales people use.. how do you say.. aggressive techniques.. databricks is addressing the source of market confusion head-on

falaki4y ago· 4 in thread

tl;dr: The data warehouse company used a pre-baked TPC-DS dataset and claimed they have similar performance to Databricks. Turns out if you use the official TPC-DS data generation scripts, you get much worse performance.

slownews454y ago

Even worse, they claimed to have similar performance to Databricks AND claimed databricks "lacked integrity". WOW, talk about chutzpah!

tyingq4y ago

I read the original post, the Snowflake response, and this. From that I gather that both of them aren't being completely honest or fair when making comparisons. A fair amount of truth, but also some clever wording and omission on both their parts. Which is not surprising or particularly new in this space :)

slownews454y ago

Databricks results are available at tpc.org [1]

Snowflake has shown NOTHING close to this.

[1] http://tpc.org/results/fdr/tpcds/databricks~tpcds~100000~dat...

2 more replies

arnon4y ago

That's altering the methods - and generally considered a violation of the validity of the results.

dreyfan4y ago· 4 in thread

Databricks is a rapidly approaching IPO. Trying to justify their valuation with their overpriced in-memory hadoop.

kartoonhero4y ago

Databricks is way more than hadoop or spark. A great analogy - Spark is a great engine but you need to design and build all of the other subsystems.

Databricks is an F1 car - everything is built out. You get in and drive - FAST.

dreyfan4y ago

Databricks is a shit platform that encourages terrible data practices and accretion of technical debt.

3 more replies

glogla4y ago

> Databricks is an F1 car

F1 cars really unreliable and need a lot of engineers to keep running, are very expensive, and completely impractical in normal use. They are fast but only on very specific roads, they couldn't survive on normal roads.

What do you know, you might be right! :D

1 more reply

fs1114y ago

> Databricks is an F1 car - everything is built out. You get in and drive - FAST.

found the databricks employee

benjaminwootton4y ago· 3 in thread

Ive been following this and it’s kind of embarrassing to watch.

I love working with Databricks and Snowflake. They both knock it out of the park for their respective use case. They’re amazing products.

It makes no sense to fall out about this though.

For a 100TB dataset with a funky calculation, Spark will trounce Snowflake. For a 1 row dataset, Snowflake will return before the spark job has been serialised.

imslowbutnice4y ago

What are you talking about. Spark isn't even used, and TPC DS is not a funky calculation at all. It's supposed to be a collection of typical datawarehouse type queries. Although I'm not really sure what funky means, but why would Spark trounce Snowflake on "funky" calculation at all. Do you mean an ML algorithm, and are you implying that TPC-DS has anything close to an ML Algorithm? And why would Snowflake perform better on returning one row, they are columnar stored.

nojvek4y ago

Why would Spark trounce Snowflake. What makes it inherently so much faster at 100TB jobs?

Also what kind of queries are we talking about?

saj1th4y ago

> Why would Spark trounce Snowflake. What makes it inherently so much faster at 100TB jobs?

These are the slides from a talk one of the co-founders (@rxin) gave at Stanford. https://web.stanford.edu/class/cs245/slides/LakehouseGuestTa...

It goes into the details of how this performance is achieved(and not just at 100TB). Part of this could be attributed to innovations in the storage layer(delta lake), and part of it is just the new query engine design itself.

__MatrixMan__4y ago· 3 in thread

Instead of blog posts written but experts in app A based on their experience with app B, I wish there were a platform for this kind of comparison.

Some objective third party sets the goal and then each company submits automation (selenium?) that configures their own app to achieve the goal. Entrants are scored by:

- time

- storage

- compute

- config complexity

No need to waste time making your opponent look bad, just focus on making your self look good, and do it on a level playing field.

rxinOP4y ago

Isn’t that what the official TPC does?

falaki4y ago

That is exactly the role of tpc.org.

renewiltord4y ago

If you want some information like this quick, you're gonna have to pay to run it.

hello_moto4y ago· 3 in thread

Serious question: Databricks, Snowflake, Dremio. All these "Data" platform companies => which one do you have for your Data Lake and Data Warehouse solution?

I'm sick and tired of these companies Snake Oiling the Data industry by offering "the easiest" platform to satisfy your Data Lake + Warehouse solution only to fall hard whenever you hook it up with your production data (big dataset).

PS: Anyone selling Data Lakehouse (Data Lake + Warehouse as one platform) is on meth.

kartoonhero4y ago

Please read up on Lakehouse.

Data Lake + Merge support + DW performance is now possible.

That is the game changer.

hello_moto4y ago

It'll take a few more years until these companies fixed all the bugs and address all the scalability issues.

As of today, these companies are not good enough to take on the Data Warehouse part.

1 more reply

strongbond4y ago

Do you work for Databricks?

1 more reply

bloodyplonker224y ago· 2 in thread

Databricks is trying to punch up at the market leader. Every decent marketer knows that you should never do the opposite and punch down.

djbusby4y ago

I'm crap at marketing and know the only-punch-up rule.

aliswe4y ago

what differences in size (or height) are we talking about?

jchw4y ago· 2 in thread

Before the Snowflake blog post, I did not know what Snowflake or Databricks were. I can only imagine that this rivalry is great for both of them, even if Databricks is somewhat on the advantage end, at least from a tactical standpoint; I admit though that they seem to be a bit unnecessarily defensive considering the position they're in with the exchange.

In general though, I'm still not complaining. It's interesting to see a dispute like this unfold.

qaq4y ago

Snowflake is 120B Market Cap Darling of Cloud Data warehouses I doubt obscurity is a problem they are trying to solve

jchw4y ago

Of course they’re known among their pre-existing customer base of people and entities who already solve problems using tools like this. But it’s a subset of the multi-trillion dollar cloud industry, which itself is not the entire software engineering industry.

michaelhartm4y ago· 2 in thread

Data Wars: Snowflake vs Databricks (0 - 2)?

drawturkey4y ago

Snowflake has way more revenue, is worth 3 times more than Databricks and is growing faster. I'd say Snowflake is still in the lead. Plus, just look at Snowflake's customer list. It's a "who's who", Databricks is a "Who's that?".

thrtlvlmidnight4y ago

I took a look at Databricks public customer case studies[1] and haven't a clue who any of these companies are:

Atlassian? Adobe? ExxonMobil? PagerDuty? McAfee? HSBC? Starbucks? AstraZeneca? GlaxoSmithKline? Comcast? FINRA? Regeneron? Riot Games? Nielsen? HP? Conde Nast? Viacom? McGraw-Hill? Cisco? NBCUniversal?

Hopefully they can scale to the enterprise soon.

[1]https://databricks.com/customers

1 more reply

inetknght4y ago· 1 in thread

Snowflake accuses other companies of lacking integrity?

I really wish I could block all of Snowflake's domain from my inbox. Sadly, Google encourages spammers to just create a new email address. So I get a few emails each month from Snowflake who ask me to try their products. I've never done business with them and there's no unsubscribe link.

Fuck Snowflake for thinking it has any room to talk about integrity.

doppelganger14y ago

What I find comical is they accuse Databricks of lacking integrity but they don’t actually call out anything except their benchmark was faster than what Databricks did in Snowflake. Databricks then reruns the benchmark and says the only reason that Snowflake’s was faster was because of the built in dataset they used. Databricks was able to match Snowflakes numbers using it but when they loaded the actual data set, it was much slower, which is how a proper TPC benchmark is supposed to happen. They then said that Databricks blog doesn’t match the TPC results, but when I looked at them, they do match. I guess Snowflake just expects people to take arguments at face value. Then I saw someone on LinkedIn complaining that Databricks must have used some beta version. I didn’t see a beta version being used, but that kind of goes out the window when Databricks follows up and then posts that they matched Snowflake when they used their built in TPC data set.

This is funny and interesting to watch but also a distraction I feel. Amazon says it best when they say, “Leaders start with the customer and work backwards. They work vigorously to earn and keep customer trust. Although leaders pay attention to competitors, they obsess over customers.”

AdamProut4y ago· 1 in thread

I would say that TPC-DS and TPC-H are really table stakes benchmarks for data warehouses at this point in time (maybe they weren't 10 years ago). How to build a database that does well on them is well documented in the literature now[1][2][3][4] (maybe a few other papers). Its not easy to build such a database, but its "just" hard work and many companies have the $$ necessary to do that work. There isn't any magic or technical moat in the results for databricks (or snowflake, or redshift, etc.).

I think Databricks is overly enthusiastic about their results as they have been trying to be competitive with cloud DWs on these benchmarks for a number of years now. They have finally caught up (by building deltalake and their photon query engine which implement a number of standard DW features).

  [1] http://www.vldb.org/pvldb/vol13/p1206-dreseler.pdf
  [2] https://stratos.seas.harvard.edu/files/stratos/files/columnstoresfntdbs.pdf
  [3] https://web.stanford.edu/class/cs245/readings/c- store.pdf
  [4] http://sites.computer.org/debull/A12mar/vectorwise.pdf

thrtlvlmidnight4y ago

I agree with everything above. The main advantage the newer data warehouses have over the legacy on-prem incumbents is that they had the chance to build from scratch having learned from all of the challenges that the original players encountered.

The public pissing contest is entertaining while also being silly and slightly cringe, but I think it's a nice story for Databricks nonetheless. They now have a performant SQL-based analytics engine that can credibly compete with the best DWs in the market today, and it's just one part of their overall platform.

The sense I get is that Snowflake wants the conversation to be "no matter what you do, you need a data warehouse, and we're the best in the business at that." Databricks' Lakehouse approach is a fundamental challenge to that, and if they're getting this kind of performance from their analytics engine against the market-leading data warehouses today, that's a big momentum shift in their favour.

gnabgib4y ago

Related post (2 days ago, 95 comments): [Snowflake’s response to Databricks’ TPC-DS post](https://news.ycombinator.com/item?id=29206959)

boublepop4y ago

Snowflake must be kicking themselves hard now for letting a story that was “Databricks is a viable alternative” turn into “Snowflake has absolutely no integrity and will fling mud even while they are gaming the statistics”

Really can’t see what they can do now short of “bending” to Databricks and entering the competition. And naturally it’s no longer just enough that they show comparable performance. They have to hit their games stats somehow otherwise any news even of they beat Databricks will be reported as “see, we told you they where cheating”

naattee4y ago

snowflake should just pony up and do a TPC-DS audited benchmark

maslam4y ago

Everyone win when data platforms submit audited benchmarks...

boringg4y ago

And how soon is the S-1 for Databricks dropping?

funstuff0074y ago

I guess if anyone suggests "sampling" the data in meeting these days, they get their head blown off.

xiaodai4y ago

Spark compares itself to Hadoop only on the front page. I wonder how Spark compares to Firebolt.

uvdn74y ago

Now I see that getting rid of the DeWitt clause is indeed great. Kudos to both companies.

xiaodai4y ago

Lol

j / k navigate · click thread line to collapse

156 comments

93 comments · 25 top-level

scapecast4y ago· 10 in thread

The irony here is that what Databricks is doing to Snowflake is exactly what Snowflake did to AWS and Redshift.

Same playbook - show that you’re better in a key metric that’s easy to understand (performance) to get the attention, but then pitch the paradigm change.

In Snowflake’s case, that was separation of storage and compute.

In Databrick’s case, it’s the Lakehouse Architecture.

I think the reason why Snowflake is so nervous because they know they can’t win this game.

falaki4y ago

d-d-d4y ago

> To be fair Apache Spark, which started long before either company existed

Databricks was founded before Spark 1.0 released by Spark's creators.

Hadoop was created at a time when network and disk were much slower, RAM was less abundant. Bringing compute to the data made sense, but it typically doesn't anymore.

doppelganger14y ago

ignoramous4y ago

> I think the reason why Snowflake is so nervous because they know they can’t win this game.

Isn't Databricks' delta.io, which their Data Lakehouse product builds on top of, open source? Snowflake could take the best parts from and run with it?

bpaneural4y ago

buttaphingas4y ago

1 more reply

glogla4y ago

In what way is lakehouse architecture beneficial over something like Snowflake or BigQuery?

I understand the appeal over having lake and warehouse as separate components, but with those native cloud warehouses, you can already do everything a lake does.

turk-4y ago

I recall being a junior programmer, and wishing I could talk to my MySQL database in python code to do some processing that was difficult to express in SQL, that day is finally here.

3 more replies

doppelganger14y ago

Big Query&Data Proc, Redshift&EMR, Synapse&HDR are tied to the cloud vendors. You can’t move easily from AWS stack to GCP without refactoring. Switching costs are higher.

1cvmask4y ago· 8 in thread

Breakdown of one of those example ads:

https://db2news.wordpress.com/2011/06/08/a-closer-examinatio...

initplus4y ago

1cvmask4y ago

They also sue you for so many other reasons. It's like the management hierarchy joke that Oracle is a litigious law firm with a sales team.

https://palisadecompliance.com/oracle-org-chart/

1 more reply

laserlight4y ago

Here's some background for those who are interested [0].

[0] A solution to DeWitt clauses. https://danluu.com/anon-benchmark/

doppelganger14y ago

1 more reply

jpalomaki4y ago

I think this has been quite common clause in the license contracts. Databrics has a blog post about it: https://databricks.com/blog/2021/11/08/eliminating-the-dewit...

This is kind of understandable. Benchmarking complex software is complicated. It’s easy to give totally wrong picture of things either accidentally or deliberately.

belter4y ago

Who could forget the Unbreakable and Unhackable Campaign...

The "Unbreakable" Marketing Campaign:

https://www.oreilly.com/library/view/the-oracle-hackers/9780...

https://www.zdnet.com/article/invincible-oracle-not-so-secur...

doppelganger14y ago

The first thing unbreakable Linux did was break.

supercanuck4y ago

similiar as to how SAP is still showing growth even thought their core product (ERP Financials) hasn't changed much.

Normal_gaussian4y ago· 7 in thread

so, alternatives?

Aside from the Azure/GCP/AWS internal offeringa I know about Snowflake and Firebolt, Databricks is new to me.

glogla4y ago

Redshift is pretty terrible, stay away. AWS is even worse at delivering promises than Databricks and that's saying something.

I heard Google BigQuery is good. It is completely SaaS (like AWS Athena that works).

falaki4y ago

Apple is a big Deltalake (and Databricks) customer: https://www.youtube.com/watch?v=SFeBJxI4Q98

1 more reply

ethbr04y ago

https://en.m.wikipedia.org/wiki/Databricks

tyingq4y ago

Oracle and Teradata still have data warehouse pitches ;)

kofejnik4y ago

maybe clickhouse?

glogla4y ago

2 more replies

629514134y ago

* Databricks pivoted from analytics to ML and it's not just marketing. Clickhouse is all about OLAP use cases.

* Clickhouse competes with Druid/Pinot/Timescale, Spark competes with Flink.

drej4y ago· 6 in thread

Yes, these are very good products. But PLEASE, for the love of god, don't migrate to them unless you know you need them (and by 'need' I don't mean pimping your resume).

sanketsarang4y ago

doppelganger14y ago

The larger your data, the more that indexing and maintaining them hurt you. This is why they do much better at larger datasets vs small data sets. It’s all about trade offs.

To overcome this, they make use of cache and if the small data is frequently accessed, the performance is generally pretty good and acceptable for most use cases.

1 more reply

tshanmu4y ago

Resume driven development FTW!

StephenJGL4y ago

autokad4y ago

its my experience if its just 10s of GBs then use 'normal' solutions. if TB then spark is great for that. note I have only used DataBricks & Spark, no snowflake.

jeltz4y ago

PostgreSQL and MySQL can handle a few TB just fine. It is when you reach over 10TB that you need something else.

avip4y ago· 6 in thread

I've used both products in production. Both are good++.

The blog wars seem extremely ridiculous to me. I don't recall ever choosing one over another based on how fast it runs on some imaginary arbitrary dataset.

paxys4y ago

javajosh4y ago

kartoonhero4y ago

Its not ridiculous at all. This is the coming of age for a brand new data architecture.

One of the biggest FUDs for a data lake architecture is performance - and this benchmark should put that concern to rest.

buttaphingas4y ago

Customers should do their own valudation and see which one fits their needs best.

syntaxfree4y ago

I don’t know, “coming of age” seems to imply that there’s some pre-maturity period out of which something is emerging.

CactusOnFire4y ago

It was inevitable.

Both Databricks and Snowflake have inflated marketing budgets, and marketing feels they have to "beat" the other one or they'll lose the market.

redwood4y ago· 6 in thread

saj1th4y ago

:) That is a good question. Why spend eng cycles to submit results to the TPC council - why not just focus on customers?

I believe the co-founders have addressed this in the blog.

These blogs, I believe, show that Databricks is a viable choice for customers when performance is a top priority (along with other dimensions). That IMO is customer obsession.

kf6nux4y ago

I'd say helping customers spot fraud* is serving the customers' interests.

* I haven't executed the test suite, but fraud seems likely.

jjoonathan4y ago

All publicity is good publicity.

Both participants in a fight can win by implicitly excluding their real competitors.

glogla4y ago

EDIT: I forgot lying about how open they are when all their interesting technologies (like the new sql engine and the good parts of delta) are proprietary.

mostdataisnice4y ago

What fake benchmarks are you talking about?

vgt4y ago

falaki4y ago· 4 in thread

slownews454y ago

Even worse, they claimed to have similar performance to Databricks AND claimed databricks "lacked integrity". WOW, talk about chutzpah!

tyingq4y ago

slownews454y ago

Databricks results are available at tpc.org [1]

Snowflake has shown NOTHING close to this.

[1] http://tpc.org/results/fdr/tpcds/databricks~tpcds~100000~dat...

2 more replies

arnon4y ago

That's altering the methods - and generally considered a violation of the validity of the results.

dreyfan4y ago· 4 in thread

Databricks is a rapidly approaching IPO. Trying to justify their valuation with their overpriced in-memory hadoop.

kartoonhero4y ago

Databricks is way more than hadoop or spark. A great analogy - Spark is a great engine but you need to design and build all of the other subsystems.

Databricks is an F1 car - everything is built out. You get in and drive - FAST.

dreyfan4y ago

Databricks is a shit platform that encourages terrible data practices and accretion of technical debt.

3 more replies

glogla4y ago

> Databricks is an F1 car

What do you know, you might be right! :D

1 more reply

fs1114y ago

> Databricks is an F1 car - everything is built out. You get in and drive - FAST.

found the databricks employee

benjaminwootton4y ago· 3 in thread

Ive been following this and it’s kind of embarrassing to watch.

I love working with Databricks and Snowflake. They both knock it out of the park for their respective use case. They’re amazing products.

It makes no sense to fall out about this though.

For a 100TB dataset with a funky calculation, Spark will trounce Snowflake. For a 1 row dataset, Snowflake will return before the spark job has been serialised.

imslowbutnice4y ago

nojvek4y ago

Why would Spark trounce Snowflake. What makes it inherently so much faster at 100TB jobs?

Also what kind of queries are we talking about?

saj1th4y ago

> Why would Spark trounce Snowflake. What makes it inherently so much faster at 100TB jobs?

These are the slides from a talk one of the co-founders (@rxin) gave at Stanford. https://web.stanford.edu/class/cs245/slides/LakehouseGuestTa...

__MatrixMan__4y ago· 3 in thread

Instead of blog posts written but experts in app A based on their experience with app B, I wish there were a platform for this kind of comparison.

Some objective third party sets the goal and then each company submits automation (selenium?) that configures their own app to achieve the goal. Entrants are scored by:

- time

- storage

- compute

- config complexity

No need to waste time making your opponent look bad, just focus on making your self look good, and do it on a level playing field.

rxinOP4y ago

Isn’t that what the official TPC does?

falaki4y ago

That is exactly the role of tpc.org.

renewiltord4y ago

If you want some information like this quick, you're gonna have to pay to run it.

hello_moto4y ago· 3 in thread

Serious question: Databricks, Snowflake, Dremio. All these "Data" platform companies => which one do you have for your Data Lake and Data Warehouse solution?

PS: Anyone selling Data Lakehouse (Data Lake + Warehouse as one platform) is on meth.

kartoonhero4y ago

Please read up on Lakehouse.

Data Lake + Merge support + DW performance is now possible.

That is the game changer.

hello_moto4y ago

It'll take a few more years until these companies fixed all the bugs and address all the scalability issues.

As of today, these companies are not good enough to take on the Data Warehouse part.

1 more reply

strongbond4y ago

Do you work for Databricks?

1 more reply

bloodyplonker224y ago· 2 in thread

Databricks is trying to punch up at the market leader. Every decent marketer knows that you should never do the opposite and punch down.

djbusby4y ago

I'm crap at marketing and know the only-punch-up rule.

aliswe4y ago

what differences in size (or height) are we talking about?

jchw4y ago· 2 in thread

In general though, I'm still not complaining. It's interesting to see a dispute like this unfold.

qaq4y ago

Snowflake is 120B Market Cap Darling of Cloud Data warehouses I doubt obscurity is a problem they are trying to solve

jchw4y ago

michaelhartm4y ago· 2 in thread

Data Wars: Snowflake vs Databricks (0 - 2)?

drawturkey4y ago

thrtlvlmidnight4y ago

I took a look at Databricks public customer case studies[1] and haven't a clue who any of these companies are:

Hopefully they can scale to the enterprise soon.

[1]https://databricks.com/customers

1 more reply

inetknght4y ago· 1 in thread

Snowflake accuses other companies of lacking integrity?

Fuck Snowflake for thinking it has any room to talk about integrity.

doppelganger14y ago

AdamProut4y ago· 1 in thread

  [1] http://www.vldb.org/pvldb/vol13/p1206-dreseler.pdf
  [2] https://stratos.seas.harvard.edu/files/stratos/files/columnstoresfntdbs.pdf
  [3] https://web.stanford.edu/class/cs245/readings/c- store.pdf
  [4] http://sites.computer.org/debull/A12mar/vectorwise.pdf

thrtlvlmidnight4y ago

gnabgib4y ago

Related post (2 days ago, 95 comments): [Snowflake’s response to Databricks’ TPC-DS post](https://news.ycombinator.com/item?id=29206959)

boublepop4y ago

naattee4y ago

snowflake should just pony up and do a TPC-DS audited benchmark

maslam4y ago

Everyone win when data platforms submit audited benchmarks...

boringg4y ago

And how soon is the S-1 for Databricks dropping?

funstuff0074y ago

I guess if anyone suggests "sampling" the data in meeting these days, they get their head blown off.

xiaodai4y ago

Spark compares itself to Hadoop only on the front page. I wonder how Spark compares to Firebolt.

uvdn74y ago

Now I see that getting rid of the DeWitt clause is indeed great. Kudos to both companies.

xiaodai4y ago

Lol

j / k navigate · click thread line to collapse