Uses and abuses of cloud data warehouses (opens in new tab)

(materialize.com)

156 pointsMalp2y ago76 comments

76 comments

I remeber one time I was working as a Data & Analytics Lead (almost a Chief Data Officer but without the title) in a company were I don't work anymore and I was "challenged" by our parent company CDO about our data tech stack and operations. Just for context, my team at the time was me working as the lead and main Data Engineer plus 3 Data Analysts that I was coaching/teaching to convert into DEngs/DScientists.

At the time we were mostly a batch data shop, based on Apache Airflow + K8S + BigQuery + GCS in Google Cloud Platform, with BigQuery + GCS as the central datalake techs for analytics and processing. We still had RT capabilities due to having also some Flink processes running in the K8S cluster, and also having time-critical (time, not latency) processes running in microbatches of minutes for NRT. It was pretty cheap and sufficiently reliable, with both Airflow and Flink having self-healing capabilities at least at the node/process level (and even cluster/region level should we need it and be willing to increase the costs), while also allowing for some changes down the road like moving out of BQ if the costs scaled up too much.

What they wanted us to implement what according to them was the industry "best practices" circa 2021: a Kafka-based Datalake (KSQL and co.), at least other 4 engines (Trino, Pinot, Postgres and Flink) and an external object storage with most of the stuff running inside Docker containers orchestrated by Ansible in N compute instances manually controlled from a bastion instance. For some reason, they insisted on having a real time datalake based on Kafka. It was an insane mix of cargo cult, FOMO, high operational complexity and low reliability in one package.

I resisted the idea until the last second I was in that place. I reunited with some of my team members for drinks months later after my departure and they told me the new CDO was already convinced that said "RT-based" datalake was the way to go forward. I still shudder every time I remember the architectural diagram and I hope they didn't finally follow that terrible advice.

tl;dr: I will never understand the cargo cult around real time data and analytics but it is a thing that appeals to both decision makers and "data workers". Most businesses and operations (especially those whose main focus is not IT by itself) won't act or decide in hours, but rather in days. Build around your main use case and then make exceptions, not the other way around.

chuckhend2y ago

I agree that is a great approach - build around the main use cases and then make exceptions. I think a lot of companies have legitimate use cases for real-time analytics (outside of their internal decision making), but as you mention, preemptively optimize for the aspiration and leads them towards unnecessary tool and tech sprawl. For example, a marketplace application that shows you the quantity of an item currently available -- you as a consumer use that information to make a decision in seconds, so its a great use-case. Internally, the org probably uses that data for weekly or quarterly forecasting. I've seen use cases like that lead to the "let's make everything real-time", but not every use case benefits the same from real-time.

jgraettinger12y ago

> they told me the new CDO was already convinced that said "RT-based" datalake was the way to go forward

Is the desire for "RT-based datalake" itself misplaced? Or just that the implementation isn't up to the job? Nobody _wants_ slow data, and reports that are usually fine with T+1 delay can become time critical (for example, a "what's selling?" report on black friday).

mrbungie2y ago

Well, it's an engineering decision, so there is no direct answer.

But as an engineering manager, at least I must ask and answer the following question: even when nobody wants _slow_ data, how fast is fast enough? I don't see decision makers choosing and thinking better with a 10 min latency vs 20 min latency, as they are not looking at the reports all the time, even for big events like Black Friday (they have meetings and stuff you know, even their supporting analyst teams do).

For more time-critical matters (i.e. real time BI or real time automatic microdecision making for fraud detection), as I said, we did have the capability to run both more frequent microbatches or do RT processing using Flink connected directly to our app backend messaging system (ironically Confluent, Kafka as a Service). But that is very different to using a complex real time log as Kafka running on "pet" servers as the cornerstone of your data platform and then propagating said data to different engines/datastores (at least 4 as I said) for downstream processing. That's a lot of moving parts running in a low reliability environment.

Overengineering is a thing, and I think it was my responsability at the time to limit the level of complexity considering the reality of the business and the resources we had in the team, even if that meant 20 minutes of latency for a business report. That's my point and why I say I think it was a bad decision to use a Kafka based stack. YMMV obviously.

albert_e2y ago

Arent a lot of businesses being sold on "real time analytics" these days?

That mixes the uses cases of analytics and operations because everyone is led to believe that things that happened in last 10 minutes must go through the analytics lens and yield actionable insights in real time so their operational systems can react/adapt instantly.

Most business processes probably don't need anywhere near such real time analytics capability but it is very easy to think (or be convinced that) we do. Especially if I am a owner of a given business process (with an IT budget) why wouldn't I want the ability to understand trends in real-time and react to it if not get ahead of them and predict/be prepared. Anything less than that is seen as being shamefully behind on the tech curve.

In this context-- the section in article where it says present data is of virtually zero importance to analytics is no longer true. We need a real solution even if we apply those (presumably complex and costly) solutions to only the most deserving use cases (and not abuse them).

What is the current thinking in this space? I am sure there are technical solutions here but what is the framework to evaluate which use case actually deserves pursuing such a setup.

Curious to hear.

lmkg2y ago

I work as a web analyst (think Google Analytics).

One time I ran an A/B test on the color of a button. After the conclusion of the test, with a clear winner in hand, it took eleven months for all involved stakeholders to approve the change. The website in question got a few thousand visits a month and was not critical to any form of business.

This organization does not benefit from real-time analytics.

Now that's an extreme outlier, but my experience is that most organizations are in that position. The feedback loop from collecting data to making a decision is long, and real-time analytics shortens a part that's already not the bottleneck. The technical part of real-time analytics provides no value unless the org also has the operational capacity to use that data quickly.

I have seen this! I have, for example, seen a news site that looked at web analytics data from the morning and was able to publish new opinion pieces that afternoon if something was trending. They had a dedicated process built around that data pipeline. Critically, they had a specific idea of what they could do with that data when the received it.

So if you want a framework, I would start from a single, simple question: What can you actually do with real-time data? Name one (1) action your organization could take based on that data.

I think it's also useful to separate what data benefits from realtime and which users can make use of it. Even if you have real-time data, some consumers don't benefit from immediacy.

coredog642y ago

Generally speaking “What questions do you hope to answer with this data?” is a good filter for all kinds of operational data.

iamacyborg2y ago

Hate to say it but if your site was only getting a few thousand visitors a month your test was likely vastly underpowered and therefore irrelevant anyway

1 more reply

slotrans2y ago

Just gonna keep linking this til the heat death of the universe: https://mcfunley.com/whom-the-gods-would-destroy-they-first-...

Real-time analytics are worse than useless. At best they are a distracting resource sink, at worst they directly harm the quality of decision-making.

RyanHamilton2y ago

I find saying "X is worse than useless" a bad approach to technology. I recommend you try and think, what is the pivot point to decide between these options? e.g. php and node js, when would I pick one over the other. It's rare for one technology to completely dominate another.

Eumenes2y ago

From my experience (mostly startups), real time analytics is generally overkill, esp. from a BI perspective. Unless your business is very focused on real time data and transactional processing, you can generally get away with ETL/batch jobs. Show executives, product, and downstream teams some metrics that update a few times per day saves a ton of money over things like Snowflake/Databricks/Redshift stuff. While cloud services can be pricey, tools like dbt are really useful and can be administered by savvy business people or analyst types. Those candidates are way easier to hire compared to data engineers, sql experts, etc.

ozim2y ago

For me it mostly is that business people don't understand OLAP vs OLTP and that if they add 5 items to database and they are visible in the system their "dashboard" will not update instantly but only after when data pipes run.

Which is hard to explain because if it is not instant everywhere they think it is a bug and system is crappy. Later on they will use dashboard view once a week or once a month so 5 items update is not relevant at all.

higeorge132y ago

I work in a real time subscription analytics company (chartmogul.com). We fetch, normalize and aggregate various billing systems data and eventually visualize them into graphs and tables.

I had this discussion with key people and i would say it depends on multiple factors. Small companies really like and require real-time analytics: they want to see how a couple invoices translate into updated saas metrics or why they didn’t get a slack/email notification as soon asit happened. Larger ones will check their data less frequently per day or week, but again it depends on the people and their role. Most of them are happy with getting their data once per day into their mailboxes or warehouses.

But we try to make everyone happy so we aim for real time analytics.

mrbungie2y ago

I think GP's point is that is not about the perceived value of real time data/analytics, but rather, its actual value. Decision makers may ask for RT or NRT, but most of the time won't make a decision or action in a timeframe that actually justifies RT/NRT data/analytics.

For most operations RT/NRT data stuff normally is about novelty/vanity rather than a real existing business need.

1 more reply

jandrewrogers2y ago

The term "real-time" is much abused in marketing copy. It is often treated like a technical metric but it is actually a business metric: am I making operational decisions with the most recent data available? For many businesses, "most recent data available" can be several days old and little operational efficiency would be gained by investing in reducing that latency.

For some businesses, "real-time" can be properly defined as "within the last week". While there are many businesses where reducing that operational tempo to seconds would have an impact, it is by no means universal.

chimerasaurus2y ago

Disclaimer - Snowflake here.

I will just point out that when my team and I talk about streaming, we are focused on not real-time because in many cases, the value to a customer is not there. Not every "streaming" use case is fraud detection. In fact, we have been saying for awhile that for many streaming use cases, the value is 60 seconds < [value here] < 60 minutes.

Example: (and yes, this is a Snowflake video but has a visual) https://youtu.be/Ou04UZWwxgg?t=64

RyanHamilton2y ago

The way I think about this is a typical up and to the right graph, of cost vs Speed. As you increase the speed, you increase the cost. So for real-time to be benefical to your business, you need to have be able to make more profit with data that is 1 second fresh vs 1 minute delayed. Looking at it like this, you can roughly group them: 50 milliseconds, 1 second, 1 minute, 1 hour, 1 day. In which area is your business for making profit? Uber showing a taxi location = 1 minute (perhaps fake it moving in between). Large electricity substation monitoring = 1 minute - assuming power down takes 5 minutes to commence. Trading with user interaction = 50ms. Then at each of those points are technology systems for deliverying that speed. I guess what some of these vendors are trying to do is change the shape of the graph. If they can bring the cost down massively, then more it may be worth Uber showing a 1 second update :) I know some users that watch the litle car obsessively.

hintymad2y ago

Real-time analytics for human is not that useful. Human can't make decisions and take actions in minutes anyway, let alone seconds. A notable exception can be log analytics for operations, but I'd argue in that case throughput is more more important than a few seconds of latency. Case in point, CloudWatch Insights can consistently drive about 1GB/s of log scan. It's good enough for log search in practice.

On the other hand, real-time analytics for machines can be critical to a business, which is why Yandex built Clickhouse and ByteDance deployed more than 20K nodes of Clickhouse.

Just like using any technology, we need to figure out what problems we solve for real-time analytics first.

atwebb2y ago

Real-time generally means near-real-time and even then I liken it to availability.

If asked people would say "I need to always be up" until they see the costs associated with it, then being out for a few hours a year tends to be ok.

datadrivenangel2y ago

This is a great way looking at it. The cost starts going up rapidly from daily and approaches infinity as you get to ultra-low latency realtime analytics.

There is a minimum cost though (systems, engineers, etc), so for medium data there's often very little marginal cost up until you start getting to hourly refreshes. This is not true for larger datasets though.

riordan2y ago

> In this context-- the section in article where it says present data is of virtually zero importance to analytics is no longer true. We need a real solution even if we apply those (presumably complex and costly) solutions to only the most deserving use cases (and not abuse them).

Totally agreed, though where real-time data is being put through an analytics lens is where CDW's start to creak and get costly. In my experience, these real-time uses shift the burden from being about human-decision-makers to automated decision-making and it becomes more a part of the product. And that's cool, but it gets costly, fast.

It also makes perfect sense to fake-it-til-you-make-it for real-time use cases on an existing Cloud Data Warehouse/dbt style _modern data stack_ if your data team's already using it for the rest of their data platform; after all they already know it and it's allowed that team to scale.

But a huge part of the challenge is that once you've made it, the alternative for a data-intensive use case is a bespoke microservice or a streaming pipeline, often in a language or on a platform that's foreign to the existing data team who's built the thing. If most of your code is dbt sql and airflow jobs, working with Kafka and streaming spark is pretty foreign (not to mention entirely outside of the observability infrastructure your team already has in place). Now we've got rewrites across languages/platforms, and leave teams with the cognitive overhead of multiple architectures & toolchains (and split focus). The alternative would be having a separate team to hand off real-time systems to and only that's if the company can afford to have that many engineers. Might as well just allocate that spend to your cloud budget and let the existing data team run up a crazy bill on Snowflake or BigQuery as long as it's less than the cost of a new engineering team.

------

There's something incredible about the ruthless efficiency of sql data platforms that allows data teams to scale the number of components/engineer. Once you have a Modern-Data-Stack system in place, the marginal cost of new pipelines or transformations is negligible (and they build atop one another). That platform-enabled compounding effect doesn't really occur with data-intensive microservices/streaming pipelines and means only the biggest business-critical applications (or skunk works shadow projects) will get the data-intensive applications[1] treatment, and business stakeholders will be hesitant to greenlight it.

I think Materialize is trying to build that Modern-Data-Stack type platform for real-time use cases: one that doesn't come with the cognitive cost of a completely separate architecture or the divide of completely separate teams and tools. If I already had a go-to system in place for streaming data that could be prototyped with the data warehouse, then shifted over to a streaming platform, the same teams could manage it and we'd actually get that cumulative compounding effect. Not to mention it becomes a lot easier to then justify using a real-time application the next time.

[1]: https://martin.kleppmann.com/2014/10/16/real-time-data-produ...

spullara2y ago

These reasons are why Snowflake is building hybrid tables (under the Unistore umbrella). Those tables keep recent data in an operational store and historical data in their typical data warehouse storage systems. Best of both worlds. Still in private preview but definitely the answer to how you build applications that need both without using multiple databases and syncing.

https://www.snowflake.com/guides/htap-hybrid-transactional-a...

datavirtue2y ago

Conveniently leave out the issue of cost. Snowflake is piling on features that encourage more compute. Customers abuse the system and they (Snowflake) respond by helping cement them into continuing the abuse (spending more) by developing features to make bad habits and horrible engineering decisions look like something they should be doing. Typical.

disgruntledphd22y ago

Snowflake are the Oracle of the cloud.

1 more reply

andrenotgiant2y ago

It seems like Snowflake is going all-in on building features and doing marketing that encourage their customers to build applications, serving operational workloads, etc... on them. Things like in-product analytics, usage-based billing, personalization, etc...

Anyone here taking them up on it? I'm genuinely curious how it's going.

munchor2y ago

Disclaimer: I work at SingleStoreDB.

Building a database that can handle both analytics and operations is what we've been working on for the past 10+ years. Our customers use us to build applications with a strong analytical component to them (all of the use cases you mentioned and many more).

How's it going? It's going really well! And we're working on some really cool things that will expand our offering from being a pure data storage solution to much more of a platform[1].

If you want to learn more about our architecture, we published this paper at SIGMOD in late 2022 about it[2].

[1]: https://davidgomes.com/databases-cant-be-just-databases-anym...

[2]: https://dl.acm.org/doi/pdf/10.1145/3514221.3526055

weego2y ago

After a series of calls, examples and explanations with them we never managed to get close to a reasonable projection of what our monthly costs would be like on Snowflake. I understand why companies in this field use abstract notions of 'processing' /'compute' units but it's a no go finance wise.

Without some close to real world projections we don't have time to consider implementation to find out for ourselves.

benjaminwootton2y ago

Snowflake is one of the easier tools to measure because it’s a simple function of region, instance size, uptime. If you can simulate some real loads and understand the usage then you do have a shot at forecasting.

Of course the number is going to be high, but you have to remember it rolls up compute and requires less manpower. This is also a win for finance if they are comfortable with usage based billing.

1 more reply

politelemon2y ago

I've noticed that too. I think the marketing is definitely working, I'm seeing a few organisations starting to shift more and more workloads onto them, and some are also publishing datasets on their marketplace.

One of their most interesting offerings coming up is Snowpark which lets you run a Python function as a UDF, within Snowflake. This way you don't have to transfer data around everywhere, just run it as part of your normal SQL statements. It's also possible to pickle a function and send it over... so conceivably one could train a data science model and run that as part of a SQL statement. This could get very interesting.

lokar2y ago

Also containers:

https://www.snowflake.com/blog/snowpark-container-services-d...

jamesblonde2y ago

In theory, fine. Then you look at the walled garden that is Snowpark - only "approved" python libraries are allowed there. It will be a very constrictive set of models you can train, and very constrictive feature engineering in Python. And, wait, aren't Python UDFs super-slow (GIL) - what about Pandas UDFs (wait that's PySpark.....)

2 more replies

atwebb2y ago

> run a Python function as a UDF

Is that a differentiator? I'm unfamiliar with Snowpark's actual implementation but know SQL Server introduced Python/R in engine in 2016? something like that.

ramraj072y ago

Snowflake is capturing a large market share in analytics industries thanks to its “just works” feature. I’m a massive fan.

But in the end, snowflake stores the data in S3 as partitions. If you want to update a single value you have to replace the entire s3 partition. Similarly you need to read a reasonable amount of s3 data to retrieve even a single record. Thus you’re never going to get responses shorter than half a second (at best). As long as you don’t try and game around that limitation it works great.

Materialize up here also follows the same model in the end FWIW.

GeneralAntilles2y ago

Yeah, they're providing a path-of-least-resistance for getting stuff done in your existing data environment.

A common challenge in a lot of organizations is IT as a roadblock to deployment of internal tools coming from data teams. Snowflake is answering this with Streamlit. You get an easy platform for data people to use and deploy on and it can all be done within the business firewall under data governance within Snowflake.

kwillets2y ago

It's really bad for in-product analytics. Slow and expensive to keep running on a 24/7 SLA.

Even with Vertica doing that, we're seeing 10x costs just doing back-office DWH. My job now is keeping Vertica running so we can pay our Snowflake bill.

datadrivenangel2y ago

I assume they're angling for a salesforce acquisition as they move towards being a micro-hosting service like salesforce.

rubiquity2y ago

Snowflake is worth at least 25% of Salesforce so such an acquisition is very unlikely unless Salesforce has $60 billion or more burning a hole in their pocket.

hodgesrm2y ago

This article uses an either or definition that leaves out a big set of use cases that combine operational and analytic usage:

> First, a working definition. An operational tool facilitates the day-to-day operation of your business. Think of it in contrast to analytical tools that facilitate historical analysis of your business to inform longer term resource allocation or strategy.

Security event and incident management (SEIM) is a typical example. You want fast notification on events combined with the ability to sift through history extremely quickly to assesss problems. This is precisely the niche occupied by real-time analytic databases like ClickHouse, Druid, and Pinot.

mritchie7122y ago

I caught myself wondering how Google, Microsoft and Amazon let Snowflake win. You can argue they haven't won, but lets assume they have. Two things:

1. SNOW's market cap is $50B. GOOGL, MSFT, AMZN are all over $1T. Owning Snowflake would be a drop in the bucket for any of them (let alone if they were splitting the revenue).

2. Snowflake runs on AWS, GCP or Azure (customers choice), so a good chunk of their revenue goes back to these services.

Looking at these two points as the CEO of GOOGL, MSFT, or AMZN, I'd shrug away Snowflake "beating us". It's crazy that you can build a $50B company that your largest competitors barely care about.

hnthrowaway03282y ago

I agree. The cloud providers are basically the guys who sell shovels in gold rush. Snowflake still needs to build on top the clouds so MAG never lose. I heard that SNOW is offering its own cloud services but I could be wrong -- and even if I'm correct they have a super long way to catch up.

pm902y ago

What I heard is that AWS got there first with Redshift but then didn’t really invest as much as was required by users so Snowflake found an opening and pounced on it.

BigQuery in GCP is a pretty great alternative and I know that GCP invests/promotes it heavily, but they were slightly late to the market.

datadrivenangel2y ago

BigQuery is pretty great. The serverless by default setup works very well for most BI use cases. There are some weird issues when you're a heavy user and start hitting the normally hidden quotas.

1 more reply

riku_iki2y ago

> 1. SNOW's market cap is $50B. GOOGL, MSFT, AMZN are all over $1T. Owning Snowflake would be a drop in the bucket for any of them (let alone if they were splitting the revenue).

FAANG can't utilize their market cap to buy SNOW, they would need to pay cash, and 50B is very large amount for any of these companies (its about annual Google net income).

Also, snow stock is very inflated now, it is heavily income negative, and revenue not that high, stock price is very high on growth expectations.

mritchie7122y ago

My point is that they wouldn't want to buy it (or have focused much on building a competitive product) if it's only worth $50B.

atwong2y ago

There are other databases today that do real time analytics (ClickHouse, Apache Druid, StarRocks along with Apache Pinot). I'd look at the ClickHouse Benchmark to see who are the competitors in that space and their relative performance.

slotrans2y ago

Yeah ClickHouse is definitely the way to go here. Its ability to serve queries with low latency and high concurrency is in an entirely different league from Snowflake, Redshift, BigQuery, etc.

biggestdummy2y ago

StarRocks handles latency and concurrency as well as Clickhouse but also does joins. Less denormalization, and you can use the same platform for traditional BI/ad-hoc queries.

2 more replies

RyanHamilton2y ago

For real-time and large historical data, open source there's tdengine/questdb, commercial DolphinDB and kdb+. If you only need fast recent data and not large historical embedding is a good solution which means h2/duckdb/sqlite if open source, extremedb if commercial. I've benchmarked and ran applications on most these databases including running real-time analytics.

qoega2y ago

Open-source ClickHouse also allows both real-time and large historical data.

1 more reply

bob10292y ago

> Operational workloads have fundamental requirements that are diametrically opposite from the requirements for analytical systems, and we’re finding that a tool designed for the latter doesn’t always solve for the former.

We aren't even going to consider the other direction? Running your analytics on top of a basic-ass SQL database?

In our shop, we aren't going for a separation between operational and analytical. The scale of our business and the technology available means we can use one big database for everything [0]. The only remaining challenge is to construct the schema such that consumers of the data are made aware of the rates of change and freshness of the rows (load interval, load date, etc).

If someone wants to join operational with analytical, I think they shouldn't have to reach for a weird abstraction. Just write SQL like you always would and be aware that certain data sources might change faster than others.

Sticking everything onto one target might sound like a risky thing, but I find many of these other "best practices" DW architectures to be far less palatable (aka sketchier) than one big box. Disaster recovery of 100% of our data is handled with replication of a single transaction log and is easy to validate.

[0]: https://learn.microsoft.com/en-us/azure/azure-sql/database/h...

debarshri2y ago

15 years ago when I joined workforce business intelligence was all the rage. Data world was pretty much straight forward. You had transactional data in OLTP databases which would be shipped to Operational data stores, then rolled into the data warehouse. Datawarehouses were actual specialised hardware appliances (netezza et al) reporting tools were robust too.

Everytime I moved from one org to another, these concepts of data warehouse somehow got muddled.

dontupvoteme2y ago

The random bolding of words reeks of adtech.

is the usage of such an old html tag itself now a trigger to send something to /dev/null?

xiasongh2y ago

Why does bolding words imply adtech?

dontupvoteme2y ago

They are bad.

They indicate poorly written text.

j / k navigate · click thread line to collapse

76 comments

mrbungie2y ago

chuckhend2y ago

jgraettinger12y ago

> they told me the new CDO was already convinced that said "RT-based" datalake was the way to go forward

mrbungie2y ago

Well, it's an engineering decision, so there is no direct answer.

albert_e2y ago

Arent a lot of businesses being sold on "real time analytics" these days?

What is the current thinking in this space? I am sure there are technical solutions here but what is the framework to evaluate which use case actually deserves pursuing such a setup.

Curious to hear.

lmkg2y ago

I work as a web analyst (think Google Analytics).

This organization does not benefit from real-time analytics.

So if you want a framework, I would start from a single, simple question: What can you actually do with real-time data? Name one (1) action your organization could take based on that data.

I think it's also useful to separate what data benefits from realtime and which users can make use of it. Even if you have real-time data, some consumers don't benefit from immediacy.

coredog642y ago

Generally speaking “What questions do you hope to answer with this data?” is a good filter for all kinds of operational data.

iamacyborg2y ago

Hate to say it but if your site was only getting a few thousand visitors a month your test was likely vastly underpowered and therefore irrelevant anyway

1 more reply

slotrans2y ago

Just gonna keep linking this til the heat death of the universe: https://mcfunley.com/whom-the-gods-would-destroy-they-first-...

Real-time analytics are worse than useless. At best they are a distracting resource sink, at worst they directly harm the quality of decision-making.

RyanHamilton2y ago

Eumenes2y ago

ozim2y ago

higeorge132y ago

I work in a real time subscription analytics company (chartmogul.com). We fetch, normalize and aggregate various billing systems data and eventually visualize them into graphs and tables.

But we try to make everyone happy so we aim for real time analytics.

mrbungie2y ago

For most operations RT/NRT data stuff normally is about novelty/vanity rather than a real existing business need.

1 more reply

jandrewrogers2y ago

chimerasaurus2y ago

Disclaimer - Snowflake here.

Example: (and yes, this is a Snowflake video but has a visual) https://youtu.be/Ou04UZWwxgg?t=64

RyanHamilton2y ago

hintymad2y ago

On the other hand, real-time analytics for machines can be critical to a business, which is why Yandex built Clickhouse and ByteDance deployed more than 20K nodes of Clickhouse.

Just like using any technology, we need to figure out what problems we solve for real-time analytics first.

atwebb2y ago

Real-time generally means near-real-time and even then I liken it to availability.

If asked people would say "I need to always be up" until they see the costs associated with it, then being out for a few hours a year tends to be ok.

datadrivenangel2y ago

This is a great way looking at it. The cost starts going up rapidly from daily and approaches infinity as you get to ultra-low latency realtime analytics.

riordan2y ago

------

[1]: https://martin.kleppmann.com/2014/10/16/real-time-data-produ...

spullara2y ago

https://www.snowflake.com/guides/htap-hybrid-transactional-a...

datavirtue2y ago

disgruntledphd22y ago

Snowflake are the Oracle of the cloud.

1 more reply

andrenotgiant2y ago

Anyone here taking them up on it? I'm genuinely curious how it's going.

munchor2y ago

Disclaimer: I work at SingleStoreDB.

How's it going? It's going really well! And we're working on some really cool things that will expand our offering from being a pure data storage solution to much more of a platform[1].

If you want to learn more about our architecture, we published this paper at SIGMOD in late 2022 about it[2].

[1]: https://davidgomes.com/databases-cant-be-just-databases-anym...

[2]: https://dl.acm.org/doi/pdf/10.1145/3514221.3526055

weego2y ago

Without some close to real world projections we don't have time to consider implementation to find out for ourselves.

benjaminwootton2y ago

Of course the number is going to be high, but you have to remember it rolls up compute and requires less manpower. This is also a win for finance if they are comfortable with usage based billing.

1 more reply

politelemon2y ago

lokar2y ago

Also containers:

https://www.snowflake.com/blog/snowpark-container-services-d...

jamesblonde2y ago

2 more replies

atwebb2y ago

> run a Python function as a UDF

Is that a differentiator? I'm unfamiliar with Snowpark's actual implementation but know SQL Server introduced Python/R in engine in 2016? something like that.

ramraj072y ago

Snowflake is capturing a large market share in analytics industries thanks to its “just works” feature. I’m a massive fan.

Materialize up here also follows the same model in the end FWIW.

GeneralAntilles2y ago

Yeah, they're providing a path-of-least-resistance for getting stuff done in your existing data environment.

kwillets2y ago

It's really bad for in-product analytics. Slow and expensive to keep running on a 24/7 SLA.

Even with Vertica doing that, we're seeing 10x costs just doing back-office DWH. My job now is keeping Vertica running so we can pay our Snowflake bill.

datadrivenangel2y ago

I assume they're angling for a salesforce acquisition as they move towards being a micro-hosting service like salesforce.

rubiquity2y ago

Snowflake is worth at least 25% of Salesforce so such an acquisition is very unlikely unless Salesforce has $60 billion or more burning a hole in their pocket.

hodgesrm2y ago

This article uses an either or definition that leaves out a big set of use cases that combine operational and analytic usage:

mritchie7122y ago

I caught myself wondering how Google, Microsoft and Amazon let Snowflake win. You can argue they haven't won, but lets assume they have. Two things:

1. SNOW's market cap is $50B. GOOGL, MSFT, AMZN are all over $1T. Owning Snowflake would be a drop in the bucket for any of them (let alone if they were splitting the revenue).

2. Snowflake runs on AWS, GCP or Azure (customers choice), so a good chunk of their revenue goes back to these services.

Looking at these two points as the CEO of GOOGL, MSFT, or AMZN, I'd shrug away Snowflake "beating us". It's crazy that you can build a $50B company that your largest competitors barely care about.

hnthrowaway03282y ago

pm902y ago

What I heard is that AWS got there first with Redshift but then didn’t really invest as much as was required by users so Snowflake found an opening and pounced on it.

BigQuery in GCP is a pretty great alternative and I know that GCP invests/promotes it heavily, but they were slightly late to the market.

datadrivenangel2y ago

BigQuery is pretty great. The serverless by default setup works very well for most BI use cases. There are some weird issues when you're a heavy user and start hitting the normally hidden quotas.

1 more reply

riku_iki2y ago

> 1. SNOW's market cap is $50B. GOOGL, MSFT, AMZN are all over $1T. Owning Snowflake would be a drop in the bucket for any of them (let alone if they were splitting the revenue).

FAANG can't utilize their market cap to buy SNOW, they would need to pay cash, and 50B is very large amount for any of these companies (its about annual Google net income).

Also, snow stock is very inflated now, it is heavily income negative, and revenue not that high, stock price is very high on growth expectations.

mritchie7122y ago

My point is that they wouldn't want to buy it (or have focused much on building a competitive product) if it's only worth $50B.

atwong2y ago

slotrans2y ago

Yeah ClickHouse is definitely the way to go here. Its ability to serve queries with low latency and high concurrency is in an entirely different league from Snowflake, Redshift, BigQuery, etc.

biggestdummy2y ago

StarRocks handles latency and concurrency as well as Clickhouse but also does joins. Less denormalization, and you can use the same platform for traditional BI/ad-hoc queries.

2 more replies

RyanHamilton2y ago

qoega2y ago

Open-source ClickHouse also allows both real-time and large historical data.

1 more reply

bob10292y ago

We aren't even going to consider the other direction? Running your analytics on top of a basic-ass SQL database?

[0]: https://learn.microsoft.com/en-us/azure/azure-sql/database/h...

debarshri2y ago

Everytime I moved from one org to another, these concepts of data warehouse somehow got muddled.

dontupvoteme2y ago

The random bolding of words reeks of adtech.