Materialize: A Streaming Data Warehouse (opens in new tab)

(materialize.io)

262 pointsirfansharif6y ago100 comments

100 comments

72 comments · 26 top-level

slap_shot6y ago· 9 in thread

> We believe that streaming architectures are the only ones that can produce this ideal data infrastructure.

I just want to say this is a very dangerous assumption to make.

I run a company that helps our customers consolidate and transform data from virtually anywhere in their data warehouses. When we first started, the engineer in me made the same declaration, and I worked to get data into warehouses seconds after and event or record was generated in an origin system (website, app, database, salesforce, etc).

What I quickly learned was that analysts and data scientists simply didn't want or need this. Refreshing the data every five minutes in batches was more than sufficient.

Secondly, almost all data is useless in its raw form. The analysts had to perform ELT jobs on their data in the warehouse to clean, dedupe, aggregate, and project their business rules on that data. These functions often require the database to scan over historical data to produce the new materializations of that data. So even if we could get the data in the warehouse in sub-minute latency, the jobs to transform that data ran every 5 minutes.

To be clear, I don't discount the need of telemetry and _some_ data to be actionable in a smaller time frame, I'm just weary of a data warehouse fulfilling that obligation.

In any event, I do think this direction is the future (an overwhelming amount of data sources allow change data capture almost immediately after an event occurs), I just don't think it's only architecture that can satisfy most analysts'/data scientists' needs today.

I would love to hear the use cases that your customers have that made Materialize a good fit!

Darkphibre6y ago

I architected and implemented a true-realtime telemetry pipeline. The requirement was subsecond per-user aggregation and round-trip notification of thresholds exceeded. Took us a couple years, but when Halo 5 launched, we handled 2.5B events/hour without breaking a sweat (AMQP over Websockets). It's since been rolled out to multiple Microsoft 1st-party games.

The round-trip requirement was dropped before we launched, reducing the usage of the technology stack to pure telemetry gathering.

The analysts are all perfectly happy with 5-10 minute delays.

Link to my GDC talk, in case people are interested: https://www.youtube.com/watch?v=o098roxWAkA

benesch6y ago

> I just want to say this is a very dangerous assumption to make.

I think we're actually arguing the same points here. It's not that every use case needs single-digit millisecond latencies! There are plenty of use cases that are satisfied by batch jobs running every hour or every night.

But when you do need real-time processing, the current infrastructure is insufficient. When you do need single-digit latency, running your batch jobs every second, or every millisecond, is computationally infeasible. What you need is a reactive, streaming infrastructure that's as powerful as your existing batch infrastructure. Existing streaming infrastructure requires you to make tradeoffs on consistency, computational expressiveness, or both; we're rapidly evolving Materialize so that you don't need to compromise on either point.

And once you have streaming data warehouse in place for the use cases that really demand the single-digit latencies, you might as well plug your analysts and data scientists into that same warehouse, so you're not maintaining two separate data warehouses. That's what we mean by ideal: not only does it work for the systems with real-time requirements, but it works just as well for the humans with looser requirements.

To give you an example, let me respond to this point directly:

> Secondly, almost all data is useless in its raw form. The analysts had to perform ELT jobs on their data in the warehouse to clean, dedupe, aggregate, and project their business rules on that data. These functions often require the database to scan over historical data to produce the new materializations of that data. So even if we could get the data in the warehouse in sub-minute latency, the jobs to transform that data ran every 5 minutes.

The idea is that you would have your analysts write these ETL pipelines directly in Materialize. If you can express the cleaning/de-duplication/aggregation/projection in SQL, Materialize can incrementally maintain it for you. I'm familiar with a fair few ETL pipelines that are just SQL, though there are some transformations that are awkward to express in SQL. Down the road we might expose something closer to the raw differential dataflow API [0] for power users.

[0]: https://github.com/TimelyDataflow/differential-dataflow

vhold6y ago

I think what might be really unique here that people aren't imagining, are the new possible applications of having <100ms updates on complex materialized views.

With sufficiently expressive SQL and UDF support there are whole classes of stateful services that are performing lookups, aggregations, etc, that could be written as just views on streams of data. Experts who model systems in SQL, but aren't experts in writing distributed stateful streaming services would basically be able to start deploying services.

Are there any plans to support partitioned window functions, particularly lag(),lead(),first(),last() OVER() ? That would be remarkably powerful.

1 more reply

taurath6y ago

I could see a usecase for rapid automated A/B testing, using ML to react to performance metrics. Why have human editors on cnn.com when you can in real time do story selection based on view traffic?

That said, hope we as technologists can find some better use cases than just stealing more of people’s a attention.

thom6y ago

I work in the sports stats industry and we've been talking to Materialize for a while. We have a lot of demand for real time data, especially from the media and gambling sectors. The SLAs are mostly in seconds not milliseconds, so we currently don't have to wring every last ounce of latency out of our pipeline, but I'm very excited about the product because it offers a good balance of performance and developer productivity. I can even imagine a universe in which we directly offer the capability to customers to be able to build their own real-time KPIs in a language that is fairly accessible and easy to hire for.

derefr6y ago

The point of Materialize, from my understanding, is that you don't put things into the data warehouse and then, as a separate step, run these enrichment/reporting jobs on it.

Instead, you register persistent, stateful enrichment "streaming jobs" (i.e. incrementally-materialized views) into the data warehouse; and then, when data comes into a table upstream of these views, it gets streamed into and through the related job to incrementally populate the matview.

I believe you end up with a consistent MVCC snapshot between the source data and its dependent inc-matviews, where you can never see new source data and old derived data in the same query; sort of as if the inc-matview were being updated by a synchronous ON INSERT trigger on the original raw-data ingest transaction. (Or—closer analogy—sort of like the raw table has a computed-expression index visible as a standalone relation.)

Also, the job itself is written as a regular OLAP query over a fixed-snapshot dataset, as a data-scientist would expect; but this gets compiled by the query planner into a continuous CTE/window query sorta thing, that spits out running totals/averages/etc. as it gets fed incremental row-batches—but only at the granularity that consumers are demanding from it, either by directly querying the inc-matview, or by sourcing it in their own streaming jobs. Which, of course, you do by just using a regular OLAP query on inc-matview X in your definition of inc-matview Y.

> I would love to hear the use cases that your customers have that made Materialize a good fit!

We haven't used Materialize yet, but it's a very promising fit for our use-case: high-level financial-instrument analytics on blockchain data. We need to take realtime-ish batches of raw transaction data and extract, in est, arbitrarily-formatted logs from arbitrary program-runs embedded in the transaction data (using heavy OLAP queries); recognize patterns in those that correspond to abstract financial events (more heavy OLAP queries); enrich those pattern-events with things like fiscal unit conversions; and then emit the enriched events as, among other things, realtime notifications, while also ending up with a data warehouse populated with both the high- and low-level events, amenable to arbitrary further OLAP querying by data scientists.

jamii6y ago

> Instead, you register persistent, stateful enrichment "streaming jobs" (i.e. incrementally-materialized views) into the data warehouse; and then, when data comes into a table upstream of these views, it gets streamed into and through the related job to incrementally populate the matview.

This is correct. There is an example in the docs of dumping json records into a source and then using materialized views to normalize and query them - https://materialize.io/docs/demos/microservice/

je426y ago

I think it would be helpful if you could dive deeper why you think " Refreshing the data every five minutes in batches" is "sufficient".

From my perspective: batching is more complicated, than batching. (Batching requires you to define parameters like batch size and interval, while streaming does not for example). But may be batching tools are simpler than streaming tools, but i am not so sure.

Batching in general has also high(er) latency. That's why I usually don't prefer it unless:

That said batching has an advantage over streaming, it can ammortise a cost that you only pay once per batch process. With streaming you would pay the cost for each items as it arrives.

Further, the mindset requirements for engineers that work with batching is different than for streaming.

Each of these items can be valid concern for batching vs streaming. However, I find it difficult to value statements like "Batching" is the default because the industry has been doing this for years by default.

I think the industry as a whole benefits when engineers in these kind of discussions repeat why certain conditions lead to a choice like batching.

TeMPOraL6y ago

> I think it would be helpful if you could dive deeper why you think " Refreshing the data every five minutes in batches" is "sufficient".

Not OP, but I'm guessing because most of that data is not actionable in real-time. There's zero point to get real-time data to analysts or decision makers if they're not going to use it to make real-time decisions; arguably, it can be even counterproductive, leading to an organizational ADHD, where people fret over minute-to-minute changes, where they should be focusing on daily or monthly running averages.

1 more reply

luhn6y ago· 8 in thread

I didn't really understand what the product actually did after reading this blog post or the products page. I found the docs much more edifying:

> Materialize lets you ask questions about your data, and then get the answers in real time.

> Why not just use your database’s built-in functionality to perform these same computations? Because your database often acts as if it’s never been asked that question before, which means it can take a long time to come up with an answer, each and every time you pose the query.

> Materialize instead keeps the results of the queries and incrementally updates them as new data comes in. So, rather than recalculating the answer each time it’s asked, Materialize continually updates the answer and gives you the answer’s current state from memory.

> Importantly, Materialize supports incrementally updating a much broader set of views than is common in traditional databases (e.g. views over multi-way joins with complex aggregations), and can do incremental updates in the presence of arbitrary inserts, updates, and deletes in the input streams.

https://materialize.io/docs/

yomly6y ago

This reminds me a lot about Noria DB. Wonder if anyone familiar with both can shed any further light?

benesch6y ago

Indeed, Materialize is quite similar to Noria, and has the Frank McSherry stamp of awesomeness. [0] We know many of the Noria folks and have a lot of respect for them and their work. I also worked on the Noria project for a summer in college, and am a full-time engineer at Materialize now.

The biggest difference is one of intended use. Noria is, first and foremost, a research prototype, intended to explore new ideas in systems research. Materialize, by contrast, is intended to be a rock-solid piece of production infrastructure. (Much of the interesting research, in timely and differential dataflow, is already done.) We've invested a good bit in supporting the thornier bits of SQL, like full joins, nested subqueries, correlated subqueries, variable-precision decimals, and so on. Noria's support for SQL is less extensive; I think the decisions have been guided mostly by what's necessary to run its lobste.rs and HotCRP benchmarks. Make no mistake: Noria is an impressive piece of engineering, but, as an enterprise looking to deploy Noria, there's no one you can pay for support or to implement feature requests.

One area where Noria shines is in partial materialization. Details are in the Noria paper [1], but the tl;dr is that Noria has a lot of smarts around automatically materializing only the subset of the view that is actually accessed, while presently Materialize requires that you explicitly declare what subsets to materialize. We have some plans for how to bring these smarts to Materialize, but we haven't implemented them yet.

Also worth noting is that Materialize's underlying dataflow engine, differential dataflow, has the ability to support iterative computation, while Noria's engine requires an acyclic dataflow graph. We don't yet expose this power in Materialize, but will soon. Put another way: `WITH RECURSIVE` queries are a real and near possibility in Materialize, while (as I understand it) `WITH RECURSIVE` queries would require substantial retooling of Noria's underlying dataflow engine.

One of the creators of Noria, Jon Gjengset, did an interview on Noria [2] that covered some of differences between Noria and differential dataflow from his perspective, which I highly recommend you check out as well!

[0]: https://twitter.com/frankmcsherry/status/1056957760435376129...

[1]: https://jon.tsp.io/papers/osdi18-noria.pdf

[2]: https://notamonadtutorial.com/interview-with-norias-creator-...

2 more replies

jadbox6y ago

It also sounds a bit like InfluxDB to me..

Ar-Curunir6y ago

That’s built on top of differential data flow, the same thing underlying Materialize

1 more reply

pedalpete6y ago

I had the same response. So reading what you've posted here, it appears to be a smart(er) cache.

Do you think it's valuable to have this as a service rather than roll your own solution?

jacques_chester6y ago

I think the main step forward is that it has an efficient means for calculating what views change when a new datum arrives. Which means that it could, in theory, hold a very large amount of views for relatively low overhead.

It's a surprisingly difficult problem. I wouldn't roll my own.

1 more reply

airstrike6y ago

This comment reminded me of this classic: https://news.ycombinator.com/item?id=9224

Qwuke6y ago

The differential dataflow codebase was really polished and optimized last I saw - when they say "demand milliseconds" I think they put the effort into delivering that.

Additionally, given it will be Apache licensed in 4 years, I think it'd be good to ask "Can I wait?" before going all in

chrisjc6y ago· 3 in thread

Would it be fair to say this is a more OLAP-oriented approach to what KSqlDB (not KSql, but https://ksqldb.io/) does?

Seems that it's perhaps lacking the richness of how ksqldb uses Kafka Connectors (sinks and sources), but I don't see any reason you couldn't use Materialize in conjunction with ksqldb.

Eg:

KC-source --> ksql --> materialize --> kafka --> KC-sink

Question to Materialize...

What connectors (sinks and sources) do you have or plan to develop? Seems like it's mostly Kafka in and out at the moment.

Why would I use this over KSqlDB?

Can I snapshot and resume from the stream? Or do I need to rehydrate to re-establish state?

benesch6y ago

> Would it be fair to say this is a more OLAP-oriented approach to what KSqlDB (not KSql, but https://ksqldb.io/) does?

I'm not sure I'd say it's "more OLAP." ksqlDB is about as OLAP as it gets, considering it doesn't support any sort of transactions or consistency. We think Materialize is quite a bit more powerful than what ksqlDB offers, thanks to the underlying technologies (timely/differential). For example, our joins are proper SQL joins, and don't require you to reason about the complicated and confusing difference between a stream and a table (https://docs.ksqldb.io/en/latest/developer-guide/joins/join-...). We also have preliminary support for maintaining the consistency properties of upstream OLTP data sources, and we'll be rolling out a more complete story here shortly.

> Seems that it's perhaps lacking the richness of how ksqldb uses Kafka Connectors (sinks and sources), but I don't see any reason you couldn't use Materialize in conjunction with ksqldb.

Is there something in particular about ksqlDB connectors that we don't seem to support? Our CREATE SOURCE command is quite powerful: https://materialize.io/docs/sql/create-source/.

> What connectors (sinks and sources) do you have or plan to develop? Seems like it's mostly Kafka in and out at the moment.

We already support file sources in a variety of formats, and support for Amazon Kinesis is on the short-term roadmap: https://github.com/MaterializeInc/materialize/issues/1239.

Are there other connector types you'd like to see?

> Can I snapshot and resume from the stream? Or do I need to rehydrate to re-establish state?

At the moment you can't snapshot and resume, but support for this is planned.

chrisjc6y ago

Thanks for the detailed response. At this point i think the onus is on me to go and take a deeper look into timely/differential.

Important "sources" to me are obviously Kafka, but also MySQL and Mongo. Important "sinks" would be Snowflake (maybe through S3, or directly though PUT) and ElasticSearch. Although I imagine you might soon be telling me that you don't need a data warehouse once you have Materialize :)

dswalter6y ago

Presumably one reason to use this is latency: materialize is built on differential and timely dataflow, some innovative frameworks (by Frank McSherry, who was at Microsoft Silicon Valley Research back in the day and is elsewhere on this HN discussion) that are intended to reduce the amount of computation in certain kinds of calculations. Materialized views are particularly ripe for those advances.

It's also written in rust instead of Java, so there's no JVM RAM penalty or GC to contend with.

solidangle6y ago· 3 in thread

> Blazing fast results

I highly doubt this, given that the query engine is interpreted and non-vectorized. Queries are 10x to a 100x slower on a simple query, and 100x to 1000x slower on a query with large aggregations and joins without compilation of vectorization.

> Full SQL Exploration

Except for window functions it seems. These actually matter to data analysts.

benesch6y ago

Considerations are completely different in a streaming context. It’s not so much about how fast you can churn through terabytes of data; it’s more about how quickly you can turn around the incremental computation with each new datum. There’s some serious research behind this product, in timely and differential dataflow, and I’d encourage you to check out some of that research before making sweeping performance claims. Frank’s blog post on TPC-H is a good place to start: https://github.com/frankmcsherry/blog/blob/master/posts/2017...

We definitely have some performance engineering work to do in Materialize, but don’t let the lack of vectorization scare you off. It’s just not as important for a streaming engine.

edmundsauto6y ago

It's one thing to be skeptical and ask for evidence of speed, another to dismiss them out of hand due to a casual review of their website. Or did I miss it that you tried it out and found it wanting?

I work in an org with > 100 data scientists. I bet that 50% have never used window functions. I would guess than fewer than 20% know how to write one.

Your intent in this comment is unclear, but if you were looking to provide actionable feedback, you might want to reconsider your tone. This project looks like a pretty impressive feat of applied CS theory to doing useful stuff.

solidangle6y ago

FWIW, I do think this project is really cool. I should have taken a little bit more time to write this comment, as it's overly negative right now.

mason556y ago· 3 in thread

Not sure how the featuresets compare but AWS is releasing materialized views for Redshift sometime soon and one of the things it will support is incremental refresh (assuming your view meets some criteria).

I'm sure Materialize is better at this since it's purpose-built but if you're on Redshift you can get at least some of the benefits of incremental materialize.

jerglingu6y ago

Materialized views, the upcoming ability to query directly from an RDS transactional DB through Redshift, and finally true separation of compute vs storage with RA nodes IMO make Redshift the market leader by a huge margin now. I haven't actually tested RA nodes yet, but if performance is even just a fraction of legacy nodes then the competition is already dead. Redshift already is one of the best engines to optimize a query

chrisjc6y ago

It's been a while since I've used Redshift, but isn't it still dependent on data coming in via a COPY from S3? Any sort of Redshift materialized view offering would depend on batches of data landing in an underlying table or tables. The closest service offering from AWS is probably using Kinesis analytics (or Flink on KA) using their flavor of streaming SQL to join Kinesis streams forming new ones.

mason556y ago

With the introduction of Spectrum you can back a Redshift table with data in S3 directly. I'm not sure how that interacts with the materialized views though. Probably not supported yet but I would expect it to be eventually.

programmarchy6y ago· 3 in thread

Isn't BigQuery real-time as well? For me, the wow factor is that you can host this yourself.

chrisjc6y ago

I'm unfamiliar with BQ, but from what I understand BQ doesn't have any concept of streams... with maybe an exception for change-streams (change tracking) and that's very different and less powerful than what Materialize is able to do.

After all, in BQ you're only able to query what has been already has been ingested or is being digested and that is somewhat by definition not 'real-time'.

manigandham6y ago

GCP has PubSub which is their streaming/messagebus product. Cloud Dataflow is their stream/batch processing engine (backed by Apache Beam).

There's a new feature called "Dataflow SQL" that lets you submit a declarative SQL query through the BQ UI as a job to Dataflow which can join both BigQuery tables and PubSub topics in the same job.

https://cloud.google.com/dataflow/docs/guides/sql/dataflow-s...

1 more reply

manigandham6y ago

BigQuery does have streaming ingest and can pickup new rows immediately but you still have to run a query to get the latest results.

This system is more like BigQuery with materialized views that are constantly updated as new data arrives.

streblo6y ago· 2 in thread

How does materialize compare in performance (especially ingress/egress latency) to other OLAP systems like Druid or ClickHouse? Would love to see some benchmarks.

jamii6y ago

We're working on benchmarks, but I expect that at the moment Materialize will be slower for one-off queries but faster for frequently repeated queries.

grumblestumble6y ago

This appears to not be fully open source: https://github.com/MaterializeInc/materialize/blob/master/LI...

cbdumas6y ago· 2 in thread

I'm not sure if it's a mistake or just some cheeky humor but the "Known Limitations" link on their Docs page returns a 404.

frankmcsherry6y ago

It looks like a caching bug somewhere in the web pipeline. It isn't running on Materialize, sadly.

There are a pile of known limitations, and our goal is to be 100% clear about them.

polskibus6y ago

Hi Frank! Congrats on the progress - I remember listening to you describing Materialized at SIGMOD'19 in the big hall - it sounded like magic back then, glad to see it come true :) I have a question now - can you subscribe to view updates somehow, instead of querying them on an ad-hoc basis?

1 more reply

wiradikusuma6y ago· 2 in thread

Materialize connects directly to event stream processors (like Kafka) --- how about Pulsar? (Goggling doesn't yield anything useful, Materialize and Pulsar are both name of brands of other things)

chris_overseas6y ago

Connecting Pulsar and Materialize is of interest to me too and something I might try when I find time to do so. Note that Pulsar does have a Kafka compatibility layer already[0], so it might just work out of the box. If you try this I'd be keen to hear how it goes.

[0] https://pulsar.apache.org/docs/en/adaptors-kafka/

EDIT: I don't think this adaptor will work after all, it works by replacing the Kafka Java client library with its own, so is only applicable to Java applications.

jamii6y ago

Adding sources is pretty easy. We just need them to provide:

* timestamped changes eg "record foo was deleted at time 42"

* watermarks eg "I've now sent you all the changes up to time 42"

Here's the implementation of the kafka source - https://github.com/MaterializeInc/materialize/blob/master/sr.... (There's also a bit of code in the parser/planner to wire up the CREATE SOURCE syntax, but it's fairly trivial.)

gbrits6y ago· 2 in thread

Looks promising! Can materialized views be backfilled?

vhold6y ago

According to the documentation at https://materialize.io/docs/overview/architecture/

> Streaming sources must receive all of their data from the stream itself; there is no way to “seed” a streaming source with static data. However, you can union streaming and file sources in views, which accomplishes a similar outcome.

What I find odd is that the documentation doesn't show UNION support in the SELECT documentation : https://materialize.io/docs/sql/select/

Perhaps just an oversight in the documentation?

benesch6y ago

Good catch, thanks! I've filed an issue about this: https://github.com/MaterializeInc/materialize/issues/2056

Though it's not yet documented, we do support all of the SQL set operations [0]: UNION [ALL], INTERSECT [ALL], and EXCEPT [ALL].

[0]: https://www.postgresql.org/docs/current/queries-union.html

justlexi936y ago· 2 in thread

Clickhouse has materilized views and is free.

vhold6y ago

I think the biggest difference is that Materialize can do any kind of SQL join on many tables at once. Clickhouse materialized views can only reference one table.

What I'd like to know is if that would enable basically implementing social networks as just 3 tables and one materialized view, and how it would scale and perform.

Users, Relationships, Post, and a Feed materialized view that simply joins them together with an index of user_id and post_timestamp.

As relationships and messages are created or deleted, the feed view is nearly instantly updated. The whole entire view service logic then is just one really fast query. "select user,post,post_timestamp from feed where user_id = current_user and post_timestamp <= last_page_post_timestamp order by post_timestamp desc limit page_size"

wrmsr6y ago

In the real world it is inevitably a lot more than just those 3 tables - add 'groups', different types of groups, different privacy settings, different per-user feed preferences, experiments, and any number of other things which _can_ be expressed as pure, normalized, joined tables in a matview but make naive approaches a lot less likely to actually work in prod.

In my experience the most successful approach to this is a midpoint - you materialize/denormalize enough to feed your app endpoints and search engines but retain flexibility in searching those fat but instantly available docs, and relatedly you also don't always need to preemptively materialize absolutely everything in any particular view - see https://engineering.fb.com/data-infrastructure/dragon-a-dist... . Without being able to transparently operate on arbitrarily partially populated matviews you are locked into a self-defeating all-or-nothing system that is likely to culturally do more harm than good with its rigidity. Imagine for example if there were no 'caches', just a binary choice of precomputing everything ahead of time or recomputing everything every time. Neither extreme is sufficient for all cases and real applications are comprised of many different points on that spectrum.

1 more reply

drej6y ago· 1 in thread

I really like the pg protocol (like e.g. Cockroach), it let me use my usual tools. There are a few things I noticed:

1. It has a fairly rich support for types - these new-ish SQL engines often lack quite a lot of things, but this seems pretty decent. 2. I don't see any comparisons to KSQL, which seems to be the primary competitor. 3. Read the license. Read it carefully. It has a weird "will become open source in four years" clause, so keep that in mind. It also disallows it being hosted for clients to use (esentially as a DBaaS).

jamii6y ago

The main difference vs KSQL is that we support standard SQL (roughly SQL92 + correlated subqueries + json so far) and provide strong consistency:

KSQL has a distinction between streams and tables, effectively giving you control over how views are materialized but also forcing you to do that work yourself. In Materialize you just write the same SQL that you would for a batch job and the planner figures out how to transform it into a streaming dataflow.

KSQL is also eventually consistent - late arriving records might cause changes to previously emitted answers. Materialize won't give you an answer for a given timestamp until it can guarantee that the answer is correct and won't be retroactively updated later.

Expect to see some performance comparisons soon too.

kiwicopple6y ago· 1 in thread

For anyone that might be considering trying something similar with their own Postgres database (PG10+), we recently opensourced this: https://github.com/supabase/realtime

It's an Elixir (Phoenix) server that listens to PostgreSQL's native replication, transforms it into JSON, then blasts it over websockets.

I see that Materialize are using Debezium, which will give you a similar result, just with connectors to Kafka etc

thom6y ago

The magic here isn't that it's giving you streaming updates of a database, it's that it's making some of those updates ridiculously fast. The point is that if you have a materialized view that takes a long time to update (or just a stack of views that are slow to return results), you now have results instantly whenever changes come in. If you built a bunch of code to work around this and do clever invalidation of cached data, you get to throw all that away. The input/output into the system is less of an issue overall.

gaogao6y ago· 1 in thread

The "you may not cluster any server instances of the Licensed Work together for one use" in the license is a fairly tricky clause. Under this clause, how would one run a fault-tolerant instance of Materialize?

benesch6y ago

You would need to contact us to buy an enterprise license. (Or wait four years for the license to convert to Apache 2.)

1290cc6y ago· 1 in thread

Pretty cool tech although I feel they may have missed the moment as AWS, Azure and GCP are becoming hypercompetitive to solve all things related to data/storage. Azure has been churning out major updates to its services and clearly taking inspiration from companies like Snowflake. AWS I think hesitated to compete with Snowflake as they were running on AWS anyway - win/win for them.

Snowflake had incredible timing as they hit the market just before CFO's and non-tech business leaders realized the cost and talent needed to pull off a datalake successfully was more than they'd like. Those that were sick of the management jumped to Snowflake fast and AWS/Azure never really responded until recently.

Awesome to see all the innovative takes on solving these extremely technical problems! I love it!

manigandham6y ago

Google's data product suite seems to be the most advanced today with a completely no-ops approach and very solid primitives to work with.

Snowflake still doesn't an answer to streaming data (other than their fragile Kafka connector) and Azure Synapse still isn't publically available even months after their announcement.

AWS had a good headstart but they keep piling on more products and features that it's now a big mess and requires yet another tool (AWS Lake Formation) just to wire it all up.

polskibus6y ago· 1 in thread

Looking back at the project, knowing what you know now, if you were to start again (but without obtained rust skills), would you go with rust again or pick another toolbox?

jamii6y ago

> would you go with rust again

Absolutely. Even aside from safety, rust has so many quality of life improvements over C++ - algebraic data types, pattern matching, checked errors instead of exceptions, sane standard library, easy packaging, readable error messages, #[must_use], immutable utf8 strings, traits instead of templates, hygienic macros etc.

Other than compile times and editor tooling, which are being actively worked on, the only real pain point I can think of is the poor support for custom allocators.

As for other alternatives, I don't personally have strong opinions on go but I think it's notable that none of the ex-cockroachdb folks at materialize suggested using it instead of rust.

gbrits6y ago· 1 in thread

Is this similar to TimescaleDB's Continuous Aggregates? Interested in knowing the overlap / differences.

benesch6y ago

Similar in concept, but much more powerful in execution. We can incrementally materialize practically any SQL 92 query, with the killer features being joins and correlated subqueries. I’m not super familiar, but TimescaleDB’s continuous aggregates (just on a cursory glance) don’t support stacked aggregates, for example: https://github.com/timescale/timescaledb/issues/1400

tschmidleithner6y ago· 1 in thread

dang6y ago

This comment was briefly downvoted and flagged because the convention on HN is only to link to past threads that have comments. Otherwise users click on the link, don't find the discussion they expected, and get ornery.

Since you're new and likely didn't know this, I've restored your post. (And welcome!)

irfansharifOP6y ago

(Linked in the post but) github repo: https://github.com/MaterializeInc/materialize

samuell6y ago

For anyone interested in the details behind all of this, you should check out Frank's blog:

https://github.com/frankmcsherry/blog

yayr6y ago

I am curious about the physical storage. Is it purely in-memory or is there a disk persistency possible? Is there some kind of data compression applied or what are the memory needs of it? Is it a row or column based data persistence pattern?

manigandham6y ago

Congrats on the launch, always nice to see new products.

This is an interesting mix between the (now obsolete) PipelineDB, TimescaleDB with continuous aggregates, Kafka and other message systems with KSQL/ksqlDB/KarelDB, stream processing engines like Spark, and typical RDBMS like SQL Server with materialized views.

The amount of research to support complex and layered queries definitely sets this apart.

simo76y ago

I'm wondering how this technology could work for OLAP cubes.

An OLAP cube that is automatically & incrementally kept in sync with the changes in the source data sounds promising.

Is that a potential use case?

mnkmnk6y ago

How does this compare to Spark structured streaming? That too allows writing SQL on streams and having the state update incrementally.

rl36y ago

Their careers page says it's built in Rust.

That's always nice to see, since Rust jobs are somewhat rare.

sashavingardt26y ago

I'm intrigued by the product. The SQL examples on the product page are atrocious though ((((

j / k navigate · click thread line to collapse

100 comments

72 comments · 26 top-level

slap_shot6y ago· 9 in thread

> We believe that streaming architectures are the only ones that can produce this ideal data infrastructure.

I just want to say this is a very dangerous assumption to make.

What I quickly learned was that analysts and data scientists simply didn't want or need this. Refreshing the data every five minutes in batches was more than sufficient.

To be clear, I don't discount the need of telemetry and _some_ data to be actionable in a smaller time frame, I'm just weary of a data warehouse fulfilling that obligation.

I would love to hear the use cases that your customers have that made Materialize a good fit!

Darkphibre6y ago

The round-trip requirement was dropped before we launched, reducing the usage of the technology stack to pure telemetry gathering.

The analysts are all perfectly happy with 5-10 minute delays.

Link to my GDC talk, in case people are interested: https://www.youtube.com/watch?v=o098roxWAkA

benesch6y ago

> I just want to say this is a very dangerous assumption to make.

To give you an example, let me respond to this point directly:

[0]: https://github.com/TimelyDataflow/differential-dataflow

vhold6y ago

I think what might be really unique here that people aren't imagining, are the new possible applications of having <100ms updates on complex materialized views.

Are there any plans to support partitioned window functions, particularly lag(),lead(),first(),last() OVER() ? That would be remarkably powerful.

1 more reply

taurath6y ago

I could see a usecase for rapid automated A/B testing, using ML to react to performance metrics. Why have human editors on cnn.com when you can in real time do story selection based on view traffic?

That said, hope we as technologists can find some better use cases than just stealing more of people’s a attention.

thom6y ago

derefr6y ago

The point of Materialize, from my understanding, is that you don't put things into the data warehouse and then, as a separate step, run these enrichment/reporting jobs on it.

> I would love to hear the use cases that your customers have that made Materialize a good fit!

jamii6y ago

This is correct. There is an example in the docs of dumping json records into a source and then using materialized views to normalize and query them - https://materialize.io/docs/demos/microservice/

je426y ago

I think it would be helpful if you could dive deeper why you think " Refreshing the data every five minutes in batches" is "sufficient".

Batching in general has also high(er) latency. That's why I usually don't prefer it unless:

That said batching has an advantage over streaming, it can ammortise a cost that you only pay once per batch process. With streaming you would pay the cost for each items as it arrives.

Further, the mindset requirements for engineers that work with batching is different than for streaming.

I think the industry as a whole benefits when engineers in these kind of discussions repeat why certain conditions lead to a choice like batching.

TeMPOraL6y ago

> I think it would be helpful if you could dive deeper why you think " Refreshing the data every five minutes in batches" is "sufficient".

1 more reply

luhn6y ago· 8 in thread

I didn't really understand what the product actually did after reading this blog post or the products page. I found the docs much more edifying:

> Materialize lets you ask questions about your data, and then get the answers in real time.

https://materialize.io/docs/

yomly6y ago

This reminds me a lot about Noria DB. Wonder if anyone familiar with both can shed any further light?

benesch6y ago

[0]: https://twitter.com/frankmcsherry/status/1056957760435376129...

[1]: https://jon.tsp.io/papers/osdi18-noria.pdf

[2]: https://notamonadtutorial.com/interview-with-norias-creator-...

2 more replies

jadbox6y ago

It also sounds a bit like InfluxDB to me..

Ar-Curunir6y ago

That’s built on top of differential data flow, the same thing underlying Materialize

1 more reply

pedalpete6y ago

I had the same response. So reading what you've posted here, it appears to be a smart(er) cache.

Do you think it's valuable to have this as a service rather than roll your own solution?

jacques_chester6y ago

It's a surprisingly difficult problem. I wouldn't roll my own.

1 more reply

airstrike6y ago

This comment reminded me of this classic: https://news.ycombinator.com/item?id=9224

Qwuke6y ago

The differential dataflow codebase was really polished and optimized last I saw - when they say "demand milliseconds" I think they put the effort into delivering that.

Additionally, given it will be Apache licensed in 4 years, I think it'd be good to ask "Can I wait?" before going all in

chrisjc6y ago· 3 in thread

Would it be fair to say this is a more OLAP-oriented approach to what KSqlDB (not KSql, but https://ksqldb.io/) does?

Seems that it's perhaps lacking the richness of how ksqldb uses Kafka Connectors (sinks and sources), but I don't see any reason you couldn't use Materialize in conjunction with ksqldb.

Eg:

KC-source --> ksql --> materialize --> kafka --> KC-sink

Question to Materialize...

What connectors (sinks and sources) do you have or plan to develop? Seems like it's mostly Kafka in and out at the moment.

Why would I use this over KSqlDB?

Can I snapshot and resume from the stream? Or do I need to rehydrate to re-establish state?

benesch6y ago

> Would it be fair to say this is a more OLAP-oriented approach to what KSqlDB (not KSql, but https://ksqldb.io/) does?

> Seems that it's perhaps lacking the richness of how ksqldb uses Kafka Connectors (sinks and sources), but I don't see any reason you couldn't use Materialize in conjunction with ksqldb.

Is there something in particular about ksqlDB connectors that we don't seem to support? Our CREATE SOURCE command is quite powerful: https://materialize.io/docs/sql/create-source/.

> What connectors (sinks and sources) do you have or plan to develop? Seems like it's mostly Kafka in and out at the moment.

We already support file sources in a variety of formats, and support for Amazon Kinesis is on the short-term roadmap: https://github.com/MaterializeInc/materialize/issues/1239.

Are there other connector types you'd like to see?

> Can I snapshot and resume from the stream? Or do I need to rehydrate to re-establish state?

At the moment you can't snapshot and resume, but support for this is planned.

chrisjc6y ago

Thanks for the detailed response. At this point i think the onus is on me to go and take a deeper look into timely/differential.

dswalter6y ago

It's also written in rust instead of Java, so there's no JVM RAM penalty or GC to contend with.

solidangle6y ago· 3 in thread

> Blazing fast results

> Full SQL Exploration

Except for window functions it seems. These actually matter to data analysts.

benesch6y ago

We definitely have some performance engineering work to do in Materialize, but don’t let the lack of vectorization scare you off. It’s just not as important for a streaming engine.

edmundsauto6y ago

It's one thing to be skeptical and ask for evidence of speed, another to dismiss them out of hand due to a casual review of their website. Or did I miss it that you tried it out and found it wanting?

I work in an org with > 100 data scientists. I bet that 50% have never used window functions. I would guess than fewer than 20% know how to write one.

solidangle6y ago

FWIW, I do think this project is really cool. I should have taken a little bit more time to write this comment, as it's overly negative right now.

mason556y ago· 3 in thread

I'm sure Materialize is better at this since it's purpose-built but if you're on Redshift you can get at least some of the benefits of incremental materialize.

jerglingu6y ago

chrisjc6y ago

mason556y ago

programmarchy6y ago· 3 in thread

Isn't BigQuery real-time as well? For me, the wow factor is that you can host this yourself.

chrisjc6y ago

After all, in BQ you're only able to query what has been already has been ingested or is being digested and that is somewhat by definition not 'real-time'.

manigandham6y ago

GCP has PubSub which is their streaming/messagebus product. Cloud Dataflow is their stream/batch processing engine (backed by Apache Beam).

There's a new feature called "Dataflow SQL" that lets you submit a declarative SQL query through the BQ UI as a job to Dataflow which can join both BigQuery tables and PubSub topics in the same job.

https://cloud.google.com/dataflow/docs/guides/sql/dataflow-s...

1 more reply

manigandham6y ago

BigQuery does have streaming ingest and can pickup new rows immediately but you still have to run a query to get the latest results.

This system is more like BigQuery with materialized views that are constantly updated as new data arrives.

streblo6y ago· 2 in thread

How does materialize compare in performance (especially ingress/egress latency) to other OLAP systems like Druid or ClickHouse? Would love to see some benchmarks.

jamii6y ago

We're working on benchmarks, but I expect that at the moment Materialize will be slower for one-off queries but faster for frequently repeated queries.

grumblestumble6y ago

This appears to not be fully open source: https://github.com/MaterializeInc/materialize/blob/master/LI...

cbdumas6y ago· 2 in thread

I'm not sure if it's a mistake or just some cheeky humor but the "Known Limitations" link on their Docs page returns a 404.

frankmcsherry6y ago

It looks like a caching bug somewhere in the web pipeline. It isn't running on Materialize, sadly.

There are a pile of known limitations, and our goal is to be 100% clear about them.

polskibus6y ago

1 more reply

wiradikusuma6y ago· 2 in thread

Materialize connects directly to event stream processors (like Kafka) --- how about Pulsar? (Goggling doesn't yield anything useful, Materialize and Pulsar are both name of brands of other things)

chris_overseas6y ago

[0] https://pulsar.apache.org/docs/en/adaptors-kafka/

EDIT: I don't think this adaptor will work after all, it works by replacing the Kafka Java client library with its own, so is only applicable to Java applications.

jamii6y ago

Adding sources is pretty easy. We just need them to provide:

* timestamped changes eg "record foo was deleted at time 42"

* watermarks eg "I've now sent you all the changes up to time 42"

gbrits6y ago· 2 in thread

Looks promising! Can materialized views be backfilled?

vhold6y ago

According to the documentation at https://materialize.io/docs/overview/architecture/

What I find odd is that the documentation doesn't show UNION support in the SELECT documentation : https://materialize.io/docs/sql/select/

Perhaps just an oversight in the documentation?

benesch6y ago

Good catch, thanks! I've filed an issue about this: https://github.com/MaterializeInc/materialize/issues/2056

Though it's not yet documented, we do support all of the SQL set operations [0]: UNION [ALL], INTERSECT [ALL], and EXCEPT [ALL].

[0]: https://www.postgresql.org/docs/current/queries-union.html

justlexi936y ago· 2 in thread

Clickhouse has materilized views and is free.

vhold6y ago

I think the biggest difference is that Materialize can do any kind of SQL join on many tables at once. Clickhouse materialized views can only reference one table.

What I'd like to know is if that would enable basically implementing social networks as just 3 tables and one materialized view, and how it would scale and perform.

Users, Relationships, Post, and a Feed materialized view that simply joins them together with an index of user_id and post_timestamp.

wrmsr6y ago

1 more reply

drej6y ago· 1 in thread

I really like the pg protocol (like e.g. Cockroach), it let me use my usual tools. There are a few things I noticed:

jamii6y ago

The main difference vs KSQL is that we support standard SQL (roughly SQL92 + correlated subqueries + json so far) and provide strong consistency:

Expect to see some performance comparisons soon too.

kiwicopple6y ago· 1 in thread

For anyone that might be considering trying something similar with their own Postgres database (PG10+), we recently opensourced this: https://github.com/supabase/realtime

It's an Elixir (Phoenix) server that listens to PostgreSQL's native replication, transforms it into JSON, then blasts it over websockets.

I see that Materialize are using Debezium, which will give you a similar result, just with connectors to Kafka etc

thom6y ago

gaogao6y ago· 1 in thread

benesch6y ago

You would need to contact us to buy an enterprise license. (Or wait four years for the license to convert to Apache 2.)

1290cc6y ago· 1 in thread

Awesome to see all the innovative takes on solving these extremely technical problems! I love it!

manigandham6y ago

Google's data product suite seems to be the most advanced today with a completely no-ops approach and very solid primitives to work with.

Snowflake still doesn't an answer to streaming data (other than their fragile Kafka connector) and Azure Synapse still isn't publically available even months after their announcement.

AWS had a good headstart but they keep piling on more products and features that it's now a big mess and requires yet another tool (AWS Lake Formation) just to wire it all up.

polskibus6y ago· 1 in thread

Looking back at the project, knowing what you know now, if you were to start again (but without obtained rust skills), would you go with rust again or pick another toolbox?

jamii6y ago

> would you go with rust again

Other than compile times and editor tooling, which are being actively worked on, the only real pain point I can think of is the poor support for custom allocators.

As for other alternatives, I don't personally have strong opinions on go but I think it's notable that none of the ex-cockroachdb folks at materialize suggested using it instead of rust.

gbrits6y ago· 1 in thread

Is this similar to TimescaleDB's Continuous Aggregates? Interested in knowing the overlap / differences.

benesch6y ago

tschmidleithner6y ago· 1 in thread

dang6y ago

Since you're new and likely didn't know this, I've restored your post. (And welcome!)

irfansharifOP6y ago

(Linked in the post but) github repo: https://github.com/MaterializeInc/materialize

samuell6y ago

For anyone interested in the details behind all of this, you should check out Frank's blog:

https://github.com/frankmcsherry/blog

yayr6y ago

manigandham6y ago

Congrats on the launch, always nice to see new products.

The amount of research to support complex and layered queries definitely sets this apart.

simo76y ago

I'm wondering how this technology could work for OLAP cubes.

An OLAP cube that is automatically & incrementally kept in sync with the changes in the source data sounds promising.

Is that a potential use case?

mnkmnk6y ago

How does this compare to Spark structured streaming? That too allows writing SQL on streams and having the state update incrementally.

rl36y ago

Their careers page says it's built in Rust.

That's always nice to see, since Rust jobs are somewhat rare.

sashavingardt26y ago

I'm intrigued by the product. The SQL examples on the product page are atrocious though ((((

j / k navigate · click thread line to collapse