> Materialize lets you ask questions about your data, and then get the answers in real time.
> Why not just use your database’s built-in functionality to perform these same computations? Because your database often acts as if it’s never been asked that question before, which means it can take a long time to come up with an answer, each and every time you pose the query.
> Materialize instead keeps the results of the queries and incrementally updates them as new data comes in. So, rather than recalculating the answer each time it’s asked, Materialize continually updates the answer and gives you the answer’s current state from memory.
> Importantly, Materialize supports incrementally updating a much broader set of views than is common in traditional databases (e.g. views over multi-way joins with complex aggregations), and can do incremental updates in the presence of arbitrary inserts, updates, and deletes in the input streams.
The biggest difference is one of intended use. Noria is, first and foremost, a research prototype, intended to explore new ideas in systems research. Materialize, by contrast, is intended to be a rock-solid piece of production infrastructure. (Much of the interesting research, in timely and differential dataflow, is already done.) We've invested a good bit in supporting the thornier bits of SQL, like full joins, nested subqueries, correlated subqueries, variable-precision decimals, and so on. Noria's support for SQL is less extensive; I think the decisions have been guided mostly by what's necessary to run its lobste.rs and HotCRP benchmarks. Make no mistake: Noria is an impressive piece of engineering, but, as an enterprise looking to deploy Noria, there's no one you can pay for support or to implement feature requests.
One area where Noria shines is in partial materialization. Details are in the Noria paper [1], but the tl;dr is that Noria has a lot of smarts around automatically materializing only the subset of the view that is actually accessed, while presently Materialize requires that you explicitly declare what subsets to materialize. We have some plans for how to bring these smarts to Materialize, but we haven't implemented them yet.
Also worth noting is that Materialize's underlying dataflow engine, differential dataflow, has the ability to support iterative computation, while Noria's engine requires an acyclic dataflow graph. We don't yet expose this power in Materialize, but will soon. Put another way: `WITH RECURSIVE` queries are a real and near possibility in Materialize, while (as I understand it) `WITH RECURSIVE` queries would require substantial retooling of Noria's underlying dataflow engine.
One of the creators of Noria, Jon Gjengset, did an interview on Noria [2] that covered some of differences between Noria and differential dataflow from his perspective, which I highly recommend you check out as well!
[0]: https://twitter.com/frankmcsherry/status/1056957760435376129...
[1]: https://jon.tsp.io/papers/osdi18-noria.pdf
[2]: https://notamonadtutorial.com/interview-with-norias-creator-...
Do you think it's valuable to have this as a service rather than roll your own solution?
It's a surprisingly difficult problem. I wouldn't roll my own.
Additionally, given it will be Apache licensed in 4 years, I think it'd be good to ask "Can I wait?" before going all in
I just want to say this is a very dangerous assumption to make.
I run a company that helps our customers consolidate and transform data from virtually anywhere in their data warehouses. When we first started, the engineer in me made the same declaration, and I worked to get data into warehouses seconds after and event or record was generated in an origin system (website, app, database, salesforce, etc).
What I quickly learned was that analysts and data scientists simply didn't want or need this. Refreshing the data every five minutes in batches was more than sufficient.
Secondly, almost all data is useless in its raw form. The analysts had to perform ELT jobs on their data in the warehouse to clean, dedupe, aggregate, and project their business rules on that data. These functions often require the database to scan over historical data to produce the new materializations of that data. So even if we could get the data in the warehouse in sub-minute latency, the jobs to transform that data ran every 5 minutes.
To be clear, I don't discount the need of telemetry and _some_ data to be actionable in a smaller time frame, I'm just weary of a data warehouse fulfilling that obligation.
In any event, I do think this direction is the future (an overwhelming amount of data sources allow change data capture almost immediately after an event occurs), I just don't think it's only architecture that can satisfy most analysts'/data scientists' needs today.
I would love to hear the use cases that your customers have that made Materialize a good fit!
The round-trip requirement was dropped before we launched, reducing the usage of the technology stack to pure telemetry gathering.
The analysts are all perfectly happy with 5-10 minute delays.
Link to my GDC talk, in case people are interested: https://www.youtube.com/watch?v=o098roxWAkA
I think we're actually arguing the same points here. It's not that every use case needs single-digit millisecond latencies! There are plenty of use cases that are satisfied by batch jobs running every hour or every night.
But when you do need real-time processing, the current infrastructure is insufficient. When you do need single-digit latency, running your batch jobs every second, or every millisecond, is computationally infeasible. What you need is a reactive, streaming infrastructure that's as powerful as your existing batch infrastructure. Existing streaming infrastructure requires you to make tradeoffs on consistency, computational expressiveness, or both; we're rapidly evolving Materialize so that you don't need to compromise on either point.
And once you have streaming data warehouse in place for the use cases that really demand the single-digit latencies, you might as well plug your analysts and data scientists into that same warehouse, so you're not maintaining two separate data warehouses. That's what we mean by ideal: not only does it work for the systems with real-time requirements, but it works just as well for the humans with looser requirements.
To give you an example, let me respond to this point directly:
> Secondly, almost all data is useless in its raw form. The analysts had to perform ELT jobs on their data in the warehouse to clean, dedupe, aggregate, and project their business rules on that data. These functions often require the database to scan over historical data to produce the new materializations of that data. So even if we could get the data in the warehouse in sub-minute latency, the jobs to transform that data ran every 5 minutes.
The idea is that you would have your analysts write these ETL pipelines directly in Materialize. If you can express the cleaning/de-duplication/aggregation/projection in SQL, Materialize can incrementally maintain it for you. I'm familiar with a fair few ETL pipelines that are just SQL, though there are some transformations that are awkward to express in SQL. Down the road we might expose something closer to the raw differential dataflow API [0] for power users.
[0]: https://github.com/TimelyDataflow/differential-dataflow
With sufficiently expressive SQL and UDF support there are whole classes of stateful services that are performing lookups, aggregations, etc, that could be written as just views on streams of data. Experts who model systems in SQL, but aren't experts in writing distributed stateful streaming services would basically be able to start deploying services.
Are there any plans to support partitioned window functions, particularly lag(),lead(),first(),last() OVER() ? That would be remarkably powerful.
That said, hope we as technologists can find some better use cases than just stealing more of people’s a attention.
The point of Materialize, from my understanding, is that you don't put things into the data warehouse and then, as a separate step, run these enrichment/reporting jobs on it.
Instead, you register persistent, stateful enrichment "streaming jobs" (i.e. incrementally-materialized views) into the data warehouse; and then, when data comes into a table upstream of these views, it gets streamed into and through the related job to incrementally populate the matview.
I believe you end up with a consistent MVCC snapshot between the source data and its dependent inc-matviews, where you can never see new source data and old derived data in the same query; sort of as if the inc-matview were being updated by a synchronous ON INSERT trigger on the original raw-data ingest transaction. (Or—closer analogy—sort of like the raw table has a computed-expression index visible as a standalone relation.)
Also, the job itself is written as a regular OLAP query over a fixed-snapshot dataset, as a data-scientist would expect; but this gets compiled by the query planner into a continuous CTE/window query sorta thing, that spits out running totals/averages/etc. as it gets fed incremental row-batches—but only at the granularity that consumers are demanding from it, either by directly querying the inc-matview, or by sourcing it in their own streaming jobs. Which, of course, you do by just using a regular OLAP query on inc-matview X in your definition of inc-matview Y.
> I would love to hear the use cases that your customers have that made Materialize a good fit!
We haven't used Materialize yet, but it's a very promising fit for our use-case: high-level financial-instrument analytics on blockchain data. We need to take realtime-ish batches of raw transaction data and extract, in est, arbitrarily-formatted logs from arbitrary program-runs embedded in the transaction data (using heavy OLAP queries); recognize patterns in those that correspond to abstract financial events (more heavy OLAP queries); enrich those pattern-events with things like fiscal unit conversions; and then emit the enriched events as, among other things, realtime notifications, while also ending up with a data warehouse populated with both the high- and low-level events, amenable to arbitrary further OLAP querying by data scientists.
This is correct. There is an example in the docs of dumping json records into a source and then using materialized views to normalize and query them - https://materialize.io/docs/demos/microservice/
From my perspective: batching is more complicated, than batching. (Batching requires you to define parameters like batch size and interval, while streaming does not for example). But may be batching tools are simpler than streaming tools, but i am not so sure.
Batching in general has also high(er) latency. That's why I usually don't prefer it unless:
That said batching has an advantage over streaming, it can ammortise a cost that you only pay once per batch process. With streaming you would pay the cost for each items as it arrives.
Further, the mindset requirements for engineers that work with batching is different than for streaming.
Each of these items can be valid concern for batching vs streaming. However, I find it difficult to value statements like "Batching" is the default because the industry has been doing this for years by default.
I think the industry as a whole benefits when engineers in these kind of discussions repeat why certain conditions lead to a choice like batching.
Not OP, but I'm guessing because most of that data is not actionable in real-time. There's zero point to get real-time data to analysts or decision makers if they're not going to use it to make real-time decisions; arguably, it can be even counterproductive, leading to an organizational ADHD, where people fret over minute-to-minute changes, where they should be focusing on daily or monthly running averages.
Seems that it's perhaps lacking the richness of how ksqldb uses Kafka Connectors (sinks and sources), but I don't see any reason you couldn't use Materialize in conjunction with ksqldb.
Eg:
KC-source --> ksql --> materialize --> kafka --> KC-sink
Question to Materialize...
What connectors (sinks and sources) do you have or plan to develop? Seems like it's mostly Kafka in and out at the moment.
Why would I use this over KSqlDB?
Can I snapshot and resume from the stream? Or do I need to rehydrate to re-establish state?
I'm not sure I'd say it's "more OLAP." ksqlDB is about as OLAP as it gets, considering it doesn't support any sort of transactions or consistency. We think Materialize is quite a bit more powerful than what ksqlDB offers, thanks to the underlying technologies (timely/differential). For example, our joins are proper SQL joins, and don't require you to reason about the complicated and confusing difference between a stream and a table (https://docs.ksqldb.io/en/latest/developer-guide/joins/join-...). We also have preliminary support for maintaining the consistency properties of upstream OLTP data sources, and we'll be rolling out a more complete story here shortly.
> Seems that it's perhaps lacking the richness of how ksqldb uses Kafka Connectors (sinks and sources), but I don't see any reason you couldn't use Materialize in conjunction with ksqldb.
Is there something in particular about ksqlDB connectors that we don't seem to support? Our CREATE SOURCE command is quite powerful: https://materialize.io/docs/sql/create-source/.
> What connectors (sinks and sources) do you have or plan to develop? Seems like it's mostly Kafka in and out at the moment.
We already support file sources in a variety of formats, and support for Amazon Kinesis is on the short-term roadmap: https://github.com/MaterializeInc/materialize/issues/1239.
Are there other connector types you'd like to see?
> Can I snapshot and resume from the stream? Or do I need to rehydrate to re-establish state?
At the moment you can't snapshot and resume, but support for this is planned.
Important "sources" to me are obviously Kafka, but also MySQL and Mongo. Important "sinks" would be Snowflake (maybe through S3, or directly though PUT) and ElasticSearch. Although I imagine you might soon be telling me that you don't need a data warehouse once you have Materialize :)
It's also written in rust instead of Java, so there's no JVM RAM penalty or GC to contend with.
1. It has a fairly rich support for types - these new-ish SQL engines often lack quite a lot of things, but this seems pretty decent. 2. I don't see any comparisons to KSQL, which seems to be the primary competitor. 3. Read the license. Read it carefully. It has a weird "will become open source in four years" clause, so keep that in mind. It also disallows it being hosted for clients to use (esentially as a DBaaS).
KSQL has a distinction between streams and tables, effectively giving you control over how views are materialized but also forcing you to do that work yourself. In Materialize you just write the same SQL that you would for a batch job and the planner figures out how to transform it into a streaming dataflow.
KSQL is also eventually consistent - late arriving records might cause changes to previously emitted answers. Materialize won't give you an answer for a given timestamp until it can guarantee that the answer is correct and won't be retroactively updated later.
Expect to see some performance comparisons soon too.
It's an Elixir (Phoenix) server that listens to PostgreSQL's native replication, transforms it into JSON, then blasts it over websockets.
I see that Materialize are using Debezium, which will give you a similar result, just with connectors to Kafka etc
I highly doubt this, given that the query engine is interpreted and non-vectorized. Queries are 10x to a 100x slower on a simple query, and 100x to 1000x slower on a query with large aggregations and joins without compilation of vectorization.
> Full SQL Exploration
Except for window functions it seems. These actually matter to data analysts.
We definitely have some performance engineering work to do in Materialize, but don’t let the lack of vectorization scare you off. It’s just not as important for a streaming engine.
I work in an org with > 100 data scientists. I bet that 50% have never used window functions. I would guess than fewer than 20% know how to write one.
Your intent in this comment is unclear, but if you were looking to provide actionable feedback, you might want to reconsider your tone. This project looks like a pretty impressive feat of applied CS theory to doing useful stuff.
Snowflake had incredible timing as they hit the market just before CFO's and non-tech business leaders realized the cost and talent needed to pull off a datalake successfully was more than they'd like. Those that were sick of the management jumped to Snowflake fast and AWS/Azure never really responded until recently.
Awesome to see all the innovative takes on solving these extremely technical problems! I love it!
Snowflake still doesn't an answer to streaming data (other than their fragile Kafka connector) and Azure Synapse still isn't publically available even months after their announcement.
AWS had a good headstart but they keep piling on more products and features that it's now a big mess and requires yet another tool (AWS Lake Formation) just to wire it all up.
There are a pile of known limitations, and our goal is to be 100% clear about them.
This is an interesting mix between the (now obsolete) PipelineDB, TimescaleDB with continuous aggregates, Kafka and other message systems with KSQL/ksqlDB/KarelDB, stream processing engines like Spark, and typical RDBMS like SQL Server with materialized views.
The amount of research to support complex and layered queries definitely sets this apart.
I'm sure Materialize is better at this since it's purpose-built but if you're on Redshift you can get at least some of the benefits of incremental materialize.
[0] https://pulsar.apache.org/docs/en/adaptors-kafka/
EDIT: I don't think this adaptor will work after all, it works by replacing the Kafka Java client library with its own, so is only applicable to Java applications.
* timestamped changes eg "record foo was deleted at time 42"
* watermarks eg "I've now sent you all the changes up to time 42"
Here's the implementation of the kafka source - https://github.com/MaterializeInc/materialize/blob/master/sr.... (There's also a bit of code in the parser/planner to wire up the CREATE SOURCE syntax, but it's fairly trivial.)
An OLAP cube that is automatically & incrementally kept in sync with the changes in the source data sounds promising.
Is that a potential use case?
Absolutely. Even aside from safety, rust has so many quality of life improvements over C++ - algebraic data types, pattern matching, checked errors instead of exceptions, sane standard library, easy packaging, readable error messages, #[must_use], immutable utf8 strings, traits instead of templates, hygienic macros etc.
Other than compile times and editor tooling, which are being actively worked on, the only real pain point I can think of is the poor support for custom allocators.
As for other alternatives, I don't personally have strong opinions on go but I think it's notable that none of the ex-cockroachdb folks at materialize suggested using it instead of rust.
That's always nice to see, since Rust jobs are somewhat rare.
> Streaming sources must receive all of their data from the stream itself; there is no way to “seed” a streaming source with static data. However, you can union streaming and file sources in views, which accomplishes a similar outcome.
What I find odd is that the documentation doesn't show UNION support in the SELECT documentation : https://materialize.io/docs/sql/select/
Perhaps just an oversight in the documentation?
Though it's not yet documented, we do support all of the SQL set operations [0]: UNION [ALL], INTERSECT [ALL], and EXCEPT [ALL].
[0]: https://www.postgresql.org/docs/current/queries-union.html
After all, in BQ you're only able to query what has been already has been ingested or is being digested and that is somewhat by definition not 'real-time'.
There's a new feature called "Dataflow SQL" that lets you submit a declarative SQL query through the BQ UI as a job to Dataflow which can join both BigQuery tables and PubSub topics in the same job.
https://cloud.google.com/dataflow/docs/guides/sql/dataflow-s...
This system is more like BigQuery with materialized views that are constantly updated as new data arrives.
Since you're new and likely didn't know this, I've restored your post. (And welcome!)
What I'd like to know is if that would enable basically implementing social networks as just 3 tables and one materialized view, and how it would scale and perform.
Users, Relationships, Post, and a Feed materialized view that simply joins them together with an index of user_id and post_timestamp.
As relationships and messages are created or deleted, the feed view is nearly instantly updated. The whole entire view service logic then is just one really fast query. "select user,post,post_timestamp from feed where user_id = current_user and post_timestamp <= last_page_post_timestamp order by post_timestamp desc limit page_size"
In my experience the most successful approach to this is a midpoint - you materialize/denormalize enough to feed your app endpoints and search engines but retain flexibility in searching those fat but instantly available docs, and relatedly you also don't always need to preemptively materialize absolutely everything in any particular view - see https://engineering.fb.com/data-infrastructure/dragon-a-dist... . Without being able to transparently operate on arbitrarily partially populated matviews you are locked into a self-defeating all-or-nothing system that is likely to culturally do more harm than good with its rigidity. Imagine for example if there were no 'caches', just a binary choice of precomputing everything ahead of time or recomputing everything every time. Neither extreme is sufficient for all cases and real applications are comprised of many different points on that spectrum.