Writing a Time Series Database from Scratch (opens in new tab)

(fabxc.org)

180 pointsspektom9y ago38 comments

38 comments

32 comments · 6 top-level

jnordwick9y ago· 17 in thread

I still can't figure out why people can't even come close to KDB+. It is a real conundrum. I've been waiting patiently for something to show up, but the gap seems to keep getting bigger instead of smaller.

Is it that people want to make the problem more complex that it needs to be? Is it that those who know most about these issues don't share their secrets so implemented from the outside often don't have a good understanding of how to do things properly? If you were to asked the guy behind Prometheus if he's looked at the commercial offerings and what he's learned from them, would even be able to speak about them intelligently?

There seems to be a huge skills gap on these things that I can't put my finger on. I'd love to be able to use a real TSDB, even at only half the speed and usefulness. It would be great for these smaller firms that cant or wont pay the license fees for a commercial offering until they get larger.

jandrewrogers9y ago

This observation applies to database engines generally. It is straightforward to explain: almost everyone with deep expertise in sophisticated database engine internals are contractually prohibited from disclosing anything about the design of such things. It is an industry steeped in trade secrets. Sophisticated database engines are littered with novel algorithms and designs that have never been published. Because basic performance superiority is a key market differentiator, database companies have invested heavily in computer science R&D for decades to get an edge they are loath to share. Sadly, academia is increasingly in the role of independently re-discovering what has been known for 10-20 years but treated as a secret.

The other big factor is that almost all open source database engines are the product of someone who is basically designing their first database engine. Prometheus is now on its third(?) redesign, and while you can see the growth of the designers' skill it is still a relatively naive design. This is not a jab at the designers, it just takes many creative iterations over many years to discover all the tricks that the experts know but nobody publishes. I probably spent a decade producing database engines that in hindsight were pretty mediocre but at the time I thought they were sophisticated -- I didn't know what I didn't know.

As a last observation, the other issue is that building a genuinely sophisticated database engine requires a commitment to writing an enormous amount of code before the barest database kernel can even bootstrap, on the order of 20kLoC of dense C++, never mind provide any database functionality. I've noticed that most open source projects need to see the payoff of code running and doing something minimally useful with much lower levels of investment. For database engines, getting to running code as quickly as possible compromises the design but the pressure to get to running code is understandable. The amount of time and effort required to design and build a state-of-the-art database engine goes far beyond what most people are willing to invest in what is essentially a hobby.

wickawic9y ago

> Sadly, academia is increasingly in the role of independently re-discovering what has been known for 10-20 years but treated as a secret.

This has been true of network programming for a long time as well. Private companies find novel ways of switching and routing packets, reducing latency, etc. and academia is left to pick up the pieces.

zeveb9y ago

> This observation applies to database engines generally. It is straightforward to explain: almost everyone with deep expertise in sophisticated database engine internals are contractually prohibited from disclosing anything about the design of such things. It is an industry steeped in trade secrets. Sophisticated database engines are littered with novel algorithms and designs that have never been published. Because basic performance superiority is a key market differentiator, database companies have invested heavily in computer science R&D for decades to get an edge they are loath to share. Sadly, academia is increasingly in the role of independently re-discovering what has been known for 10-20 years but treated as a secret.

This makes an interesting case for software patents: what if database companies patented those insights instead of keeping them secret, and thus could keep others from using them for, say, 3-4 years? At the end of that time anyone could use those techniques.

This seems to me better than either trade secrets are the current absurdly-long patent lifetimes.

geocar9y ago

KDB+ is not 20kLoC of dense C++.

1 more reply

stingraycharles9y ago

You should look into proprietary technologies, there are some very compelling offerings there. I have personally used QuasarDB ( https://www.quasardb.net/ ), and can say that it's on par with kdb+ in terms of performance, scales horizontally and has support for transactions and secondary indexes. They have a single node free edition as well.

Typically these solutions tend to follow the money, but if you ignore the list price, you can usually get a decent deal for a startup.

jnordwick9y ago

I know there are other proprietary dbs out there. I was speaking of FOSS/open databases any cant they come even without an order of magnitude of commercial offerings. Quasar would be included in that commercial group.

I'm only slightly familiar with Quasar, but I believe it is just a key-value store with document characterists, not a general database. The comparisons on their website are mostly to other KV stores and document systems.

StreamBright9y ago

>> I still can't figure out why people can't even come close to KDB+

You can have a look what happend with extremely simple code bases over time. Imagine the same effect on larger code.

https://www.computer.org/cms/Computer.org/ComputingNow/issue...

zeptomu9y ago

I have to admit that I did not know time-series databases are a thing and just recently realized that because they came up more often. Unfortunately I do not have an answer for you, but I know traditional DB systems are hard to build, as we expect more reliability and guarantees from them compared to other (daemon) software.

However I am interested to know why this kind of data and/or problems require specific software and why it can't be handled by traditional RDBMs? Obviously you could model the domain with a classic database, but seemingly there exist important queries that can't be satisfied (at least timely) through classic systems - what kind of queries are these, and why do they fail using a general-purpose DB approach?

nickpeterson9y ago

I'm no expert, but I believe the Crux of the issue is how data is naturally organized and stored. In a row oriented database, most data is stored in pages that contain rows. There are often indexes with ordering, but unless a secondary index contains all the values needed (often called a covering index), the entire row must be retrieved to answer any query that uses that information.

Most Time-series databases are columnar in nature, and often have the concept of time baked into the ordering of values (think vectors not sets). Because they are columnar, they are more trivial to retrieve just the data needed by a query. Suddenly instead of loading a billion rows and averaging the value in one column, you're just accesing the column itself to answer the question. From an IO perspective that's a huge savings.

Now imagine you have special, dark arts for working on compressed data, and a query optimizer you've been tuning for a decade for demanding clients. It does not surprise me that kdb is much faster than the open source competitors. And to be fair, even with excellent traditional databases like postgres, I bet Oracle, db2, and Ms SQL are still generally faster in most queries.

1 more reply

nunb9y ago

There is also timescale which is built atop Postgres. Or rather, modifies Postgres' query engine etc., this is not some rails app!

https://blog.timescale.com/when-boring-is-awesome-building-a...

ash9y ago

> ... why this kind of data and/or problems require specific software and why it can't be handled by traditional RDBMs?

There's Tgres, that stores time series in Postgres, efficiently (using Postgres arrays and views):

https://github.com/tgres/tgres

https://grisha.org/blog/2017/01/21/storing-time-seris-in-pos...

shaklee39y ago

What about extremedb? It's not free, but I believe is on par. Or did you mean only free?

jnordwick9y ago

I was talking mostly about FOSS.

I've actually never heard of ExtremeDB. It's only in-memory though, so it can it even handle the huge datasets that timeseries/column dbs handle? I don't see any performance tests for it either.

1 more reply

rodionos9y ago

In-memory, K/Q, expensive, old, closed source?

angersock9y ago

> It would be great for these smaller firms that cant or wont pay the license fees for a commercial offering until they get larger.

Why should we give that work away for free, especially if it's not that hard to roll your own for most small-scale needs?

derefr9y ago

Small company doesn't mean small scale, if the data is purchased/licensed instead of primary data. There are one-man startups trying to do build their business on analysis of Facebook-sized data-sets.

1 more reply

branchless9y ago

If someone could get somewhere near kdb and have it be easier to write / debug I'm sure banks would be interested as there are very few kdb heroes out there. The amount of time wasted by strats writing code once then another one writing the same thing some time later must be stupendous.

bongonewhere9y ago· 6 in thread

Is like everyone creating a time series database from scratch?

rodionos9y ago

The fact that many companies (FB, Uber, Google, Netflix, SO) roll their own TSDBs for metrics collection suggests that there is a real need. Or maybe there is not. It could a way to make boring system monitoring jobs fun and fancy again.

metaobject9y ago

Perhaps these companies have such varied requirements that none of the existing TSDB systems fully support? We built our own custom time series db, along with a suite of tools for accessing, slicing, plotting, etc bc (at the time, at least) there was no support for bitpacking data, and storing/calculating certain spatial data operations.

weq9y ago

My first job was in 2005 writing a TSDB abstraction over sql server used for BI with defense.

we merged ~30Gb of data per night in about 8hrs on a 32core machine which was severly limited by a platter SAN. We would get about 1 disk failure /6months.

So do u need an actual TSDB? Or are other companies doing what we did?

Rapzid9y ago

I was under the impression that influxdb's storage engine was going to be a viable solution for Prometheus at some point. Now they are writing their own; not sure where that interest went.

sagichmal9y ago

That was never true AFAIK. I recall one of the core devs did a comparison of storage engines years ago and found Influx unsuitable; that work led to the current-gen storage engine.

1 more reply

pas9y ago

That was about remote storage. And that was around the time when Influx itself used pluggable backends. Now influx has its own storage engine, and prometheus too.

nicolaslem9y ago· 2 in thread

The description of this new storage engine does not explain how it manages the durability of the data.

When you compare with the extreme efforts traditional databases take to ensure that unplugging a server will never ever result in data loss[0], silencing this problem makes me wonder.

Is it that at this ingest rate even trying to ensure durability is a vain effort?

[0] https://www.sqlite.org/atomiccommit.html

bbrazil9y ago

Durability is not a requirement in that sense.

Consider that a regular scrape has happened and that data has been accepted by the DB but not yet flushed to disk.

Whether the database dies just before or just after the scrape produces the same result: The data for that scrape isn't present when the server restarts.

There plenty of other ways a scrape might not succeed that we have no control over (e.g. other end is overloaded, network blip), so there's not much point obsessing over this particular failure mode.

> Is it that at this ingest rate even trying to ensure durability is a vain effort?

It's not in vain, but it'd be a bad engineering tradeoff in terms of throughput.

tveita9y ago

They mention using a write-ahead-log, which should be sufficient for durability if implemented correctly.

bogomipz9y ago· 1 in thread

I had a question about the following statement from the post:

>"Prometheus's storage layer has historically shown outstanding performance, where a single server is able to ingest up to one million samples per second as several million time series"

How are there one million samples per second equating to several million time series? Is a single sample not equivalent to a single data point in a time series db for a particular metric in Prometheus?

ah-9y ago

I think this means that there are e.g. 10 million different time series, that each get a new sample appended every 10 seconds.

iksaif9y ago

You may also want to check https://github.com/criteo/biggraphite/wiki/BigGraphite-Annou... which is also about how to write a TSDB from Scatch but with different goals.

ah-9y ago

Exciting times in database land! It certainly seems like the good systems are converging on very similar storage architectures. This design is so similar to how Kafka and Kudu work internally.

As the raw storage seems pretty optimal now, I suspect next we'll see a comeback of indices for more precise queries to get another jump in performance.

j / k navigate · click thread line to collapse

38 comments

32 comments · 6 top-level

jnordwick9y ago· 17 in thread

jandrewrogers9y ago

wickawic9y ago

> Sadly, academia is increasingly in the role of independently re-discovering what has been known for 10-20 years but treated as a secret.

zeveb9y ago

This seems to me better than either trade secrets are the current absurdly-long patent lifetimes.

geocar9y ago

KDB+ is not 20kLoC of dense C++.

1 more reply

stingraycharles9y ago

Typically these solutions tend to follow the money, but if you ignore the list price, you can usually get a decent deal for a startup.

jnordwick9y ago

StreamBright9y ago

>> I still can't figure out why people can't even come close to KDB+

You can have a look what happend with extremely simple code bases over time. Imagine the same effect on larger code.

https://www.computer.org/cms/Computer.org/ComputingNow/issue...

zeptomu9y ago

nickpeterson9y ago

1 more reply

nunb9y ago

There is also timescale which is built atop Postgres. Or rather, modifies Postgres' query engine etc., this is not some rails app!

https://blog.timescale.com/when-boring-is-awesome-building-a...

ash9y ago

> ... why this kind of data and/or problems require specific software and why it can't be handled by traditional RDBMs?

There's Tgres, that stores time series in Postgres, efficiently (using Postgres arrays and views):

https://github.com/tgres/tgres

https://grisha.org/blog/2017/01/21/storing-time-seris-in-pos...

shaklee39y ago

What about extremedb? It's not free, but I believe is on par. Or did you mean only free?

jnordwick9y ago

I was talking mostly about FOSS.

I've actually never heard of ExtremeDB. It's only in-memory though, so it can it even handle the huge datasets that timeseries/column dbs handle? I don't see any performance tests for it either.

1 more reply

rodionos9y ago

In-memory, K/Q, expensive, old, closed source?

angersock9y ago

> It would be great for these smaller firms that cant or wont pay the license fees for a commercial offering until they get larger.

Why should we give that work away for free, especially if it's not that hard to roll your own for most small-scale needs?

derefr9y ago

1 more reply

branchless9y ago

bongonewhere9y ago· 6 in thread

Is like everyone creating a time series database from scratch?

rodionos9y ago

metaobject9y ago

weq9y ago

My first job was in 2005 writing a TSDB abstraction over sql server used for BI with defense.

we merged ~30Gb of data per night in about 8hrs on a 32core machine which was severly limited by a platter SAN. We would get about 1 disk failure /6months.

So do u need an actual TSDB? Or are other companies doing what we did?

Rapzid9y ago

I was under the impression that influxdb's storage engine was going to be a viable solution for Prometheus at some point. Now they are writing their own; not sure where that interest went.

sagichmal9y ago

That was never true AFAIK. I recall one of the core devs did a comparison of storage engines years ago and found Influx unsuitable; that work led to the current-gen storage engine.

1 more reply

pas9y ago

That was about remote storage. And that was around the time when Influx itself used pluggable backends. Now influx has its own storage engine, and prometheus too.

nicolaslem9y ago· 2 in thread

The description of this new storage engine does not explain how it manages the durability of the data.

When you compare with the extreme efforts traditional databases take to ensure that unplugging a server will never ever result in data loss[0], silencing this problem makes me wonder.

Is it that at this ingest rate even trying to ensure durability is a vain effort?

[0] https://www.sqlite.org/atomiccommit.html

bbrazil9y ago

Durability is not a requirement in that sense.

Consider that a regular scrape has happened and that data has been accepted by the DB but not yet flushed to disk.

Whether the database dies just before or just after the scrape produces the same result: The data for that scrape isn't present when the server restarts.

There plenty of other ways a scrape might not succeed that we have no control over (e.g. other end is overloaded, network blip), so there's not much point obsessing over this particular failure mode.

> Is it that at this ingest rate even trying to ensure durability is a vain effort?

It's not in vain, but it'd be a bad engineering tradeoff in terms of throughput.

tveita9y ago

They mention using a write-ahead-log, which should be sufficient for durability if implemented correctly.

bogomipz9y ago· 1 in thread

I had a question about the following statement from the post:

>"Prometheus's storage layer has historically shown outstanding performance, where a single server is able to ingest up to one million samples per second as several million time series"

ah-9y ago

I think this means that there are e.g. 10 million different time series, that each get a new sample appended every 10 seconds.

iksaif9y ago

You may also want to check https://github.com/criteo/biggraphite/wiki/BigGraphite-Annou... which is also about how to write a TSDB from Scatch but with different goals.

ah-9y ago

Exciting times in database land! It certainly seems like the good systems are converging on very similar storage architectures. This design is so similar to how Kafka and Kudu work internally.

As the raw storage seems pretty optimal now, I suspect next we'll see a comeback of indices for more precise queries to get another jump in performance.

j / k navigate · click thread line to collapse