Citus and pg_partman: Creating a scalable time series database on Postgres (opens in new tab)

(citusdata.com)

144 pointsspathak8y ago19 comments

19 comments

16 comments · 6 top-level

est8y ago· 4 in thread

The problem with Citus and TimescaleDB is that users can not install extensions on AWS RDS PostgreSQL.

pg3148y ago

That's a problem with AWS RDS PostgreSQL, not with Citus and TimescaleDB.

mslot8y ago

The Citus Cloud service is similar to RDS, but for Citus clusters. Every cluster runs on AWS in its own VPC and it lets you peer with the VPC. It also provides auto-failover, PITR, forking, read-only followers and scaling out without any downtime.

takeda8y ago

AWS RDS has tons of other problems. It's perfect for developement, but for production usage it might bite you.

manigandham8y ago

The answer to that is to use Redshift which is a proper distributed column-oriented data warehouse. Partion on time and let the database do the rest.

Using standard Postgres with sharding for OLTP workloads is great but there are better options for OLAP, especially if you’re using managed services. Also there is the Citus cloud offering if you want to stay with Postgres.

jnordwick8y ago· 3 in thread

I wouldn't call this a time series database at all. To me, tsdb implies analytics over a time dimension such as weather sensors or stock market data.

This is just patitioning a log on time so you can query the most recent and delete the old stuff.

I doesn't even really seem to me you necessarily want to partition on time since your load distribution is going to be terrible.

Edit: too add a little. There is a thing called a temporal database that is a little more general in usage i feel in that it is more about facts at specific points in time (such as your address last year) that i think this is more about.

There is even a bitemporal database that has two time dimensions (what do we think your last year's address is right now and what did we think your last year's address was yesterday - and in those you don't ever delete data that is wrong, you just update your belief about that point in time) and they are really interesting to work with. Those would seem much more similar to this.

smilliken8y ago

Thanks for mentioning bitemporal databases, I hadn't heard of anything like that before. Time-based mutable facts are so hard to represent well.

I think their definition of time-series database fits the common usage I've seen everywhere: the data has a time dimension and is append-only/immutable (well, ok, you can mutate the data in a postgresql table, but nobody's forcing you to).

Given the choice between selecting a specialized time-series only database or using a time-series pattern in your existing postgresql database, postgresql is often (usually?) the more pragmatic choice. That's what we do at mixrank with time-series tables approaching the 100 billions of rows.

sahil-kang8y ago

I also feel that using the bitemporal pattern on a Postgres DB is the most pragmatic choice. What are some advantages to using a specialized timeseries DB? I can’t really think of any.

1 more reply

onderkalaci8y ago

> I doesn't even really seem to me you necessarily want to partition on time since your load distribution is going to be terrible.

I think that's not accurate. The tables mentioned in the post are first sharded/distributed on `repo_id`. Later, each shard is also partitioned on time dimension (i.e., `created_at `). Thus, the load should be distributed proportionally with the activity for each `repo_id`.

truth_seeker8y ago· 2 in thread

How well it performs compared to TimescaleDB ?

mslot8y ago

In general I'd expect native partitioning to have similar performance characteristics as TimescaleDB. On the insert side lot of the benefit comes from partitions having small indexes, on the delete side from the ability to drop partitions quickly, and on the select side from skipping partitions based on filters.

Postgres 10 partitioning does have a few limitations and inefficiencies that will be resolved in Postgres 11. Partitioning is the most actively developed area of postgres.

Note that Citus shards across multiple nodes, and can then partition on disk using native partitioning, which is automated by pg_partman. TimescaleDB so far only works on a single node.

Citus can also run parallel, distributed SQL queries, perform distributed transactions, and build rollups tables in parallel, and is used in Postgres clusters with up to a petabyte of data.

brightball8y ago

From everything I've seen on it, Citus makes horizontally scaling a multi-tenant database a transparent operation. This is ideal just because so many applications emphasize recency in the data that you don't want rarely accessed, 3 year old information cluttering up your indexes that are looking up things from this month 99% of the time.

Timescale seems to me more about the analytics side of things, focusing on ingestion speed and aggregation. Correct me if I'm wrong somebody.

polskibus8y ago· 1 in thread

There was a post about timescale DB recently. Can you combine citusdb and timescale somehow to get super fast timeseries data and citus flexibility?

craigkerstiens8y ago

Craig from Citus here. From our cursory looks it does look like they could work together, though personally we've not tried it and I'm not aware of users that have (though it's entirely plausible some have already). What you'd have to do is setup timescale on each of your distributed nodes and tables. You'd still talk to Citus as the primary point and it would re-write and push the queries down to the nodes which could then leverage timescale.

mslot8y ago

Big shout-out to Keith Fiske for developing and maintaining pg_partman. It makes Postgres 10's native partitioning really easy to use.

brightball8y ago

To me, this seems like the ideal solution to so many multitenant applications...

j / k navigate · click thread line to collapse

19 comments

16 comments · 6 top-level

est8y ago· 4 in thread

The problem with Citus and TimescaleDB is that users can not install extensions on AWS RDS PostgreSQL.

pg3148y ago

That's a problem with AWS RDS PostgreSQL, not with Citus and TimescaleDB.

mslot8y ago

takeda8y ago

AWS RDS has tons of other problems. It's perfect for developement, but for production usage it might bite you.

manigandham8y ago

The answer to that is to use Redshift which is a proper distributed column-oriented data warehouse. Partion on time and let the database do the rest.

jnordwick8y ago· 3 in thread

I wouldn't call this a time series database at all. To me, tsdb implies analytics over a time dimension such as weather sensors or stock market data.

This is just patitioning a log on time so you can query the most recent and delete the old stuff.

I doesn't even really seem to me you necessarily want to partition on time since your load distribution is going to be terrible.

smilliken8y ago

Thanks for mentioning bitemporal databases, I hadn't heard of anything like that before. Time-based mutable facts are so hard to represent well.

sahil-kang8y ago

I also feel that using the bitemporal pattern on a Postgres DB is the most pragmatic choice. What are some advantages to using a specialized timeseries DB? I can’t really think of any.

1 more reply

onderkalaci8y ago

> I doesn't even really seem to me you necessarily want to partition on time since your load distribution is going to be terrible.

truth_seeker8y ago· 2 in thread

How well it performs compared to TimescaleDB ?

mslot8y ago

Postgres 10 partitioning does have a few limitations and inefficiencies that will be resolved in Postgres 11. Partitioning is the most actively developed area of postgres.

Note that Citus shards across multiple nodes, and can then partition on disk using native partitioning, which is automated by pg_partman. TimescaleDB so far only works on a single node.

Citus can also run parallel, distributed SQL queries, perform distributed transactions, and build rollups tables in parallel, and is used in Postgres clusters with up to a petabyte of data.

brightball8y ago

Timescale seems to me more about the analytics side of things, focusing on ingestion speed and aggregation. Correct me if I'm wrong somebody.

polskibus8y ago· 1 in thread

There was a post about timescale DB recently. Can you combine citusdb and timescale somehow to get super fast timeseries data and citus flexibility?

craigkerstiens8y ago

mslot8y ago

Big shout-out to Keith Fiske for developing and maintaining pg_partman. It makes Postgres 10's native partitioning really easy to use.

brightball8y ago

To me, this seems like the ideal solution to so many multitenant applications...

j / k navigate · click thread line to collapse