Edit: the PostGIS extension is GPL, and that license choice has been very successful. Hopefully the AGPL works out at least as well for Citus, I'm just not familiar enough to know what the implications will be in this context.
[0] https://en.wikipedia.org/wiki/Affero_General_Public_License
> With the release of newly open sourced Citus v5.0, pg_shard's codebase has been merged into Citus...
This is fantastic, sounds like the setup process is much simpler.
I wonder if they have introduced the Active/Active Master solution they were working on? I know before, there is 1 Master and multiple Worker nodes. The solution before was to have a passive backup of the Master.
If say, they released the Active/Active Master later on this year. That's huge. I can pretty much think of my DB solution as done at this point.
We're working on making Citus masterless. In all openness, we evaluated two different approaches to this in the past six months, and wrapped up the design for one. This design works well on the cloud, and we already demonstrated a working version: https://youtu.be/_nun2S6EdWo?t=411
For on-premise deployments, the primary challenge is set-up complexity. We're now prototyping one of those designs to know more: https://github.com/citusdata/citus/issues/389
We expect to share all the details and a concrete timeline in April.
Or is Citus taking over the master/master replication? (or is it doing something different?)
also in Turkish: kolaylıklar dilerim :)
Several things have changed over the last two years that allowed us to make this happen: Most importantly, we've continued building out the product for a more broad user base, grown with more customers and users, received further funding as validation, and expanded both our team and product to offer additional revenue generating services. All put together, open sourcing Citus is something we've always wanted to do, and we are excited to continue building on it for many years to come, with the help of the community and our enterprise customers.
Beyond that, we have a few other things in the works for the future that will cover other revenue models.
Companies of any appreciable size will be happy to pay for support if they choose to make Citus a part of their critical infrastructure. And the industry reached an inflection point where there are enough companies want as much of their infrastructure to be open source as possible, that you can run a company where most of your stuff is open source, while still making a ton of money (like RedHat, CoreOS, Docker, etc)
AGPL means the only people using it have to licensed the same.
AGPL only requires open sourcing any modifications you make to the software when you give users direct access to a server running the software, which seems like something you would never want to do in case of a database.
You can use database servers running AGPL software in a closed source SaaS: http://www.gnu.org/licenses/why-affero-gpl.en.html
"For customers with large production deployments, we also offer an enterprise edition that comes with additional functionality"
Software isn't free to produce, and the need to make money off software isn't something companies should be ashamed of. In fact, nowadays I'm leaning towards trusting OSS with clear financial sustainability over software whose long term existence seems shaky.
I use a major open source system with enterprise features and support but don't pay for any of those options. I've used it for 3 years and it's been invaluable. No pressure to start paying for anything. Some of its premium features have actually become free over that period. But I wouldn't decide that all open source systems with premium features are safe based on that experience.
Can you publish competitive positioning of Citus vs Actian Matrix (nee ParAccel) and Vertica? I'd love to compare them side by side - even if it's just from your point of view :-)
While this might appear as an implementation piece at first, it has important product implications, and might even impact how you might want to think about your database stack. By not forking the core database, you are choosing to always stay with the core PostgreSQL product. For starters, you get the uber-cool (and uber-fast) JSONB type that came with 9.4, or the recently checked in UPSERTs, or the popular PostGIS extension for geospatial capabilities. More philosophically, the moment you use forks of database, you know you'll be diverging over time. And when you introduce new databases and/or piece together many different ones to build one application, your development cycles will only get costlier and more complex over time.
This was a long answer to a short question, but hopefully useful. Let me know if you have questions, or any feedback using Citus – would love to hear your thoughts!
Precisely because cstore_fdw is built for PostgreSQL, and Citus is PostgreSQL (see Part 3), however, you can still choose to use cstore_fdw as the storage engine for your Citus cluster. Citus will still parallelize the queries as you'd expect it to, but instead of hitting row- based tables, they will hit columnar ones. cstore_fdw has certain limitations, importantly it is not updatable; so we don't consider it as an alternative to a data warehouse. Rather, it is useful if you are archiving your quickly growing timeseries / event data on PostgreSQL or Citus.
First, Citus is not a traditional data warehouse. We position Citus as the real-time, scalable database that serves your application under a mix of high- concurrency short requests and ad-hoc SQL analytics (i.e. think both random and sequential scans for a customer-facing analytics app). The default storage engine for Citus is the PostgreSQL storage engine, which is row-based. This is in contrast to many data warehouses, which often use a column store and/or batch data loads, and are focused purely on analytics. The trade-offs you get are: - Citus vs. DWH performance: DWH and Citus both have a similar parallelization for analytics queries (multi-core, multi-machine), but most data warehouses typically use a columnar storage engine instead of a row-based one. Columnar storage is designed for faster analytics queries, so that makes columnar DWH generally faster on longer running analytics queries. However, this comes at the expense of (1) concurrency and (2) short-request performance (think simple lookups, updates, real-time data ingest) vs. Citus' row-based storage. If you've tried having 10s of concurrent connections to Redshift for short lookups, or performing 100s/1000s of inserts/updates to power your application, these limitations will be familiar. This is to be expected, as Redshift is not designed as a real-time operational database, but an offline data warehouse.
In essence, the two classes of products are more complimentary than substitutes, even while they have some overlaps in their analytic capabilities. Something like Redshift will give you fast offline analytics, after you move your data in batch (via S3); Citus will directly power your analytic apps in real-time; without ETL'ing your event/user data back and forth between separate OLTP and OLAP databases. Both can be extremely fast: Redshift can run complex data warehousing queries that take an hour in a few minutes, Citus can scan and aggregate 100 million records in a few seconds, while simultaneously ingesting your events in real-time.
I hope that provides some clarification on the workloads. There is a lot more, including columnar storage and product approach (re: implications of extending Postgres 9.5 vs. forking Postgres 8.x), and I’ll dive into those in separate comments as well.
It's a good product, and it was even fairly easy to do a major version upgrade / cluster relocation. At least as easy as such a thing can be. :-)
Which of these are supported:
1. Full PostgreSQL SQL language
2. All isolation levels including Serializable (in the sense that they actually provide the same guarantees as normal PostgreSQL)
3. Never losing any committed data on sub-majority failures (i.e. synchronous replication)
4. Ability to automatically distribute the data (i.e. sharding)
5. Ability to replicate the data instead or in addition to sharding
6. Transactionally-correct read scalability
7. Transactionally-correct write scalability where possible (i.e. multi-master replication)
8. Automatic configuration only requiring to specify some sort of "cluster identifier" the node belongs to
On PostgreSQL language support, we're updating our FAQ to have more information: https://www.citusdata.com/frequently-asked-questions Since the PostgreSQL manual (and its feature set) spans over 4K+ pages, we found that the best way to think about Citus' capabilities is from a use-case standpoint. If your workload needs distributed transactions that span across machines, or large ETL jobs, Citus currently isn't the best fit.
Citus supports sharding and replication out of the box (#4, #5). On #6, reads go through a master node (metadata server) and you see what you write.
We don't have #7. The way in which we implement this also has implications on your other questions. Multi-master (no single metadata server) is by far the biggest feature request that we receive: https://news.ycombinator.com/item?id=11353866
If we go with the approach in https://github.com/citusdata/citus/issues/389, you will be able to configure #3, #6, #7 through PostgreSQL's streaming replication settings. We still won't support distributed transactions that span across multiple machines.
On #8, could you elaborate a bit more? Do you mean a logical identifier for the node?
Also, it's hard to write a concise reply on a topic that requires so much context. I'd love to grab coffee with anyone who's interested in diving deep into distributed databases. Feel free to shoot me an email at ozgun@citusdata.com
Do you know when you're planning to release Citrus 5.0 deb/rpm packages?
Have a donut and look at our marketing spreadsheets.
I'm so tired of "seamless" "effortless" "simple" distributed database lies. There's mathematical theorems as to why there is no free lunch.
To use open source code, the more permissive the license the better. But to actually open your own code, BSDL is a very tough sell.
That's also why they use the AGPL. With database systems, even if they were under the GPL, some competitors could just modify the system and run it on their own server with improvements, and offer just the service to their clients. Again, the improvements go one way only: since the competitor would not distribute the modified system, as it's running on their servers, they would not need to distribute source changes. With the AGPL, that loophole is closed.
If this were true then Cloudera, Horton and a whole bunch of other companies would be out of business, yet in reality they are doing really well. All that AGPL is doing for Citus is:
1. Turning away people (customers) who are religious about licenses.
2. Eliminating any possibility of this code ever being integrated into PostgreSQL
They are of course free to release their code under any license they wish. I just think releasing code under the *GPL when you profited from a liberal BSDL is a douche nozzle thing to do. But knock yourself out! This tells me all I need to know about the company.
https://www.citusdata.com/blog/15-marco-slot/402-interactive...
I work for a BI consultancy and we don't even bat an eye until we hit billions of records in a primary fact table.
Certainly the DB server does need to scale vertically to some extent as you pass through the orders of magnitude > 10M. A good columnstore engine is also worthwhile to consider.
https://blog.cloudflare.com/scaling-out-postgresql-for-cloud...
Yes, that's what you're seeing right now, but in the past Citus (used to be "CitusDB") was a superset of the entire PostgreSQL codebase. During the lead-up to the open source release, we removed the use of any static methods or internal machinery and rewrote the installation process to use the PostgreSQL CREATE EXTENSION command. Additionally, we moved all of pg_shard's DML functionality into Citus to unify the product line.
So ultimately CitusDB was a fork but is now entirely an extension.
This would be complimentary to what Citus does, which is distributing the load across multiple shard instances (each with their own cores, benefiting from the parallel work in 9.6).
"for customers with large production deployments, we also offer an enterprise edition that comes with additional functionality"
One thing I'm having trouble with is finding information about transactional semantics. If I make several updates (to differently sharded keys) in a single transaction, will the transaction boundaries be preserved (committed "locally" first, then replicated atomically to shards)? Or will they fan out to different shards with separate begin/commit statements? Or without transactional boundaries at all?
In fact, I can't really find any information on how CitusDB achieves its transparent sharding for queries and writes. Does it add triggers to distributed tables to rewrite inserts, updates and deletes? Or are tables renamed and replaced with foreign tables? I wish the documentation was a bit more extensive.
Since I heard last year at PgConfSV that you will be releasing CitusDB 5.0 as open source, I've been waiting for this moment to come.
It makes 9.5's awesome capabilities to be augmented with sharding and distributed queries. While this targets real-time analytics and OLAP scenarios, being an open source extension to 9.5 means that a whole lot of users will benefit from this, even under more OLTP-like scenarios.
Now that Citus is open source, ToroDB will add a new CitusDB backend soon, to scale-out the Citus way, rather than in a Mongo way :)
Keep up with the good work!