Our Journey to PostgreSQL 12 (opens in new tab)

(tech.coffeemeetsbagel.com)

215 pointstommyzli5y ago114 comments

114 comments

82 comments · 16 top-level

0xbadcafebee5y ago· 15 in thread

Am I the only one who thinks it's bizarre that a structured query language defines so much of how we choose to architect and operate our systems?

Think about it for a sec: SQL is literally just a language to query and manipulate data. There's no reason that schema changes and data changes have to happen only through the one language, and only through one interface on one piece of software.

For whatever reason, this has just been how the most popular products have done it, and they largely just never changed their designs in 40 years. I like the language, and the general organization of the data is handy. But everything else about it is archaic.

Why fumble around with synchronization? 99% of the data in big datasets doesn't change. This doesn't even have to be "log-based", we just need to be able to ship the old, stable data and treat it almost like "cold storage".

Why is there a single point of entry into the data? You have to use the one database cluster to access the one database and the one set of tables. Why can't we expose that same data in multiple ways, using multiple pieces of software, on multiple endpoints?

Other protocols and languages have ways of dealing with these kinds of things. LDAP can refer you to a different endpoint to process what you need. Web servers can store, process, and retrieve the same content across many different endpoints in a variety of ways. Lots of technology exists that can easily replicate, snapshot, version-control, etc arbitrary pieces of data and expose them to any application using standard interfaces.

Why haven't we created a database yet which works more like the Unix operating system?

sk5t5y ago

Are you kidding, LDAP referrals as a model of how to do it? I mean, I did a lot of LDAP work back in the day, and that's not a feature that saw a lot of action outside academia. Just write your own thing on top of HTTP, that's got referrals too!

There are practically many ways of talking to database systems, if it isn't too troubling that some SQL is often happening somewhere. Like, there's Hasura, postgrest, etc.; or Mongo has a variety of drivers that support different inputs.

One might consider the most unix'y database to be Berkeley DB/Sleepycat, but that is probably not what you wanted. ;)

outworlder5y ago

> Why haven't we created a database yet which works more like the Unix operating system?

Not to be overly snarky, but have you tried? Database design is full of trade-offs.

0xbadcafebee5y ago

When I was a kid and learning to program, I wrote some shitty databases for fun. I learned about the trade-offs and that it was easy to write a database that out-performed RDBMSes in specific criteria. But I hadn't thought of making them extensible.

I have a pet project I'm working on, which is a generi distributed system where each component is a microservice. It turns out there's lots of these things built already, mainly by systems engineers for obscure things (Airflow, Rundeck, Stackstorm being some examples). I'll probably think about how I can redesign my project with composeable databases in mind. I don't expect I'll ever have a working product, but it'll be useful to think about this problem.

Dylan168075y ago

You're basically telling them to put in months of work to find out. Even if it's not too snarky, it's a ridiculous way to learn something that could be conveyed pretty well in a blog post or a chapter of a book.

tyingq5y ago

I think that's a fair observation, but in this case, a dating app, a relational database and SQL seem like a great fit. The ends users are fairly literally SELECTING and JOINING with LIMITS and newly FOREIGN relationships and so forth :)

paulryanrogers5y ago

Why are we still using ASCII or Unicode character interfaces in shells? Because like SQL they work and are moderately well understood.

There are many query languages and having one common one as a base is useful to transfer skills. Think of it as an on ramp to more specific dialects or technologies.

rubiquity5y ago

You just described distributed databases -- which are overwhelmingly now deciding to use SQL as their interface of choice. You're completely hand waving over the fact that data is just a bunch of bits on disk grouped into pages. Everything above that fact is a tradeoff.

0xbadcafebee5y ago

Actually not really. A Unix operating system can do everything I described with regular-old data, and it's not a distributed operating system. It simply has extensible standard interfaces.

Do you need a distributed database to read a .txt file with cat, Emacs, and Firefox? No. Why not? Because there's an I/O standard they all use. Does that .txt file have to live on a single filesystem, or disk? No. Why not? Because the storage mediums all have a standard virtual filesystem interface.

There is no reason databases cannot do exactly the same thing. It's just that nobody has made them do it yet. They've stuck with the exact same paradigm, and that then drives how all of us build our systems, with this archaic 40 year old model that requires heavy-lifting maintenance windows and a lot of praying.

curryst5y ago

You're interweaving several different issues here.

> Why fumble around with synchronization? 99% of the data in big datasets doesn't change. This doesn't even have to be "log-based", we just need to be able to ship the old, stable data and treat it almost like "cold storage".

This is not a feature of SQL, this is a feature of the database. Also, this sounds exactly like doing full-table replication to get the "old" data and then turning on log-based replication. You can do key-based replication if you really want to avoid log-based, but it's generally just a less efficient version of log-based replication.

> Why is there a single point of entry into the data? You have to use the one database cluster to access the one database and the one set of tables. Why can't we expose that same data in multiple ways, using multiple pieces of software, on multiple endpoints?

You can. Postgres supports both Perl and Python extensions that run in the RDBMS process, iirc. Very few people use them because running in the RDBMS process means that you can break the RDBMS process in really bad ways, and it is very difficult to gain any benefits over just running a separate process that communicates over SQL.

So if you consider other processes that communicate with the database and then show views of that over other protocols, that describes most of the backend apps in the world.

There's also stuff like Presto[1] that allows you to run queries distributed over multiple databases, multiple types of databases, etc, etc, etc. In that case, conceptually, Presto is "the database" and all the records you refer to are remote.

1: https://prestodb.io/

0xbadcafebee5y ago

> This is not a feature of SQL, this is a feature of the database

Yet they always seem tied together eh? Somehow the conventions are stuck together, and that then affects how our systems work.

> Postgres supports both Perl and Python extensions that run in the RDBMS process

But I'm talking about not having to use the RDBMS process. If I have a text file on the disk, I can use a million different programs to access it. I don't have to run one program through another program just to open, read, write, and close the file with any program. Why don't we design our databases to work this way?

> Very few people use them because running in the RDBMS process means that you can break the RDBMS process in really bad ways

Yes, it does sound bad. That's why I'd prefer an indirect method rather than having to wedge access through the RDBMS

> So if you consider other processes that communicate with the database and then show views of that over other protocols, that describes most of the backend apps in the world.

Yep! We architect entire systems-of-systems just because the model for our data management in an RDBMS is too rigid. We're building spaceships to get to the grocery store because we haven't yet figured out roads and cars.

lmm5y ago

You're not the only one. There are lots of better alternatives to SQL databases for most use cases (I'm lucky enough to have worked in some places where SQL datastores were the exception rather than the rule). But it takes a long time for cultural change to happen.

strokirk5y ago

Would you mind mentioning some good options? I've always been interested in databases, but find it hard to know which ones to learn more about and when they'd actually be worth investing in (especially since it's hard to build knowledge from toy projects).

1 more reply

kall5y ago

Putting data into cold storage, spinning up multiple flexible access points with different datasets... Sounds like what Snowflake is doing right? I don‘t really use it but looks neat from the outside. May be nice to bring some of that to OLTP.

marcinzm5y ago

Life is about tradeoffs. Complexity, latency, cost and so on. Things in general are much harder to implement correctly (see Jepsen tests) than to talk about in broad terms.

valenterry5y ago

Not sure why you are downvoted, you made a lot of very valid points and I agree.

People get very comfortable very quickly, even tech savvy folks. Having to learn another language will scare many away, even though the effort might be the same - it's perceived harder.

lmarcos5y ago· 14 in thread

Silly question: they have 5.7TB in their database... How come? It's a dating app founded in 2012, I can understand that one can accumulate such much data in 11 years, but sure you can periodically archive "unused" data and move it out of your primary database, right? I mean, are the 5.7TB of data actually needed in a daily basis by their app?

(I assume data for analytical purposes is not stored in their primary DB, which is fair to assume I believe)

YuriNiyazov5y ago

As someone who manages an order of magnitude greater database than that for an app founded in 2008 (and that's after multiple archivings of unused data, and no severe crimes like storing image blobs in the DB), I can tell you that, uhm, Coffee Meets Bagel is either not that successful, or is doing a very good job at managing their DB size.

jamesmishra5y ago

5.7 TB is small by database standards. I work at a much smaller company and deal with "proportionally" much more data.

I don't know if it would be worth the engineering effort to try and archive old data in a way that still makes it transparently accessible to users that go looking for it--especially when modern databases ought to be able to scale up and out without manual archiving.

arcticfox5y ago

5.7 TB for an OLTP database is small?! I must be living in a different world. Obviously I know you can go that big, but I thought the number of use-cases would be limited.

1 more reply

systems5y ago

its huge by database standard, i worked in large multinationals and dealt with some of the their largest databases

5.7 is enormous by database standard , there is no way you can get good query performance on a 5.7 tb database without solid physical partitioning and heavily optimized queries, and most normal companies even with 200-500 GB database use datamarts to have good performance without a super complex architectures and geniuses working fro you in db admin department

the more i think about it, the more i think that 5.7 TB would be unusably huge, and if you have this much data, most wont even bother to partition, the db will be broken into several (hundreds) smaller databases

1 more reply

diziet5y ago

Imagine there are 10m users. That's 600kb per user.

mbyio5y ago

And you have to account for indexes, temporary tables used for data analysis, etc. And most of it is probably not compressed. So with that perspective it isn't that much data at all.

stickfigure5y ago

That's an incredibly large amount per user? I have worked on a couple online dating sites, including one that was fairly popular (Let's Date - which stiffed me for my last invoice before they went belly up grrr). Unless you're storing images in the database, it's really hard to generate 600k for a dating workload - even with indexes.

The only thing I can imagine generating 600k per user is putting something like "hit tracking" in the database. Which I've done - yes it adds up - but it's also relatively easy to move to some other kind of store.

1 more reply

mrweasel5y ago

That's really a good way of looking at it. I though it sounded like a lot of data, well, 600kb is a lot of textual data, but who knows what they have stuffed into the database.

I worked for an e-commerce site, with a few million customers, even more orders, data-duplication all over the place, and still we where using a perhaps a 200GB of database storage.

sharadov5y ago

Not a lot, but there may be options to partition, but again, you can't comment unless you know the design.

hermanradtke5y ago

Agreed. I ran a 1 billion dollar GMV e-commerce company and our primary OLTP database was around 60 GB. Everything else was moved to an OLAP database.

philipwhiuk25y ago

Purely at a guess, people's images are stored in the app as blobs because it's "easier"

kevas5y ago

What was the strategy for moving things over? By age? Monitoring queries and determining what data isn’t being queried? Something else?

tommyzliOP5y ago

Nope.

1 more reply

wejick5y ago

I think it's because they use Django and naturally it's a monolithic architecture. We will hardly find database this big in microservice world, instead there will be several smaller databases.

Now whether several smaller DBs is easier to manage compared to one big one, it would be debatable. However with that huge DB, I would prefer having several smaller one.

outworlder5y ago· 7 in thread

> As I mentioned earlier we run Postgres on i3.8xlarge instances in EC2, which come with about 7.6TB of NVMe storage.

Wait a second. You run your production database on ephemeral storage? Wow.

I see the replication setup and the S3 WAL archiving and whatnot but still... that's brave.

craigkerstiens5y ago

It's really a question of how many replicas you have, if you're running with sync rep or not, and what your DR story is like. We've tested it before a few times at previous employers and are exploring it rolling it out for Crunchy Bridge currently. The NVMe storage is really nice it's great performance and the price balance of it is good as well. But it does come with nuances... I wouldn't let a user provision without HA for example. In cases for a standard app without crazy uptime requirements having a standby or 2 is wasted cash. So it isn't for everyone, but can be for some people.

snissn5y ago

hi! Can you share how you do HA on postgres? Master/slave with monitoring and manual fall over or is that automatic? If so reliable? What tooling do you use? Thanks!

tommyzliOP5y ago

We are living life on the edge to an extent, but we have 5 hot standbys across AZs and regular backups + WAL archives to S3.

May not be as durable as EBS, but it's enough for me to sleep soundly at night. And with a highly concurrent WAL-G download, it takes like an hour to catch up a new replica from scratch.

throwdbaaway5y ago

Fine, with enough replicas, you can sleep well at night. But how about the 3 years uptime without reboot? Can you really enjoy your morning coffee without thinking about it? :)

Netflix went full ephemeral storage for their Cassandra clusters since the beginning, at the time when they were just spinning disks. Years later, they still insist on doing this, and had to come up with creative solution to fix the uptime issue: https://netflixtechblog.medium.com/datastore-flash-upgrades-...

2 more replies

bcrosby955y ago

This was pretty common in AWS back in the late 00s. Performance usually sucked too much otherwise.

paulryanrogers5y ago

Even with prioritized IOPS I once had to resort to RAID0 and replicas to get needed performance under budget on EBS. Probably should have just bumped instance size and used local storage.

eropple5y ago

It's funny--we used to run Vertica on ephemeral nodes and actually found a performance improvement going to EBS, but that was pre-NVMe in AWS.

I wonder how big the delta was for CMB between EBS and ephemeral?

ants_a5y ago· 6 in thread

pg_upgrade would have worked fine given your requirements. Just using normal streaming replication to move database over to new systems and then performing an in place pg_upgrade there would be doable with most likely a couple of minutes of downtime and a much quicker and more robust process.

tommyzliOP5y ago

How would that have worked with multiple replicas cascading from the new primary? Streaming replication doesn't work across versions, so would we have had to build out a tree of new instances, then pg_upgrade them all at the same time?

craigkerstiens5y ago

I believe it is possible to repoint the replicas with re-wind and then repoint to a new timeline. This is something we looked at at Heroku a long time back. It wasn't trivial to fix, but Heroku eventually improved some of this. This post drills into some of that - https://blog.keikooda.net/2017/10/18/battle-with-a-phantom-w...

cconstantine5y ago

I've done a few postgres upgrades, the first using pg_upgrade, and the last doing effectively what you did (it was even 9.x -> 12).

My experience was that it needs to rewrite all the data for some tables/indexes under some circumstances, and the db won't be available while that happens. So, unless your db can be down for the time it takes to rewrite all that data it isn't really an option. After having done the upgrade through streaming logical replication I'm not sure I would try pg_upgrade again.

I did the pg_upgrade style update a long time ago, so most details are fuzzy, but I remember setting up a string of replicas something like:

primary -> [read_replica, backups_replica]

read_replica -> [upgrade_replica]

upgrade_replica -> [read_replica_upgraded, backups_replica_upgraded]

This allowed us to do multiple practice rounds without putting any unnecessary load on the primary. I think we needed to re-initiate replication off 'upgrade_replica' after the upgrade, but we did the live update during low-load so the extra read load wasn't an issue.

ants_a5y ago

In pg_upgrade documentation there is documented a way to use rsync to quickly replicate the upgraded contents of the new primary to replicas. So you would first move to upgraded base VMs running the old version streaming from old primary, which can be done one host at a time if need be.

The new cluster can then be pg_upgraded and rsynced all at once.

paulryanrogers5y ago

Good questions. As logical replication matures it may someday be possible to replicate among versions.

2 more replies

sa465y ago

We tried pg_upgrade going from Postgres 10 to Postgres 12 and it didn't work. An individual instance was about 8 TiB in size. We left it running for over a day to see if it would complete. Instead, we used an approach similar to logical replication described in the article.

To be fair, our use case was probably close to pathological for pg_upgrade. We had lots of TOAST data and dozens to hundreds of indexes per table.

mbyio5y ago· 5 in thread

This is a very difficult thing to do. Very impressive. I have so many questions but my number one is: were you able to evaluate alternatives to your existing vertical scaling based setup? For example, cockroachdb, multi-master postgres, using sharding instead of a single DB, etc. At that database size, you are well past the point in which a more advanced DB technology would theoretically help you scale and simplify your architecture, so I'm curious why you didn't go that route.

paulryanrogers5y ago

Distributed systems are hard. Multi master is particularly sticky, especially if the data doesn't have natural boundaries.

Once solved though horizontal is nice, if more involved to maintain.

namibj5y ago

CockroachDB is pretty good at encapsulating the complexity of multi-master.

You'll have to accept that transactions can fail due to conflicts, so if they are interactive, you'll have to retry manually.

Edit: I'd like hear criticism, instead of just seeing disapproval.

1 more reply

tommyzliOP5y ago

basically what paulryanrogers said.

We thought about migrating to Citus, but I don't have a good idea of how to shard our dataset efficiently.

If we were to shard by user id, then creating a match between two people would require cross-shard transactions and joins. Sharding by geography is also tough because people move around pretty frequently.

skinkestek5y ago

What kind of data is this?

My best guesses are

- either it is SAAS in which case shard it should make sense to shard by customer

- or it is something-to-consumer (social networking?) on which case I guess you'll have to take a step back and see if you can sacrifice one of your current assumptions

... but I feel I'm missing something since what I am saying feels a bit trivial.

1 more reply

mslot5y ago

Sharding a matching engine is indeed pretty hard, and requires redundancy and very deliberate data modelling choices.

That does seem like a fun exercise :).

The general rules of the game are: You can only scale up throughput of queries/transactions that only access 1 shard (some percentage going to 2 shards can be ok). You can only scale down response time of large operations that span across shard since they are parallelized. You should only join distributed tables on their distribution column. You can join reference tables on any column.

The thing that comes to mind is to use a reference table for any user data that is used to find/score matches. Reference tables are replicated to every node and can be joined with distributed tables and each other using arbitrary join clauses, so joining by score or distance is not a problem, but you need to store the data multiple times.

One of the immediate benefits of reference tables is that reads can be load-balanced across the nodes, either by using a setting (citus.task_assignment_policy = 'round-robin') or using a distributed table as a routing/parallelization scheme.

    CREATE TABLE profiles (
        user_id bigint primary key,
        location geometry,
        profile jsonb
    );
    SELECT create_reference_table('profiles');
    
    CREATE TABLE users (
        user_id bigint primary key references profiles (user_id),
        name text,
        email text
    );
    SELECT create_distributed_table('users', 'user_id');
    
    -- replicate match_score function to all the nodes
    SELECT create_distributed_function('match_score(jsonb,jsonb)');
    
    -- look up profile of user 350, goes to 1 shard
    SELECT * FROM users u, profiles p WHERE u.user_id = p.user_id AND u.user_id = 350;
    
    -- find matches for user #240 within 5km, goes to 1 shard
    SELECT b.user_id, match_score(a.profile, b.profile) AS score
    FROM users u, profiles a, profiles b
    WHERE u.user_id = 240 AND u.user_id = a.user_id 
    AND match_score(a.profile,b.profile) > 0.9 AND st_distance(a.location,b.location) < 5000 
    ORDER BY score DESC LIMIT 10;

The advantage of having the distributed users table in the join is mainly that you divide the work in a way that keeps each worker node's cache relatively hot for a specific subset of users, though you'll still be scanning most of the data to find matches.

Where it gets a bit more interesting is if your dating site is opinionated / does not let you search, since you can then generate matches upfront in batches in parallel.

    CREATE TABLE match_candidates (
        user_id_a bigint references profiles (user_id),
        user_id_b bigint references profiles (user_id),
        score float,
        primary key (user_id_a, user_id_b)
    );
    SELECT create_distributed_table('match_candidates', 'user_id_a', colocate_with :='users');
    
    -- generate match candidates for all users in a distributed, parallel fashion
    -- will generate a match candidate in both directions, assuming score is commutative
    INSERT INTO match_candidates
    SELECT a.user_id, b.user_id, match_score(a.profile,b.profile) AS score
    FROM users u, profiles a, profiles b
    WHERE u.user_id = a.user_id 
    AND match_score(a.profile,b.profile) > 0.9 AND st_distance(a.location,b.location) < 5000 
    ORDER BY score DESC LIMIT 10;

For interests/matches, it might make sense to have some redundancy in order to achieve reads that go to 1 shard as much possible.

    CREATE TABLE interests (
        user_id_a bigint references profiles (user_id),
        user_id_b bigint references profiles (user_id),
        initiated_by_a bool,
        mutual bool,
        primary key (user_id_a, user_id_b)
    );
    SELECT create_distributed_table('interests', 'user_id_a', colocate_with :='users');
    
    -- 240 is interested in 350, insert into 2 shards (uses 2PC)
    BEGIN;
    INSERT INTO interests VALUES (240, 350, true, false);
    INSERT INTO interests VALUES (350, 240, false, false);
    END;
    
    -- people interested in #350, goes to 1 shard
    SELECT * FROM interests JOIN profiles ON (user_id_b = user_id) WHERE user_id_a = 350 AND NOT initiated_by_a;
    
    -- it's a match! update 2 shards (uses 2PC)
    BEGIN;
    UPDATE interests SET mutual = true WHERE user_id_a = 240 AND user_id_b = 350;
    UPDATE interests SET mutual = true WHERE user_id_a = 350 AND user_id_b = 240;
    END;
    
    -- people #240 is matched with, goes to 1 shard
    SELECT * FROM interests JOIN profiles ON (user_id_b = user_id) WHERE user_id_a = 240 AND mutual;

For data related to a specific match, you can perhaps use the smallest user ID as the distribution column to avoid the redundancy.

    CREATE TABLE messages (
        user_id_a bigint,
        user_id_b bigint,
        from_a bool,
        message_text text,
        message_time timestamptz default now(),
        message_id bigserial,
        primary key (user_id_a, user_id_b, message_id),
        foreign key (user_id_a, user_id_b) references interests (user_id_a, user_id_b) on delete cascade
    );
    SELECT create_distributed_table('messages', 'user_id_a', colocate_with :='interests');

    -- user 350 sends a message to 240, goes to 1 shard
    INSERT INTO messages VALUES (240, 350, false, 'hi #240!');
    
    -- user 240 sends a message to 350, goes to 1 shard
    INSERT INTO messages VALUES (240, 350, true, 'hi!');
    
    -- user 240 looks at chat with user 350, goes to 1 shard
    SELECT from_a, message_text, message_time
    FROM messages 
    WHERE user_id_a = 240 AND user_id_b = 350
    ORDER BY message_time DESC LIMIT 100;

This exercise goes on for a while. You still get the benefits of PostgreSQL and ability to scale up throughput of common operations or scale down response time of batch operations, but it does require careful data model choices.

(Citus engineer who enjoys distributed systems puzzles)

1 more reply

philipwhiuk25y ago· 5 in thread

The fact that the answer isn't "move to RDS where Amazon solves the problem for us, which isn't our core business as a relationships app" seems to me to be a massive failing of the RDS offering and cloud services in general.

ramraj075y ago

RDS can be.. expensive? Like by a lot?

aeyes5y ago

If you want to do upgrades like this on RDS with minimal downtime you will end up doing the same process: Set up new servers, do logical replication, switch over.

The RDS update process is a single button and you have no way of knowing how long it will take. There are some tricks like turning off Multi AZ and taking a snapshot manually before starting the process but still - for large instances you could be waiting anywhere from 30 minutes to 3 hours for RDS to finish. With large instance types I have seen RDS take a full hour just to provision an instance, in the meantime you'll sit there hitting F5 not knowing if it will ever finish.

cwyers5y ago

Only if they evaluated RDS and found it wanting. They don't even mention testing it.

tommyzliOP5y ago

It's not in the post, but I answered this in a separate thread. RDS doesn't let us provision as many IOPS as we need.

Apparently Aurora behaves differently, but I wasn't aware of that when we specced out the project.

2 more replies

tyingq5y ago

I suspect the RDS limitations are left there to push you to Aurora. They control that and would be better equipped to make the most of their infrastructure and margins with it.

temp6675y ago· 5 in thread

Very nice.

Did they migrate into Amazon RDS while doing this? For smaller projects I've stopped doing the self managed postgresql thing. The pricing is higher (75%?) for RDS for some use cases but can be worth it.

Going to try RDS Proxy next.

tommyzliOP5y ago

Thanks! We stuck with plain EC2. RDS has a limit of 80,000 provisioned IOPS and our read replicas on Postgres 9.6 would regularly hit near double that during peak

sk5t5y ago

Did you consider lowering those IOPS with application-level and/or distributed in-memory cache and/or pub-sub notifications to let your app nodes not pester the database so much? Reasonably performant hand-written SQL (no ORM!), review of query plans, maybe shift the hot path into functions/procs?

1 more reply

throwdbaaway5y ago

That's an impressive amount of IOPS. I have always been a EBS / pd-ssd guy, and mostly rely on memory to reduce the IOPS requirement. But as cloud providers typically charge a ridiculous amount of money for memory, a setup like yours with instance storage / local-ssd is an intriguing option.

orf5y ago

That limit doesn’t apply to Aurora - did you consider that?

1 more reply

0xbadcafebee5y ago

So.......... caching?

shoo5y ago· 3 in thread

> We then made the following changes to the subscriber database in order to speed up the synchronization: [...] Set fsync to off

I'm curious how much risk of data loss this added.

I guess the baseline is "we need to migrate before we run out of disk" I.e. you're either going to have data loss or a long period of unavailability if the migration cannot be carried out fast enough.

ants_a5y ago

If fsync is turned back on and a manual sync call is issued before considering the replica valid there will be no risk from this.

tommyzliOP5y ago

ants_a is correct. Also, our NVMe storage is ephemeral so you aren't recovering from a power loss anyways :)

rubiquity5y ago

Disclaimer: I work at AWS, not on EC2.

Locally attached disks are not ephemeral to instance reboots/power failures. However, the disks are wiped after instance terminations. On the official EC2 product pages this is called "instance storage" not "ephemeral storage."

1 more reply

mooreds5y ago· 2 in thread

Amazing that the process went so smoothly and that there were so many resources for them to draw from. Jumping from 9 to 12 is quite a few major versions!

Also liked the couple of gotchas which go to show no matter how smooth a data migration is, there'll be some bumps.

hans_castorp5y ago

> Jumping from 9 to 12 is quite a few major versions!

Just a little side note.

They were jumping from 9.6 to 12, not from 9.0 to 12.

Before Postgres 10 was released, the first two digits defined a "major" version). So from 9.6 to 12 it's three major releases (9.6 -> 10, 10 -> 11, 11 -> 12)

foxhill5y ago

still, changes between 9.6 and 12 are _numerous_, both in features and performance: llvm based query compilation, CTE de-materialisation, proper procedures, and that's just off the top of my head.

i wish the process for upgrading postgres were easier/more dynamic. i'm sure plenty of people are still using versions 9.6 or earlier.

1 more reply

justinclift5y ago· 2 in thread

Reading over this, it seems like there isn't an offsite backup done of the database? eg to have a copy of the data in a "safe place" off AWS infrastructure

If something goes wrong with their relationship with AWS, that could be business ending. :(

tyingq5y ago

The termination clauses in their T&C's say they will give you access, post "for cause" termination, so long as you've paid your bill. Though I'm mindful that pulling a lot of data could take a long time.

Dylan168075y ago

That's still a lot of trust that nothing else wipes the account.

But this post wasn't about backups, so there might be a whole lot excluded from the diagram.

u678u5y ago· 1 in thread

I love RDBMS over NoSql but the whole upgrade and schema change always is stressful. I miss the days when we could ask our DBA to deal with it. :)

jabo5y ago

At least with RDBMS, the database engine takes care of the actual data movement for you, after you issue a SQL command. With NoSQL, when you need to update your document format, you now need to handle the data migration yourself.

esseti5y ago· 1 in thread

if pglogical better than a min downtime with pgdump/psql? it seems a lot of work to setup pglogical to migrate versions (or am i missing anything?)

darkr5y ago

With 5.7 TB data, you're probably looking at something like 24 hours for a dump/restore including index rebuilds

haltingproblem5y ago

Love your app, easily the best experience of all dating apps. However, stability and notifications are atrocious. App notifications lead the data showing up in the app. Sometimes notifications fail all together. You guys can dominate this space if you can fix these issues.

sharadov5y ago

Given the limitations that you had ( move to larger instance, downtime restrictions), you took the most optimal path. Fantastic work! I am in the process of what you did, but across couple hundred instances ( am using pg_upgrade for most but will be using an approach similar to yours where we can't afford downtime).

rubiquity5y ago

Uisng async replication and read replicas in a relational DB is a great way to play reverse Wheel of Fortune and go from ACID to just C. You must get some fun bug reports. A poster below mentions doing actions in the app and their side effects vanishing. At the end of the day it's a business decision but that would not be fun to program against, though maybe some of it can be handled with app/client-side with caching and causal consistency.

edit: For more on the nuances of Postgres tradeoffs for replication and transaction isolation: https://www.postgresql.org/docs/9.1/high-availability.html

latchkey5y ago

I read this and all I can think about is all that private information on some unmaintained database server.

j / k navigate · click thread line to collapse

114 comments

82 comments · 16 top-level

0xbadcafebee5y ago· 15 in thread

Am I the only one who thinks it's bizarre that a structured query language defines so much of how we choose to architect and operate our systems?

Why haven't we created a database yet which works more like the Unix operating system?

sk5t5y ago

One might consider the most unix'y database to be Berkeley DB/Sleepycat, but that is probably not what you wanted. ;)

outworlder5y ago

> Why haven't we created a database yet which works more like the Unix operating system?

Not to be overly snarky, but have you tried? Database design is full of trade-offs.

0xbadcafebee5y ago

Dylan168075y ago

tyingq5y ago

paulryanrogers5y ago

Why are we still using ASCII or Unicode character interfaces in shells? Because like SQL they work and are moderately well understood.

There are many query languages and having one common one as a base is useful to transfer skills. Think of it as an on ramp to more specific dialects or technologies.

rubiquity5y ago

0xbadcafebee5y ago

Actually not really. A Unix operating system can do everything I described with regular-old data, and it's not a distributed operating system. It simply has extensible standard interfaces.

curryst5y ago

You're interweaving several different issues here.

So if you consider other processes that communicate with the database and then show views of that over other protocols, that describes most of the backend apps in the world.

1: https://prestodb.io/

0xbadcafebee5y ago

> This is not a feature of SQL, this is a feature of the database

Yet they always seem tied together eh? Somehow the conventions are stuck together, and that then affects how our systems work.

> Postgres supports both Perl and Python extensions that run in the RDBMS process

> Very few people use them because running in the RDBMS process means that you can break the RDBMS process in really bad ways

Yes, it does sound bad. That's why I'd prefer an indirect method rather than having to wedge access through the RDBMS

> So if you consider other processes that communicate with the database and then show views of that over other protocols, that describes most of the backend apps in the world.

lmm5y ago

strokirk5y ago

1 more reply

kall5y ago

marcinzm5y ago

Life is about tradeoffs. Complexity, latency, cost and so on. Things in general are much harder to implement correctly (see Jepsen tests) than to talk about in broad terms.

valenterry5y ago

Not sure why you are downvoted, you made a lot of very valid points and I agree.

People get very comfortable very quickly, even tech savvy folks. Having to learn another language will scare many away, even though the effort might be the same - it's perceived harder.

lmarcos5y ago· 14 in thread

(I assume data for analytical purposes is not stored in their primary DB, which is fair to assume I believe)

YuriNiyazov5y ago

jamesmishra5y ago

5.7 TB is small by database standards. I work at a much smaller company and deal with "proportionally" much more data.

arcticfox5y ago

5.7 TB for an OLTP database is small?! I must be living in a different world. Obviously I know you can go that big, but I thought the number of use-cases would be limited.

1 more reply

systems5y ago

its huge by database standard, i worked in large multinationals and dealt with some of the their largest databases

1 more reply

diziet5y ago

Imagine there are 10m users. That's 600kb per user.

mbyio5y ago

And you have to account for indexes, temporary tables used for data analysis, etc. And most of it is probably not compressed. So with that perspective it isn't that much data at all.

stickfigure5y ago

1 more reply

mrweasel5y ago

That's really a good way of looking at it. I though it sounded like a lot of data, well, 600kb is a lot of textual data, but who knows what they have stuffed into the database.

I worked for an e-commerce site, with a few million customers, even more orders, data-duplication all over the place, and still we where using a perhaps a 200GB of database storage.

sharadov5y ago

Not a lot, but there may be options to partition, but again, you can't comment unless you know the design.

hermanradtke5y ago

Agreed. I ran a 1 billion dollar GMV e-commerce company and our primary OLTP database was around 60 GB. Everything else was moved to an OLAP database.

philipwhiuk25y ago

Purely at a guess, people's images are stored in the app as blobs because it's "easier"

kevas5y ago

What was the strategy for moving things over? By age? Monitoring queries and determining what data isn’t being queried? Something else?

tommyzliOP5y ago

Nope.

1 more reply

wejick5y ago

I think it's because they use Django and naturally it's a monolithic architecture. We will hardly find database this big in microservice world, instead there will be several smaller databases.

Now whether several smaller DBs is easier to manage compared to one big one, it would be debatable. However with that huge DB, I would prefer having several smaller one.

outworlder5y ago· 7 in thread

> As I mentioned earlier we run Postgres on i3.8xlarge instances in EC2, which come with about 7.6TB of NVMe storage.

Wait a second. You run your production database on ephemeral storage? Wow.

I see the replication setup and the S3 WAL archiving and whatnot but still... that's brave.

craigkerstiens5y ago

snissn5y ago

hi! Can you share how you do HA on postgres? Master/slave with monitoring and manual fall over or is that automatic? If so reliable? What tooling do you use? Thanks!

tommyzliOP5y ago

We are living life on the edge to an extent, but we have 5 hot standbys across AZs and regular backups + WAL archives to S3.

May not be as durable as EBS, but it's enough for me to sleep soundly at night. And with a highly concurrent WAL-G download, it takes like an hour to catch up a new replica from scratch.

throwdbaaway5y ago

Fine, with enough replicas, you can sleep well at night. But how about the 3 years uptime without reboot? Can you really enjoy your morning coffee without thinking about it? :)

2 more replies

bcrosby955y ago

This was pretty common in AWS back in the late 00s. Performance usually sucked too much otherwise.

paulryanrogers5y ago

Even with prioritized IOPS I once had to resort to RAID0 and replicas to get needed performance under budget on EBS. Probably should have just bumped instance size and used local storage.

eropple5y ago

It's funny--we used to run Vertica on ephemeral nodes and actually found a performance improvement going to EBS, but that was pre-NVMe in AWS.

I wonder how big the delta was for CMB between EBS and ephemeral?

ants_a5y ago· 6 in thread

tommyzliOP5y ago

craigkerstiens5y ago

cconstantine5y ago

I've done a few postgres upgrades, the first using pg_upgrade, and the last doing effectively what you did (it was even 9.x -> 12).

I did the pg_upgrade style update a long time ago, so most details are fuzzy, but I remember setting up a string of replicas something like:

primary -> [read_replica, backups_replica]

read_replica -> [upgrade_replica]

upgrade_replica -> [read_replica_upgraded, backups_replica_upgraded]

ants_a5y ago

The new cluster can then be pg_upgraded and rsynced all at once.

paulryanrogers5y ago

Good questions. As logical replication matures it may someday be possible to replicate among versions.

2 more replies

sa465y ago

To be fair, our use case was probably close to pathological for pg_upgrade. We had lots of TOAST data and dozens to hundreds of indexes per table.

mbyio5y ago· 5 in thread

paulryanrogers5y ago

Distributed systems are hard. Multi master is particularly sticky, especially if the data doesn't have natural boundaries.

Once solved though horizontal is nice, if more involved to maintain.

namibj5y ago

CockroachDB is pretty good at encapsulating the complexity of multi-master.

You'll have to accept that transactions can fail due to conflicts, so if they are interactive, you'll have to retry manually.

Edit: I'd like hear criticism, instead of just seeing disapproval.

1 more reply

tommyzliOP5y ago

basically what paulryanrogers said.

We thought about migrating to Citus, but I don't have a good idea of how to shard our dataset efficiently.

skinkestek5y ago

What kind of data is this?

My best guesses are

- either it is SAAS in which case shard it should make sense to shard by customer

- or it is something-to-consumer (social networking?) on which case I guess you'll have to take a step back and see if you can sacrifice one of your current assumptions

... but I feel I'm missing something since what I am saying feels a bit trivial.

1 more reply

mslot5y ago

Sharding a matching engine is indeed pretty hard, and requires redundancy and very deliberate data modelling choices.

That does seem like a fun exercise :).

    CREATE TABLE profiles (
        user_id bigint primary key,
        location geometry,
        profile jsonb
    );
    SELECT create_reference_table('profiles');
    
    CREATE TABLE users (
        user_id bigint primary key references profiles (user_id),
        name text,
        email text
    );
    SELECT create_distributed_table('users', 'user_id');
    
    -- replicate match_score function to all the nodes
    SELECT create_distributed_function('match_score(jsonb,jsonb)');
    
    -- look up profile of user 350, goes to 1 shard
    SELECT * FROM users u, profiles p WHERE u.user_id = p.user_id AND u.user_id = 350;
    
    -- find matches for user #240 within 5km, goes to 1 shard
    SELECT b.user_id, match_score(a.profile, b.profile) AS score
    FROM users u, profiles a, profiles b
    WHERE u.user_id = 240 AND u.user_id = a.user_id 
    AND match_score(a.profile,b.profile) > 0.9 AND st_distance(a.location,b.location) < 5000 
    ORDER BY score DESC LIMIT 10;

Where it gets a bit more interesting is if your dating site is opinionated / does not let you search, since you can then generate matches upfront in batches in parallel.

    CREATE TABLE match_candidates (
        user_id_a bigint references profiles (user_id),
        user_id_b bigint references profiles (user_id),
        score float,
        primary key (user_id_a, user_id_b)
    );
    SELECT create_distributed_table('match_candidates', 'user_id_a', colocate_with :='users');
    
    -- generate match candidates for all users in a distributed, parallel fashion
    -- will generate a match candidate in both directions, assuming score is commutative
    INSERT INTO match_candidates
    SELECT a.user_id, b.user_id, match_score(a.profile,b.profile) AS score
    FROM users u, profiles a, profiles b
    WHERE u.user_id = a.user_id 
    AND match_score(a.profile,b.profile) > 0.9 AND st_distance(a.location,b.location) < 5000 
    ORDER BY score DESC LIMIT 10;

For interests/matches, it might make sense to have some redundancy in order to achieve reads that go to 1 shard as much possible.

    CREATE TABLE interests (
        user_id_a bigint references profiles (user_id),
        user_id_b bigint references profiles (user_id),
        initiated_by_a bool,
        mutual bool,
        primary key (user_id_a, user_id_b)
    );
    SELECT create_distributed_table('interests', 'user_id_a', colocate_with :='users');
    
    -- 240 is interested in 350, insert into 2 shards (uses 2PC)
    BEGIN;
    INSERT INTO interests VALUES (240, 350, true, false);
    INSERT INTO interests VALUES (350, 240, false, false);
    END;
    
    -- people interested in #350, goes to 1 shard
    SELECT * FROM interests JOIN profiles ON (user_id_b = user_id) WHERE user_id_a = 350 AND NOT initiated_by_a;
    
    -- it's a match! update 2 shards (uses 2PC)
    BEGIN;
    UPDATE interests SET mutual = true WHERE user_id_a = 240 AND user_id_b = 350;
    UPDATE interests SET mutual = true WHERE user_id_a = 350 AND user_id_b = 240;
    END;
    
    -- people #240 is matched with, goes to 1 shard
    SELECT * FROM interests JOIN profiles ON (user_id_b = user_id) WHERE user_id_a = 240 AND mutual;

For data related to a specific match, you can perhaps use the smallest user ID as the distribution column to avoid the redundancy.

    CREATE TABLE messages (
        user_id_a bigint,
        user_id_b bigint,
        from_a bool,
        message_text text,
        message_time timestamptz default now(),
        message_id bigserial,
        primary key (user_id_a, user_id_b, message_id),
        foreign key (user_id_a, user_id_b) references interests (user_id_a, user_id_b) on delete cascade
    );
    SELECT create_distributed_table('messages', 'user_id_a', colocate_with :='interests');

    -- user 350 sends a message to 240, goes to 1 shard
    INSERT INTO messages VALUES (240, 350, false, 'hi #240!');
    
    -- user 240 sends a message to 350, goes to 1 shard
    INSERT INTO messages VALUES (240, 350, true, 'hi!');
    
    -- user 240 looks at chat with user 350, goes to 1 shard
    SELECT from_a, message_text, message_time
    FROM messages 
    WHERE user_id_a = 240 AND user_id_b = 350
    ORDER BY message_time DESC LIMIT 100;

(Citus engineer who enjoys distributed systems puzzles)

1 more reply

philipwhiuk25y ago· 5 in thread

ramraj075y ago

RDS can be.. expensive? Like by a lot?

aeyes5y ago

If you want to do upgrades like this on RDS with minimal downtime you will end up doing the same process: Set up new servers, do logical replication, switch over.

cwyers5y ago

Only if they evaluated RDS and found it wanting. They don't even mention testing it.

tommyzliOP5y ago

It's not in the post, but I answered this in a separate thread. RDS doesn't let us provision as many IOPS as we need.

Apparently Aurora behaves differently, but I wasn't aware of that when we specced out the project.

2 more replies

tyingq5y ago

I suspect the RDS limitations are left there to push you to Aurora. They control that and would be better equipped to make the most of their infrastructure and margins with it.

temp6675y ago· 5 in thread

Very nice.

Going to try RDS Proxy next.

tommyzliOP5y ago

Thanks! We stuck with plain EC2. RDS has a limit of 80,000 provisioned IOPS and our read replicas on Postgres 9.6 would regularly hit near double that during peak

sk5t5y ago

1 more reply

throwdbaaway5y ago

orf5y ago

That limit doesn’t apply to Aurora - did you consider that?

1 more reply

0xbadcafebee5y ago

So.......... caching?

shoo5y ago· 3 in thread

> We then made the following changes to the subscriber database in order to speed up the synchronization: [...] Set fsync to off

I'm curious how much risk of data loss this added.

ants_a5y ago

If fsync is turned back on and a manual sync call is issued before considering the replica valid there will be no risk from this.

tommyzliOP5y ago

ants_a is correct. Also, our NVMe storage is ephemeral so you aren't recovering from a power loss anyways :)

rubiquity5y ago

Disclaimer: I work at AWS, not on EC2.

1 more reply

mooreds5y ago· 2 in thread

Amazing that the process went so smoothly and that there were so many resources for them to draw from. Jumping from 9 to 12 is quite a few major versions!

Also liked the couple of gotchas which go to show no matter how smooth a data migration is, there'll be some bumps.

hans_castorp5y ago

> Jumping from 9 to 12 is quite a few major versions!

Just a little side note.

They were jumping from 9.6 to 12, not from 9.0 to 12.

Before Postgres 10 was released, the first two digits defined a "major" version). So from 9.6 to 12 it's three major releases (9.6 -> 10, 10 -> 11, 11 -> 12)

foxhill5y ago

still, changes between 9.6 and 12 are _numerous_, both in features and performance: llvm based query compilation, CTE de-materialisation, proper procedures, and that's just off the top of my head.

i wish the process for upgrading postgres were easier/more dynamic. i'm sure plenty of people are still using versions 9.6 or earlier.

1 more reply

justinclift5y ago· 2 in thread

Reading over this, it seems like there isn't an offsite backup done of the database? eg to have a copy of the data in a "safe place" off AWS infrastructure

If something goes wrong with their relationship with AWS, that could be business ending. :(

tyingq5y ago

Dylan168075y ago

That's still a lot of trust that nothing else wipes the account.

But this post wasn't about backups, so there might be a whole lot excluded from the diagram.

u678u5y ago· 1 in thread

I love RDBMS over NoSql but the whole upgrade and schema change always is stressful. I miss the days when we could ask our DBA to deal with it. :)

jabo5y ago

esseti5y ago· 1 in thread

if pglogical better than a min downtime with pgdump/psql? it seems a lot of work to setup pglogical to migrate versions (or am i missing anything?)

darkr5y ago

With 5.7 TB data, you're probably looking at something like 24 hours for a dump/restore including index rebuilds

haltingproblem5y ago

sharadov5y ago

rubiquity5y ago

edit: For more on the nuances of Postgres tradeoffs for replication and transaction isolation: https://www.postgresql.org/docs/9.1/high-availability.html

latchkey5y ago

I read this and all I can think about is all that private information on some unmaintained database server.

j / k navigate · click thread line to collapse