Details on the Oracle failure at JPMorgan Chase (opens in new tab)

(dbms2.com)

66 pointsbabar15y ago24 comments

24 comments

13 comments · 4 top-level

babarOP15y ago· 5 in thread

Short on details, and I thought the speculation on transferring the user info to a NoSQL solution sounds naive, but interesting to get a peek into the internal operations at a large enterprise. I think I want to go validate some of our database backup right now.

etm11715y ago

Very interesting, I wonder how many people read about that issue and then went and checked their backups. Heck, I actually verified the backups on my home network after reading about this yesterday.

georgefclay15y ago

I agree. Also his comment about "over engineering" making the system "more brittle" was odd.

For data that important, I would have mirrored the databases to "warm standby" servers. They could have been back up in minutes with no data loss. Sure it would have doubled the cost, but how much money did they lose during the outage.

jasonwatkinspdx15y ago

You completely failed to read the article.

Otherwise you'd know that they had a fault that propagated to the hot spare. It's also utterly daft to think that a financial enterprise as large as JPM/Chase wouldn't already be running a HA setup. In this case it appears to be Oracle RAC.

I'm astounded how often I have to remind people that replication and backups are very different things, and that you need both.

I'm also depressed how many utterly thoughtless comments are made here on hackernews lately.

2 more replies

etm11715y ago

The way I read it was the problem was corruption inside the database and the warm backup was corrupted during the automatic mirroring before they noticed the problem. So at that point, both the PROD and Failover instance are busted once the issue was determined. To resolve, it looks like they had to rollback to the last valid full DB backup from Sunday and then apply the log backups iteratively from Sunday to catch up the DB before bringing it back online.

At my shop we had a similar issue (but at the SAN level, not the DB level) where the corruption issue was data that exposed a bug in the system. The data was automatically mirrored to the warm standby machine. When PROD crashed, the standby was brought up and immediately crashed also. We had to rebuild from tape backups which was stupid-slow (trademarked term there ;-). All in all it was a horrible mess that was root-caused to a bug in vendor firmware. Eerily similar to the JPMorgan Chase issue in the OP.

2 more replies

rbranson15y ago

Perhaps you don't understand how RAC works. A RAC cluster is cache-coherent with a shared disk system, in this case an EMC SAN. It's designed to be both scalable and fault tolerant. The replication would have been handled by the SAN itself, at the block level. There would be two completely independent (edit:DISK) cabinets that would replicate synchronously. Some software assumes synchronous replication and it's cheaper to just spend a ton of money on an expensive replicating SAN and Oracle RAC than it is to rebuild the software, so an async replication scenario is out of the question.

1 more reply

unohoo15y ago· 4 in thread

having worked at an enterprise software company and working with several big clients (including banks), I find it surprising (and shocking to some extent) that JPMC didnt have a more efficient disaster recovery process in place.

I am not saying they didnt have one, just that disaster recovery scenarios should factor into such outages. Hypothetical fire drills etc. are needed at such critical businesses like banks.

My guess is that a bunch of people @ jpmc will most likely be losing their jobs over this.

vl15y ago

I read it differently. These things happen, you can get data corruption replicated to the hot spare, i.e. failure more catastrophic than this setup is able to handle.

They were able to identify the problem and successfully recover from backup and successfully replay missing transactions in a reasonable amount of time for the setup this large. In my book it's a success.

VladRussian15y ago

And bunch of people getting the jobs/contracts. Like anybody who've been working with big enterprises, i'm sure it would be a net positive effect :)

Btw, while everybody's salivating over "Oracle failure", was it really Oracle failure, ie. like failure of Oracle?

unohoo15y ago

ohh yeah. i totally agree. I'm pretty sure if its an oracle issue, ibm is going to put its sales pitch for db2 in overdrive to jpmc

1 more reply

jrockway15y ago

With the same experience you have, I am not shocked at all. People design and implement "processes" to "prevent" production issues from happening, but they are mostly feel-good sounding things on top of "let's cross our fingers and hope nothing bad happens".

This usually works, which is why people think it's an acceptable policy. But real planning involves things like software correctness, proper test procedures, ways of making a test environment that's exactly identical to production, and so on. This is hard (and slows down development... "tests, what a waste of time!"), so people instead say, "let's try really hard to not fuck something up".

Trying hard only gets you so far, as JPM learned.

known15y ago

caused by corruption in an Oracle database.

Could happen to any database. Not just Oracle.

lvecsey15y ago

So do those 8 machines and the code on it represent the system before or after the Wamu merger, or something in between? I've heard many of these banks such as Citigroup have something like 13 different databases or systems, many of which duplicate functionality.

j / k navigate · click thread line to collapse

24 comments

13 comments · 4 top-level

babarOP15y ago· 5 in thread

etm11715y ago

Very interesting, I wonder how many people read about that issue and then went and checked their backups. Heck, I actually verified the backups on my home network after reading about this yesterday.

georgefclay15y ago

I agree. Also his comment about "over engineering" making the system "more brittle" was odd.

jasonwatkinspdx15y ago

You completely failed to read the article.

I'm astounded how often I have to remind people that replication and backups are very different things, and that you need both.

I'm also depressed how many utterly thoughtless comments are made here on hackernews lately.

2 more replies

etm11715y ago

2 more replies

rbranson15y ago

1 more reply

unohoo15y ago· 4 in thread

I am not saying they didnt have one, just that disaster recovery scenarios should factor into such outages. Hypothetical fire drills etc. are needed at such critical businesses like banks.

My guess is that a bunch of people @ jpmc will most likely be losing their jobs over this.

vl15y ago

I read it differently. These things happen, you can get data corruption replicated to the hot spare, i.e. failure more catastrophic than this setup is able to handle.

VladRussian15y ago

And bunch of people getting the jobs/contracts. Like anybody who've been working with big enterprises, i'm sure it would be a net positive effect :)

Btw, while everybody's salivating over "Oracle failure", was it really Oracle failure, ie. like failure of Oracle?

unohoo15y ago

ohh yeah. i totally agree. I'm pretty sure if its an oracle issue, ibm is going to put its sales pitch for db2 in overdrive to jpmc

1 more reply

jrockway15y ago

Trying hard only gets you so far, as JPM learned.

known15y ago

caused by corruption in an Oracle database.

Could happen to any database. Not just Oracle.

lvecsey15y ago

j / k navigate · click thread line to collapse