Running PostgreSQL on Compression-enabled ZFS (opens in new tab)

(citusdata.com)

114 pointscwsteinbach13y ago56 comments

56 comments

47 comments · 11 top-level

old-gregg13y ago· 11 in thread

Can this simply be an artifact of terrible disk I/O on AWS or overall difference between ZFS/ext3?

Do you think the results would have been similar if you were to use no-compression-ZFS instead of ext3 on a proper database hardware?

Basically trying to figure out if the low performance of uncompressed dataset is specific to AWS/ext3. Thanks.

mbell13y ago

Another issue is that ZFS is extremely aggressive with caching data in ram (L1ARC). That can eat up memory you'd rather give to the database heap and also tends to skew benchmarks.

wiredfool13y ago

Yes, but postgres' design in that area actually helps. Postgres relies on the OS caching the data tables for the most part. There is some caching in shared buffers, but generally that's not a huge portion of the memory of your db system. (10-25%, and not more than a few gigs)

bsg7513y ago

Placing limits on ARC (25% for me currently) limits that effect.

rscale13y ago

I run Postgres on ZFS, and simply limit the amount of memory dedicated to the ARC.

bsg7513y ago

I have seen similar advantages comparing XFS to ZFS+compression on a local server (Centos 6.3, ZFS on Linux 0.61).

Using a 2 disk striped volume for PostgreSQL 9.2, I get an average of 2.5X compression (as reported by ZFS), and a 1.5 to 2X time reduction in database restores (single threaded or 8 jobs in parallel).

Given this development box has relatively slow 7200 RPM disks, the tradeoff of more CPU time for less disk transfer makes sense.

Edit: My use case is an OLAP server. I can't state how the tradeoffs affect OLTP performance.

mfenniak13y ago

old-gregg makes a great observation here. The addition of a ZFS benchmark without-compression is needed to isolate the compression as a factor in the speedup.

That aside, I thought this was a wonderful article with non-intuitive findings. Very interesting, CirtusDB [edit: er, CitusDB]. :-)

cwsteinbachOP13y ago

old-gregg's point is valid. Since we didn't benchmark ZFS without compression we can't say for sure how much of the performance improvement is attributable to compression vs. just ZFS.

As far as AWS goes, we have noticed ephemeral disks connected to the same instance can exhibit fairly large performance differences, and attempted to control for that in our tests by reusing the same disk for each test run.

1 more reply

codewright13y ago

Citus

kevinastone13y ago

You're also testing on a c1.xlarge that gives you excess CPU compared to I/O, so it's potentially biasing your results.

cwsteinbachOP13y ago

While I can't claim that we logged CPU load while running these tests, I can say that I watched the output of top and iotop and that the CPU load was relatively light. It's also worth pointing out that Amazon describes the I/O performance of c1.xlarge instances as "high". We also considered using an hs1.8xlarge "High Storage" instance for these tests, but eventually decided that we were more interested in testing against conventional disks as opposed to SSDs.

1 more reply

trotsky13y ago

Using compression with zfs on solaris derived platforms serving as a san/nas backend for vsphere appears to speed up every workload backed on rotating storage. Well, not if vmdks use guest full disk encryption, but that's understandable.

fsiefken13y ago· 9 in thread

The Btfrs and Reiser4 filesystems also support transparent compression and might currently be a better alternative to increase Postgresql query speed. Btfrs supports gzip, LZO, LZ4 and Snappy and is in the mainline linux kernel, Reiser4 is still maintained and available as a patch on Linux 3.8.5 (latest is 3.8.8) and supports LZO and gzip (alternatively there are also the embedded NAND flash medium compatible filesystems F2FS and UBIFS which both improve on the JFFS2 filesystem and it's transparent compression). For I/O bound queries SSD drives (in your preferred raid configuration) also will speed up the system. Btfrs has built-in support for TRIM SSD already, Reiser4 TRIM/SSD support is being discussed among the remaining developers.

yunong13y ago

Do you have any benchmarks to support your claim? Statements such as "Btrfs ... might currently be a better alternative" without benchmarks are worthless. Anyway -- I'd be interested to see benchmarks of Btrfs on GNU/Linux vs ZFS on illumos -- I suspect that ZFS "might currently be a better alternative".

Simply ratcheting off a set of features and stating that Btrfs is "better" is dubious at best, and perhaps mis-leading. As the OP stated in his blog post, ZFS has a rich feature set -- which we find invaluable in our own postgres stack -- features such as incremental snapshots, a real copy on write filesystem, etc.

fsiefken13y ago

You are half right. There are no direct and current benchmarks, but following the news through the years about Ext4, Reiser4, ZFS and Btfrs (and experimenting with them) I know the latter is quite fast disk I/O wise (again this is just a hint), I listed the alternative filesystems which support transparent compression for a future benchmark or evaluation for people - like me - who think transparent compression is a nice idea for speeding up queries.

I found 2 recent Phoronix benchmarks which compare Btfrs with Ext4 and Ext4 with ZFS respectively. You can't really combine them as it seems the hardware used is different but if you use Ext4 as a rough translation key it seems ZFS on linux (which is what the OP used) is slower then Ext4 and Btfrs. Transparent compression speed would depend on cpu and is comparable.

April 18, 2013 Ext4 vs ZFS http://www.phoronix.com/scan.php?page=news_item&px=MTM1N...

February 18, 2013 Btfrs (and others) vs Ext4 http://www.phoronix.com/scan.php?page=article&item=linux...

Unreliable Mashup which gives some indication: * fs-walk 1000 files 1 mb zfs 46.20 ext4 72.50 vs 78.67 btfrs 66.37 btfrs

* fs-walk 5000 files 1 mb 4 threads zfs 25.63 files/s ext4 79.73 vs 99.60 btfrs 94.63

* fs-mark 4000 files 32 subdir 1 mb zfs 7.78 ext4 74.07 vs 78.80 btfrs 65.17

* dbench 1 client count zfs 27.29 MB/s ext4 167.29 MB/s vs 195.24 btfrs 165.37

I'm also interested in a Btfrs benchmark vs ZFS on Illumos, this way you can determine which is the best or fastest system for this specific scenario (even thought the OP used Linux).

Incremental snaphots is a nice feature for a Postgresql stack, what is the significant or as you put it 'real' difference between the CoW and snapshot functionality of Btfrs compared to ZFS? Are there things you cannot do with Btfrs in a Postgresql stack compared to ZFS?

laumars13y ago

Reiser4 is in a weird place after the conviction of Hans. I'm not sure I'd want to trust a production system on it. And I've been less than impressed with BtrFS on the test systems I've ran it on (though I'm aware there's others who swear by it - I'm only talking about my experiences).

ZFS is a fantastic file system, but I can't help wondering if part of the issue is the fact that the benchmarking was conducted on a virtual machine. ZFS is better suited for raw disks than virtual devices (again, just my anacdotal evidence. I've never ran benchmarks myself).

iso8859-113y ago

Btrfs is unstable too. Source: https://news.ycombinator.com/item?id=5460449

gngeal13y ago

Reiser4 is in a weird place...

...and so is Hans. :-) (Sorry, I couldn't resist :-))

cwsteinbachOP13y ago

I wasn't aware that Reiser4 supported compression. Thanks for pointing that out. As for why we chose to use ZFS instead of Btrfs, we feel that ZFS is closer to being in a state where an enterprise customer would be comfortable deploying it in production. This is due to the fact that ZFS has been in development for over a decade with many Solaris sites already using it in production, and Btrfs is still marked as "unstable".

richardkmichael13y ago

EDIT: I realize you said "near" and "closer" to production ready, but I think it's worth mentioning --

No FUD intended, but I don't consider ZFS on Linux production ready. Wanting to use ZFS, I recently started regularly reading their GitHub issues.

There are deadlocks and un-importable pools in certain situations (hard-links being one: think rsync). I would not want production boxes in the same predicaments experienced by several bug reporters. Moreover, applying debug and hot-fix (hopefully) kernel patches and the associated downtime in production is a no-go for me.

Mind you, the project leads are very responsive and it's making great strides.

In addition, I believe the Linux implementation currently lacks the L2ARC (which can make ZFS really fly, caching to SSDs).

However, I would absolutely run ZFS on Illumos or Solaris; for the stability and article-mentioned compression benefits.

1 more reply

Nelson6913y ago

btrfs has the features and it's destined or ordained to be a mainline mainstream Linux filesystem (trust me on this, if you're doing serious work, you want to be on main street) but there are some cases which aren't terribly uncommon where it has some performance problems.

Hard to say if it's better than some sort of linux with zfs Frankenstein system. Would love for Oracle to make ZFS more linux friendly though, seems like a win for everybody and there are tons of users that would love for it to happen.

I don't know if I'd call Reiserfs "maintained" and I couldn't recommend it to anyone. If it is maintained seriously, my recommendation would be to rename it.

mdellabitta13y ago

> the remaining developers

/shiver/

nemothekid13y ago· 5 in thread

If I'm reading this right, with ZFS compression enabled I am seeing 1/3rd disk usage and 3x increase of speeds in query times just from switching the filesystem. Stats like that make me very skeptical. Does this mean that I can get a 3x increase in speed while cutting my disk space down by a third just by switching to ZFS? If so, why isn't everyone doing this?

bsg7513y ago

Performance gains will be dependent in part to the compressibility of the data being written. If highly compressible (text, sparse structures like database pages), then the performance gain can be significant. Binary data or that which does not compress as well, using the algorithms usable by ZFS, will not see as much benefit.

ozgune13y ago

Please also keep in mind that this blog post focuses on a workload that is completely disk I/O bound.

In practice, at least part of your working set gets served from memory, and compression doesn't help with the pages that are already in memory.

lwat13y ago

The way I make sense of this is that you need fewer (slow) disk reads to get the same amount of data into RAM, so that might explain the speedup?

I agree that it sounds too good to be true though.

rosser13y ago

Your read is correct. Once CPU time spent in decompression became less than disk wait time for the same data uncompressed, the reduced IO with compression started to win — sometimes massively. As powerful as processors are these days, results like these aren't impossible, or even terribly unlikely.

Consider the analogous (if simplified) case of logfile parsing, from my production syslog environment, with full query logging enabled:

  # ls -lrt
  ...
  -rw------- 1 root root  828096521 Apr 22 04:07 postgresql-query.log-20130421.gz
  -rw------- 1 root root 8817070769 Apr 22 04:09 postgresql-query.log-20130422
  # time zgrep -c duration postgresql-query.log-20130421.gz
  19130676

  real	0m43.818s
  user	0m44.060s
  sys	0m6.874s
  # time grep -c duration postgresql-query.log-20130422
  18634420

  real	4m7.008s
  user	0m9.826s
  sys	0m3.843s

EDIT: I'm not sure why time(1) is reporting more "user" time than "real" time in the compressed case.

1 more reply

tracker113y ago

I had an original IBM PC XT (used) with a 10MB full height (2x today's 5.25") MFM hard drive.. it had about 3MB of available disk space and took I swear 6+ minutes to boot.

It actually ran faster double-spaced (stacker) and had nearly 12MB of available space... didn't have any problems with programs loading, surprisingly enough.. which became more of an issue when moving onto a 486.

Yeah, when your storage is so relatively slow, the CPU can run compression, you can get impressive gains in space and performance.

ars13y ago· 3 in thread

I wouldn't recommend doing benchmarking on a virtual server.

You have no idea how busy the real server is, (noisy neighbors, etc), so it's impossible to have comparable results from benchmark to benchmark.

jaytaylor13y ago

FWIW, If you use one of the largest instance types (4x large or whatever), the VM will probably be on it's own host which would mean you're unlikely to have neighbors ;)

skeletonjelly13y ago

When benchmarking, it's best to remove assumptions based on "probably" though right?

1 more reply

reeses13y ago

You still have to deal with the storage fabric.

jamhan13y ago· 3 in thread

Is it just me or is "Compression Ratio" a poor label for the graph in that article? Normally, when one uses "Compression Ratio", it is the opposite of those numbers, i.e. EXT3 storage would be 1:1, ZFS-LZJB would be 2:1 (not 0.5), and ZFS-gzip would be 3.33:1 (not 0.3). It's a small thing I know but it turns convention on its head in its current form. A better label would be perhaps "Storage Size Ratio".

marshray13y ago

I don't see a problem with them expressing the ratio as a decimal since it becomes a simple multiplier of the original file size 38GB x 0.3.

But it's downright misleading to show the vertical axis from something other than 0.0 to 1.0 when comparing ratios. They start it at 0.2. In reality, LZJB is saving 50% of the space whereas gzip saves 70%. But a naive glance at the graph implies gzip look roughly 3 times smaller/better than LZJB.

Classic "How to Lie with Statistics" stuff.* I would have expected better from an "analytics" database.

* Not saying they intend to lie here but it's representative of the classic text https://en.wikipedia.org/wiki/How_to_Lie_with_Statistics

cwsteinbachOP13y ago

Author here. Believe it or not I originally had the compression ratio graph rotated 90 degrees, and had manually modified it to run from 0.00 to 1.00. Google docs for some god awful reason insists on starting at 0.2 by default. Anyway, when my colleagues reviewed a draft of this post they requested that I rotate the graph back, and in the process I forgot to reset the scale. Sorry for the confusion. It's fixed now. As for the definition of "compression ratio", I looked this up and went with the definition found here: http://en.wikipedia.org/wiki/Data_compression_ratio

I agree that it's kind of counterintuitive.

1 more reply

jamhan13y ago

If you read in any other article something like the following: "Taking Product X as having a baseline compression ratio of 1, Product Y had a compression ratio of 0.5 and Product Z had a compression ratio of 0.3", I'm pretty sure 99.9999% of the HN population would interpret that as Products Y and Z having worse compression than X, not better. That's my point.

1 more reply

atoponce13y ago· 3 in thread

I'm curious of the rest of the architecture. Each benchmark needs to be tested separately, as ZFS is likely caching the reads in the ARC. We also need a benchmark of ZFS without compression enabled.

However, we're not showing how bad ext3 is, but that the end result still shows the stellar performance, compression or not.

cwsteinbachOP13y ago

> as ZFS is likely caching the reads in the ARC

Each of the seven queries we used in our benchmark required a sequential scan of the 32GB dataset. It's unlikely that the ARC had any impact on the results since the EC2 instance had only 7GiB of memory.

GalacticDomin8r13y ago

ext3 does suck for certain workloads, one of them being large scale db's. And by suck, I mean dangerous. Unless you really want to set barrier on the fs and watch your IO plummet to 45 record player speeds.

iso8859-113y ago

What is the reason for using ext3 over ext4?

petsos13y ago· 2 in thread

Can someone give us an overview of the state of ZFS on Linux? Last time I had checked it was implemented over fuse. Has this changed?

cdjk13y ago

There are kernel modules here, which is what I assume they're using:

http://zfsonlinux.org

The licensing problems only apply to distributing CDDL and GPL code that have been compiled into the same binary, not running a CDDL-licensed module in a GPL kernel - I think. My experience with ZFS (which is awesome, btw) comes from FreeBSD.

BUGHUNTER13y ago

ZoL looks pretty good - unfortunately if you want Samba on ZoL, of course with snapshots and ACLs, you will have problems, as ACL mapping is not implemented, if I understood things well. That is a real pitty, ZFS is great, Samba is great, Linux is great and having these three things working smoothly together without having to spend weeks of research on how to get it running would help many admins to finally get away from commercial clown & bloat systems. However, the groundwork is done and if we are lucky in 2014 the Linux + Samba4 + ZFS dreamteam will be available as a stable replacement.

danbruc13y ago

The result doesn't really surprise me - many operations are bound by the available bandwidth. There is even a compressor named Blosc [1] that speeds up operations by moving compressed data between memory and L1 cache and (de)compressing it there instead of moving the uncompressed data.

[1] http://blosc.pytables.org/

GalacticDomin8r13y ago

This isn't the first time benchmarks like this have been done and these results are consistent with the earlier ones.

It shouldn't surprise most people that enabling transparent compression gives these benefits. Why you ask? Well what is the largest bottleneck in a system? Disk IO - by far. So all ZFS is doing is transferring workload to a subsystem you likely have plenty of(CPU) from one that you have the least of(Disk IO/latency)

jacob01913y ago

Would love to see these performance metrics on a powerful system with pcie or raided SSD's. Would be interesting to find the tipping point where the extra CPU time outweighs the IO reduction. Even if the DB layer performs better total application response time could be negatively impacted for CPU intensive work loads as the compression steals cycles from the application layer.

cafard13y ago

I tried running Oracle on ZFS for a while, with fairly terrible results. A bit of examination showed that ZFS was fine for table scans but had bad performance with indexes. It may be possible to tune one's way around this, but I simply dumped ZFS in favor of Automated Storage Management.

j / k navigate · click thread line to collapse

56 comments

47 comments · 11 top-level

old-gregg13y ago· 11 in thread

Can this simply be an artifact of terrible disk I/O on AWS or overall difference between ZFS/ext3?

Do you think the results would have been similar if you were to use no-compression-ZFS instead of ext3 on a proper database hardware?

Basically trying to figure out if the low performance of uncompressed dataset is specific to AWS/ext3. Thanks.

mbell13y ago

Another issue is that ZFS is extremely aggressive with caching data in ram (L1ARC). That can eat up memory you'd rather give to the database heap and also tends to skew benchmarks.

wiredfool13y ago

bsg7513y ago

Placing limits on ARC (25% for me currently) limits that effect.

rscale13y ago

I run Postgres on ZFS, and simply limit the amount of memory dedicated to the ARC.

bsg7513y ago

I have seen similar advantages comparing XFS to ZFS+compression on a local server (Centos 6.3, ZFS on Linux 0.61).

Given this development box has relatively slow 7200 RPM disks, the tradeoff of more CPU time for less disk transfer makes sense.

Edit: My use case is an OLAP server. I can't state how the tradeoffs affect OLTP performance.

mfenniak13y ago

old-gregg makes a great observation here. The addition of a ZFS benchmark without-compression is needed to isolate the compression as a factor in the speedup.

That aside, I thought this was a wonderful article with non-intuitive findings. Very interesting, CirtusDB [edit: er, CitusDB]. :-)

cwsteinbachOP13y ago

old-gregg's point is valid. Since we didn't benchmark ZFS without compression we can't say for sure how much of the performance improvement is attributable to compression vs. just ZFS.

1 more reply

codewright13y ago

Citus

kevinastone13y ago

You're also testing on a c1.xlarge that gives you excess CPU compared to I/O, so it's potentially biasing your results.

cwsteinbachOP13y ago

1 more reply

trotsky13y ago

fsiefken13y ago· 9 in thread

yunong13y ago

fsiefken13y ago

April 18, 2013 Ext4 vs ZFS http://www.phoronix.com/scan.php?page=news_item&px=MTM1N...

February 18, 2013 Btfrs (and others) vs Ext4 http://www.phoronix.com/scan.php?page=article&item=linux...

Unreliable Mashup which gives some indication: * fs-walk 1000 files 1 mb zfs 46.20 ext4 72.50 vs 78.67 btfrs 66.37 btfrs

* fs-walk 5000 files 1 mb 4 threads zfs 25.63 files/s ext4 79.73 vs 99.60 btfrs 94.63

* fs-mark 4000 files 32 subdir 1 mb zfs 7.78 ext4 74.07 vs 78.80 btfrs 65.17

* dbench 1 client count zfs 27.29 MB/s ext4 167.29 MB/s vs 195.24 btfrs 165.37

I'm also interested in a Btfrs benchmark vs ZFS on Illumos, this way you can determine which is the best or fastest system for this specific scenario (even thought the OP used Linux).

laumars13y ago

iso8859-113y ago

Btrfs is unstable too. Source: https://news.ycombinator.com/item?id=5460449

gngeal13y ago

Reiser4 is in a weird place...

...and so is Hans. :-) (Sorry, I couldn't resist :-))

cwsteinbachOP13y ago

richardkmichael13y ago

EDIT: I realize you said "near" and "closer" to production ready, but I think it's worth mentioning --

No FUD intended, but I don't consider ZFS on Linux production ready. Wanting to use ZFS, I recently started regularly reading their GitHub issues.

Mind you, the project leads are very responsive and it's making great strides.

In addition, I believe the Linux implementation currently lacks the L2ARC (which can make ZFS really fly, caching to SSDs).

However, I would absolutely run ZFS on Illumos or Solaris; for the stability and article-mentioned compression benefits.

1 more reply

Nelson6913y ago

I don't know if I'd call Reiserfs "maintained" and I couldn't recommend it to anyone. If it is maintained seriously, my recommendation would be to rename it.

mdellabitta13y ago

> the remaining developers

/shiver/

nemothekid13y ago· 5 in thread

bsg7513y ago

ozgune13y ago

Please also keep in mind that this blog post focuses on a workload that is completely disk I/O bound.

In practice, at least part of your working set gets served from memory, and compression doesn't help with the pages that are already in memory.

lwat13y ago

The way I make sense of this is that you need fewer (slow) disk reads to get the same amount of data into RAM, so that might explain the speedup?

I agree that it sounds too good to be true though.

rosser13y ago

Consider the analogous (if simplified) case of logfile parsing, from my production syslog environment, with full query logging enabled:

  # ls -lrt
  ...
  -rw------- 1 root root  828096521 Apr 22 04:07 postgresql-query.log-20130421.gz
  -rw------- 1 root root 8817070769 Apr 22 04:09 postgresql-query.log-20130422
  # time zgrep -c duration postgresql-query.log-20130421.gz
  19130676

  real	0m43.818s
  user	0m44.060s
  sys	0m6.874s
  # time grep -c duration postgresql-query.log-20130422
  18634420

  real	4m7.008s
  user	0m9.826s
  sys	0m3.843s

EDIT: I'm not sure why time(1) is reporting more "user" time than "real" time in the compressed case.

1 more reply

tracker113y ago

I had an original IBM PC XT (used) with a 10MB full height (2x today's 5.25") MFM hard drive.. it had about 3MB of available disk space and took I swear 6+ minutes to boot.

Yeah, when your storage is so relatively slow, the CPU can run compression, you can get impressive gains in space and performance.

ars13y ago· 3 in thread

I wouldn't recommend doing benchmarking on a virtual server.

You have no idea how busy the real server is, (noisy neighbors, etc), so it's impossible to have comparable results from benchmark to benchmark.

jaytaylor13y ago

FWIW, If you use one of the largest instance types (4x large or whatever), the VM will probably be on it's own host which would mean you're unlikely to have neighbors ;)

skeletonjelly13y ago

When benchmarking, it's best to remove assumptions based on "probably" though right?

1 more reply

reeses13y ago

You still have to deal with the storage fabric.

jamhan13y ago· 3 in thread

marshray13y ago

I don't see a problem with them expressing the ratio as a decimal since it becomes a simple multiplier of the original file size 38GB x 0.3.

Classic "How to Lie with Statistics" stuff.* I would have expected better from an "analytics" database.

* Not saying they intend to lie here but it's representative of the classic text https://en.wikipedia.org/wiki/How_to_Lie_with_Statistics

cwsteinbachOP13y ago

I agree that it's kind of counterintuitive.

1 more reply

jamhan13y ago

1 more reply

atoponce13y ago· 3 in thread

I'm curious of the rest of the architecture. Each benchmark needs to be tested separately, as ZFS is likely caching the reads in the ARC. We also need a benchmark of ZFS without compression enabled.

However, we're not showing how bad ext3 is, but that the end result still shows the stellar performance, compression or not.

cwsteinbachOP13y ago

> as ZFS is likely caching the reads in the ARC

GalacticDomin8r13y ago

iso8859-113y ago

What is the reason for using ext3 over ext4?

petsos13y ago· 2 in thread

Can someone give us an overview of the state of ZFS on Linux? Last time I had checked it was implemented over fuse. Has this changed?

cdjk13y ago

There are kernel modules here, which is what I assume they're using:

http://zfsonlinux.org

BUGHUNTER13y ago