undefined | Better HN

0 pointsmixedbit13y ago0 comments

Which off-the-shell RDBMS can handle queries over 3 billion rows?

0 comments

13 comments · 5 top-level

meritt13y ago· 4 in thread

Counter-question: Which startup has a actual data table with over 3 billion rows?

We have just crossed 2 billion items in our datastore. While not 3 billion yet, I expect that to happen later this year.

Too bad Redshift can't handle JSON files: Converting everything will be annoying.

fujibee13y ago

Our idea is to change from JSON on loading to Redshift, continuously. http://www.hapyrus.com/pages/flydata-for-redshift

nkohari13y ago

We do. http://adzerk.com/

badgar13y ago

When you log every mousedown because the founder misunderstands A/B testing, 3 billion rows is easy to come by. Besides - you're busy changing the world, so you should expect to use the same technology as Facebook and Google.

jacques_chester13y ago· 2 in thread

In 2007 I worked for a firm with a 4 billion row join table in PostgreSQL. Might've been 7 or 8, I don't recall which. It ran on a quad core server with 16Gb of RAM. Joins going through this table took about 2-3 seconds to complete.

mixedbitOP13y ago

But I suspect the join must have been over an indexed column, so it did not touched 4bln rows, otherwise 2-3 seconds would be hard to believe. The group by query in the article must access all 3bln rows, which makes a huge difference.

jacques_chester13y ago

All the columns were indexed.

I remember it well, because I was trying to explain why having tens of gigabytes of indexes wouldn't help them much if they only had 16Gb of RAM.

In terms of group-by performance, it depends a lot on the kind of data and how it's stored. For example, taking a sum on a columnar store is quite amenable to parallel solutions and a lot of databases will do that way.

taligent13y ago· 2 in thread

SAP HANA would be one but it is basically in memory so very, vey expensive.

mbesto13y ago

And it's not a RDBMS. It's basically the same technology as RedShift, but not cloud based (yet).

res0nat0r13y ago

Actually HANA One is available in the AWS Marketplace: https://aws.amazon.com/marketplace/pp/B009KA3CRY/ref=mkt_ste...

EwanToo13y ago

I can't think of an off the shelf RDBMS which can't handle queries on 3 billion rows.

SQL Server can

Oracle can

Postgres can

Even MySQL can (!)

The limitations are almost always in the hardware, not the software.

If you're looking at column based systems, you can look at Greenplum (does both row and column-based storage), InfiniDB (MySQL based), and all sorts of expensive but very fast appliance options like Netezza, Teradata, etc.

jbverschoor13y ago

postgres?

j / k navigate · click thread line to collapse