Google Reader API replacement, powered by Riak (opens in new tab)

(blog.superfeedr.com)

70 pointscharlieok13y ago18 comments

18 comments

16 comments · 4 top-level

abalone13y ago· 5 in thread

Interesting, I would have thought that the unidirectional, read-only nature of the publisher-subscriber relationship would have made this simple for a traditional SQL database with read replicas and a very basic partitioning scheme. You assign workers to monitor feeds for updates, they update the DB, and.. done.

Looks like they may have added some complexity with their feed parser implementation, what they refer to as "supernoders". Looks like they don't lock ownership of feeds during parsing, thus allowing concurrent supernoders to get into race conditions while parsing the same feed.

And so it turns into another NoSQL example of employing conflict resolution to fix things.

I wonder if they could just use a simple locking scheme to prevent more than 1 parser from parsing the same feed at the same time. This sounds simpler than conflict resolution, to me.

bonzoesc13y ago

I wonder if they could just use a simple locking scheme to prevent more than 1 parser from parsing the same feed at the same time. This sounds simpler than conflict resolution, to me.

You may want to check out Aphyr's "Call Me Maybe" series of posts about distributed databases: http://aphyr.com/tags/jepsen

The short version is that convergent conflict resolution seems intimidating but works better than locking and synchronization.

abalone13y ago

Ok, I did. I actually read the whole series, including the postgres 2-phase commit I/O error case.

I don't see where it draws that conclusion at all.

The postgres post shows that even ACID databases have network error cases which can leave your client in an indeterminate state. Fair enough. However the solution for this is... to restart your client once the network's back up. All it needs to do is requery the DB to determine the truth.

Compare that to writing conflict resolution logic for all your data because there is no single source of truth. This is considerably more complicated.

The series actually ends up recommending "the right design for the right problem space." I am not making a general SQL vs. NoSQL argument here, but I think in this case they may have taken a more complicated approach than necessary.

harryh13y ago

That's a very strange interpretation of Aphyr's posts.

julien13y ago

Locking is a no go has we get 2 things: high frequency updates feeds (up to a couple dozen entries per second), which we need to constantly update. Locking them for even a fraction of time could turn into a nightmare and a huge backlog if for any reason a single entry takes a little more time to be written.

abalone13y ago

For RSS feeds? Or are you talking about something else? RSS is not a streaming protocol. The .rss/.atom file would explode under the conditions you describe. Sounds like you are designing for a more complicated case?

bonzoesc13y ago· 3 in thread

For the listing/deleting problems, have you looked at using LevelDB and secondary indexes (2i) to make range queries cheaper?

Disclosure: I work at Basho, makers of the Riak database.

Ixiaus13y ago

We use Riak in production and LevelDB is a must IMHO (except for some edge cases); bitcask is nice, but 2i is nicer.

jethroalias9713y ago

I have been using Riak's secondary indexes for my latest project and have generally found them a joy to use. However, I do have to have to question a bit the way they are architected.

You could solve this by putting a sort value in the key and using a range query, but this wouldn't work if you want the most recent items keyed with time, because the items could be unevenly spaced back in time. Also, Riak, like many databases based on Dynamo, thrives on fat data which one would think would favor lists. LevelDB is also supposedly slower than Bitcask, the default backend, but I'm not sure if this is still true.

I've been trying to think of ways around these problems. A simple thought I had was to simply cache the response as pages in Riak. Although this introduces new problems like how to know how often to reset the cache, too often and I may as well not have this cache, too infrequently and users get stale data. I would also have to handle this using worker threads because I wouldn't want the odd 100th user to get a big latency hit. The database would also either have to be continually polled, wasting CPU, or potentially not have the data cached when needed.

Another solution I've been considering is to write a secondary index layer on top of Riak using a skiplist or btree to know where to add and remove data when it gets to be very large. This seems like a cool idea, but might be tricky to implement and do conflict resolutions on.

My last idea was the most ambitious, which was to implement a separate distributed database specifically for secondary indexes and range queries which would not be bound by Dynamo. The idea here is to have each node in charge of a segment of the key space (like Big Table) and then have it split and coalesce not only based on size, but also on frequency of reads and writes to handle the bottleneck problem.

I initially was going to have this paired with the Dynamo database (https://github.com/dbunker/Dynago) I was experimenting with using Go and LevelDB, but there is no reason it couldn't work with any Key-Value eventually-consistent hyper-reliable database to provide light-weight secondary indexes. Having it constantly check the core key-value database would mean it wouldn't have to be super reliable in its own right and so could be kept relatively simple.

But again the simplest solution may ultimately be the way to go, I'm not sure, all these seem to have pretty big trade offs.

bonzoesc13y ago

Assuming you are using Riak's default configuration each range query hits 1/3 of the cluster, which could get pretty hairy on large clusters that have lots of requests. Also, there is no pagination, so if an index has a million objects you'll have to be prepared to wait even if you only want the first part of the query.
You could solve this by putting a sort value in the key and using a range query, but this wouldn't work if you want the most recent items keyed with time, because the items could be unevenly spaced back in time.

Pagination is coming soon; it's in riak_kv master already, but in buyer-beware #yolo territory.

LevelDB is also supposedly slower than Bitcask, the default backend, but I'm not sure if this is still true.

Bitcask is faster when all the keys fit in memory: it's designed to load any value with a single disk seek. LevelDB can't make that guarantee, but neither can Bitcask with too many keys for available memory.

Caching is one of the two hard problems in software engineering (along with "naming things" and "off-by-one errors"), so good luck :) If you're not opposed to running a separate service, Memcache is what I'd use.

1 more reply

ivank13y ago· 2 in thread

There is another ongoing backup of Google Reader's feed cache: http://www.archiveteam.org/index.php?title=Google_Reader and the data is landing at Internet Archive.

(If anyone has a dedicated server with a high transfer cap, we could really use for temporary storage and uploading to IA. Email in profile.)

For anyone else doing an independent backup, you can get more than 1000 items by using ?r=n&n=1000 and following the continuation in the JSON response with a ?c= URL parameter. And keep in mind that Google doesn't canonicalize feed URLs for the same content, so you have to grab all of them.

aviv13y ago

Is the list of feed URLs you have collected thus far (without the cached feed content) publicly accessible?

ivank13y ago

Not yet, I don't have a good way to provide query ability on my postgres db. I also haven't yet imported a lot of files I have lying around.

You can email me for an rsync source that contains the work items we've generated. Right now this is about ~68.2M URLs, mostly on the big blog platforms. This list should grow considerably.

JeffJenkins13y ago· 2 in thread

SuperFeedr is awesome and Julien is great to work with as a user of the service. I wish they had this feature when I was working on my multi-medium client (now defunct) a year and a half ago.

The only downside—and why I stopped using it—is that the pricing model is per-item, so if you have frequently updating feeds it can get very expensive. Although I never tried to use it, the pricing page does say they they'll meet whatever it costs you to run your own feed system since their cost should be lower than yours.

julien13y ago

The second part will stay, while we hope to change the 1st part (pricing per item) really soon into another scheme!

JeffJenkins13y ago

Awesome. If I needed access to feeds again I'd definitely use superfeedr.

j / k navigate · click thread line to collapse

18 comments

16 comments · 4 top-level

abalone13y ago· 5 in thread

And so it turns into another NoSQL example of employing conflict resolution to fix things.

I wonder if they could just use a simple locking scheme to prevent more than 1 parser from parsing the same feed at the same time. This sounds simpler than conflict resolution, to me.

bonzoesc13y ago

I wonder if they could just use a simple locking scheme to prevent more than 1 parser from parsing the same feed at the same time. This sounds simpler than conflict resolution, to me.

You may want to check out Aphyr's "Call Me Maybe" series of posts about distributed databases: http://aphyr.com/tags/jepsen

The short version is that convergent conflict resolution seems intimidating but works better than locking and synchronization.

abalone13y ago

Ok, I did. I actually read the whole series, including the postgres 2-phase commit I/O error case.

I don't see where it draws that conclusion at all.

Compare that to writing conflict resolution logic for all your data because there is no single source of truth. This is considerably more complicated.

harryh13y ago

That's a very strange interpretation of Aphyr's posts.

julien13y ago

abalone13y ago

bonzoesc13y ago· 3 in thread

For the listing/deleting problems, have you looked at using LevelDB and secondary indexes (2i) to make range queries cheaper?

Disclosure: I work at Basho, makers of the Riak database.

Ixiaus13y ago

We use Riak in production and LevelDB is a must IMHO (except for some edge cases); bitcask is nice, but 2i is nicer.

jethroalias9713y ago

I have been using Riak's secondary indexes for my latest project and have generally found them a joy to use. However, I do have to have to question a bit the way they are architected.

But again the simplest solution may ultimately be the way to go, I'm not sure, all these seem to have pretty big trade offs.

bonzoesc13y ago

Pagination is coming soon; it's in riak_kv master already, but in buyer-beware #yolo territory.

LevelDB is also supposedly slower than Bitcask, the default backend, but I'm not sure if this is still true.

1 more reply

ivank13y ago· 2 in thread

There is another ongoing backup of Google Reader's feed cache: http://www.archiveteam.org/index.php?title=Google_Reader and the data is landing at Internet Archive.

(If anyone has a dedicated server with a high transfer cap, we could really use for temporary storage and uploading to IA. Email in profile.)

aviv13y ago

Is the list of feed URLs you have collected thus far (without the cached feed content) publicly accessible?

ivank13y ago

Not yet, I don't have a good way to provide query ability on my postgres db. I also haven't yet imported a lot of files I have lying around.

You can email me for an rsync source that contains the work items we've generated. Right now this is about ~68.2M URLs, mostly on the big blog platforms. This list should grow considerably.

JeffJenkins13y ago· 2 in thread

SuperFeedr is awesome and Julien is great to work with as a user of the service. I wish they had this feature when I was working on my multi-medium client (now defunct) a year and a half ago.

julien13y ago

The second part will stay, while we hope to change the 1st part (pricing per item) really soon into another scheme!

JeffJenkins13y ago

Awesome. If I needed access to feeds again I'd definitely use superfeedr.

j / k navigate · click thread line to collapse