Cool idea. Have you considered applying it to a HN data set?
With several of these sites being image hosts, I'm also curious to see how this will change as Reddit rolls out its own photo/video hosting which I believe is still in beta on limited subreddits today.