Hey ahadrana, I haven't found anything about the page ranks on the website, are they included? Do you know if it is possible to go only trough the metadata of the crawl, say to get the page ranks for a list of pages or do you have to go through the full crawl?
The pagerank and other metadata we compute is not part of the S3 corpus, but we do collect this information and probably will make it available in a separate S3 bucket in Hadoop SequenceFiles format. Be aware that our pagerank will probably not have a high degree of correlation to Google's pagerank number, since their pagerank calculation is going to be a lot more sophisticated than our version.