It seems like if these large public datasets continue to come on line, there will need to be some sort of semi-cooperative distributed data store to make them truly "accessible." Or the data provider will need to provide an access/query API, rather than just a big tank that you can copy if you dare.