maybe robot.txt should be upgraded with license specifics.
not for commercial use, etc.
so I believe all content is covered under fair use which to me means common crawl has a right to scrape everything and it's the user of common crawl to sort out the details.