Skip to main content

Home/ Future of the Web/ Group items tagged Common Crawl

Rss Feed Group items tagged

Paul Merrell

Common Crawl Founder Gil Elbaz Speaks About New Relationship With Amazon, Semantic Web ... - 0 views

  • The Common Crawl Foundation’s repository of openly and freely accessible web crawl data is about to go live as a Public Data Set on Amazon Web Services.
  • Elbaz’ goal in developing the repository: “You can’t access, let alone download, the Google or the Bing crawl data. So certainly we’re differentiated in being very open and transparent about what we’re crawling and actually making it available to developers,” he says. “You might ask why is it going to be revolutionary to allow many more engineers and researchers and developers and students access to this data, whereas historically you have to work for one of the big search engines…. The question is, the world has the largest-ever corpus of knowledge out there on the web, and is there more that one can do with it than Google and Microsoft and a handful of other search engines are already doing? And the answer is unquestionably yes. ”
  • Common Crawl’s data already is stored on Amazon’s S3 service, but now Amazon will be providing the storage space for free through the Public Data Set program. Not only does that remove from Common Crawl the storage burden and costs for hosting its crawl of 5 billion web pages – some 50 or 60 terabytes large – but it should make it easier for users to access the data, and remove the bandwidth-related costs they might incur for downloads. Users won’t have to deal with setting up accounts, being responsible for bandwidth bills incurred, and more complex authentication processes.
1 - 1 of 1
Showing 20 items per page