Group items tagged

Filter: All | Bookmarks | Topics Simple Middle

Common Crawl Founder Gil Elbaz Speaks About New Relationship With Amazon, Semantic Web ... - 0 views

semanticweb.com/ping-big-data-expertise_b26109

search web crawling Common Crawl

shared by Paul Merrell on 28 Jan 12 - No Cached

The Common Crawl Foundation’s repository of openly and freely accessible web crawl data is about to go live as a Public Data Set on Amazon Web Services.
...

Cancel
Elbaz’ goal in developing the repository: “You can’t access, let alone download, the Google or the Bing crawl data. So certainly we’re differentiated in being very open and transparent about what we’re crawling and actually making it available to developers,” he says. “You might ask why is it going to be revolutionary to allow many more engineers and researchers and developers and students access to this data, whereas historically you have to work for one of the big search engines…. The question is, the world has the largest-ever corpus of knowledge out there on the web, and is there more that one can do with it than Google and Microsoft and a handful of other search engines are already doing? And the answer is unquestionably yes. ”
...

Cancel
Common Crawl’s data already is stored on Amazon’s S3 service, but now Amazon will be providing the storage space for free through the Public Data Set program. Not only does that remove from Common Crawl the storage burden and costs for hosting its crawl of 5 billion web pages – some 50 or 60 terabytes large – but it should make it easier for users to access the data, and remove the bandwidth-related costs they might incur for downloads. Users won’t have to deal with setting up accounts, being responsible for bandwidth bills incurred, and more complex authentication processes.
...

Cancel

1 - 1 of 1

Showing 20▼ items per page

Group items tagged

Common Crawl Founder Gil Elbaz Speaks About New Relationship With Amazon, Semantic Web ... - 0 views

Related searches