Skip to main content

Home/ Open Intelligence / Web 3X (Social + Mobile)/ Group items tagged metadata

Rss Feed Group items tagged

Dan R.D.

Thoughts on Google Plus: The Magic Isn't Social, It's Semantic [28Jul11] - 0 views

  • Sparks are a very simple taxonomy right now, but do have persistent URIs, which you can find by hovering over a Spark and looking in the left of the status bar at the bottom of your screen.
  • This warrants more dissecting and attention. Will they eventually use all or some of the hierarchy of Google Directory? Will they become hierarchical? Will the algorithm improve as we click on links that interest us? Can we add our own information? Are we creating new entities for Google as we search for and add Sparks to our items of interest – it seems that way. It’s not an ontology yet, but it’s a start. Lots of people creating persistent URIs for entities they’ve dreamed up – I hear that evil cackle again!
  • Google, by nature of its founding, is in a prime position to address the challenges that many enterprise technologists have when thinking about semantic data – how do we handle unstructured data? We have metadata: in schema, in taxonomies, in ontologies even. We have loads of content. With no metadata. How do we get them together? We can’t afford to hire a small army of indexers to apply the metadata to the content. The system metadata is insufficient and poor. We have a pretty good search tool, and have put some effort into data dictionaries, entity extraction and rules-based classification. We have tools that do latent semantic indexing and latent semantic analysis.  Make sense of unstructured information? Sure, Google can do that. Hopefully they will not reduce efforts in these areas too much to focus on other projects. Many of us can execute a search and return nothing useful; crowdsourcing tagging in G+ may re-vitalize  components of the search algorithm.
Marc-Alexandre Gagnon

New 5 Billion Page Web Index with Page Rank Now Available for Free from Common Crawl Fo... - 0 views

  • A freely accessible index of 5 billion web pages, their page rank, their link graphs and other metadata, hosted on Amazon EC2, was announced today by the Common Crawl Foundation. "It is crucial [in] our information-based society that Web crawl data be open and accessible to anyone who desires to utilize it," writes Foundation director Lisa Green on the organization's blog.
  • The Foundation explains the scope of the project thusly. "Common Crawl is a Web Scale crawl, and as such, each version of our crawl contains billions of documents from the various sites that we are successfully able to crawl. This dataset can be tens of terabytes in size, making transfer of the crawl to interested third parties costly and impractical. In addition to this, performing data processing operations on a dataset this large requires parallel processing techniques, and a potentially large computer cluster. "Luckily for us, Amazon's EC2/S3 cloud computing infrastructure provides us with both a theoretically unlimited storage capacity coupled with localized access to an elastic compute cloud."
  • The Foundation is an organization dedicated to leveraging the falling costs of crawling and storage for the benefit of "individuals, academic groups, small start-ups, big companies, governments and nonprofits." It's lead by Gilad Elbaz, the forefather of Google AdSense and the CEO of data platform startup Factual. Joining Elbaz on the Foundation board is internet public domain champion Carl Malamud and semantic web serial entrepreneur Nova Spivack. Director Lisa Green came to the Foundation by way of Creative Commons.
  • ...2 more annotations...
  • The organization was formed three years ago, just now started talking about itself publicly and believes that free access to all this information could lead to "a new wave of innovation, education and research."
  • Open Web Advocate James Walker agrees: "An openly accessible archive of the web - that's not owned and controlled by Google - levels the playing field pretty significantly for research and innovation."
Dan R.D.

Seeker Friendly - the Future of Search [29Apr10] - 0 views

  • We need ambient findability. We need smart ways of guiding people towards the content they’d like to see — with categorization and search playing complementary goals.Getting people to the content they want to see, using the search functionality your average newspaper website has on offer, is not exactly what I’d describe as fast or effortless. Full-text search can be a daunting experience. We need some sort of a sitemap that acts as a gateway to our content and is broader than our primary navigation.We need deep links to the topics that are currently on people’s mind and that are being talked about.How neat would it be if we could also browse by mood or by genre?We need quick links to topic pages about related persons, organizations, events and locations.We need links to terms on Wikipedia (e.g. using Apture) or the ability to look things up in a dictionary (like the one they have over at the New York Times)Related content should be referred to either using tags or if you’re really hip, using relationships. Search behavior doesn’t always revolve around a big input box and a submit button.Faceted search needs facets: ways of splitting up search results into meaningful categories. Rich metadata and a well thought-out categorization scheme is a prequisite.Online search should work similarly to asking a question to a flesh-and-blood reporter
  •  
    Can't find what you're looking for? Here is how web developers could make your search a lot less difficult.
1 - 5 of 5
Showing 20 items per page