Extracting Structured Data from the Common Web Crawl
More and more websites have started to embed structured data describing products, people, organizations, places, events into their HTML pages. The Web Data Commons project extracts this data from several billion web pages and provides the extracted data for download. Web Data Commons thus enables you to use the data without needing to crawl the Web yourself.
Extracting Structured Data from the Common Web Crawl
More and more websites have started to embed structured data describing products, people, organizations, places, events into their HTML pages. The Web Data Commons project extracts this data from several billion web pages and provides the extracted data for download.
"Federated Knowledge Extraction Framework
FOX is a framework that integrates the Linked Data Cloud and makes use of the diversity of NLP algorithms to extract RDF triples of high accuracy out of NL. In its current version, it integrates and merges the results of Named Entity Recognition tools. Keyword Extraction and Relation Extraction tools will be merged soon."
Federated knOwledge eXtraction Framework
FOX is a framework that integrates the Linked Data Cloud and makes uses of the diversity of NLP algorithms to extract RDF triples of high accuracy out of NL. In its current version, it integrates and merges the results of Named Entity Recognition, Keyword Extraction and Relation Extraction tools.
AlchemyAPI provides content owners and web developers with a rich suite of content analysis and meta-data annotation tools.
Expose the semantic richness hidden in any content, using named entity extraction, keyword extraction, sentiment analysis, document categorization, concept tagging, language detection, and structured content scraping. Use AlchemyAPI to enhance your website, blog, content management system, or semantic web application.
"The Sentikator is a computer linguistic engine designed to calculably recognize, analyze and quantify emotions and content in texts. It allows extracting sentiment out of news, analyst recommendations, social media data, transcripts, press releases, broker news, factsheets, weather forecasts and many other sources. Sentiment extraction is highly reliable and disseminated data is preprocessed so that implementation into existing or new financial applications is easily possible. The Sentikator gives valuable insights to emotions"
Wandora is a general purpose information extraction, management, and publishing application based on Topic Maps and Java Swing. Wandora has graphical user interface, layered presentation of knowledge, several data storage options, huge collection of data extraction, import and export options, embedded server, and open plug-in architecture. Wandora is a FOSS application with GNU GPL license.
OpenStructs is an education and distribution site dedicated to open source software for converting, managing, viewing and manipulating structured data. Structured data can represent any existing data struct from the simplest attribute-value pair formats to fully specified relational database schema. Material on this OpenStructs site ranges from individual tools to complete open semantic frameworks (OSF) with which to builld comprehensive semantic instances.
All OpenStructs tools are premised on the canonical RDF (Resource Description Framework) data model. Thus, OpenStructs tools either convert existing data structures to RDF, extract structure from content as RDF, or manage and manipulate RDF. All OpenStructs tools and approaches are as compliant as possible with existing open standards from the W3C. The intent is to achieve maximum data and software interoperabililty.
OutWit Hub explores the depths of the Web for you, automatically collecting and organizing data and media from online sources. OutWit Hub breaks down Web pages into their different constituents. Navigating from page to page automatically, it extracts information elements and organizes them into usable collections.
Anything To Triples (any23) is a library, a web service and a command line tool that extracts structured data in RDF format from a variety of Web documents. Currently it supports the following input formats: RDF/XML, Turtle, Notation 3, RDFa.
Microformats: Adr, Geo, hCalendar, hCard, hListing, hResume, hReview, License, XFN and Species
"SchemaWeb is a directory of RDF schemas expressed in the RDFS, OWL and DAML+OIL schema languages.
SchemaWeb is a place for developers and designers working with RDF. It provides a comprehensive directory of RDF schemas to be browsed and searched by human agents and also an extensive set of web services to be used by software agents that wish to obtain real-time schema information whilst processing RDF data.
RDF Schemas are the critical layer of the Semantic Web. They provide the semantic linkage that 'intelligent' software needs to extract value giving information from the raw data defined by RDF triples. "
"Querying Wikipedia like a Semantic Database
DBpedia is a community effort to extract structured information from Wikipedia and make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia and to link other datasets on the Web to Wikipedia data."
"Extract Meaning from your Text.
The TextRazor API helps you extract and understand the Who, What, Why and How from your legal documents with unprecedented accuracy and speed."
If you own a business, you need to monitor your competitors' move so as to remain ahead of the game. However, you need to do a market research so as to gather useful information that will help you determine your position in the online business.
DBpedia is a project aiming to extract structured information from the information created as part of the Wikipedia project. This structured information is then made available on the World Wide Web. DBpedia allows users to query relationships and properties associated with Wikipedia resources, including links to other related datasets. DBpedia has been described by Tim Berners-Lee as one of the more famous parts of the Linked Data project.
Semantic Desktop with KDE
Nepomuk aims to provide the basis to handle all kinds of metadata on the KDE desktop in a generic fashion. This ranges from simple information such as tags or ratings over metadata extracted from files to metadata that is generated by applications automatically. RDF, the Resource Description Framework, provides the powerful basis to store and query all this data. The goal is to categorize all metadata using clean ontologies to make an automated handling and enrichment of the data possible.
Wikipedia users constantly revise Wikipedia articles with updates happening almost each second. Hence, data stored in the official DBpedia endpoint can quickly become outdated, and Wikipedia articles need to be re-extracted. DBpedia-Live enables such a continuous synchronization between DBpedia and Wikipedia.