Skip to main content

Home/ (HBSN) Useful Webservices APIs/ Group items tagged open source

Rss Feed Group items tagged

François Dongier

Extracting Enterprise Vocabularies Using Linked Open Data | Semantic Web Dog Food - 0 views

  • A common vocabulary is vital to smooth business operation, yet codifying and maintaining an enterprise vocabulary is an arduous, manual task. We describe a process to automatically extract a domain specific vocabulary (terms and types) from unstructured data in the enterprise guided by term definitions in Linked Open Data (LOD). We validate our techniques by applying them to the IT (Information Technology) domain, taking 58 Gartner analyst reports and using two specific LOD sources -- DBpedia and Freebase.
    • François Dongier
       
      This IBM article is referenced by Juan Sequeda in a post to the Linking Open Data mailing list (public-lod@w3.org, Feb 4, 2010) : Hi Matthias, We worked on something similar: entity type discovery using linked open data. Our project was given a corpus of documents in the same domain, identify specific entity types in the documents. Our objective was to search for documents in a corpus by specific entities. For example: "find articles that are about RDBMs" Standard NER tools identify high level types such as persons, organization, places because they have been previously trained on general corpora. I assume tools like OpenCalais have been trained on news-like documents and Zemanta has been trained on blog-like documents. We were interested in identifying specific types such a "RDBMS" when the word "Oracle" would show up in the text. In order to do that, we followed several domain term extraction techniques. We used LOD, specifically DBpedia, Freebase and Opencyc to disambiguate terms and also retrieve the entities. Honestly, evaluation is pretty hard to do, but our current implementation was not that bad (75% precision and 55% recall). We built upon some work by IBM where they create a vocabulary from text using LOD [1] Let me see if I can clean up the code and publish it as a service. [1] http://data.semanticweb.org/conference/iswc/2009/paper/inuse/143/html Juan Sequeda (575) SEQ-UEDA www.juansequeda.com
François Dongier

Anything to Triples - - 0 views

  • Anything To Triples (any23) is a library and web service that extracts structured data in RDF format from a variety of Web documents. Currently it supports the following input formats: RDF/XML, Turtle, Notation 3 RDFa Microformats: Adr, Geo, hCalendar, hCard, hListing, hResume, hReview, License and XFN Any23 is used in major Web of Data applications such as sindice.com and sig.ma. It is written in Java and licensed under the Apache License. Any23 can be used in various ways: As a library in Java applications that consume structured data from the Web. As a command-line tool for extracting and converting between the supported formats. There is a web service and API where you can try it at any23.org.
  • The original codebase comes from open-sourcing the "RDFizer" component of the Sindice search engine. The project is supported by DERI, NUI Galway, Web of Data - FBK and the OKKAM project (ICT-215032). Individual developers who have contributed to any23 include: Michele Catasta, Richard Cyganiak, Michele Mostarda, Davide Palmisano, Gabriele Renzi, Jürgen Umbrich.
François Dongier

Taking Search -- And Meaning -- Beyond English - Semantic Web - 0 views

  • Multi-lingual text analytics vendor Basis Technology Corp., which develops the Rosette linguistics platform
  • The company this week released Rosette 7, the latest version of its software, which is used in major web and enterprise search engines, from Google to Bing to Oracle software. The product supports 55 languages for language identification, and if you count different encodings that grows to over 100 languages and encoding pairs. For base linguistics for search engine enablement it supports 20 languages, depending on how you count them.
  • Another major feature in Rosette 7 is name matching and name translation, a problem the company has been working on for more than five years with the result that this is the first time name translation and searching are integrated into the Rosette platform’s same core set of APIs.
  • ...1 more annotation...
  • The latest version also now supports Lucene-based applications, so any organization using the open source search toolkits can get the same advanced linguistic processing used by high end web and enterprise search engines.
1 - 7 of 7
Showing 20 items per page