Skip to main content

Diigo Home
Home/ Groups/ Web harvesting solution
Ishta

Aduna - Aperture - 0 views

  • Flexible content and metadata extraction framework


    Download button


    Aperture is a Java framework for extracting and querying full-text content and metadata from various information systems (e.g. file systems, web sites, mail boxes) and the file formats (e.g. documents, images) occurring in these systems.

Ishta

Universal Feed Parser - 0 views

  • Ishta
     
    Easy to use python feed parser, opensource, well tested (3000 unit tests)
Ishta

Aperture Framework - 0 views


  • Aperture is a Java framework for extracting and querying full-text
    content and metadata from various information systems (e.g. file systems,
    web sites, mail boxes) and the file formats (e.g. documents, images)
    occurring in these systems.
  • Ishta
     
    Java framework for data extraction, crawling, harvesting,
    able to process different data sources, and extract metadata
    and output rdf
    pluggable architecture, opensource. RDF insertion
    .
Ishta

XPather :: Firefox Add-ons - 0 views

  • XPather 1.3
    Homepage


    by Viktor Zigo




    Feature rich XPath generator, editor, inspector and simple extraction tool...





    Feature rich XPath generator, editor, inspector and simple extraction tool.







    http://xpath.alephzarro.com

  • Ishta
     
    Firefox extension for browsing and evaluating Xpath
Ishta

Solvent - SIMILE - 0 views

shared by Ishta on 28 Apr 07 - Snapshot
  • Solvent is a Firefox extension that helps you write screen scrapers for Piggy Bank.

  • In short, screen scrapers allow you to turn a regular web page into a regular web page plus semantic data, and thus frees the data from the page/site that contains it.

    • Ishta
       
      This page has nice definition of screen scrapper and what information extraction for semantic web is all about
Ishta

PyXPCOM - MDC - 0 views

  • PyXPCOM allows for communication between Python and XPCOM, such that a Python application can access XPCOM objects, and XPCOM can access any Python class that implements an XPCOM interface. With PyXPCOM, a developer can talk to XPCOM or embed Gecko from a Python application. PyXPCOM is similar to JavaXPCOM (Java-XPCOM bridge) or XPConnect (JavaScript-XPCOM bridge).


    Python classes and interfaces: Mozilla defines many external interfaces available to embeddors and component developers. PyXPCOM provides access to these interfaces as Python interfaces. PyXPCOM also contains several classes that provide access to functions for initializing and shutting down XPCOM and Gecko from Python, as well as some XPCOM helper functions.
  • Ishta
     
    is it about controlling  firefox from python script?
Ishta

XULRunner Hall of Fame - MDC - 0 views

  • This page tracks existing XULRunner-based applications.
  • Ishta
     
    List of mozilla XULRunner  Gecko based applications
Ishta

Amara equivalents of Mike Kay's XSLT 2.0, XQuery examples ✏Copia - 0 views

  • Since seeing Mike Kay's presentation at XTech
    2005
    I've been meaning to write up some
    Amara equivalents to the
    examples in the paper, "Comparing XSLT and
    XQuery"
    .
    Here they are.



    This is not meant to be an advocacy piece, but rather a set of useful
    examples. I think the Amara examples tend to be easier to follow for
    typical programmers (although they also expose some things I'd like to
    improve), but with XSLT and XQuery you get cleaner declarative
    semantics, and cross-language support.

  • Ishta
     
    This page shows pythonic vs xquery approach to xml  processing by Uche
Ishta

Sam Ruby: Bleach Alternatives - 0 views

  • It occurs to me that I’ve seen these problems solved before, and with a better tool.  And I even have that the important piece installed on my machine...


    I’d love to see all HTML processing in UFP become pluggable, and for a plug-in based on Mozilla to become a reality.  Many of the pieces seem to be in place.  After an apt-get install python2.4-gtk2, I find that I can import gtkmozembed from within Python.  It looks like more pieces to the puzzle are (or will) become available with GtkMozEdit.  But I don’t believe that fine grained access to the DOM from within Python is either necessary or even desirable.


    To my way of thinking, the ideal would be to run Mozilla in a headless mode.  I’d simply construct a MozEmbed object, stream in some data, that data would have some unobtrusive javascript or would use an evalInSandbox technique to make adjustments to the DOM tree, and finally either an HTMLSerializer or an XHTMLSerializer would be used to return back sanitized content.


    I’d much rather use DOM/XPath techniques than regular expressions.


    At this point, it occurs to me that a number of people who read this weblog have far more experience and/or better contacts than I do to help pull these pieces together.

1 - 9 of 9
Showing 20 items per page
Join this group