Aperture is a Java framework for extracting and querying full-text
content and metadata from various information systems (e.g. file systems,
web sites, mail boxes) and the file formats (e.g. documents, images)
occurring in these systems.
Collection of resources for describing web content harvesting issues. Opensource projects, articles, recipies etc.
Feature rich XPath generator, editor, inspector and simple extraction tool...
Feature rich XPath generator, editor, inspector and simple extraction tool.
http://xpath.alephzarro.com
In short, screen scrapers allow you to turn a regular web page into a regular web page plus semantic data, and thus frees the data from the page/site that contains it.
PyXPCOM allows for communication between Python and XPCOM, such that a Python application can access XPCOM objects, and XPCOM can access any Python class that implements an XPCOM interface. With PyXPCOM, a developer can talk to XPCOM or embed Gecko from a Python application. PyXPCOM is similar to JavaXPCOM (Java-XPCOM bridge) or XPConnect (JavaScript-XPCOM bridge).
Since seeing Mike Kay's presentation at XTech
2005 I've been meaning to write up some
Amara equivalents to the
examples in the paper, "Comparing XSLT and
XQuery".
Here they are.
This is not meant to be an advocacy piece, but rather a set of useful
examples. I think the Amara examples tend to be easier to follow for
typical programmers (although they also expose some things I'd like to
improve), but with XSLT and XQuery you get cleaner declarative
semantics, and cross-language support.
It occurs to me that I’ve seen these problems solved before, and with a better tool. And I even have that the important piece installed on my machine...
I’d love to see all HTML processing in UFP become pluggable, and for a plug-in based on Mozilla to become a reality. Many of the pieces seem to be in place. After an apt-get install python2.4-gtk2, I find that I can import gtkmozembed from within Python. It looks like more pieces to the puzzle are (or will) become available with GtkMozEdit. But I don’t believe that fine grained access to the DOM from within Python is either necessary or even desirable.
To my way of thinking, the ideal would be to run Mozilla in a headless mode. I’d simply construct a MozEmbed object, stream in some data, that data would have some unobtrusive javascript or would use an evalInSandbox technique to make adjustments to the DOM tree, and finally either an HTMLSerializer or an XHTMLSerializer would be used to return back sanitized content.
I’d much rather use DOM/XPath techniques than regular expressions.
At this point, it occurs to me that a number of people who read this weblog have far more experience and/or better contacts than I do to help pull these pieces together.