It occurs to me that I’ve seen these problems solved before, and with a better tool. And I even have that the important piece installed on my machine...
I’d love to see all HTML processing in UFP become pluggable, and for a plug-in based on Mozilla to become a reality. Many of the pieces seem to be in place. After an apt-get install python2.4-gtk2, I find that I can import gtkmozembed from within Python. It looks like more pieces to the puzzle are (or will) become available with GtkMozEdit. But I don’t believe that fine grained access to the DOM from within Python is either necessary or even desirable.
To my way of thinking, the ideal would be to run Mozilla in a headless mode. I’d simply construct a MozEmbed object, stream in some data, that data would have some unobtrusive javascript or would use an evalInSandbox technique to make adjustments to the DOM tree, and finally either an HTMLSerializer or an XHTMLSerializer would be used to return back sanitized content.
I’d much rather use DOM/XPath techniques than regular expressions.
At this point, it occurs to me that a number of people who read this weblog have far more experience and/or better contacts than I do to help pull these pieces together.