Group items tagged

Filter: All | Bookmarks | Topics Simple Middle

drupal.org/835184

shared by Jac Londe on 26 Mar 14 - No Cached

Feeds HTML Parser for Node Creation
...

Cancel
I currently have a Drupal site set up and am using the Feeds module (http://drupal.org/project/feeds) to create CCK nodes from a few RSS feeds. This is a standard use case of that module, at least as far as I know.
...

Cancel
The need has arisen for the ability to also create content from non-RSS/non-XML sources. I need a new Parser created for the Feeds module that would allow for one to populate CCK fields based on the parsing of raw HTML content. My first thought is that the user should be allowed to define a regular expression for each field, with the field then being populated by the output of the regular expression applied to the raw HTML content. However, I am open to suggestions on different solutions which might be easier for the developer.
...

Cancel

shared by Jac Londe on 26 Mar 14 - No Cached

Parsing HTML in Python
...

Cancel
Up until now, I've avoided doing any HTMl parsing in my RSS reader FeedMe.
...

Cancel
from HTMLParser import HTMLParser class MyFancyHTMLParser(HTMLParser): def fetch_url(self, url) : request = urllib2.Request(url) response = urllib2.urlopen(request) link = response.geturl() html = response.read() response.close() self.feed(html) # feed() starts the HTMLParser parsing def handle_starttag(self, tag, attrs): if tag == 'img' : # attrs is a list of tuples, (attribute, value) srcindex = self.has_attr('src', attrs) if srcindex < 0 : return # img with no src tag? skip it src = attrs[srcindex][1] # Make relative URLs absolute src = self.make_absolute(src) attrs[srcindex] = (attrs[srcindex][0], src) print '<' + tag for attr in attrs : print ' ' + attr[0] if len(attr) > 1 and type(attr[1]) == 'str' : # make sure attr[1] doesn't have any embedded double-quotes val = attr[1].replace('"', '\"') print '="' + val + '"') print '>' def handle_endtag(self, tag): self.outfile.write('</' + tag.encode(self.encoding) + '>\n'
...

Cancel

1 - 2 of 2

Showing 20▼ items per page