Skip to main content

Home/ Scripters/ Group items tagged parser

Rss Feed Group items tagged

Jac Londe

Feeds HTML Parser for Node Creation | Drupal.org - 0 views

  • Feeds HTML Parser for Node Creation
  • I currently have a Drupal site set up and am using the Feeds module (http://drupal.org/project/feeds) to create CCK nodes from a few RSS feeds. This is a standard use case of that module, at least as far as I know.
  • The need has arisen for the ability to also create content from non-RSS/non-XML sources. I need a new Parser created for the Feeds module that would allow for one to populate CCK fields based on the parsing of raw HTML content. My first thought is that the user should be allowed to define a regular expression for each field, with the field then being populated by the output of the regular expression applied to the raw HTML content. However, I am open to suggestions on different solutions which might be easier for the developer.
Jac Londe

Parsing HTML in Python (Shallow Thoughts) - 0 views

  • Parsing HTML in Python
  • Up until now, I've avoided doing any HTMl parsing in my RSS reader FeedMe.
  • from HTMLParser import HTMLParser class MyFancyHTMLParser(HTMLParser): def fetch_url(self, url) : request = urllib2.Request(url) response = urllib2.urlopen(request) link = response.geturl() html = response.read() response.close() self.feed(html) # feed() starts the HTMLParser parsing def handle_starttag(self, tag, attrs): if tag == 'img' : # attrs is a list of tuples, (attribute, value) srcindex = self.has_attr('src', attrs) if srcindex < 0 : return # img with no src tag? skip it src = attrs[srcindex][1] # Make relative URLs absolute src = self.make_absolute(src) attrs[srcindex] = (attrs[srcindex][0], src) print '<' + tag for attr in attrs : print ' ' + attr[0] if len(attr) > 1 and type(attr[1]) == 'str' : # make sure attr[1] doesn't have any embedded double-quotes val = attr[1].replace('"', '\"') print '="' + val + '"') print '>' def handle_endtag(self, tag): self.outfile.write('</' + tag.encode(self.encoding) + '>\n'
1 - 2 of 2
Showing 20 items per page