../_images/parse.png

6.2. Parsing HTML Data

See also

6.2.1. Picking information from an HTML page

  • A difficult problem
  • HTML defines page layout, not content – advantage XML
  • Very useful because of volume of data available
  • If the format of the page changes, your program is broken.

6.2.2. HTML

Token
one piece of information in an HTML formatted page
  • A token is the data associated with a pair of HTML tags.

    <tagName> token </tagname>

  • Example token types:

    • URL or image reference
    • Textual information
  • HTML tags usually only relates to formatting

  • Must look at several tokens to determine context of the data

  • Start-tag, End-tag structure leads parsing code to use finite state machines and stacks. ( <TABLE> ... </TABLE>)

6.2.3. Example Tokens

Here are some example tokens in the form of Python dictionary objects:

{'data': [], 'type': 'StartTag', 'name': u'html'}
{'data': [], 'type': 'StartTag', 'name': u'head'}
{'data': u'\n    ', 'type': 'SpaceCharacters'}
{'data': [], 'type': 'StartTag', 'name': u'title'}
{'data': u' ', 'type': 'SpaceCharacters'}
{'data': u'Tim Bower', 'type': 'Characters'}
{'data': u' ', 'type': 'SpaceCharacters'}
{'data': [], 'type': 'EndTag', 'name': u'title'}
{'data': u'\n', 'type': 'SpaceCharacters'}
{'data': [], 'type': 'EndTag', 'name': u'head'}
{'data': u'\n\n', 'type': 'SpaceCharacters'}
{'data': [(u'bgcolor', u'lightyellow')], 'type': 'StartTag', 'name': u'body'}
{'data': u' \n\n', 'type': 'SpaceCharacters'}
{'data': [], 'type': 'StartTag', 'name': u'table'}
{'data': u' ', 'type': 'SpaceCharacters'}
{'data': [], 'type': 'StartTag', 'name': u'tbody'}
{'data': [], 'type': 'StartTag', 'name': u'tr'}
{'data': u'\n', 'type': 'SpaceCharacters'}
{'data': [], 'type': 'StartTag', 'name': u'td'}
{'data': u'\n', 'type': 'SpaceCharacters'}
{'data': [], 'type': 'StartTag', 'name': u'h1'}
{'data': u'Tim Bower', 'type': 'Characters'}
{'data': [], 'type': 'EndTag', 'name': u'h1'}

All of the above tokens came from the few following HTML lines:

<HTML>
<HEAD>
<TITLE> Tim Bower </TITLE>
</HEAD>
<BODY BGCOLOR="lightyellow">
<TABLE> <TR>
<TD>
<H1>Tim Bower</H1>

6.2.4. Two main programming strategies

  1. The call-back approach (HTMLParser shown in The Text Book)

    • Define your own class that extends the HTMLParser class
    • Nice use of inheritance and polymorphism
    • Pass the HTML page to the parser and it calls functions from your class as needed to process the start-tags, data elements, end-tags and a few other miscellaneous tags.
  2. The document tree approach

    • Parser builds a tree (data structure object) based on the page contents
    • Iterate through the tree, or a list of tokens taken from the tree.

6.2.5. HTMLParser

Here is an simple example for programming HTMLParser.

import HTMLParser

class TitleParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.title = ''
        self.readingtitle = 0

    def handle_starttag(self, tag, \ attrs):
        if tag == 'title':
            self.readingtitle = 1

    def handle_data(self, data):
        if self.readingtitle:
            self.title += data

    def handle_endtag(self, tag):
        if tag == 'title':
            print “*** %s ***” % self.title
            self.readingtitle = 0

if __name__ == '__main__':
    fd = open(sys.argv[1])
    tp = TitleParser()
    tp.feed(fd.read())
  • Argh!, HTMLParser is fragile and hard to debug.
Traceback (most recent call last):
File "C:\Users\tim\Documents\Classes\Net_Programming\Source_code\
Topic 3 - Web\weatherParser.py", line 258, in <module>
parser.feed(data)
File "C:\Python25\lib\HTMLParser.py", line 108, in feed
self.goahead(0)
File "C:\Python25\lib\HTMLParser.py", line 148, in goahead
k = self.parse_starttag(i)
File "C:\Python25\lib\HTMLParser.py", line 226, in parse_starttag
endpos = self.check_for_whole_start_tag(i)
File "C:\Python25\lib\HTMLParser.py", line 301,
in check_for_whole_start_tag
self.error("malformed start tag")
File "C:\Python25\lib\HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParseError: malformed start tag, at line 120, column 477
  • HTMLParser seems to lack robustness to handle malformed or complex web pages

6.2.6. html5lib

  • Found on Python package index [PYPI]

  • Project Web Page [HT5LIB]

  • Install setuptools then use easy_install to install html5lib (see Installation of Python Packages with Setuptools)

  • Advantages:

    • Robust, standards based parser
    • Filtering data after the page is parsed is easier to follow and debug than the call-back approach
  • Disadvantage:

    • Documentation of API for traversing the tree, is lacking or hard to follow.
    • Traversing the tree, rather than iterating through the tokens, is probably the most flexible approach, but it is more complex for a beginner to become familiar with.

6.2.6.1. html5lib Usage

  1. Build the tree:
  2. Loop through tokens:
p = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("dom"))
f = open( "weather.html", "r" )
dom_tree = p.parse(f.read())
f.close()
walker = treewalkers.getTreeWalker("dom")
stream = walker(dom_tree)

passtags = [ u'a', u'h1', u'h2', u'h3', u'h4',u'em',
            u'strong', u'br', u'img', u'dl', u'dt', u'dd' ]

for token in stream:
    # Don't show non interesting stuff
    if token.has_key('name'):
        if token['name'] in passtags:
            continue
        print token

6.2.6.2. html5lib Tokens

  • Stream of tokens is a list

  • Each token is a dictionary

    • token[‘data’]

      • String (unicode encoding)
      • Empty list
      • List of tuples for formatting attributes
    • token[‘type’ ] – (StartTag, EndTag, Characters, SpaceCharacters)

    • token[‘name’] – description of start and end tags. (table, tr, td,

      h1, br, ul, li, … )

  • See Example Tokens

6.2.6.3. html5lib Token Parsing

doingTitle = False
for token in stream:
    if token.has_key('name'):
        if token['name'] in passtags:
            continue
        else:
            tName = token['name']

    tType = token['type']
    if tType == 'StartTag':
        if tName == u'title':
            title = ''
            doingTitle = True
    elif tType == 'EndTag':
        if tName == u'title':
            print "*** %s ***\n" % title
            doingTitle = False
    elif tType == 'Characters':
        if doingTitle:
            title += token['data']

6.2.6.4. The DOM Tree Alternative

  • The DOM tree may be used directly.
  • Not documented with html5lib, but xml.dom is a standard Python module.
  • DOM trees are normally used with XML, but html5lib can make a DOM tree from HTML data.
  • Walk through the tree by examining children nodes of each node.  With knowledge of the page structure, you may be able to go almost directly to the desired information.
  • See chapter 8 and DOMtry.py