../_images/parse.png

6.3. Parsing HTML Data

See also

6.3.1. Picking information from an HTML page

  • A difficult problem
  • HTML defines page layout, not content – advantage XML
  • Very useful because of volume of data available
  • If the format of the page changes, your program is broken.

6.3.2. HTML

Token

one piece of information in an HTML formatted page

  • A token is the data associated with a pair of HTML tags.

    <tagName> token </tagname>

  • Example token types:

    • URL or image reference
    • Textual information
  • HTML tags usually only relates to formatting

  • Must look at several tokens to determine context of the data

  • Start-tag, End-tag structure leads parsing code to use finite state machines and stacks. ( <TABLE> ... </TABLE>)

6.3.3. Example Tokens

For the following examples, consider these simple HTML lines:

<HTML>
<HEAD>
<TITLE> Tim Bower </TITLE>
</HEAD>
<BODY BGCOLOR="lightyellow">
<TABLE> <TR>
<TD>
<H1>Tim Bower</H1>

Here are some example tokens in the form of Python dictionary objects:

{'data': [], 'type': 'StartTag', 'name': u'html'}
{'data': [], 'type': 'StartTag', 'name': u'head'}
{'data': u'\n    ', 'type': 'SpaceCharacters'}
{'data': [], 'type': 'StartTag', 'name': u'title'}
{'data': u' ', 'type': 'SpaceCharacters'}
{'data': u'Tim Bower', 'type': 'Characters'}
{'data': u' ', 'type': 'SpaceCharacters'}
{'data': [], 'type': 'EndTag', 'name': u'title'}
{'data': u'\n', 'type': 'SpaceCharacters'}
{'data': [], 'type': 'EndTag', 'name': u'head'}
{'data': u'\n\n', 'type': 'SpaceCharacters'}
{'data': [(u'bgcolor', u'lightyellow')], 'type': 'StartTag', 'name': u'body'}
{'data': u' \n\n', 'type': 'SpaceCharacters'}
{'data': [], 'type': 'StartTag', 'name': u'table'}
{'data': u' ', 'type': 'SpaceCharacters'}
{'data': [], 'type': 'StartTag', 'name': u'tbody'}
{'data': [], 'type': 'StartTag', 'name': u'tr'}
{'data': u'\n', 'type': 'SpaceCharacters'}
{'data': [], 'type': 'StartTag', 'name': u'td'}
{'data': u'\n', 'type': 'SpaceCharacters'}
{'data': [], 'type': 'StartTag', 'name': u'h1'}
{'data': u'Tim Bower', 'type': 'Characters'}
{'data': [], 'type': 'EndTag', 'name': u'h1'}

6.3.4. Three main programming strategies

  1. The call-back approach (HTMLParser shown in The Text Book)

    • Define your own class that extends the HTMLParser class
    • Nice use of inheritance and polymorphism
    • Pass the HTML page to the parser and it calls functions from your class as needed to process the start-tags, data elements, end-tags and a few other miscellaneous tags.
  2. Iterate through tokens

    • Parser builds a tree (data structure object) based on the page contents
    • Iterate through the tree, or a list of tokens taken from the tree.
  3. The document tree approach

    • As above, the parser builds a tree.
    • Use tree searching methods to find desired content. You will likely want to use a web parsing module for this such as lxml or BeautifulSoup. The second edition of the The Text Book suggests lxml. However, I recommend BeautifulSoup because it has better cross-platform support, and is simpler to install. It is worth noting that the 3.1 version of BeautifulSoup did not work out, so make certain that you are using version 3.2 or later. The failed 3.1 version, may have also prompted some to recommend lxml over BeautifulSoup.

6.3.5. HTMLParser

Here is an simple example for programming HTMLParser.

import HTMLParser

class TitleParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.title = ''
        self.readingtitle = 0

    def handle_starttag(self, tag, \ attrs):
        if tag == 'title':
            self.readingtitle = 1

    def handle_data(self, data):
        if self.readingtitle:
            self.title += data

    def handle_endtag(self, tag):
        if tag == 'title':
            print “*** %s ***” % self.title
            self.readingtitle = 0

if __name__ == '__main__':
    fd = open(sys.argv[1])
    tp = TitleParser()
    tp.feed(fd.read())
  • Argh!, HTMLParser is fragile and hard to debug.
Traceback (most recent call last):
File "C:\Users\tim\Documents\Classes\Net_Programming\Source_code\
Topic 3 - Web\weatherParser.py", line 258, in <module>
parser.feed(data)
File "C:\Python25\lib\HTMLParser.py", line 108, in feed
self.goahead(0)
File "C:\Python25\lib\HTMLParser.py", line 148, in goahead
k = self.parse_starttag(i)
File "C:\Python25\lib\HTMLParser.py", line 226, in parse_starttag
endpos = self.check_for_whole_start_tag(i)
File "C:\Python25\lib\HTMLParser.py", line 301,
in check_for_whole_start_tag
self.error("malformed start tag")
File "C:\Python25\lib\HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParseError: malformed start tag, at line 120, column 477
  • HTMLParser seems to lack robustness to handle malformed or complex web pages

6.3.6. html5lib

  • Found on Python package index [PYPI]

  • Project Web Page [HT5LIB]

  • Install distribute then use easy_install or pip if installed to install html5lib (see Installation of Python Packages)

  • Advantages:

    • Robust, standards based parser
    • Filtering data after the page is parsed is easier to follow and debug than the call-back approach
  • Disadvantage:

    • Documentation of API for traversing the tree, is lacking or hard to follow.
    • Traversing the tree, rather than iterating through the tokens, is probably the most flexible approach, but it is more complex for a beginner to become familiar with.

6.3.6.1. html5lib Usage

  1. Build the tree:
  2. Loop through tokens:
p = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("dom"))
f = open( "weather.html", "r" )
dom_tree = p.parse(f.read())
f.close()
walker = treewalkers.getTreeWalker("dom")
stream = walker(dom_tree)

passtags = [ u'a', u'h1', u'h2', u'h3', u'h4',u'em',
            u'strong', u'br', u'img', u'dl', u'dt', u'dd' ]

for token in stream:
    # Don't show non interesting stuff
    if token.has_key('name'):
        if token['name'] in passtags:
            continue
        print token

6.3.6.2. html5lib Tokens

  • Stream of tokens is a list

  • Each token is a dictionary

    • token[‘data’]

      • String (unicode encoding)
      • Empty list
      • List of tuples for formatting attributes
    • token[‘type’ ] (StartTag, EndTag, Characters, SpaceCharacters)

    • token[‘name’] description of start and end tags. (table, tr, td,

      h1, br, ul, li )

  • See Example Tokens

6.3.6.3. html5lib Token Parsing

doingTitle = False
for token in stream:
    if token.has_key('name'):
        if token['name'] in passtags:
            continue
        else:
            tName = token['name']

    tType = token['type']
    if tType == 'StartTag':
        if tName == u'title':
            title = ''
            doingTitle = True
    elif tType == 'EndTag':
        if tName == u'title':
            print "*** %s ***\n" % title
            doingTitle = False
    elif tType == 'Characters':
        if doingTitle:
            title += token['data']

6.3.6.4. The DOM Tree Alternative

  • The DOM tree may be used directly.
  • Not documented with html5lib, but xml.dom is a standard Python module.
  • DOM trees are normally used with XML, but html5lib can make a DOM tree from HTML data.
  • Walk through the tree by examining children nodes of each node. With knowledge of the page structure, you may be able to go almost directly to the desired information.
  • See chapter 8 and DOMtry.py

6.3.7. BeautifulSoup

For a more complete tutorial on BeautifulSoup, please refer to the project web page: http://www.crummy.com/software/BeautifulSoup/

The basic concept to using BeautifulSoup is that each tag in the HTML is a node of a tree. Many nodes, such as table node, will also have child nodes. BeautifulSoup contains methods to search the tree, or sub-part of the tree that begins at a particular node. When a desired node is found, the data of the node may be retrieved or printed. In the following example, I print the title of my web page. Notice how the find method was used to locate a desired node in the tree.

# Parse some of my web page

import sys
import urllib
import urllib2
import os.path
from BeautifulSoup import BeautifulSoup

def gethtml(url):
    "Return the html from either a file or the web"
    # For testing purposes, just read the html from a file
    # check file already there
    filename = "testpage.html"
    if os.path.exists(filename):
        fileobject = open(filename, 'r')
        data = fileobject.readlines()
    else:
        req = urllib2.Request(url)
        fd = urllib2.urlopen(req)

        #read in the website and save it to a file
        data = fd.readlines()
        fd.close()
        fileobject = open(filename, 'w')
        for line in data:
            fileobject.write(line)

    fileobject.close()
    return data


def check_title(html):
    soup = BeautifulSoup(' '.join(html))
    node = soup.find( 'title' )
    print(node.text)

check_title(gethtml("http://www.sal.ksu.edu/faculty/tim/"))