See also
A token is the data associated with a pair of HTML tags.
<tagName> token </tagname>
Example token types:
- URL or image reference
- Textual information
HTML tags usually only relates to formatting
Must look at several tokens to determine context of the data
Start-tag, End-tag structure leads parsing code to use finite state machines and stacks. ( <TABLE> ... </TABLE>)
Here are some example tokens in the form of Python dictionary objects:
{'data': [], 'type': 'StartTag', 'name': u'html'}
{'data': [], 'type': 'StartTag', 'name': u'head'}
{'data': u'\n ', 'type': 'SpaceCharacters'}
{'data': [], 'type': 'StartTag', 'name': u'title'}
{'data': u' ', 'type': 'SpaceCharacters'}
{'data': u'Tim Bower', 'type': 'Characters'}
{'data': u' ', 'type': 'SpaceCharacters'}
{'data': [], 'type': 'EndTag', 'name': u'title'}
{'data': u'\n', 'type': 'SpaceCharacters'}
{'data': [], 'type': 'EndTag', 'name': u'head'}
{'data': u'\n\n', 'type': 'SpaceCharacters'}
{'data': [(u'bgcolor', u'lightyellow')], 'type': 'StartTag', 'name': u'body'}
{'data': u' \n\n', 'type': 'SpaceCharacters'}
{'data': [], 'type': 'StartTag', 'name': u'table'}
{'data': u' ', 'type': 'SpaceCharacters'}
{'data': [], 'type': 'StartTag', 'name': u'tbody'}
{'data': [], 'type': 'StartTag', 'name': u'tr'}
{'data': u'\n', 'type': 'SpaceCharacters'}
{'data': [], 'type': 'StartTag', 'name': u'td'}
{'data': u'\n', 'type': 'SpaceCharacters'}
{'data': [], 'type': 'StartTag', 'name': u'h1'}
{'data': u'Tim Bower', 'type': 'Characters'}
{'data': [], 'type': 'EndTag', 'name': u'h1'}
All of the above tokens came from the few following HTML lines:
<HTML>
<HEAD>
<TITLE> Tim Bower </TITLE>
</HEAD>
<BODY BGCOLOR="lightyellow">
<TABLE> <TR>
<TD>
<H1>Tim Bower</H1>
The call-back approach (HTMLParser shown in The Text Book)
- Define your own class that extends the HTMLParser class
- Nice use of inheritance and polymorphism
- Pass the HTML page to the parser and it calls functions from your class as needed to process the start-tags, data elements, end-tags and a few other miscellaneous tags.
The document tree approach
- Parser builds a tree (data structure object) based on the page contents
- Iterate through the tree, or a list of tokens taken from the tree.
Here is an simple example for programming HTMLParser.
import HTMLParser
class TitleParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.title = ''
self.readingtitle = 0
def handle_starttag(self, tag, \ attrs):
if tag == 'title':
self.readingtitle = 1
def handle_data(self, data):
if self.readingtitle:
self.title += data
def handle_endtag(self, tag):
if tag == 'title':
print “*** %s ***” % self.title
self.readingtitle = 0
if __name__ == '__main__':
fd = open(sys.argv[1])
tp = TitleParser()
tp.feed(fd.read())
Traceback (most recent call last):
File "C:\Users\tim\Documents\Classes\Net_Programming\Source_code\
Topic 3 - Web\weatherParser.py", line 258, in <module>
parser.feed(data)
File "C:\Python25\lib\HTMLParser.py", line 108, in feed
self.goahead(0)
File "C:\Python25\lib\HTMLParser.py", line 148, in goahead
k = self.parse_starttag(i)
File "C:\Python25\lib\HTMLParser.py", line 226, in parse_starttag
endpos = self.check_for_whole_start_tag(i)
File "C:\Python25\lib\HTMLParser.py", line 301,
in check_for_whole_start_tag
self.error("malformed start tag")
File "C:\Python25\lib\HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParseError: malformed start tag, at line 120, column 477
Found on Python package index [PYPI]
Install setuptools then use easy_install to install html5lib (see Installation of Python Packages with Setuptools)
Advantages:
- Robust, standards based parser
- Filtering data after the page is parsed is easier to follow and debug than the call-back approach
Disadvantage:
- Documentation of API for traversing the tree, is lacking or hard to follow.
- Traversing the tree, rather than iterating through the tokens, is probably the most flexible approach, but it is more complex for a beginner to become familiar with.
p = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("dom"))
f = open( "weather.html", "r" )
dom_tree = p.parse(f.read())
f.close()
walker = treewalkers.getTreeWalker("dom")
stream = walker(dom_tree)
passtags = [ u'a', u'h1', u'h2', u'h3', u'h4',u'em',
u'strong', u'br', u'img', u'dl', u'dt', u'dd' ]
for token in stream:
# Don't show non interesting stuff
if token.has_key('name'):
if token['name'] in passtags:
continue
print token
Stream of tokens is a list
Each token is a dictionary
token[‘data’]
- String (unicode encoding)
- Empty list
- List of tuples for formatting attributes
token[‘type’ ] – (StartTag, EndTag, Characters, SpaceCharacters)
- token[‘name’] – description of start and end tags. (table, tr, td,
h1, br, ul, li, … )
See Example Tokens
doingTitle = False
for token in stream:
if token.has_key('name'):
if token['name'] in passtags:
continue
else:
tName = token['name']
tType = token['type']
if tType == 'StartTag':
if tName == u'title':
title = ''
doingTitle = True
elif tType == 'EndTag':
if tName == u'title':
print "*** %s ***\n" % title
doingTitle = False
elif tType == 'Characters':
if doingTitle:
title += token['data']