See also
one piece of information in an HTML formatted page
A token is the data associated with a pair of HTML tags.
<tagName> token </tagname>
Example token types:
- URL or image reference
- Textual information
HTML tags usually only relates to formatting
Must look at several tokens to determine context of the data
Start-tag, End-tag structure leads parsing code to use finite state machines and stacks. ( <TABLE> ... </TABLE>)
For the following examples, consider these simple HTML lines:
<HTML>
<HEAD>
<TITLE> Tim Bower </TITLE>
</HEAD>
<BODY BGCOLOR="lightyellow">
<TABLE> <TR>
<TD>
<H1>Tim Bower</H1>
Here are some example tokens in the form of Python dictionary objects:
{'data': [], 'type': 'StartTag', 'name': u'html'}
{'data': [], 'type': 'StartTag', 'name': u'head'}
{'data': u'\n ', 'type': 'SpaceCharacters'}
{'data': [], 'type': 'StartTag', 'name': u'title'}
{'data': u' ', 'type': 'SpaceCharacters'}
{'data': u'Tim Bower', 'type': 'Characters'}
{'data': u' ', 'type': 'SpaceCharacters'}
{'data': [], 'type': 'EndTag', 'name': u'title'}
{'data': u'\n', 'type': 'SpaceCharacters'}
{'data': [], 'type': 'EndTag', 'name': u'head'}
{'data': u'\n\n', 'type': 'SpaceCharacters'}
{'data': [(u'bgcolor', u'lightyellow')], 'type': 'StartTag', 'name': u'body'}
{'data': u' \n\n', 'type': 'SpaceCharacters'}
{'data': [], 'type': 'StartTag', 'name': u'table'}
{'data': u' ', 'type': 'SpaceCharacters'}
{'data': [], 'type': 'StartTag', 'name': u'tbody'}
{'data': [], 'type': 'StartTag', 'name': u'tr'}
{'data': u'\n', 'type': 'SpaceCharacters'}
{'data': [], 'type': 'StartTag', 'name': u'td'}
{'data': u'\n', 'type': 'SpaceCharacters'}
{'data': [], 'type': 'StartTag', 'name': u'h1'}
{'data': u'Tim Bower', 'type': 'Characters'}
{'data': [], 'type': 'EndTag', 'name': u'h1'}
The call-back approach (HTMLParser shown in The Text Book)
- Define your own class that extends the HTMLParser class
- Nice use of inheritance and polymorphism
- Pass the HTML page to the parser and it calls functions from your class as needed to process the start-tags, data elements, end-tags and a few other miscellaneous tags.
Iterate through tokens
- Parser builds a tree (data structure object) based on the page contents
- Iterate through the tree, or a list of tokens taken from the tree.
The document tree approach
- As above, the parser builds a tree.
- Use tree searching methods to find desired content. You will likely want to use a web parsing module for this such as lxml or BeautifulSoup. The second edition of the The Text Book suggests lxml. However, I recommend BeautifulSoup because it has better cross-platform support, and is simpler to install. It is worth noting that the 3.1 version of BeautifulSoup did not work out, so make certain that you are using version 3.2 or later. The failed 3.2 version, may have also prompted some to recommend lxml over BeautifulSoup.
Here is an simple example for programming HTMLParser.
import HTMLParser
class TitleParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.title = ''
self.readingtitle = 0
def handle_starttag(self, tag, \ attrs):
if tag == 'title':
self.readingtitle = 1
def handle_data(self, data):
if self.readingtitle:
self.title += data
def handle_endtag(self, tag):
if tag == 'title':
print “*** %s ***” % self.title
self.readingtitle = 0
if __name__ == '__main__':
fd = open(sys.argv[1])
tp = TitleParser()
tp.feed(fd.read())
Traceback (most recent call last):
File "C:\Users\tim\Documents\Classes\Net_Programming\Source_code\
Topic 3 - Web\weatherParser.py", line 258, in <module>
parser.feed(data)
File "C:\Python25\lib\HTMLParser.py", line 108, in feed
self.goahead(0)
File "C:\Python25\lib\HTMLParser.py", line 148, in goahead
k = self.parse_starttag(i)
File "C:\Python25\lib\HTMLParser.py", line 226, in parse_starttag
endpos = self.check_for_whole_start_tag(i)
File "C:\Python25\lib\HTMLParser.py", line 301,
in check_for_whole_start_tag
self.error("malformed start tag")
File "C:\Python25\lib\HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParseError: malformed start tag, at line 120, column 477
Found on Python package index [PYPI]
Install setuptools then use easy_install to install html5lib (see Installation of Python Packages with Setuptools)
Advantages:
- Robust, standards based parser
- Filtering data after the page is parsed is easier to follow and debug than the call-back approach
Disadvantage:
- Documentation of API for traversing the tree, is lacking or hard to follow.
- Traversing the tree, rather than iterating through the tokens, is probably the most flexible approach, but it is more complex for a beginner to become familiar with.
p = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("dom"))
f = open( "weather.html", "r" )
dom_tree = p.parse(f.read())
f.close()
walker = treewalkers.getTreeWalker("dom")
stream = walker(dom_tree)
passtags = [ u'a', u'h1', u'h2', u'h3', u'h4',u'em',
u'strong', u'br', u'img', u'dl', u'dt', u'dd' ]
for token in stream:
# Don't show non interesting stuff
if token.has_key('name'):
if token['name'] in passtags:
continue
print token
Stream of tokens is a list
Each token is a dictionary
token[‘data’]
- String (unicode encoding)
- Empty list
- List of tuples for formatting attributes
token[‘type’ ] (StartTag, EndTag, Characters, SpaceCharacters)
- token[‘name’] description of start and end tags. (table, tr, td,
h1, br, ul, li )
See Example Tokens
doingTitle = False
for token in stream:
if token.has_key('name'):
if token['name'] in passtags:
continue
else:
tName = token['name']
tType = token['type']
if tType == 'StartTag':
if tName == u'title':
title = ''
doingTitle = True
elif tType == 'EndTag':
if tName == u'title':
print "*** %s ***\n" % title
doingTitle = False
elif tType == 'Characters':
if doingTitle:
title += token['data']
For a more complete tutorial on BeautifulSoup, please refer to the project web page: http://www.crummy.com/software/BeautifulSoup/
The basic concept to using BeautifulSoup is that each tag in the HTML is a node of a tree. Many nodes, such as table node, will also have child nodes. BeautifulSoup contains methods to search the tree, or sub-part of the tree that begins at a particular node. When a desired node is found, the data of the node may be retrieved or printed. In the following example, I print the title of my web page. Notice how the find method was used to locate a desired node in the tree.
# Parse some of my web page
import sys
import urllib
import urllib2
import os.path
from BeautifulSoup import BeautifulSoup
def gethtml(url):
"Return the html from either a file or the web"
# For testing purposes, just read the html from a file
# check file already there
filename = "testpage.html"
if os.path.exists(filename):
fileobject = open(filename, 'r')
data = fileobject.readlines()
else:
req = urllib2.Request(url)
fd = urllib2.urlopen(req)
#read in the website and save it to a file
data = fd.readlines()
fd.close()
fileobject = open(filename, 'w')
for line in data:
fileobject.write(line)
fileobject.close()
return data
def check_title(html):
soup = BeautifulSoup(' '.join(html))
node = soup.find( 'title' )
print(node.text)
check_title(gethtml("http://www.sal.ksu.edu/faculty/tim/"))