6.2. Retrieving Web Pages with HTTP

The Python standard library includes two modules to facilitate retrieving web pages.

urllib.urlencode([('a', 'z'), ('b', 'y')])

Encode data to be sent to a web server as part of a HTTP request. See examples below.

Parameters:data – A list containing two element tuples. Each element of the tuple is a string.
Return type:string
urllib2.Request(url)

Returns an object suitable for use with urlopen().

Parameters:url – A correctly formed URL for either a simple web page or with encoded data.
Return type:urllib2.Request object
urllib2.urlopen(req[, encoding])

Connect to a web server using HTTP to retrieve data

Parameters:
  • req – urllib2.Request object
  • encoding – string with correctly encoded data
Return type:

file object

6.2.1. Basic GET

Since we are only requesting a static page and do not send data to the server, we just use urllib2 to make a connection. The fd variable here is a socket file object, which we read() from and then close(). After the readlines() function, our variable data contains a list of strings for each line of the web page.

import urllib2

page = "http://www.sal.ksu.edu/faculty/tim/"
req = urllib2.Request(page)
fd = urllib2.urlopen(req)
data = fd.readlines()
fd.close()
with open("index.html", "w") as out:
    for line in data:
        out.write(line)

6.2.2. Submitting with GET

Data may be manually encoded into the URL string, or generated with the urllib.urlencode() function. Once we have a string holding the correct URL, we can use urllib2.Request() and urllib2.urlopen() to retrieve the page as above.

>>> import urllib
>>> encoding = urllib.urlencode([
        ('activity', 'water ski'),
        ('lake', 'Milford'),
        ('code', 52)
        ])
>>> print encoding
activity=water+ski&lake=Milford&code=52

>>> url = "http://www.example.com" + '?' + encoding
>>> print url
http://www.example.com?activity=water+ski&lake=Milford&code=52

6.2.3. Submitting with POST

As before, we use the urllib.urlencode() function to encode data, which will be sent with a POST request. This time, rather than tacking the data onto the URL, we pass it as a second argument to the urllib2.urlopen() function.

import sys
import urllib
import urllib2

encoding = urllib.urlencode([
                    ('activity', 'water ski'),
                    ('lake', 'Milford'),
                    ('code', 52)
                    ])
url = "http://www.example.com"
req = urllib2.Request(url)
fd = urllib2.urlopen(req, encoding)
while 1:
    data = fd.read(1024)
    if not len(data):
        break
    sys.stdout.write(data)
fd.close()

6.2.4. Request Alternative

The modules from the standard library work well relatively simple HTTP requests; however, some requests are not so simple. In these cases, an alternative module might simplify the programming task. Of course, modules not in the standard library need to be installed. See Installation of Python Packages.

A recommended alternative module for generating more complex requests to download web pages is the Requests module. See the Quick Start introduction to get a feel for how to use it.

Table Of Contents

Previous topic

6.1. HTTP Protocol

Next topic

6.3. Parsing HTML Data

This Page