../_images/http.png

6.1. Retrieving Web Pages with HTTP

See also

The HTTP protocol defines a specific format for the contents of a message from a client to request information from a web server. A simple static page is retrieved with a GET request. Dynamic page requests that require a small amount of data to be sent as part of the request, also use the GET request and embed the data in the URL. A zip code or a part number are examples of the type of data that might be embeded inside a GET request. When a larger amount of data is sent to the server, such as when a form was filled out or file up-loaded, then a POST request is sent. Python includes two modules to facilitate retrieving web pages.

urllib.urlencode([('a', 'z'), ('b', 'y')])

Encode data to be sent to a web server as part of a HTTP request. See examples below.

Parameter:data – A list containing two element tuples. Each element of the tuple is a string.
Return type:string
urllib2.Request(url)

Returns an object suitable for use with urlopen().

Parameter:url – A correctly formed URL for either a simple web page or with encoded data.
Return type:urllib2.Request object
urllib2.urlopen(req[, encoding])

Connect to a web server using HTTP to retrieve data

Parameters:
  • req – urllib2.Request object
  • encoding – string with correctly encoded data
Return type:

file object

6.1.1. HTTP Basics

With HTTP, the client sends a message requesting data, which may be a static page or a page that the server will dynamically generate. The server then sends data back, usually in the form of an HTML, XHTML or similar document. HTTP is a stateless, connectionless protocol. Both of these term relate to the one request, one reply nature of HTTP.

Stateless
With most protocols, the client and server send several message back and forth. So the server can keep track of the state of overall conversation for each client. This is not the case with HTTP. Each client request stands on its own as a request for information. Web servers often have server side applications, such as a store front, which treat the sequence of messages to and from each client as a session and would thus track the state of the clients. However, we are just talking about the web server proper, which uses the HTTP protocol.
Connectionless
This has very similar mean to stateless. When you connect to a ssh, ftp or telnet server, you have an ongoing connection (session) to the server. With HTTP, as soon as the request is received and reply sent, the socket connection is closed. So if you are using a web based application, such as web-mail to read your e-mail, then the overall session with the server side application actually consists of many distinct socket connections.

HTTP was really designed for simple web page retrieval, not on-going interactions with a server side application. For this reason, some have questioned if HTTP is really the protocol, which should be used for such activity. However, it seems to work well as a protocol designed for the simplest case, but applicable in conjunction with other technologies for more complex applications.

6.1.2. Basic GET

Here is how to retrieve a simple web page using socket programming. Notice, that we have to concern ourselves with not only the socket connection, but the syntax of the HTTP protocol.

import socket

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(('www.sal.ksu.edu', 80))
request = """GET /faculty/tim/index.html HTTP/1.0\n
From: tim@sal.ksu.edu\n
User-Agent: Python\n
\n"""

s.send(request)
fp = open("index.html", "w")
while 1:
    data = s.recv(1024)
    if not len(data):
        break
    fp.write(data)

s.close()
fp.close()

Now, for the easy way to do the same. Since we are only requesting a static page and do not send data to the server, we just use urllib2 to make a connection. The fd variable here is a socket file object, which we read() from and then close(). After the readlines() function, our variable data contains a list of strings for each line of the web page.

import urllib2

page = "http://www.sal.ksu.edu/faculty/tim/"
req = urllib2.Request(page)
fd = urllib2.urlopen(req)
data = fd.readlines()
fd.close()
with open("index.html", "w") as out:
    for line in data:
        out.write(line)

6.1.3. Submitting with GET

A GET request with data embedded in the URL uses a question mark symbol (?) to separate the web address from the data in the URL. The encoded data is generated with the urllib.urlencode() function. Once we have a string holding the correct URL, we can use urllib2.Request() and urllib2.urlopen() to retrieve the page as above.

>>> import urllib
>>> encoding = urllib.urlencode([
        ('activity', 'water ski'),
        ('lake', 'Milford'),
        ('code', 52)
        ])
>>> print encoding
activity=water+ski&lake=Milford&code=52

>>> url = "http://www.example.com" + '?' + encoding
>>> print url
http://www.example.com?activity=water+ski&lake=Milford&code=52

6.1.4. Submitting with POST

As before, we use the urllib.urlencode() function to encode data, which will be sent with a POST request. This time, rather than tacking the data onto the URL, we pass it as a second argument to the urllib2.urlopen() function.

import sys
import urllib
import urllib2

encoding = urllib.urlencode([
                    ('activity', 'water ski'),
                    ('lake', 'Milford'),
                    ('code', 52)
                    ])
url = "http://www.example.com"
req = urllib2.Request(url)
fd = urllib2.urlopen(req, encoding)
while 1:
    data = fd.read(1024)
    if not len(data):
        break
    sys.stdout.write(data)
fd.close()