See also
The HTTP protocol defines a specific format for the contents of a message from a client to request information from a web server. A simple static page is retrieved with a GET request. Dynamic page requests that require a small amount of data to be sent as part of the request, also use the GET request and embed the data in the URL. A zip code or a part number are examples of the type of data that might be embeded inside a GET request. When a larger amount of data is sent to the server, such as when a form was filled out or file up-loaded, then a POST request is sent. Python includes two modules to facilitate retrieving web pages.
Encode data to be sent to a web server as part of a HTTP request. See examples below.
| Parameter: | data – A list containing two element tuples. Each element of the tuple is a string. |
|---|---|
| Return type: | string |
Returns an object suitable for use with urlopen().
| Parameter: | url – A correctly formed URL for either a simple web page or with encoded data. |
|---|---|
| Return type: | urllib2.Request object |
Connect to a web server using HTTP to retrieve data
| Parameters: |
|
|---|---|
| Return type: | file object |
With HTTP, the client sends a message requesting data, which may be a static page or a page that the server will dynamically generate. The server then sends data back, usually in the form of an HTML, XHTML or similar document. HTTP is a stateless, connectionless protocol. Both of these term relate to the one request, one reply nature of HTTP.
HTTP was really designed for simple web page retrieval, not on-going interactions with a server side application. For this reason, some have questioned if HTTP is really the protocol, which should be used for such activity. However, it seems to work well as a protocol designed for the simplest case, but applicable in conjunction with other technologies for more complex applications.
Here is how to retrieve a simple web page using socket programming. Notice, that we have to concern ourselves with not only the socket connection, but the syntax of the HTTP protocol.
import socket
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(('www.sal.ksu.edu', 80))
request = """GET /faculty/tim/index.html HTTP/1.0\n
From: tim@sal.ksu.edu\n
User-Agent: Python\n
\n"""
s.send(request)
fp = open("index.html", "w")
while 1:
data = s.recv(1024)
if not len(data):
break
fp.write(data)
s.close()
fp.close()
Now, for the easy way to do the same. Since we are only requesting a static page and do not send data to the server, we just use urllib2 to make a connection. The fd variable here is a socket file object, which we read() from and then close(). After the readlines() function, our variable data contains a list of strings for each line of the web page.
import urllib2
page = "http://www.sal.ksu.edu/faculty/tim/"
req = urllib2.Request(page)
fd = urllib2.urlopen(req)
data = fd.readlines()
fd.close()
with open("index.html", "w") as out:
for line in data:
out.write(line)
A GET request with data embedded in the URL uses a question mark symbol (?) to separate the web address from the data in the URL. The encoded data is generated with the urllib.urlencode() function. Once we have a string holding the correct URL, we can use urllib2.Request() and urllib2.urlopen() to retrieve the page as above.
>>> import urllib
>>> encoding = urllib.urlencode([
('activity', 'water ski'),
('lake', 'Milford'),
('code', 52)
])
>>> print encoding
activity=water+ski&lake=Milford&code=52
>>> url = "http://www.example.com" + '?' + encoding
>>> print url
http://www.example.com?activity=water+ski&lake=Milford&code=52
As before, we use the urllib.urlencode() function to encode data, which will be sent with a POST request. This time, rather than tacking the data onto the URL, we pass it as a second argument to the urllib2.urlopen() function.
import sys
import urllib
import urllib2
encoding = urllib.urlencode([
('activity', 'water ski'),
('lake', 'Milford'),
('code', 52)
])
url = "http://www.example.com"
req = urllib2.Request(url)
fd = urllib2.urlopen(req, encoding)
while 1:
data = fd.read(1024)
if not len(data):
break
sys.stdout.write(data)
fd.close()