1.. _urllib-howto: 2 3*********************************************************** 4 HOWTO Fetch Internet Resources Using The urllib Package 5*********************************************************** 6 7:Author: `Michael Foord <http://www.voidspace.org.uk/python/index.shtml>`_ 8 9.. note:: 10 11 There is a French translation of an earlier revision of this 12 HOWTO, available at `urllib2 - Le Manuel manquant 13 <http://www.voidspace.org.uk/python/articles/urllib2_francais.shtml>`_. 14 15 16 17Introduction 18============ 19 20.. sidebar:: Related Articles 21 22 You may also find useful the following article on fetching web resources 23 with Python: 24 25 * `Basic Authentication <http://www.voidspace.org.uk/python/articles/authentication.shtml>`_ 26 27 A tutorial on *Basic Authentication*, with examples in Python. 28 29**urllib.request** is a Python module for fetching URLs 30(Uniform Resource Locators). It offers a very simple interface, in the form of 31the *urlopen* function. This is capable of fetching URLs using a variety of 32different protocols. It also offers a slightly more complex interface for 33handling common situations - like basic authentication, cookies, proxies and so 34on. These are provided by objects called handlers and openers. 35 36urllib.request supports fetching URLs for many "URL schemes" (identified by the string 37before the ``":"`` in URL - for example ``"ftp"`` is the URL scheme of 38``"ftp://python.org/"``) using their associated network protocols (e.g. FTP, HTTP). 39This tutorial focuses on the most common case, HTTP. 40 41For straightforward situations *urlopen* is very easy to use. But as soon as you 42encounter errors or non-trivial cases when opening HTTP URLs, you will need some 43understanding of the HyperText Transfer Protocol. The most comprehensive and 44authoritative reference to HTTP is :rfc:`2616`. This is a technical document and 45not intended to be easy to read. This HOWTO aims to illustrate using *urllib*, 46with enough detail about HTTP to help you through. It is not intended to replace 47the :mod:`urllib.request` docs, but is supplementary to them. 48 49 50Fetching URLs 51============= 52 53The simplest way to use urllib.request is as follows:: 54 55 import urllib.request 56 with urllib.request.urlopen('http://python.org/') as response: 57 html = response.read() 58 59If you wish to retrieve a resource via URL and store it in a temporary 60location, you can do so via the :func:`shutil.copyfileobj` and 61:func:`tempfile.NamedTemporaryFile` functions:: 62 63 import shutil 64 import tempfile 65 import urllib.request 66 67 with urllib.request.urlopen('http://python.org/') as response: 68 with tempfile.NamedTemporaryFile(delete=False) as tmp_file: 69 shutil.copyfileobj(response, tmp_file) 70 71 with open(tmp_file.name) as html: 72 pass 73 74Many uses of urllib will be that simple (note that instead of an 'http:' URL we 75could have used a URL starting with 'ftp:', 'file:', etc.). However, it's the 76purpose of this tutorial to explain the more complicated cases, concentrating on 77HTTP. 78 79HTTP is based on requests and responses - the client makes requests and servers 80send responses. urllib.request mirrors this with a ``Request`` object which represents 81the HTTP request you are making. In its simplest form you create a Request 82object that specifies the URL you want to fetch. Calling ``urlopen`` with this 83Request object returns a response object for the URL requested. This response is 84a file-like object, which means you can for example call ``.read()`` on the 85response:: 86 87 import urllib.request 88 89 req = urllib.request.Request('http://www.voidspace.org.uk') 90 with urllib.request.urlopen(req) as response: 91 the_page = response.read() 92 93Note that urllib.request makes use of the same Request interface to handle all URL 94schemes. For example, you can make an FTP request like so:: 95 96 req = urllib.request.Request('ftp://example.com/') 97 98In the case of HTTP, there are two extra things that Request objects allow you 99to do: First, you can pass data to be sent to the server. Second, you can pass 100extra information ("metadata") *about* the data or about the request itself, to 101the server - this information is sent as HTTP "headers". Let's look at each of 102these in turn. 103 104Data 105---- 106 107Sometimes you want to send data to a URL (often the URL will refer to a CGI 108(Common Gateway Interface) script or other web application). With HTTP, 109this is often done using what's known as a **POST** request. This is often what 110your browser does when you submit a HTML form that you filled in on the web. Not 111all POSTs have to come from forms: you can use a POST to transmit arbitrary data 112to your own application. In the common case of HTML forms, the data needs to be 113encoded in a standard way, and then passed to the Request object as the ``data`` 114argument. The encoding is done using a function from the :mod:`urllib.parse` 115library. :: 116 117 import urllib.parse 118 import urllib.request 119 120 url = 'http://www.someserver.com/cgi-bin/register.cgi' 121 values = {'name' : 'Michael Foord', 122 'location' : 'Northampton', 123 'language' : 'Python' } 124 125 data = urllib.parse.urlencode(values) 126 data = data.encode('ascii') # data should be bytes 127 req = urllib.request.Request(url, data) 128 with urllib.request.urlopen(req) as response: 129 the_page = response.read() 130 131Note that other encodings are sometimes required (e.g. for file upload from HTML 132forms - see `HTML Specification, Form Submission 133<https://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13>`_ for more 134details). 135 136If you do not pass the ``data`` argument, urllib uses a **GET** request. One 137way in which GET and POST requests differ is that POST requests often have 138"side-effects": they change the state of the system in some way (for example by 139placing an order with the website for a hundredweight of tinned spam to be 140delivered to your door). Though the HTTP standard makes it clear that POSTs are 141intended to *always* cause side-effects, and GET requests *never* to cause 142side-effects, nothing prevents a GET request from having side-effects, nor a 143POST requests from having no side-effects. Data can also be passed in an HTTP 144GET request by encoding it in the URL itself. 145 146This is done as follows:: 147 148 >>> import urllib.request 149 >>> import urllib.parse 150 >>> data = {} 151 >>> data['name'] = 'Somebody Here' 152 >>> data['location'] = 'Northampton' 153 >>> data['language'] = 'Python' 154 >>> url_values = urllib.parse.urlencode(data) 155 >>> print(url_values) # The order may differ from below. #doctest: +SKIP 156 name=Somebody+Here&language=Python&location=Northampton 157 >>> url = 'http://www.example.com/example.cgi' 158 >>> full_url = url + '?' + url_values 159 >>> data = urllib.request.urlopen(full_url) 160 161Notice that the full URL is created by adding a ``?`` to the URL, followed by 162the encoded values. 163 164Headers 165------- 166 167We'll discuss here one particular HTTP header, to illustrate how to add headers 168to your HTTP request. 169 170Some websites [#]_ dislike being browsed by programs, or send different versions 171to different browsers [#]_. By default urllib identifies itself as 172``Python-urllib/x.y`` (where ``x`` and ``y`` are the major and minor version 173numbers of the Python release, 174e.g. ``Python-urllib/2.5``), which may confuse the site, or just plain 175not work. The way a browser identifies itself is through the 176``User-Agent`` header [#]_. When you create a Request object you can 177pass a dictionary of headers in. The following example makes the same 178request as above, but identifies itself as a version of Internet 179Explorer [#]_. :: 180 181 import urllib.parse 182 import urllib.request 183 184 url = 'http://www.someserver.com/cgi-bin/register.cgi' 185 user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)' 186 values = {'name': 'Michael Foord', 187 'location': 'Northampton', 188 'language': 'Python' } 189 headers = {'User-Agent': user_agent} 190 191 data = urllib.parse.urlencode(values) 192 data = data.encode('ascii') 193 req = urllib.request.Request(url, data, headers) 194 with urllib.request.urlopen(req) as response: 195 the_page = response.read() 196 197The response also has two useful methods. See the section on `info and geturl`_ 198which comes after we have a look at what happens when things go wrong. 199 200 201Handling Exceptions 202=================== 203 204*urlopen* raises :exc:`URLError` when it cannot handle a response (though as 205usual with Python APIs, built-in exceptions such as :exc:`ValueError`, 206:exc:`TypeError` etc. may also be raised). 207 208:exc:`HTTPError` is the subclass of :exc:`URLError` raised in the specific case of 209HTTP URLs. 210 211The exception classes are exported from the :mod:`urllib.error` module. 212 213URLError 214-------- 215 216Often, URLError is raised because there is no network connection (no route to 217the specified server), or the specified server doesn't exist. In this case, the 218exception raised will have a 'reason' attribute, which is a tuple containing an 219error code and a text error message. 220 221e.g. :: 222 223 >>> req = urllib.request.Request('http://www.pretend_server.org') 224 >>> try: urllib.request.urlopen(req) 225 ... except urllib.error.URLError as e: 226 ... print(e.reason) #doctest: +SKIP 227 ... 228 (4, 'getaddrinfo failed') 229 230 231HTTPError 232--------- 233 234Every HTTP response from the server contains a numeric "status code". Sometimes 235the status code indicates that the server is unable to fulfil the request. The 236default handlers will handle some of these responses for you (for example, if 237the response is a "redirection" that requests the client fetch the document from 238a different URL, urllib will handle that for you). For those it can't handle, 239urlopen will raise an :exc:`HTTPError`. Typical errors include '404' (page not 240found), '403' (request forbidden), and '401' (authentication required). 241 242See section 10 of :rfc:`2616` for a reference on all the HTTP error codes. 243 244The :exc:`HTTPError` instance raised will have an integer 'code' attribute, which 245corresponds to the error sent by the server. 246 247Error Codes 248~~~~~~~~~~~ 249 250Because the default handlers handle redirects (codes in the 300 range), and 251codes in the 100--299 range indicate success, you will usually only see error 252codes in the 400--599 range. 253 254:attr:`http.server.BaseHTTPRequestHandler.responses` is a useful dictionary of 255response codes in that shows all the response codes used by :rfc:`2616`. The 256dictionary is reproduced here for convenience :: 257 258 # Table mapping response codes to messages; entries have the 259 # form {code: (shortmessage, longmessage)}. 260 responses = { 261 100: ('Continue', 'Request received, please continue'), 262 101: ('Switching Protocols', 263 'Switching to new protocol; obey Upgrade header'), 264 265 200: ('OK', 'Request fulfilled, document follows'), 266 201: ('Created', 'Document created, URL follows'), 267 202: ('Accepted', 268 'Request accepted, processing continues off-line'), 269 203: ('Non-Authoritative Information', 'Request fulfilled from cache'), 270 204: ('No Content', 'Request fulfilled, nothing follows'), 271 205: ('Reset Content', 'Clear input form for further input.'), 272 206: ('Partial Content', 'Partial content follows.'), 273 274 300: ('Multiple Choices', 275 'Object has several resources -- see URI list'), 276 301: ('Moved Permanently', 'Object moved permanently -- see URI list'), 277 302: ('Found', 'Object moved temporarily -- see URI list'), 278 303: ('See Other', 'Object moved -- see Method and URL list'), 279 304: ('Not Modified', 280 'Document has not changed since given time'), 281 305: ('Use Proxy', 282 'You must use proxy specified in Location to access this ' 283 'resource.'), 284 307: ('Temporary Redirect', 285 'Object moved temporarily -- see URI list'), 286 287 400: ('Bad Request', 288 'Bad request syntax or unsupported method'), 289 401: ('Unauthorized', 290 'No permission -- see authorization schemes'), 291 402: ('Payment Required', 292 'No payment -- see charging schemes'), 293 403: ('Forbidden', 294 'Request forbidden -- authorization will not help'), 295 404: ('Not Found', 'Nothing matches the given URI'), 296 405: ('Method Not Allowed', 297 'Specified method is invalid for this server.'), 298 406: ('Not Acceptable', 'URI not available in preferred format.'), 299 407: ('Proxy Authentication Required', 'You must authenticate with ' 300 'this proxy before proceeding.'), 301 408: ('Request Timeout', 'Request timed out; try again later.'), 302 409: ('Conflict', 'Request conflict.'), 303 410: ('Gone', 304 'URI no longer exists and has been permanently removed.'), 305 411: ('Length Required', 'Client must specify Content-Length.'), 306 412: ('Precondition Failed', 'Precondition in headers is false.'), 307 413: ('Request Entity Too Large', 'Entity is too large.'), 308 414: ('Request-URI Too Long', 'URI is too long.'), 309 415: ('Unsupported Media Type', 'Entity body in unsupported format.'), 310 416: ('Requested Range Not Satisfiable', 311 'Cannot satisfy request range.'), 312 417: ('Expectation Failed', 313 'Expect condition could not be satisfied.'), 314 315 500: ('Internal Server Error', 'Server got itself in trouble'), 316 501: ('Not Implemented', 317 'Server does not support this operation'), 318 502: ('Bad Gateway', 'Invalid responses from another server/proxy.'), 319 503: ('Service Unavailable', 320 'The server cannot process the request due to a high load'), 321 504: ('Gateway Timeout', 322 'The gateway server did not receive a timely response'), 323 505: ('HTTP Version Not Supported', 'Cannot fulfill request.'), 324 } 325 326When an error is raised the server responds by returning an HTTP error code 327*and* an error page. You can use the :exc:`HTTPError` instance as a response on the 328page returned. This means that as well as the code attribute, it also has read, 329geturl, and info, methods as returned by the ``urllib.response`` module:: 330 331 >>> req = urllib.request.Request('http://www.python.org/fish.html') 332 >>> try: 333 ... urllib.request.urlopen(req) 334 ... except urllib.error.HTTPError as e: 335 ... print(e.code) 336 ... print(e.read()) #doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE 337 ... 338 404 339 b'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 340 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n\n\n<html 341 ... 342 <title>Page Not Found</title>\n 343 ... 344 345Wrapping it Up 346-------------- 347 348So if you want to be prepared for :exc:`HTTPError` *or* :exc:`URLError` there are two 349basic approaches. I prefer the second approach. 350 351Number 1 352~~~~~~~~ 353 354:: 355 356 357 from urllib.request import Request, urlopen 358 from urllib.error import URLError, HTTPError 359 req = Request(someurl) 360 try: 361 response = urlopen(req) 362 except HTTPError as e: 363 print('The server couldn\'t fulfill the request.') 364 print('Error code: ', e.code) 365 except URLError as e: 366 print('We failed to reach a server.') 367 print('Reason: ', e.reason) 368 else: 369 # everything is fine 370 371 372.. note:: 373 374 The ``except HTTPError`` *must* come first, otherwise ``except URLError`` 375 will *also* catch an :exc:`HTTPError`. 376 377Number 2 378~~~~~~~~ 379 380:: 381 382 from urllib.request import Request, urlopen 383 from urllib.error import URLError 384 req = Request(someurl) 385 try: 386 response = urlopen(req) 387 except URLError as e: 388 if hasattr(e, 'reason'): 389 print('We failed to reach a server.') 390 print('Reason: ', e.reason) 391 elif hasattr(e, 'code'): 392 print('The server couldn\'t fulfill the request.') 393 print('Error code: ', e.code) 394 else: 395 # everything is fine 396 397 398info and geturl 399=============== 400 401The response returned by urlopen (or the :exc:`HTTPError` instance) has two 402useful methods :meth:`info` and :meth:`geturl` and is defined in the module 403:mod:`urllib.response`.. 404 405**geturl** - this returns the real URL of the page fetched. This is useful 406because ``urlopen`` (or the opener object used) may have followed a 407redirect. The URL of the page fetched may not be the same as the URL requested. 408 409**info** - this returns a dictionary-like object that describes the page 410fetched, particularly the headers sent by the server. It is currently an 411:class:`http.client.HTTPMessage` instance. 412 413Typical headers include 'Content-length', 'Content-type', and so on. See the 414`Quick Reference to HTTP Headers <http://jkorpela.fi/http.html>`_ 415for a useful listing of HTTP headers with brief explanations of their meaning 416and use. 417 418 419Openers and Handlers 420==================== 421 422When you fetch a URL you use an opener (an instance of the perhaps 423confusingly-named :class:`urllib.request.OpenerDirector`). Normally we have been using 424the default opener - via ``urlopen`` - but you can create custom 425openers. Openers use handlers. All the "heavy lifting" is done by the 426handlers. Each handler knows how to open URLs for a particular URL scheme (http, 427ftp, etc.), or how to handle an aspect of URL opening, for example HTTP 428redirections or HTTP cookies. 429 430You will want to create openers if you want to fetch URLs with specific handlers 431installed, for example to get an opener that handles cookies, or to get an 432opener that does not handle redirections. 433 434To create an opener, instantiate an ``OpenerDirector``, and then call 435``.add_handler(some_handler_instance)`` repeatedly. 436 437Alternatively, you can use ``build_opener``, which is a convenience function for 438creating opener objects with a single function call. ``build_opener`` adds 439several handlers by default, but provides a quick way to add more and/or 440override the default handlers. 441 442Other sorts of handlers you might want to can handle proxies, authentication, 443and other common but slightly specialised situations. 444 445``install_opener`` can be used to make an ``opener`` object the (global) default 446opener. This means that calls to ``urlopen`` will use the opener you have 447installed. 448 449Opener objects have an ``open`` method, which can be called directly to fetch 450urls in the same way as the ``urlopen`` function: there's no need to call 451``install_opener``, except as a convenience. 452 453 454Basic Authentication 455==================== 456 457To illustrate creating and installing a handler we will use the 458``HTTPBasicAuthHandler``. For a more detailed discussion of this subject -- 459including an explanation of how Basic Authentication works - see the `Basic 460Authentication Tutorial 461<http://www.voidspace.org.uk/python/articles/authentication.shtml>`_. 462 463When authentication is required, the server sends a header (as well as the 401 464error code) requesting authentication. This specifies the authentication scheme 465and a 'realm'. The header looks like: ``WWW-Authenticate: SCHEME 466realm="REALM"``. 467 468e.g. 469 470.. code-block:: none 471 472 WWW-Authenticate: Basic realm="cPanel Users" 473 474 475The client should then retry the request with the appropriate name and password 476for the realm included as a header in the request. This is 'basic 477authentication'. In order to simplify this process we can create an instance of 478``HTTPBasicAuthHandler`` and an opener to use this handler. 479 480The ``HTTPBasicAuthHandler`` uses an object called a password manager to handle 481the mapping of URLs and realms to passwords and usernames. If you know what the 482realm is (from the authentication header sent by the server), then you can use a 483``HTTPPasswordMgr``. Frequently one doesn't care what the realm is. In that 484case, it is convenient to use ``HTTPPasswordMgrWithDefaultRealm``. This allows 485you to specify a default username and password for a URL. This will be supplied 486in the absence of you providing an alternative combination for a specific 487realm. We indicate this by providing ``None`` as the realm argument to the 488``add_password`` method. 489 490The top-level URL is the first URL that requires authentication. URLs "deeper" 491than the URL you pass to .add_password() will also match. :: 492 493 # create a password manager 494 password_mgr = urllib.request.HTTPPasswordMgrWithDefaultRealm() 495 496 # Add the username and password. 497 # If we knew the realm, we could use it instead of None. 498 top_level_url = "http://example.com/foo/" 499 password_mgr.add_password(None, top_level_url, username, password) 500 501 handler = urllib.request.HTTPBasicAuthHandler(password_mgr) 502 503 # create "opener" (OpenerDirector instance) 504 opener = urllib.request.build_opener(handler) 505 506 # use the opener to fetch a URL 507 opener.open(a_url) 508 509 # Install the opener. 510 # Now all calls to urllib.request.urlopen use our opener. 511 urllib.request.install_opener(opener) 512 513.. note:: 514 515 In the above example we only supplied our ``HTTPBasicAuthHandler`` to 516 ``build_opener``. By default openers have the handlers for normal situations 517 -- ``ProxyHandler`` (if a proxy setting such as an :envvar:`http_proxy` 518 environment variable is set), ``UnknownHandler``, ``HTTPHandler``, 519 ``HTTPDefaultErrorHandler``, ``HTTPRedirectHandler``, ``FTPHandler``, 520 ``FileHandler``, ``DataHandler``, ``HTTPErrorProcessor``. 521 522``top_level_url`` is in fact *either* a full URL (including the 'http:' scheme 523component and the hostname and optionally the port number) 524e.g. ``"http://example.com/"`` *or* an "authority" (i.e. the hostname, 525optionally including the port number) e.g. ``"example.com"`` or ``"example.com:8080"`` 526(the latter example includes a port number). The authority, if present, must 527NOT contain the "userinfo" component - for example ``"joe:password@example.com"`` is 528not correct. 529 530 531Proxies 532======= 533 534**urllib** will auto-detect your proxy settings and use those. This is through 535the ``ProxyHandler``, which is part of the normal handler chain when a proxy 536setting is detected. Normally that's a good thing, but there are occasions 537when it may not be helpful [#]_. One way to do this is to setup our own 538``ProxyHandler``, with no proxies defined. This is done using similar steps to 539setting up a `Basic Authentication`_ handler: :: 540 541 >>> proxy_support = urllib.request.ProxyHandler({}) 542 >>> opener = urllib.request.build_opener(proxy_support) 543 >>> urllib.request.install_opener(opener) 544 545.. note:: 546 547 Currently ``urllib.request`` *does not* support fetching of ``https`` locations 548 through a proxy. However, this can be enabled by extending urllib.request as 549 shown in the recipe [#]_. 550 551.. note:: 552 553 ``HTTP_PROXY`` will be ignored if a variable ``REQUEST_METHOD`` is set; see 554 the documentation on :func:`~urllib.request.getproxies`. 555 556 557Sockets and Layers 558================== 559 560The Python support for fetching resources from the web is layered. urllib uses 561the :mod:`http.client` library, which in turn uses the socket library. 562 563As of Python 2.3 you can specify how long a socket should wait for a response 564before timing out. This can be useful in applications which have to fetch web 565pages. By default the socket module has *no timeout* and can hang. Currently, 566the socket timeout is not exposed at the http.client or urllib.request levels. 567However, you can set the default timeout globally for all sockets using :: 568 569 import socket 570 import urllib.request 571 572 # timeout in seconds 573 timeout = 10 574 socket.setdefaulttimeout(timeout) 575 576 # this call to urllib.request.urlopen now uses the default timeout 577 # we have set in the socket module 578 req = urllib.request.Request('http://www.voidspace.org.uk') 579 response = urllib.request.urlopen(req) 580 581 582------- 583 584 585Footnotes 586========= 587 588This document was reviewed and revised by John Lee. 589 590.. [#] Google for example. 591.. [#] Browser sniffing is a very bad practice for website design - building 592 sites using web standards is much more sensible. Unfortunately a lot of 593 sites still send different versions to different browsers. 594.. [#] The user agent for MSIE 6 is 595 *'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)'* 596.. [#] For details of more HTTP request headers, see 597 `Quick Reference to HTTP Headers`_. 598.. [#] In my case I have to use a proxy to access the internet at work. If you 599 attempt to fetch *localhost* URLs through this proxy it blocks them. IE 600 is set to use the proxy, which urllib picks up on. In order to test 601 scripts with a localhost server, I have to prevent urllib from using 602 the proxy. 603.. [#] urllib opener for SSL proxy (CONNECT method): `ASPN Cookbook Recipe 604 <https://code.activestate.com/recipes/456195/>`_. 605 606