1:mod:`urllib` --- Open arbitrary resources by URL 2================================================= 3 4.. module:: urllib 5 :synopsis: Open an arbitrary network resource by URL (requires sockets). 6 7.. note:: 8 The :mod:`urllib` module has been split into parts and renamed in 9 Python 3 to :mod:`urllib.request`, :mod:`urllib.parse`, 10 and :mod:`urllib.error`. The :term:`2to3` tool will automatically adapt 11 imports when converting your sources to Python 3. 12 Also note that the :func:`urllib.request.urlopen` function in Python 3 is 13 equivalent to :func:`urllib2.urlopen` and that :func:`urllib.urlopen` has 14 been removed. 15 16.. index:: 17 single: WWW 18 single: World Wide Web 19 single: URL 20 21This module provides a high-level interface for fetching data across the World 22Wide Web. In particular, the :func:`urlopen` function is similar to the 23built-in function :func:`open`, but accepts Universal Resource Locators (URLs) 24instead of filenames. Some restrictions apply --- it can only open URLs for 25reading, and no seek operations are available. 26 27.. seealso:: 28 29 The `Requests package <http://requests.readthedocs.org/>`_ 30 is recommended for a higher-level HTTP client interface. 31 32.. warning:: When opening HTTPS URLs, it does not attempt to validate the 33 server certificate. Use at your own risk! 34 35 36High-level interface 37-------------------- 38 39.. function:: urlopen(url[, data[, proxies[, context]]]) 40 41 Open a network object denoted by a URL for reading. If the URL does not 42 have a scheme identifier, or if it has :file:`file:` as its scheme 43 identifier, this opens a local file (without :term:`universal newlines`); 44 otherwise it opens a socket to a server somewhere on the network. If the 45 connection cannot be made the :exc:`IOError` exception is raised. If all 46 went well, a file-like object is returned. This supports the following 47 methods: :meth:`read`, :meth:`readline`, :meth:`readlines`, :meth:`fileno`, 48 :meth:`close`, :meth:`info`, :meth:`getcode` and :meth:`geturl`. It also 49 has proper support for the :term:`iterator` protocol. One caveat: the 50 :meth:`read` method, if the size argument is omitted or negative, may not 51 read until the end of the data stream; there is no good way to determine 52 that the entire stream from a socket has been read in the general case. 53 54 Except for the :meth:`info`, :meth:`getcode` and :meth:`geturl` methods, 55 these methods have the same interface as for file objects --- see section 56 :ref:`bltin-file-objects` in this manual. (It is not a built-in file object, 57 however, so it can't be used at those few places where a true built-in file 58 object is required.) 59 60 .. index:: module: mimetools 61 62 The :meth:`info` method returns an instance of the class 63 :class:`mimetools.Message` containing meta-information associated with the 64 URL. When the method is HTTP, these headers are those returned by the server 65 at the head of the retrieved HTML page (including Content-Length and 66 Content-Type). When the method is FTP, a Content-Length header will be 67 present if (as is now usual) the server passed back a file length in response 68 to the FTP retrieval request. A Content-Type header will be present if the 69 MIME type can be guessed. When the method is local-file, returned headers 70 will include a Date representing the file's last-modified time, a 71 Content-Length giving file size, and a Content-Type containing a guess at the 72 file's type. See also the description of the :mod:`mimetools` module. 73 74 The :meth:`geturl` method returns the real URL of the page. In some cases, the 75 HTTP server redirects a client to another URL. The :func:`urlopen` function 76 handles this transparently, but in some cases the caller needs to know which URL 77 the client was redirected to. The :meth:`geturl` method can be used to get at 78 this redirected URL. 79 80 The :meth:`getcode` method returns the HTTP status code that was sent with the 81 response, or ``None`` if the URL is no HTTP URL. 82 83 If the *url* uses the :file:`http:` scheme identifier, the optional *data* 84 argument may be given to specify a ``POST`` request (normally the request type 85 is ``GET``). The *data* argument must be in standard 86 :mimetype:`application/x-www-form-urlencoded` format; see the :func:`urlencode` 87 function below. 88 89 The :func:`urlopen` function works transparently with proxies which do not 90 require authentication. In a Unix or Windows environment, set the 91 :envvar:`http_proxy`, or :envvar:`ftp_proxy` environment variables to a URL that 92 identifies the proxy server before starting the Python interpreter. For example 93 (the ``'%'`` is the command prompt):: 94 95 % http_proxy="http://www.someproxy.com:3128" 96 % export http_proxy 97 % python 98 ... 99 100 The :envvar:`no_proxy` environment variable can be used to specify hosts which 101 shouldn't be reached via proxy; if set, it should be a comma-separated list 102 of hostname suffixes, optionally with ``:port`` appended, for example 103 ``cern.ch,ncsa.uiuc.edu,some.host:8080``. 104 105 In a Windows environment, if no proxy environment variables are set, proxy 106 settings are obtained from the registry's Internet Settings section. 107 108 .. index:: single: Internet Config 109 110 In a Mac OS X environment, :func:`urlopen` will retrieve proxy information 111 from the OS X System Configuration Framework, which can be managed with 112 Network System Preferences panel. 113 114 115 Alternatively, the optional *proxies* argument may be used to explicitly specify 116 proxies. It must be a dictionary mapping scheme names to proxy URLs, where an 117 empty dictionary causes no proxies to be used, and ``None`` (the default value) 118 causes environmental proxy settings to be used as discussed above. For 119 example:: 120 121 # Use http://www.someproxy.com:3128 for HTTP proxying 122 proxies = {'http': 'http://www.someproxy.com:3128'} 123 filehandle = urllib.urlopen(some_url, proxies=proxies) 124 # Don't use any proxies 125 filehandle = urllib.urlopen(some_url, proxies={}) 126 # Use proxies from environment - both versions are equivalent 127 filehandle = urllib.urlopen(some_url, proxies=None) 128 filehandle = urllib.urlopen(some_url) 129 130 Proxies which require authentication for use are not currently supported; 131 this is considered an implementation limitation. 132 133 The *context* parameter may be set to a :class:`ssl.SSLContext` instance to 134 configure the SSL settings that are used if :func:`urlopen` makes a HTTPS 135 connection. 136 137 .. versionchanged:: 2.3 138 Added the *proxies* support. 139 140 .. versionchanged:: 2.6 141 Added :meth:`getcode` to returned object and support for the 142 :envvar:`no_proxy` environment variable. 143 144 .. versionchanged:: 2.7.9 145 The *context* parameter was added. 146 147 .. deprecated:: 2.6 148 The :func:`urlopen` function has been removed in Python 3 in favor 149 of :func:`urllib2.urlopen`. 150 151 152.. function:: urlretrieve(url[, filename[, reporthook[, data]]]) 153 154 Copy a network object denoted by a URL to a local file, if necessary. If the URL 155 points to a local file, or a valid cached copy of the object exists, the object 156 is not copied. Return a tuple ``(filename, headers)`` where *filename* is the 157 local file name under which the object can be found, and *headers* is whatever 158 the :meth:`info` method of the object returned by :func:`urlopen` returned (for 159 a remote object, possibly cached). Exceptions are the same as for 160 :func:`urlopen`. 161 162 The second argument, if present, specifies the file location to copy to (if 163 absent, the location will be a tempfile with a generated name). The third 164 argument, if present, is a hook function that will be called once on 165 establishment of the network connection and once after each block read 166 thereafter. The hook will be passed three arguments; a count of blocks 167 transferred so far, a block size in bytes, and the total size of the file. The 168 third argument may be ``-1`` on older FTP servers which do not return a file 169 size in response to a retrieval request. 170 171 If the *url* uses the :file:`http:` scheme identifier, the optional *data* 172 argument may be given to specify a ``POST`` request (normally the request type 173 is ``GET``). The *data* argument must in standard 174 :mimetype:`application/x-www-form-urlencoded` format; see the :func:`urlencode` 175 function below. 176 177 .. versionchanged:: 2.5 178 :func:`urlretrieve` will raise :exc:`ContentTooShortError` when it detects that 179 the amount of data available was less than the expected amount (which is the 180 size reported by a *Content-Length* header). This can occur, for example, when 181 the download is interrupted. 182 183 The *Content-Length* is treated as a lower bound: if there's more data to read, 184 :func:`urlretrieve` reads more data, but if less data is available, it raises 185 the exception. 186 187 You can still retrieve the downloaded data in this case, it is stored in the 188 :attr:`content` attribute of the exception instance. 189 190 If no *Content-Length* header was supplied, :func:`urlretrieve` can not check 191 the size of the data it has downloaded, and just returns it. In this case you 192 just have to assume that the download was successful. 193 194 195.. data:: _urlopener 196 197 The public functions :func:`urlopen` and :func:`urlretrieve` create an instance 198 of the :class:`FancyURLopener` class and use it to perform their requested 199 actions. To override this functionality, programmers can create a subclass of 200 :class:`URLopener` or :class:`FancyURLopener`, then assign an instance of that 201 class to the ``urllib._urlopener`` variable before calling the desired function. 202 For example, applications may want to specify a different 203 :mailheader:`User-Agent` header than :class:`URLopener` defines. This can be 204 accomplished with the following code:: 205 206 import urllib 207 208 class AppURLopener(urllib.FancyURLopener): 209 version = "App/1.7" 210 211 urllib._urlopener = AppURLopener() 212 213 214.. function:: urlcleanup() 215 216 Clear the cache that may have been built up by previous calls to 217 :func:`urlretrieve`. 218 219 220Utility functions 221----------------- 222 223.. function:: quote(string[, safe]) 224 225 Replace special characters in *string* using the ``%xx`` escape. Letters, 226 digits, and the characters ``'_.-'`` are never quoted. By default, this 227 function is intended for quoting the path section of the URL. The optional 228 *safe* parameter specifies additional characters that should not be quoted 229 --- its default value is ``'/'``. 230 231 Example: ``quote('/~connolly/')`` yields ``'/%7econnolly/'``. 232 233 234.. function:: quote_plus(string[, safe]) 235 236 Like :func:`quote`, but also replaces spaces by plus signs, as required for 237 quoting HTML form values when building up a query string to go into a URL. 238 Plus signs in the original string are escaped unless they are included in 239 *safe*. It also does not have *safe* default to ``'/'``. 240 241 242.. function:: unquote(string) 243 244 Replace ``%xx`` escapes by their single-character equivalent. 245 246 Example: ``unquote('/%7Econnolly/')`` yields ``'/~connolly/'``. 247 248 249.. function:: unquote_plus(string) 250 251 Like :func:`unquote`, but also replaces plus signs by spaces, as required for 252 unquoting HTML form values. 253 254 255.. function:: urlencode(query[, doseq]) 256 257 Convert a mapping object or a sequence of two-element tuples to a 258 "percent-encoded" string, suitable to pass to :func:`urlopen` above as the 259 optional *data* argument. This is useful to pass a dictionary of form 260 fields to a ``POST`` request. The resulting string is a series of 261 ``key=value`` pairs separated by ``'&'`` characters, where both *key* and 262 *value* are quoted using :func:`quote_plus` above. When a sequence of 263 two-element tuples is used as the *query* argument, the first element of 264 each tuple is a key and the second is a value. The value element in itself 265 can be a sequence and in that case, if the optional parameter *doseq* is 266 evaluates to ``True``, individual ``key=value`` pairs separated by ``'&'`` are 267 generated for each element of the value sequence for the key. The order of 268 parameters in the encoded string will match the order of parameter tuples in 269 the sequence. The :mod:`urlparse` module provides the functions 270 :func:`parse_qs` and :func:`parse_qsl` which are used to parse query strings 271 into Python data structures. 272 273 274.. function:: pathname2url(path) 275 276 Convert the pathname *path* from the local syntax for a path to the form used in 277 the path component of a URL. This does not produce a complete URL. The return 278 value will already be quoted using the :func:`quote` function. 279 280 281.. function:: url2pathname(path) 282 283 Convert the path component *path* from a percent-encoded URL to the local syntax for a 284 path. This does not accept a complete URL. This function uses :func:`unquote` 285 to decode *path*. 286 287 288.. function:: getproxies() 289 290 This helper function returns a dictionary of scheme to proxy server URL 291 mappings. It scans the environment for variables named ``<scheme>_proxy``, 292 in case insensitive way, for all operating systems first, and when it cannot 293 find it, looks for proxy information from Mac OSX System Configuration for 294 Mac OS X and Windows Systems Registry for Windows. 295 If both lowercase and uppercase environment variables exist (and disagree), 296 lowercase is preferred. 297 298 .. note:: 299 300 If the environment variable ``REQUEST_METHOD`` is set, which usually 301 indicates your script is running in a CGI environment, the environment 302 variable ``HTTP_PROXY`` (uppercase ``_PROXY``) will be ignored. This is 303 because that variable can be injected by a client using the "Proxy:" 304 HTTP header. If you need to use an HTTP proxy in a CGI environment, 305 either use ``ProxyHandler`` explicitly, or make sure the variable name 306 is in lowercase (or at least the ``_proxy`` suffix). 307 308.. note:: 309 urllib also exposes certain utility functions like splittype, splithost and 310 others parsing URL into various components. But it is recommended to use 311 :mod:`urlparse` for parsing URLs rather than using these functions directly. 312 Python 3 does not expose these helper functions from :mod:`urllib.parse` 313 module. 314 315 316URL Opener objects 317------------------ 318 319.. class:: URLopener([proxies[, context[, **x509]]]) 320 321 Base class for opening and reading URLs. Unless you need to support opening 322 objects using schemes other than :file:`http:`, :file:`ftp:`, or :file:`file:`, 323 you probably want to use :class:`FancyURLopener`. 324 325 By default, the :class:`URLopener` class sends a :mailheader:`User-Agent` header 326 of ``urllib/VVV``, where *VVV* is the :mod:`urllib` version number. 327 Applications can define their own :mailheader:`User-Agent` header by subclassing 328 :class:`URLopener` or :class:`FancyURLopener` and setting the class attribute 329 :attr:`version` to an appropriate string value in the subclass definition. 330 331 The optional *proxies* parameter should be a dictionary mapping scheme names to 332 proxy URLs, where an empty dictionary turns proxies off completely. Its default 333 value is ``None``, in which case environmental proxy settings will be used if 334 present, as discussed in the definition of :func:`urlopen`, above. 335 336 The *context* parameter may be a :class:`ssl.SSLContext` instance. If given, 337 it defines the SSL settings the opener uses to make HTTPS connections. 338 339 Additional keyword parameters, collected in *x509*, may be used for 340 authentication of the client when using the :file:`https:` scheme. The keywords 341 *key_file* and *cert_file* are supported to provide an SSL key and certificate; 342 both are needed to support client authentication. 343 344 :class:`URLopener` objects will raise an :exc:`IOError` exception if the server 345 returns an error code. 346 347 .. method:: open(fullurl[, data]) 348 349 Open *fullurl* using the appropriate protocol. This method sets up cache and 350 proxy information, then calls the appropriate open method with its input 351 arguments. If the scheme is not recognized, :meth:`open_unknown` is called. 352 The *data* argument has the same meaning as the *data* argument of 353 :func:`urlopen`. 354 355 356 .. method:: open_unknown(fullurl[, data]) 357 358 Overridable interface to open unknown URL types. 359 360 361 .. method:: retrieve(url[, filename[, reporthook[, data]]]) 362 363 Retrieves the contents of *url* and places it in *filename*. The return value 364 is a tuple consisting of a local filename and either a 365 :class:`mimetools.Message` object containing the response headers (for remote 366 URLs) or ``None`` (for local URLs). The caller must then open and read the 367 contents of *filename*. If *filename* is not given and the URL refers to a 368 local file, the input filename is returned. If the URL is non-local and 369 *filename* is not given, the filename is the output of :func:`tempfile.mktemp` 370 with a suffix that matches the suffix of the last path component of the input 371 URL. If *reporthook* is given, it must be a function accepting three numeric 372 parameters. It will be called after each chunk of data is read from the 373 network. *reporthook* is ignored for local URLs. 374 375 If the *url* uses the :file:`http:` scheme identifier, the optional *data* 376 argument may be given to specify a ``POST`` request (normally the request type 377 is ``GET``). The *data* argument must in standard 378 :mimetype:`application/x-www-form-urlencoded` format; see the :func:`urlencode` 379 function below. 380 381 382 .. attribute:: version 383 384 Variable that specifies the user agent of the opener object. To get 385 :mod:`urllib` to tell servers that it is a particular user agent, set this in a 386 subclass as a class variable or in the constructor before calling the base 387 constructor. 388 389 390.. class:: FancyURLopener(...) 391 392 :class:`FancyURLopener` subclasses :class:`URLopener` providing default handling 393 for the following HTTP response codes: 301, 302, 303, 307 and 401. For the 30x 394 response codes listed above, the :mailheader:`Location` header is used to fetch 395 the actual URL. For 401 response codes (authentication required), basic HTTP 396 authentication is performed. For the 30x response codes, recursion is bounded 397 by the value of the *maxtries* attribute, which defaults to 10. 398 399 For all other response codes, the method :meth:`http_error_default` is called 400 which you can override in subclasses to handle the error appropriately. 401 402 .. note:: 403 404 According to the letter of :rfc:`2616`, 301 and 302 responses to POST requests 405 must not be automatically redirected without confirmation by the user. In 406 reality, browsers do allow automatic redirection of these responses, changing 407 the POST to a GET, and :mod:`urllib` reproduces this behaviour. 408 409 The parameters to the constructor are the same as those for :class:`URLopener`. 410 411 .. note:: 412 413 When performing basic authentication, a :class:`FancyURLopener` instance calls 414 its :meth:`prompt_user_passwd` method. The default implementation asks the 415 users for the required information on the controlling terminal. A subclass may 416 override this method to support more appropriate behavior if needed. 417 418 The :class:`FancyURLopener` class offers one additional method that should be 419 overloaded to provide the appropriate behavior: 420 421 .. method:: prompt_user_passwd(host, realm) 422 423 Return information needed to authenticate the user at the given host in the 424 specified security realm. The return value should be a tuple, ``(user, 425 password)``, which can be used for basic authentication. 426 427 The implementation prompts for this information on the terminal; an application 428 should override this method to use an appropriate interaction model in the local 429 environment. 430 431.. exception:: ContentTooShortError(msg[, content]) 432 433 This exception is raised when the :func:`urlretrieve` function detects that the 434 amount of the downloaded data is less than the expected amount (given by the 435 *Content-Length* header). The :attr:`content` attribute stores the downloaded 436 (and supposedly truncated) data. 437 438 .. versionadded:: 2.5 439 440 441:mod:`urllib` Restrictions 442-------------------------- 443 444 .. index:: 445 pair: HTTP; protocol 446 pair: FTP; protocol 447 448* Currently, only the following protocols are supported: HTTP, (versions 0.9 and 449 1.0), FTP, and local files. 450 451* The caching feature of :func:`urlretrieve` has been disabled until I find the 452 time to hack proper processing of Expiration time headers. 453 454* There should be a function to query whether a particular URL is in the cache. 455 456* For backward compatibility, if a URL appears to point to a local file but the 457 file can't be opened, the URL is re-interpreted using the FTP protocol. This 458 can sometimes cause confusing error messages. 459 460* The :func:`urlopen` and :func:`urlretrieve` functions can cause arbitrarily 461 long delays while waiting for a network connection to be set up. This means 462 that it is difficult to build an interactive Web client using these functions 463 without using threads. 464 465 .. index:: 466 single: HTML 467 pair: HTTP; protocol 468 module: htmllib 469 470* The data returned by :func:`urlopen` or :func:`urlretrieve` is the raw data 471 returned by the server. This may be binary data (such as an image), plain text 472 or (for example) HTML. The HTTP protocol provides type information in the reply 473 header, which can be inspected by looking at the :mailheader:`Content-Type` 474 header. If the returned data is HTML, you can use the module :mod:`htmllib` to 475 parse it. 476 477 .. index:: single: FTP 478 479* The code handling the FTP protocol cannot differentiate between a file and a 480 directory. This can lead to unexpected behavior when attempting to read a URL 481 that points to a file that is not accessible. If the URL ends in a ``/``, it is 482 assumed to refer to a directory and will be handled accordingly. But if an 483 attempt to read a file leads to a 550 error (meaning the URL cannot be found or 484 is not accessible, often for permission reasons), then the path is treated as a 485 directory in order to handle the case when a directory is specified by a URL but 486 the trailing ``/`` has been left off. This can cause misleading results when 487 you try to fetch a file whose read permissions make it inaccessible; the FTP 488 code will try to read it, fail with a 550 error, and then perform a directory 489 listing for the unreadable file. If fine-grained control is needed, consider 490 using the :mod:`ftplib` module, subclassing :class:`FancyURLopener`, or changing 491 *_urlopener* to meet your needs. 492 493* This module does not support the use of proxies which require authentication. 494 This may be implemented in the future. 495 496 .. index:: module: urlparse 497 498* Although the :mod:`urllib` module contains (undocumented) routines to parse 499 and unparse URL strings, the recommended interface for URL manipulation is in 500 module :mod:`urlparse`. 501 502 503.. _urllib-examples: 504 505Examples 506-------- 507 508Here is an example session that uses the ``GET`` method to retrieve a URL 509containing parameters:: 510 511 >>> import urllib 512 >>> params = urllib.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0}) 513 >>> f = urllib.urlopen("http://www.musi-cal.com/cgi-bin/query?%s" % params) 514 >>> print f.read() 515 516The following example uses the ``POST`` method instead:: 517 518 >>> import urllib 519 >>> params = urllib.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0}) 520 >>> f = urllib.urlopen("http://www.musi-cal.com/cgi-bin/query", params) 521 >>> print f.read() 522 523The following example uses an explicitly specified HTTP proxy, overriding 524environment settings:: 525 526 >>> import urllib 527 >>> proxies = {'http': 'http://proxy.example.com:8080/'} 528 >>> opener = urllib.FancyURLopener(proxies) 529 >>> f = opener.open("http://www.python.org") 530 >>> f.read() 531 532The following example uses no proxies at all, overriding environment settings:: 533 534 >>> import urllib 535 >>> opener = urllib.FancyURLopener({}) 536 >>> f = opener.open("http://www.python.org/") 537 >>> f.read() 538 539