• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1:mod:`urlparse` --- Parse URLs into components
2==============================================
3
4.. module:: urlparse
5   :synopsis: Parse URLs into or assemble them from components.
6
7
8.. index::
9   single: WWW
10   single: World Wide Web
11   single: URL
12   pair: URL; parsing
13   pair: relative; URL
14
15.. note::
16   The :mod:`urlparse` module is renamed to :mod:`urllib.parse` in Python 3.
17   The :term:`2to3` tool will automatically adapt imports when converting
18   your sources to Python 3.
19
20**Source code:** :source:`Lib/urlparse.py`
21
22--------------
23
24This module defines a standard interface to break Uniform Resource Locator (URL)
25strings up in components (addressing scheme, network location, path etc.), to
26combine the components back into a URL string, and to convert a "relative URL"
27to an absolute URL given a "base URL."
28
29The module has been designed to match the Internet RFC on Relative Uniform
30Resource Locators. It supports the following URL schemes: ``file``, ``ftp``,
31``gopher``, ``hdl``, ``http``, ``https``, ``imap``, ``mailto``, ``mms``,
32``news``,  ``nntp``, ``prospero``, ``rsync``, ``rtsp``, ``rtspu``,  ``sftp``,
33``shttp``, ``sip``, ``sips``, ``snews``, ``svn``,  ``svn+ssh``, ``telnet``,
34``wais``.
35
36.. versionadded:: 2.5
37   Support for the ``sftp`` and ``sips`` schemes.
38
39The :mod:`urlparse` module defines the following functions:
40
41
42.. function:: urlparse(urlstring[, scheme[, allow_fragments]])
43
44   Parse a URL into six components, returning a 6-tuple.  This corresponds to the
45   general structure of a URL: ``scheme://netloc/path;parameters?query#fragment``.
46   Each tuple item is a string, possibly empty. The components are not broken up in
47   smaller parts (for example, the network location is a single string), and %
48   escapes are not expanded. The delimiters as shown above are not part of the
49   result, except for a leading slash in the *path* component, which is retained if
50   present.  For example:
51
52      >>> from urlparse import urlparse
53      >>> o = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html')
54      >>> o   # doctest: +NORMALIZE_WHITESPACE
55      ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
56                  params='', query='', fragment='')
57      >>> o.scheme
58      'http'
59      >>> o.port
60      80
61      >>> o.geturl()
62      'http://www.cwi.nl:80/%7Eguido/Python.html'
63
64
65   Following the syntax specifications in :rfc:`1808`, urlparse recognizes
66   a netloc only if it is properly introduced by '//'.  Otherwise the
67   input is presumed to be a relative URL and thus to start with
68   a path component.
69
70       >>> from urlparse import urlparse
71       >>> urlparse('//www.cwi.nl:80/%7Eguido/Python.html')
72       ParseResult(scheme='', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
73                  params='', query='', fragment='')
74       >>> urlparse('www.cwi.nl/%7Eguido/Python.html')
75       ParseResult(scheme='', netloc='', path='www.cwi.nl/%7Eguido/Python.html',
76                  params='', query='', fragment='')
77       >>> urlparse('help/Python.html')
78       ParseResult(scheme='', netloc='', path='help/Python.html', params='',
79                  query='', fragment='')
80
81   If the *scheme* argument is specified, it gives the default addressing
82   scheme, to be used only if the URL does not specify one.  The default value for
83   this argument is the empty string.
84
85   If the *allow_fragments* argument is false, fragment identifiers are not
86   recognized and parsed as part of the preceding component, even if the URL's
87   addressing scheme normally does support them.  The default value for this
88   argument is :const:`True`.
89
90   The return value is actually an instance of a subclass of :class:`tuple`.  This
91   class has the following additional read-only convenience attributes:
92
93   +------------------+-------+--------------------------+----------------------+
94   | Attribute        | Index | Value                    | Value if not present |
95   +==================+=======+==========================+======================+
96   | :attr:`scheme`   | 0     | URL scheme specifier     | *scheme* parameter   |
97   +------------------+-------+--------------------------+----------------------+
98   | :attr:`netloc`   | 1     | Network location part    | empty string         |
99   +------------------+-------+--------------------------+----------------------+
100   | :attr:`path`     | 2     | Hierarchical path        | empty string         |
101   +------------------+-------+--------------------------+----------------------+
102   | :attr:`params`   | 3     | Parameters for last path | empty string         |
103   |                  |       | element                  |                      |
104   +------------------+-------+--------------------------+----------------------+
105   | :attr:`query`    | 4     | Query component          | empty string         |
106   +------------------+-------+--------------------------+----------------------+
107   | :attr:`fragment` | 5     | Fragment identifier      | empty string         |
108   +------------------+-------+--------------------------+----------------------+
109   | :attr:`username` |       | User name                | :const:`None`        |
110   +------------------+-------+--------------------------+----------------------+
111   | :attr:`password` |       | Password                 | :const:`None`        |
112   +------------------+-------+--------------------------+----------------------+
113   | :attr:`hostname` |       | Host name (lower case)   | :const:`None`        |
114   +------------------+-------+--------------------------+----------------------+
115   | :attr:`port`     |       | Port number as integer,  | :const:`None`        |
116   |                  |       | if present               |                      |
117   +------------------+-------+--------------------------+----------------------+
118
119   See section :ref:`urlparse-result-object` for more information on the result
120   object.
121
122   .. versionchanged:: 2.5
123      Added attributes to return value.
124
125   .. versionchanged:: 2.7
126      Added IPv6 URL parsing capabilities.
127
128
129.. function:: parse_qs(qs[, keep_blank_values[, strict_parsing]])
130
131   Parse a query string given as a string argument (data of type
132   :mimetype:`application/x-www-form-urlencoded`).  Data are returned as a
133   dictionary.  The dictionary keys are the unique query variable names and the
134   values are lists of values for each name.
135
136   The optional argument *keep_blank_values* is a flag indicating whether blank
137   values in percent-encoded queries should be treated as blank strings.   A true value
138   indicates that blanks should be retained as  blank strings.  The default false
139   value indicates that blank values are to be ignored and treated as if they were
140   not included.
141
142   The optional argument *strict_parsing* is a flag indicating what to do with
143   parsing errors.  If false (the default), errors are silently ignored.  If true,
144   errors raise a :exc:`ValueError` exception.
145
146   Use the :func:`urllib.urlencode` function to convert such dictionaries into
147   query strings.
148
149   .. versionadded:: 2.6
150      Copied from the :mod:`cgi` module.
151
152
153.. function:: parse_qsl(qs[, keep_blank_values[, strict_parsing]])
154
155   Parse a query string given as a string argument (data of type
156   :mimetype:`application/x-www-form-urlencoded`).  Data are returned as a list of
157   name, value pairs.
158
159   The optional argument *keep_blank_values* is a flag indicating whether blank
160   values in percent-encoded queries should be treated as blank strings.   A true value
161   indicates that blanks should be retained as  blank strings.  The default false
162   value indicates that blank values are to be ignored and treated as if they were
163   not included.
164
165   The optional argument *strict_parsing* is a flag indicating what to do with
166   parsing errors.  If false (the default), errors are silently ignored.  If true,
167   errors raise a :exc:`ValueError` exception.
168
169   Use the :func:`urllib.urlencode` function to convert such lists of pairs into
170   query strings.
171
172   .. versionadded:: 2.6
173      Copied from the :mod:`cgi` module.
174
175
176.. function:: urlunparse(parts)
177
178   Construct a URL from a tuple as returned by ``urlparse()``. The *parts* argument
179   can be any six-item iterable. This may result in a slightly different, but
180   equivalent URL, if the URL that was parsed originally had unnecessary delimiters
181   (for example, a ? with an empty query; the RFC states that these are
182   equivalent).
183
184
185.. function:: urlsplit(urlstring[, scheme[, allow_fragments]])
186
187   This is similar to :func:`urlparse`, but does not split the params from the URL.
188   This should generally be used instead of :func:`urlparse` if the more recent URL
189   syntax allowing parameters to be applied to each segment of the *path* portion
190   of the URL (see :rfc:`2396`) is wanted.  A separate function is needed to
191   separate the path segments and parameters.  This function returns a 5-tuple:
192   (addressing scheme, network location, path, query, fragment identifier).
193
194   The return value is actually an instance of a subclass of :class:`tuple`.  This
195   class has the following additional read-only convenience attributes:
196
197   +------------------+-------+-------------------------+----------------------+
198   | Attribute        | Index | Value                   | Value if not present |
199   +==================+=======+=========================+======================+
200   | :attr:`scheme`   | 0     | URL scheme specifier    | *scheme* parameter   |
201   +------------------+-------+-------------------------+----------------------+
202   | :attr:`netloc`   | 1     | Network location part   | empty string         |
203   +------------------+-------+-------------------------+----------------------+
204   | :attr:`path`     | 2     | Hierarchical path       | empty string         |
205   +------------------+-------+-------------------------+----------------------+
206   | :attr:`query`    | 3     | Query component         | empty string         |
207   +------------------+-------+-------------------------+----------------------+
208   | :attr:`fragment` | 4     | Fragment identifier     | empty string         |
209   +------------------+-------+-------------------------+----------------------+
210   | :attr:`username` |       | User name               | :const:`None`        |
211   +------------------+-------+-------------------------+----------------------+
212   | :attr:`password` |       | Password                | :const:`None`        |
213   +------------------+-------+-------------------------+----------------------+
214   | :attr:`hostname` |       | Host name (lower case)  | :const:`None`        |
215   +------------------+-------+-------------------------+----------------------+
216   | :attr:`port`     |       | Port number as integer, | :const:`None`        |
217   |                  |       | if present              |                      |
218   +------------------+-------+-------------------------+----------------------+
219
220   See section :ref:`urlparse-result-object` for more information on the result
221   object.
222
223   .. versionadded:: 2.2
224
225   .. versionchanged:: 2.5
226      Added attributes to return value.
227
228
229.. function:: urlunsplit(parts)
230
231   Combine the elements of a tuple as returned by :func:`urlsplit` into a complete
232   URL as a string. The *parts* argument can be any five-item iterable. This may
233   result in a slightly different, but equivalent URL, if the URL that was parsed
234   originally had unnecessary delimiters (for example, a ? with an empty query; the
235   RFC states that these are equivalent).
236
237   .. versionadded:: 2.2
238
239
240.. function:: urljoin(base, url[, allow_fragments])
241
242   Construct a full ("absolute") URL by combining a "base URL" (*base*) with
243   another URL (*url*).  Informally, this uses components of the base URL, in
244   particular the addressing scheme, the network location and (part of) the path,
245   to provide missing components in the relative URL.  For example:
246
247      >>> from urlparse import urljoin
248      >>> urljoin('http://www.cwi.nl/%7Eguido/Python.html', 'FAQ.html')
249      'http://www.cwi.nl/%7Eguido/FAQ.html'
250
251   The *allow_fragments* argument has the same meaning and default as for
252   :func:`urlparse`.
253
254   .. note::
255
256      If *url* is an absolute URL (that is, starting with ``//`` or ``scheme://``),
257      the *url*'s host name and/or scheme will be present in the result.  For example:
258
259   .. doctest::
260
261      >>> urljoin('http://www.cwi.nl/%7Eguido/Python.html',
262      ...         '//www.python.org/%7Eguido')
263      'http://www.python.org/%7Eguido'
264
265   If you do not want that behavior, preprocess the *url* with :func:`urlsplit` and
266   :func:`urlunsplit`, removing possible *scheme* and *netloc* parts.
267
268
269.. function:: urldefrag(url)
270
271   If *url* contains a fragment identifier, returns a modified version of *url*
272   with no fragment identifier, and the fragment identifier as a separate string.
273   If there is no fragment identifier in *url*, returns *url* unmodified and an
274   empty string.
275
276
277.. seealso::
278
279   :rfc:`3986` - Uniform Resource Identifiers
280      This is the current standard (STD66). Any changes to urlparse module
281      should conform to this. Certain deviations could be observed, which are
282      mostly for backward compatibility purposes and for certain de-facto
283      parsing requirements as commonly observed in major browsers.
284
285   :rfc:`2732` - Format for Literal IPv6 Addresses in URL's.
286      This specifies the parsing requirements of IPv6 URLs.
287
288   :rfc:`2396` - Uniform Resource Identifiers (URI): Generic Syntax
289      Document describing the generic syntactic requirements for both Uniform Resource
290      Names (URNs) and Uniform Resource Locators (URLs).
291
292   :rfc:`2368` - The mailto URL scheme.
293      Parsing requirements for mailto URL schemes.
294
295   :rfc:`1808` - Relative Uniform Resource Locators
296      This Request For Comments includes the rules for joining an absolute and a
297      relative URL, including a fair number of "Abnormal Examples" which govern the
298      treatment of border cases.
299
300   :rfc:`1738` - Uniform Resource Locators (URL)
301      This specifies the formal syntax and semantics of absolute URLs.
302
303
304.. _urlparse-result-object:
305
306Results of :func:`urlparse` and :func:`urlsplit`
307------------------------------------------------
308
309The result objects from the :func:`urlparse` and :func:`urlsplit` functions are
310subclasses of the :class:`tuple` type.  These subclasses add the attributes
311described in those functions, as well as provide an additional method:
312
313
314.. method:: ParseResult.geturl()
315
316   Return the re-combined version of the original URL as a string. This may differ
317   from the original URL in that the scheme will always be normalized to lower case
318   and empty components may be dropped. Specifically, empty parameters, queries,
319   and fragment identifiers will be removed.
320
321   The result of this method is a fixpoint if passed back through the original
322   parsing function:
323
324      >>> import urlparse
325      >>> url = 'HTTP://www.Python.org/doc/#'
326
327      >>> r1 = urlparse.urlsplit(url)
328      >>> r1.geturl()
329      'http://www.Python.org/doc/'
330
331      >>> r2 = urlparse.urlsplit(r1.geturl())
332      >>> r2.geturl()
333      'http://www.Python.org/doc/'
334
335   .. versionadded:: 2.5
336
337The following classes provide the implementations of the parse results:
338
339
340.. class:: ParseResult(scheme, netloc, path, params, query, fragment)
341
342   Concrete class for :func:`urlparse` results.
343
344
345.. class:: SplitResult(scheme, netloc, path, query, fragment)
346
347   Concrete class for :func:`urlsplit` results.
348
349