• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1
2:mod:`re` --- Regular expression operations
3===========================================
4
5.. module:: re
6   :synopsis: Regular expression operations.
7.. moduleauthor:: Fredrik Lundh <fredrik@pythonware.com>
8.. sectionauthor:: Andrew M. Kuchling <amk@amk.ca>
9
10
11This module provides regular expression matching operations similar to
12those found in Perl. Both patterns and strings to be searched can be
13Unicode strings as well as 8-bit strings.
14
15Regular expressions use the backslash character (``'\'``) to indicate
16special forms or to allow special characters to be used without invoking
17their special meaning.  This collides with Python's usage of the same
18character for the same purpose in string literals; for example, to match
19a literal backslash, one might have to write ``'\\\\'`` as the pattern
20string, because the regular expression must be ``\\``, and each
21backslash must be expressed as ``\\`` inside a regular Python string
22literal.
23
24The solution is to use Python's raw string notation for regular expression
25patterns; backslashes are not handled in any special way in a string literal
26prefixed with ``'r'``.  So ``r"\n"`` is a two-character string containing
27``'\'`` and ``'n'``, while ``"\n"`` is a one-character string containing a
28newline.  Usually patterns will be expressed in Python code using this raw
29string notation.
30
31It is important to note that most regular expression operations are available as
32module-level functions and :class:`RegexObject` methods.  The functions are
33shortcuts that don't require you to compile a regex object first, but miss some
34fine-tuning parameters.
35
36
37.. _re-syntax:
38
39Regular Expression Syntax
40-------------------------
41
42A regular expression (or RE) specifies a set of strings that matches it; the
43functions in this module let you check if a particular string matches a given
44regular expression (or if a given regular expression matches a particular
45string, which comes down to the same thing).
46
47Regular expressions can be concatenated to form new regular expressions; if *A*
48and *B* are both regular expressions, then *AB* is also a regular expression.
49In general, if a string *p* matches *A* and another string *q* matches *B*, the
50string *pq* will match AB.  This holds unless *A* or *B* contain low precedence
51operations; boundary conditions between *A* and *B*; or have numbered group
52references.  Thus, complex expressions can easily be constructed from simpler
53primitive expressions like the ones described here.  For details of the theory
54and implementation of regular expressions, consult the Friedl book referenced
55above, or almost any textbook about compiler construction.
56
57A brief explanation of the format of regular expressions follows.  For further
58information and a gentler presentation, consult the :ref:`regex-howto`.
59
60Regular expressions can contain both special and ordinary characters. Most
61ordinary characters, like ``'A'``, ``'a'``, or ``'0'``, are the simplest regular
62expressions; they simply match themselves.  You can concatenate ordinary
63characters, so ``last`` matches the string ``'last'``.  (In the rest of this
64section, we'll write RE's in ``this special style``, usually without quotes, and
65strings to be matched ``'in single quotes'``.)
66
67Some characters, like ``'|'`` or ``'('``, are special. Special
68characters either stand for classes of ordinary characters, or affect
69how the regular expressions around them are interpreted. Regular
70expression pattern strings may not contain null bytes, but can specify
71the null byte using the ``\number`` notation, e.g., ``'\x00'``.
72
73Repetition qualifiers (``*``, ``+``, ``?``, ``{m,n}``, etc) cannot be
74directly nested. This avoids ambiguity with the non-greedy modifier suffix
75``?``, and with other modifiers in other implementations. To apply a second
76repetition to an inner repetition, parentheses may be used. For example,
77the expression ``(?:a{6})*`` matches any multiple of six ``'a'`` characters.
78
79
80The special characters are:
81
82``'.'``
83   (Dot.)  In the default mode, this matches any character except a newline.  If
84   the :const:`DOTALL` flag has been specified, this matches any character
85   including a newline.
86
87``'^'``
88   (Caret.)  Matches the start of the string, and in :const:`MULTILINE` mode also
89   matches immediately after each newline.
90
91``'$'``
92   Matches the end of the string or just before the newline at the end of the
93   string, and in :const:`MULTILINE` mode also matches before a newline.  ``foo``
94   matches both 'foo' and 'foobar', while the regular expression ``foo$`` matches
95   only 'foo'.  More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'``
96   matches 'foo2' normally, but 'foo1' in :const:`MULTILINE` mode; searching for
97   a single ``$`` in ``'foo\n'`` will find two (empty) matches: one just before
98   the newline, and one at the end of the string.
99
100``'*'``
101   Causes the resulting RE to match 0 or more repetitions of the preceding RE, as
102   many repetitions as are possible.  ``ab*`` will match 'a', 'ab', or 'a' followed
103   by any number of 'b's.
104
105``'+'``
106   Causes the resulting RE to match 1 or more repetitions of the preceding RE.
107   ``ab+`` will match 'a' followed by any non-zero number of 'b's; it will not
108   match just 'a'.
109
110``'?'``
111   Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
112   ``ab?`` will match either 'a' or 'ab'.
113
114``*?``, ``+?``, ``??``
115   The ``'*'``, ``'+'``, and ``'?'`` qualifiers are all :dfn:`greedy`; they match
116   as much text as possible.  Sometimes this behaviour isn't desired; if the RE
117   ``<.*>`` is matched against ``<a> b <c>``, it will match the entire
118   string, and not just ``<a>``.  Adding ``?`` after the qualifier makes it
119   perform the match in :dfn:`non-greedy` or :dfn:`minimal` fashion; as *few*
120   characters as possible will be matched.  Using the RE ``<.*?>`` will match
121   only ``<a>``.
122
123``{m}``
124   Specifies that exactly *m* copies of the previous RE should be matched; fewer
125   matches cause the entire RE not to match.  For example, ``a{6}`` will match
126   exactly six ``'a'`` characters, but not five.
127
128``{m,n}``
129   Causes the resulting RE to match from *m* to *n* repetitions of the preceding
130   RE, attempting to match as many repetitions as possible.  For example,
131   ``a{3,5}`` will match from 3 to 5 ``'a'`` characters.  Omitting *m* specifies a
132   lower bound of zero,  and omitting *n* specifies an infinite upper bound.  As an
133   example, ``a{4,}b`` will match ``aaaab`` or a thousand ``'a'`` characters
134   followed by a ``b``, but not ``aaab``. The comma may not be omitted or the
135   modifier would be confused with the previously described form.
136
137``{m,n}?``
138   Causes the resulting RE to match from *m* to *n* repetitions of the preceding
139   RE, attempting to match as *few* repetitions as possible.  This is the
140   non-greedy version of the previous qualifier.  For example, on the
141   6-character string ``'aaaaaa'``, ``a{3,5}`` will match 5 ``'a'`` characters,
142   while ``a{3,5}?`` will only match 3 characters.
143
144``'\'``
145   Either escapes special characters (permitting you to match characters like
146   ``'*'``, ``'?'``, and so forth), or signals a special sequence; special
147   sequences are discussed below.
148
149   If you're not using a raw string to express the pattern, remember that Python
150   also uses the backslash as an escape sequence in string literals; if the escape
151   sequence isn't recognized by Python's parser, the backslash and subsequent
152   character are included in the resulting string.  However, if Python would
153   recognize the resulting sequence, the backslash should be repeated twice.  This
154   is complicated and hard to understand, so it's highly recommended that you use
155   raw strings for all but the simplest expressions.
156
157``[]``
158   Used to indicate a set of characters.  In a set:
159
160   * Characters can be listed individually, e.g. ``[amk]`` will match ``'a'``,
161     ``'m'``, or ``'k'``.
162
163   * Ranges of characters can be indicated by giving two characters and separating
164     them by a ``'-'``, for example ``[a-z]`` will match any lowercase ASCII letter,
165     ``[0-5][0-9]`` will match all the two-digits numbers from ``00`` to ``59``, and
166     ``[0-9A-Fa-f]`` will match any hexadecimal digit.  If ``-`` is escaped (e.g.
167     ``[a\-z]``) or if it's placed as the first or last character (e.g. ``[a-]``),
168     it will match a literal ``'-'``.
169
170   * Special characters lose their special meaning inside sets.  For example,
171     ``[(+*)]`` will match any of the literal characters ``'('``, ``'+'``,
172     ``'*'``, or ``')'``.
173
174   * Character classes such as ``\w`` or ``\S`` (defined below) are also accepted
175     inside a set, although the characters they match depends on whether
176     :const:`LOCALE` or  :const:`UNICODE` mode is in force.
177
178   * Characters that are not within a range can be matched by :dfn:`complementing`
179     the set.  If the first character of the set is ``'^'``, all the characters
180     that are *not* in the set will be matched.  For example, ``[^5]`` will match
181     any character except ``'5'``, and ``[^^]`` will match any character except
182     ``'^'``.  ``^`` has no special meaning if it's not the first character in
183     the set.
184
185   * To match a literal ``']'`` inside a set, precede it with a backslash, or
186     place it at the beginning of the set.  For example, both ``[()[\]{}]`` and
187     ``[]()[{}]`` will both match a parenthesis.
188
189``'|'``
190   ``A|B``, where A and B can be arbitrary REs, creates a regular expression that
191   will match either A or B.  An arbitrary number of REs can be separated by the
192   ``'|'`` in this way.  This can be used inside groups (see below) as well.  As
193   the target string is scanned, REs separated by ``'|'`` are tried from left to
194   right. When one pattern completely matches, that branch is accepted. This means
195   that once ``A`` matches, ``B`` will not be tested further, even if it would
196   produce a longer overall match.  In other words, the ``'|'`` operator is never
197   greedy.  To match a literal ``'|'``, use ``\|``, or enclose it inside a
198   character class, as in ``[|]``.
199
200``(...)``
201   Matches whatever regular expression is inside the parentheses, and indicates the
202   start and end of a group; the contents of a group can be retrieved after a match
203   has been performed, and can be matched later in the string with the ``\number``
204   special sequence, described below.  To match the literals ``'('`` or ``')'``,
205   use ``\(`` or ``\)``, or enclose them inside a character class: ``[(] [)]``.
206
207``(?...)``
208   This is an extension notation (a ``'?'`` following a ``'('`` is not meaningful
209   otherwise).  The first character after the ``'?'`` determines what the meaning
210   and further syntax of the construct is. Extensions usually do not create a new
211   group; ``(?P<name>...)`` is the only exception to this rule. Following are the
212   currently supported extensions.
213
214``(?iLmsux)``
215   (One or more letters from the set ``'i'``, ``'L'``, ``'m'``, ``'s'``,
216   ``'u'``, ``'x'``.)  The group matches the empty string; the letters
217   set the corresponding flags: :const:`re.I` (ignore case),
218   :const:`re.L` (locale dependent), :const:`re.M` (multi-line),
219   :const:`re.S` (dot matches all), :const:`re.U` (Unicode dependent),
220   and :const:`re.X` (verbose), for the entire regular expression. (The
221   flags are described in :ref:`contents-of-module-re`.) This
222   is useful if you wish to include the flags as part of the regular
223   expression, instead of passing a *flag* argument to the
224   :func:`re.compile` function.
225
226   Note that the ``(?x)`` flag changes how the expression is parsed. It should be
227   used first in the expression string, or after one or more whitespace characters.
228   If there are non-whitespace characters before the flag, the results are
229   undefined.
230
231``(?:...)``
232   A non-capturing version of regular parentheses.  Matches whatever regular
233   expression is inside the parentheses, but the substring matched by the group
234   *cannot* be retrieved after performing a match or referenced later in the
235   pattern.
236
237``(?P<name>...)``
238   Similar to regular parentheses, but the substring matched by the group is
239   accessible via the symbolic group name *name*.  Group names must be valid
240   Python identifiers, and each group name must be defined only once within a
241   regular expression.  A symbolic group is also a numbered group, just as if
242   the group were not named.
243
244   Named groups can be referenced in three contexts.  If the pattern is
245   ``(?P<quote>['"]).*?(?P=quote)`` (i.e. matching a string quoted with either
246   single or double quotes):
247
248   +---------------------------------------+----------------------------------+
249   | Context of reference to group "quote" | Ways to reference it             |
250   +=======================================+==================================+
251   | in the same pattern itself            | * ``(?P=quote)`` (as shown)      |
252   |                                       | * ``\1``                         |
253   +---------------------------------------+----------------------------------+
254   | when processing match object ``m``    | * ``m.group('quote')``           |
255   |                                       | * ``m.end('quote')`` (etc.)      |
256   +---------------------------------------+----------------------------------+
257   | in a string passed to the ``repl``    | * ``\g<quote>``                  |
258   | argument of ``re.sub()``              | * ``\g<1>``                      |
259   |                                       | * ``\1``                         |
260   +---------------------------------------+----------------------------------+
261
262``(?P=name)``
263   A backreference to a named group; it matches whatever text was matched by the
264   earlier group named *name*.
265
266``(?#...)``
267   A comment; the contents of the parentheses are simply ignored.
268
269``(?=...)``
270   Matches if ``...`` matches next, but doesn't consume any of the string.  This is
271   called a lookahead assertion.  For example, ``Isaac (?=Asimov)`` will match
272   ``'Isaac '`` only if it's followed by ``'Asimov'``.
273
274``(?!...)``
275   Matches if ``...`` doesn't match next.  This is a negative lookahead assertion.
276   For example, ``Isaac (?!Asimov)`` will match ``'Isaac '`` only if it's *not*
277   followed by ``'Asimov'``.
278
279``(?<=...)``
280   Matches if the current position in the string is preceded by a match for ``...``
281   that ends at the current position.  This is called a :dfn:`positive lookbehind
282   assertion`. ``(?<=abc)def`` will find a match in ``abcdef``, since the
283   lookbehind will back up 3 characters and check if the contained pattern matches.
284   The contained pattern must only match strings of some fixed length, meaning that
285   ``abc`` or ``a|b`` are allowed, but ``a*`` and ``a{3,4}`` are not.  Group
286   references are not supported even if they match strings of some fixed length.
287   Note that
288   patterns which start with positive lookbehind assertions will not match at the
289   beginning of the string being searched; you will most likely want to use the
290   :func:`search` function rather than the :func:`match` function:
291
292      >>> import re
293      >>> m = re.search('(?<=abc)def', 'abcdef')
294      >>> m.group(0)
295      'def'
296
297   This example looks for a word following a hyphen:
298
299      >>> m = re.search('(?<=-)\w+', 'spam-egg')
300      >>> m.group(0)
301      'egg'
302
303``(?<!...)``
304   Matches if the current position in the string is not preceded by a match for
305   ``...``.  This is called a :dfn:`negative lookbehind assertion`.  Similar to
306   positive lookbehind assertions, the contained pattern must only match strings of
307   some fixed length and shouldn't contain group references.
308   Patterns which start with negative lookbehind assertions may
309   match at the beginning of the string being searched.
310
311``(?(id/name)yes-pattern|no-pattern)``
312   Will try to match with ``yes-pattern`` if the group with given *id* or *name*
313   exists, and with ``no-pattern`` if it doesn't. ``no-pattern`` is optional and
314   can be omitted. For example,  ``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>)`` is a poor email
315   matching pattern, which will match with ``'<user@host.com>'`` as well as
316   ``'user@host.com'``, but not with ``'<user@host.com'``.
317
318   .. versionadded:: 2.4
319
320The special sequences consist of ``'\'`` and a character from the list below.
321If the ordinary character is not on the list, then the resulting RE will match
322the second character.  For example, ``\$`` matches the character ``'$'``.
323
324``\number``
325   Matches the contents of the group of the same number.  Groups are numbered
326   starting from 1.  For example, ``(.+) \1`` matches ``'the the'`` or ``'55 55'``,
327   but not ``'thethe'`` (note the space after the group).  This special sequence
328   can only be used to match one of the first 99 groups.  If the first digit of
329   *number* is 0, or *number* is 3 octal digits long, it will not be interpreted as
330   a group match, but as the character with octal value *number*. Inside the
331   ``'['`` and ``']'`` of a character class, all numeric escapes are treated as
332   characters.
333
334``\A``
335   Matches only at the start of the string.
336
337``\b``
338   Matches the empty string, but only at the beginning or end of a word.  A word is
339   defined as a sequence of alphanumeric or underscore characters, so the end of a
340   word is indicated by whitespace or a non-alphanumeric, non-underscore character.
341   Note that formally, ``\b`` is defined as the boundary between a ``\w`` and
342   a ``\W`` character (or vice versa), or between ``\w`` and the beginning/end
343   of the string, so the precise set of characters deemed to be alphanumeric
344   depends on the values of the ``UNICODE`` and ``LOCALE`` flags.
345   For example, ``r'\bfoo\b'`` matches ``'foo'``, ``'foo.'``, ``'(foo)'``,
346   ``'bar foo baz'`` but not ``'foobar'`` or ``'foo3'``.
347   Inside a character range, ``\b`` represents the backspace character, for
348   compatibility with Python's string literals.
349
350``\B``
351   Matches the empty string, but only when it is *not* at the beginning or end of a
352   word.  This means that ``r'py\B'`` matches ``'python'``, ``'py3'``, ``'py2'``,
353   but not ``'py'``, ``'py.'``, or ``'py!'``.
354   ``\B`` is just the opposite of ``\b``, so is also subject to the settings
355   of ``LOCALE`` and ``UNICODE``.
356
357``\d``
358   When the :const:`UNICODE` flag is not specified, matches any decimal digit; this
359   is equivalent to the set ``[0-9]``.  With :const:`UNICODE`, it will match
360   whatever is classified as a decimal digit in the Unicode character properties
361   database.
362
363``\D``
364   When the :const:`UNICODE` flag is not specified, matches any non-digit
365   character; this is equivalent to the set  ``[^0-9]``.  With :const:`UNICODE`, it
366   will match  anything other than character marked as digits in the Unicode
367   character  properties database.
368
369``\s``
370   When the :const:`UNICODE` flag is not specified, it matches any whitespace
371   character, this is equivalent to the set ``[ \t\n\r\f\v]``. The
372   :const:`LOCALE` flag has no extra effect on matching of the space.
373   If :const:`UNICODE` is set, this will match the characters ``[ \t\n\r\f\v]``
374   plus whatever is classified as space in the Unicode character properties
375   database.
376
377``\S``
378   When the :const:`UNICODE` flag is not specified, matches any non-whitespace
379   character; this is equivalent to the set ``[^ \t\n\r\f\v]`` The
380   :const:`LOCALE` flag has no extra effect on non-whitespace match.  If
381   :const:`UNICODE` is set, then any character not marked as space in the
382   Unicode character properties database is matched.
383
384
385``\w``
386   When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
387   any alphanumeric character and the underscore; this is equivalent to the set
388   ``[a-zA-Z0-9_]``.  With :const:`LOCALE`, it will match the set ``[0-9_]`` plus
389   whatever characters are defined as alphanumeric for the current locale.  If
390   :const:`UNICODE` is set, this will match the characters ``[0-9_]`` plus whatever
391   is classified as alphanumeric in the Unicode character properties database.
392
393``\W``
394   When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
395   any non-alphanumeric character; this is equivalent to the set ``[^a-zA-Z0-9_]``.
396   With :const:`LOCALE`, it will match any character not in the set ``[0-9_]``, and
397   not defined as alphanumeric for the current locale. If :const:`UNICODE` is set,
398   this will match anything other than ``[0-9_]`` plus characters classified as
399   not alphanumeric in the Unicode character properties database.
400
401``\Z``
402   Matches only at the end of the string.
403
404If both :const:`LOCALE` and :const:`UNICODE` flags are included for a
405particular sequence, then :const:`LOCALE` flag takes effect first followed by
406the :const:`UNICODE`.
407
408Most of the standard escapes supported by Python string literals are also
409accepted by the regular expression parser::
410
411   \a      \b      \f      \n
412   \r      \t      \v      \x
413   \\
414
415(Note that ``\b`` is used to represent word boundaries, and means "backspace"
416only inside character classes.)
417
418Octal escapes are included in a limited form: If the first digit is a 0, or if
419there are three octal digits, it is considered an octal escape. Otherwise, it is
420a group reference.  As for string literals, octal escapes are always at most
421three digits in length.
422
423.. seealso::
424
425   Mastering Regular Expressions
426      Book on regular expressions by Jeffrey Friedl, published by O'Reilly.  The
427      second edition of the book no longer covers Python at all, but the first
428      edition covered writing good regular expression patterns in great detail.
429
430
431
432.. _contents-of-module-re:
433
434Module Contents
435---------------
436
437The module defines several functions, constants, and an exception. Some of the
438functions are simplified versions of the full featured methods for compiled
439regular expressions.  Most non-trivial applications always use the compiled
440form.
441
442
443.. function:: compile(pattern, flags=0)
444
445   Compile a regular expression pattern into a regular expression object, which
446   can be used for matching using its :func:`~RegexObject.match` and
447   :func:`~RegexObject.search` methods, described below.
448
449   The expression's behaviour can be modified by specifying a *flags* value.
450   Values can be any of the following variables, combined using bitwise OR (the
451   ``|`` operator).
452
453   The sequence ::
454
455      prog = re.compile(pattern)
456      result = prog.match(string)
457
458   is equivalent to ::
459
460      result = re.match(pattern, string)
461
462   but using :func:`re.compile` and saving the resulting regular expression
463   object for reuse is more efficient when the expression will be used several
464   times in a single program.
465
466   .. note::
467
468      The compiled versions of the most recent patterns passed to
469      :func:`re.match`, :func:`re.search` or :func:`re.compile` are cached, so
470      programs that use only a few regular expressions at a time needn't worry
471      about compiling regular expressions.
472
473
474.. data:: DEBUG
475
476   Display debug information about compiled expression.
477
478
479.. data:: I
480          IGNORECASE
481
482   Perform case-insensitive matching; expressions like ``[A-Z]`` will match
483   lowercase letters, too.  This is not affected by the current locale.
484
485
486.. data:: L
487          LOCALE
488
489   Make ``\w``, ``\W``, ``\b``, ``\B``, ``\s`` and ``\S`` dependent on the
490   current locale.
491
492
493.. data:: M
494          MULTILINE
495
496   When specified, the pattern character ``'^'`` matches at the beginning of the
497   string and at the beginning of each line (immediately following each newline);
498   and the pattern character ``'$'`` matches at the end of the string and at the
499   end of each line (immediately preceding each newline).  By default, ``'^'``
500   matches only at the beginning of the string, and ``'$'`` only at the end of the
501   string and immediately before the newline (if any) at the end of the string.
502
503
504.. data:: S
505          DOTALL
506
507   Make the ``'.'`` special character match any character at all, including a
508   newline; without this flag, ``'.'`` will match anything *except* a newline.
509
510
511.. data:: U
512          UNICODE
513
514   Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S`` dependent
515   on the Unicode character properties database.
516
517   .. versionadded:: 2.0
518
519
520.. data:: X
521          VERBOSE
522
523   This flag allows you to write regular expressions that look nicer and are
524   more readable by allowing you to visually separate logical sections of the
525   pattern and add comments. Whitespace within the pattern is ignored, except
526   when in a character class or when preceded by an unescaped backslash.
527   When a line contains a ``#`` that is not in a character class and is not
528   preceded by an unescaped backslash, all characters from the leftmost such
529   ``#`` through the end of the line are ignored.
530
531   This means that the two following regular expression objects that match a
532   decimal number are functionally equal::
533
534      a = re.compile(r"""\d +  # the integral part
535                         \.    # the decimal point
536                         \d *  # some fractional digits""", re.X)
537      b = re.compile(r"\d+\.\d*")
538
539
540.. function:: search(pattern, string, flags=0)
541
542   Scan through *string* looking for the first location where the regular expression
543   *pattern* produces a match, and return a corresponding :class:`MatchObject`
544   instance. Return ``None`` if no position in the string matches the pattern; note
545   that this is different from finding a zero-length match at some point in the
546   string.
547
548
549.. function:: match(pattern, string, flags=0)
550
551   If zero or more characters at the beginning of *string* match the regular
552   expression *pattern*, return a corresponding :class:`MatchObject` instance.
553   Return ``None`` if the string does not match the pattern; note that this is
554   different from a zero-length match.
555
556   Note that even in :const:`MULTILINE` mode, :func:`re.match` will only match
557   at the beginning of the string and not at the beginning of each line.
558
559   If you want to locate a match anywhere in *string*, use :func:`search`
560   instead (see also :ref:`search-vs-match`).
561
562
563.. function:: split(pattern, string, maxsplit=0, flags=0)
564
565   Split *string* by the occurrences of *pattern*.  If capturing parentheses are
566   used in *pattern*, then the text of all groups in the pattern are also returned
567   as part of the resulting list. If *maxsplit* is nonzero, at most *maxsplit*
568   splits occur, and the remainder of the string is returned as the final element
569   of the list.  (Incompatibility note: in the original Python 1.5 release,
570   *maxsplit* was ignored.  This has been fixed in later releases.)
571
572      >>> re.split('\W+', 'Words, words, words.')
573      ['Words', 'words', 'words', '']
574      >>> re.split('(\W+)', 'Words, words, words.')
575      ['Words', ', ', 'words', ', ', 'words', '.', '']
576      >>> re.split('\W+', 'Words, words, words.', 1)
577      ['Words', 'words, words.']
578      >>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)
579      ['0', '3', '9']
580
581   If there are capturing groups in the separator and it matches at the start of
582   the string, the result will start with an empty string.  The same holds for
583   the end of the string:
584
585      >>> re.split('(\W+)', '...words, words...')
586      ['', '...', 'words', ', ', 'words', '...', '']
587
588   That way, separator components are always found at the same relative
589   indices within the result list (e.g., if there's one capturing group
590   in the separator, the 0th, the 2nd and so forth).
591
592   Note that *split* will never split a string on an empty pattern match.
593   For example:
594
595      >>> re.split('x*', 'foo')
596      ['foo']
597      >>> re.split("(?m)^$", "foo\n\nbar\n")
598      ['foo\n\nbar\n']
599
600   .. versionchanged:: 2.7
601      Added the optional flags argument.
602
603
604.. function:: findall(pattern, string, flags=0)
605
606   Return all non-overlapping matches of *pattern* in *string*, as a list of
607   strings.  The *string* is scanned left-to-right, and matches are returned in
608   the order found.  If one or more groups are present in the pattern, return a
609   list of groups; this will be a list of tuples if the pattern has more than
610   one group.  Empty matches are included in the result unless they touch the
611   beginning of another match.
612
613   .. versionadded:: 1.5.2
614
615   .. versionchanged:: 2.4
616      Added the optional flags argument.
617
618
619.. function:: finditer(pattern, string, flags=0)
620
621   Return an :term:`iterator` yielding :class:`MatchObject` instances over all
622   non-overlapping matches for the RE *pattern* in *string*.  The *string* is
623   scanned left-to-right, and matches are returned in the order found.  Empty
624   matches are included in the result unless they touch the beginning of another
625   match.
626
627   .. versionadded:: 2.2
628
629   .. versionchanged:: 2.4
630      Added the optional flags argument.
631
632
633.. function:: sub(pattern, repl, string, count=0, flags=0)
634
635   Return the string obtained by replacing the leftmost non-overlapping occurrences
636   of *pattern* in *string* by the replacement *repl*.  If the pattern isn't found,
637   *string* is returned unchanged.  *repl* can be a string or a function; if it is
638   a string, any backslash escapes in it are processed.  That is, ``\n`` is
639   converted to a single newline character, ``\r`` is converted to a carriage return, and
640   so forth.  Unknown escapes such as ``\j`` are left alone.  Backreferences, such
641   as ``\6``, are replaced with the substring matched by group 6 in the pattern.
642   For example:
643
644      >>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',
645      ...        r'static PyObject*\npy_\1(void)\n{',
646      ...        'def myfunc():')
647      'static PyObject*\npy_myfunc(void)\n{'
648
649   If *repl* is a function, it is called for every non-overlapping occurrence of
650   *pattern*.  The function takes a single match object argument, and returns the
651   replacement string.  For example:
652
653      >>> def dashrepl(matchobj):
654      ...     if matchobj.group(0) == '-': return ' '
655      ...     else: return '-'
656      >>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
657      'pro--gram files'
658      >>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)
659      'Baked Beans & Spam'
660
661   The pattern may be a string or an RE object.
662
663   The optional argument *count* is the maximum number of pattern occurrences to be
664   replaced; *count* must be a non-negative integer.  If omitted or zero, all
665   occurrences will be replaced. Empty matches for the pattern are replaced only
666   when not adjacent to a previous match, so ``sub('x*', '-', 'abc')`` returns
667   ``'-a-b-c-'``.
668
669   In string-type *repl* arguments, in addition to the character escapes and
670   backreferences described above,
671   ``\g<name>`` will use the substring matched by the group named ``name``, as
672   defined by the ``(?P<name>...)`` syntax. ``\g<number>`` uses the corresponding
673   group number; ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous
674   in a replacement such as ``\g<2>0``.  ``\20`` would be interpreted as a
675   reference to group 20, not a reference to group 2 followed by the literal
676   character ``'0'``.  The backreference ``\g<0>`` substitutes in the entire
677   substring matched by the RE.
678
679   .. versionchanged:: 2.7
680      Added the optional flags argument.
681
682
683.. function:: subn(pattern, repl, string, count=0, flags=0)
684
685   Perform the same operation as :func:`sub`, but return a tuple ``(new_string,
686   number_of_subs_made)``.
687
688   .. versionchanged:: 2.7
689      Added the optional flags argument.
690
691
692.. function:: escape(string)
693
694   Return *string* with all non-alphanumerics backslashed; this is useful if you
695   want to match an arbitrary literal string that may have regular expression
696   metacharacters in it.
697
698
699.. function:: purge()
700
701   Clear the regular expression cache.
702
703
704.. exception:: error
705
706   Exception raised when a string passed to one of the functions here is not a
707   valid regular expression (for example, it might contain unmatched parentheses)
708   or when some other error occurs during compilation or matching.  It is never an
709   error if a string contains no match for a pattern.
710
711
712.. _re-objects:
713
714Regular Expression Objects
715--------------------------
716
717.. class:: RegexObject
718
719   The :class:`RegexObject` class supports the following methods and attributes:
720
721   .. method:: RegexObject.search(string[, pos[, endpos]])
722
723      Scan through *string* looking for a location where this regular expression
724      produces a match, and return a corresponding :class:`MatchObject` instance.
725      Return ``None`` if no position in the string matches the pattern; note that this
726      is different from finding a zero-length match at some point in the string.
727
728      The optional second parameter *pos* gives an index in the string where the
729      search is to start; it defaults to ``0``.  This is not completely equivalent to
730      slicing the string; the ``'^'`` pattern character matches at the real beginning
731      of the string and at positions just after a newline, but not necessarily at the
732      index where the search is to start.
733
734      The optional parameter *endpos* limits how far the string will be searched; it
735      will be as if the string is *endpos* characters long, so only the characters
736      from *pos* to ``endpos - 1`` will be searched for a match.  If *endpos* is less
737      than *pos*, no match will be found, otherwise, if *rx* is a compiled regular
738      expression object, ``rx.search(string, 0, 50)`` is equivalent to
739      ``rx.search(string[:50], 0)``.
740
741      >>> pattern = re.compile("d")
742      >>> pattern.search("dog")     # Match at index 0
743      <_sre.SRE_Match object at ...>
744      >>> pattern.search("dog", 1)  # No match; search doesn't include the "d"
745
746
747   .. method:: RegexObject.match(string[, pos[, endpos]])
748
749      If zero or more characters at the *beginning* of *string* match this regular
750      expression, return a corresponding :class:`MatchObject` instance.  Return
751      ``None`` if the string does not match the pattern; note that this is different
752      from a zero-length match.
753
754      The optional *pos* and *endpos* parameters have the same meaning as for the
755      :meth:`~RegexObject.search` method.
756
757      >>> pattern = re.compile("o")
758      >>> pattern.match("dog")      # No match as "o" is not at the start of "dog".
759      >>> pattern.match("dog", 1)   # Match as "o" is the 2nd character of "dog".
760      <_sre.SRE_Match object at ...>
761
762      If you want to locate a match anywhere in *string*, use
763      :meth:`~RegexObject.search` instead (see also :ref:`search-vs-match`).
764
765
766   .. method:: RegexObject.split(string, maxsplit=0)
767
768      Identical to the :func:`split` function, using the compiled pattern.
769
770
771   .. method:: RegexObject.findall(string[, pos[, endpos]])
772
773      Similar to the :func:`findall` function, using the compiled pattern, but
774      also accepts optional *pos* and *endpos* parameters that limit the search
775      region like for :meth:`match`.
776
777
778   .. method:: RegexObject.finditer(string[, pos[, endpos]])
779
780      Similar to the :func:`finditer` function, using the compiled pattern, but
781      also accepts optional *pos* and *endpos* parameters that limit the search
782      region like for :meth:`match`.
783
784
785   .. method:: RegexObject.sub(repl, string, count=0)
786
787      Identical to the :func:`sub` function, using the compiled pattern.
788
789
790   .. method:: RegexObject.subn(repl, string, count=0)
791
792      Identical to the :func:`subn` function, using the compiled pattern.
793
794
795   .. attribute:: RegexObject.flags
796
797      The regex matching flags.  This is a combination of the flags given to
798      :func:`.compile` and any ``(?...)`` inline flags in the pattern.
799
800
801   .. attribute:: RegexObject.groups
802
803      The number of capturing groups in the pattern.
804
805
806   .. attribute:: RegexObject.groupindex
807
808      A dictionary mapping any symbolic group names defined by ``(?P<id>)`` to group
809      numbers.  The dictionary is empty if no symbolic groups were used in the
810      pattern.
811
812
813   .. attribute:: RegexObject.pattern
814
815      The pattern string from which the RE object was compiled.
816
817
818.. _match-objects:
819
820Match Objects
821-------------
822
823.. class:: MatchObject
824
825   Match objects always have a boolean value of ``True``.
826   Since :meth:`~regex.match` and :meth:`~regex.search` return ``None``
827   when there is no match, you can test whether there was a match with a simple
828   ``if`` statement::
829
830      match = re.search(pattern, string)
831      if match:
832          process(match)
833
834   Match objects support the following methods and attributes:
835
836
837   .. method:: MatchObject.expand(template)
838
839      Return the string obtained by doing backslash substitution on the template
840      string *template*, as done by the :meth:`~RegexObject.sub` method.  Escapes
841      such as ``\n`` are converted to the appropriate characters, and numeric
842      backreferences (``\1``, ``\2``) and named backreferences (``\g<1>``,
843      ``\g<name>``) are replaced by the contents of the corresponding group.
844
845
846   .. method:: MatchObject.group([group1, ...])
847
848      Returns one or more subgroups of the match.  If there is a single argument, the
849      result is a single string; if there are multiple arguments, the result is a
850      tuple with one item per argument. Without arguments, *group1* defaults to zero
851      (the whole match is returned). If a *groupN* argument is zero, the corresponding
852      return value is the entire matching string; if it is in the inclusive range
853      [1..99], it is the string matching the corresponding parenthesized group.  If a
854      group number is negative or larger than the number of groups defined in the
855      pattern, an :exc:`IndexError` exception is raised. If a group is contained in a
856      part of the pattern that did not match, the corresponding result is ``None``.
857      If a group is contained in a part of the pattern that matched multiple times,
858      the last match is returned.
859
860         >>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
861         >>> m.group(0)       # The entire match
862         'Isaac Newton'
863         >>> m.group(1)       # The first parenthesized subgroup.
864         'Isaac'
865         >>> m.group(2)       # The second parenthesized subgroup.
866         'Newton'
867         >>> m.group(1, 2)    # Multiple arguments give us a tuple.
868         ('Isaac', 'Newton')
869
870      If the regular expression uses the ``(?P<name>...)`` syntax, the *groupN*
871      arguments may also be strings identifying groups by their group name.  If a
872      string argument is not used as a group name in the pattern, an :exc:`IndexError`
873      exception is raised.
874
875      A moderately complicated example:
876
877         >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
878         >>> m.group('first_name')
879         'Malcolm'
880         >>> m.group('last_name')
881         'Reynolds'
882
883      Named groups can also be referred to by their index:
884
885         >>> m.group(1)
886         'Malcolm'
887         >>> m.group(2)
888         'Reynolds'
889
890      If a group matches multiple times, only the last match is accessible:
891
892         >>> m = re.match(r"(..)+", "a1b2c3")  # Matches 3 times.
893         >>> m.group(1)                        # Returns only the last match.
894         'c3'
895
896
897   .. method:: MatchObject.groups([default])
898
899      Return a tuple containing all the subgroups of the match, from 1 up to however
900      many groups are in the pattern.  The *default* argument is used for groups that
901      did not participate in the match; it defaults to ``None``.  (Incompatibility
902      note: in the original Python 1.5 release, if the tuple was one element long, a
903      string would be returned instead.  In later versions (from 1.5.1 on), a
904      singleton tuple is returned in such cases.)
905
906      For example:
907
908         >>> m = re.match(r"(\d+)\.(\d+)", "24.1632")
909         >>> m.groups()
910         ('24', '1632')
911
912      If we make the decimal place and everything after it optional, not all groups
913      might participate in the match.  These groups will default to ``None`` unless
914      the *default* argument is given:
915
916         >>> m = re.match(r"(\d+)\.?(\d+)?", "24")
917         >>> m.groups()      # Second group defaults to None.
918         ('24', None)
919         >>> m.groups('0')   # Now, the second group defaults to '0'.
920         ('24', '0')
921
922
923   .. method:: MatchObject.groupdict([default])
924
925      Return a dictionary containing all the *named* subgroups of the match, keyed by
926      the subgroup name.  The *default* argument is used for groups that did not
927      participate in the match; it defaults to ``None``.  For example:
928
929         >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
930         >>> m.groupdict()
931         {'first_name': 'Malcolm', 'last_name': 'Reynolds'}
932
933
934   .. method:: MatchObject.start([group])
935               MatchObject.end([group])
936
937      Return the indices of the start and end of the substring matched by *group*;
938      *group* defaults to zero (meaning the whole matched substring). Return ``-1`` if
939      *group* exists but did not contribute to the match.  For a match object *m*, and
940      a group *g* that did contribute to the match, the substring matched by group *g*
941      (equivalent to ``m.group(g)``) is ::
942
943         m.string[m.start(g):m.end(g)]
944
945      Note that ``m.start(group)`` will equal ``m.end(group)`` if *group* matched a
946      null string.  For example, after ``m = re.search('b(c?)', 'cba')``,
947      ``m.start(0)`` is 1, ``m.end(0)`` is 2, ``m.start(1)`` and ``m.end(1)`` are both
948      2, and ``m.start(2)`` raises an :exc:`IndexError` exception.
949
950      An example that will remove *remove_this* from email addresses:
951
952         >>> email = "tony@tiremove_thisger.net"
953         >>> m = re.search("remove_this", email)
954         >>> email[:m.start()] + email[m.end():]
955         'tony@tiger.net'
956
957
958   .. method:: MatchObject.span([group])
959
960      For :class:`MatchObject` *m*, return the 2-tuple ``(m.start(group),
961      m.end(group))``. Note that if *group* did not contribute to the match, this is
962      ``(-1, -1)``.  *group* defaults to zero, the entire match.
963
964
965   .. attribute:: MatchObject.pos
966
967      The value of *pos* which was passed to the :meth:`~RegexObject.search` or
968      :meth:`~RegexObject.match` method of the :class:`RegexObject`.  This is the
969      index into the string at which the RE engine started looking for a match.
970
971
972   .. attribute:: MatchObject.endpos
973
974      The value of *endpos* which was passed to the :meth:`~RegexObject.search` or
975      :meth:`~RegexObject.match` method of the :class:`RegexObject`.  This is the
976      index into the string beyond which the RE engine will not go.
977
978
979   .. attribute:: MatchObject.lastindex
980
981      The integer index of the last matched capturing group, or ``None`` if no group
982      was matched at all. For example, the expressions ``(a)b``, ``((a)(b))``, and
983      ``((ab))`` will have ``lastindex == 1`` if applied to the string ``'ab'``, while
984      the expression ``(a)(b)`` will have ``lastindex == 2``, if applied to the same
985      string.
986
987
988   .. attribute:: MatchObject.lastgroup
989
990      The name of the last matched capturing group, or ``None`` if the group didn't
991      have a name, or if no group was matched at all.
992
993
994   .. attribute:: MatchObject.re
995
996      The regular expression object whose :meth:`~RegexObject.match` or
997      :meth:`~RegexObject.search` method produced this :class:`MatchObject`
998      instance.
999
1000
1001   .. attribute:: MatchObject.string
1002
1003      The string passed to :meth:`~RegexObject.match` or
1004      :meth:`~RegexObject.search`.
1005
1006
1007Examples
1008--------
1009
1010
1011Checking For a Pair
1012^^^^^^^^^^^^^^^^^^^
1013
1014In this example, we'll use the following helper function to display match
1015objects a little more gracefully:
1016
1017.. testcode::
1018
1019   def displaymatch(match):
1020       if match is None:
1021           return None
1022       return '<Match: %r, groups=%r>' % (match.group(), match.groups())
1023
1024Suppose you are writing a poker program where a player's hand is represented as
1025a 5-character string with each character representing a card, "a" for ace, "k"
1026for king, "q" for queen, "j" for jack, "t" for 10, and "2" through "9"
1027representing the card with that value.
1028
1029To see if a given string is a valid hand, one could do the following:
1030
1031   >>> valid = re.compile(r"^[a2-9tjqk]{5}$")
1032   >>> displaymatch(valid.match("akt5q"))  # Valid.
1033   "<Match: 'akt5q', groups=()>"
1034   >>> displaymatch(valid.match("akt5e"))  # Invalid.
1035   >>> displaymatch(valid.match("akt"))    # Invalid.
1036   >>> displaymatch(valid.match("727ak"))  # Valid.
1037   "<Match: '727ak', groups=()>"
1038
1039That last hand, ``"727ak"``, contained a pair, or two of the same valued cards.
1040To match this with a regular expression, one could use backreferences as such:
1041
1042   >>> pair = re.compile(r".*(.).*\1")
1043   >>> displaymatch(pair.match("717ak"))     # Pair of 7s.
1044   "<Match: '717', groups=('7',)>"
1045   >>> displaymatch(pair.match("718ak"))     # No pairs.
1046   >>> displaymatch(pair.match("354aa"))     # Pair of aces.
1047   "<Match: '354aa', groups=('a',)>"
1048
1049To find out what card the pair consists of, one could use the
1050:meth:`~MatchObject.group` method of :class:`MatchObject` in the following
1051manner:
1052
1053.. doctest::
1054
1055   >>> pair.match("717ak").group(1)
1056   '7'
1057
1058   # Error because re.match() returns None, which doesn't have a group() method:
1059   >>> pair.match("718ak").group(1)
1060   Traceback (most recent call last):
1061     File "<pyshell#23>", line 1, in <module>
1062       re.match(r".*(.).*\1", "718ak").group(1)
1063   AttributeError: 'NoneType' object has no attribute 'group'
1064
1065   >>> pair.match("354aa").group(1)
1066   'a'
1067
1068
1069Simulating scanf()
1070^^^^^^^^^^^^^^^^^^
1071
1072.. index:: single: scanf()
1073
1074Python does not currently have an equivalent to :c:func:`scanf`.  Regular
1075expressions are generally more powerful, though also more verbose, than
1076:c:func:`scanf` format strings.  The table below offers some more-or-less
1077equivalent mappings between :c:func:`scanf` format tokens and regular
1078expressions.
1079
1080+--------------------------------+---------------------------------------------+
1081| :c:func:`scanf` Token          | Regular Expression                          |
1082+================================+=============================================+
1083| ``%c``                         | ``.``                                       |
1084+--------------------------------+---------------------------------------------+
1085| ``%5c``                        | ``.{5}``                                    |
1086+--------------------------------+---------------------------------------------+
1087| ``%d``                         | ``[-+]?\d+``                                |
1088+--------------------------------+---------------------------------------------+
1089| ``%e``, ``%E``, ``%f``, ``%g`` | ``[-+]?(\d+(\.\d*)?|\.\d+)([eE][-+]?\d+)?`` |
1090+--------------------------------+---------------------------------------------+
1091| ``%i``                         | ``[-+]?(0[xX][\dA-Fa-f]+|0[0-7]*|\d+)``     |
1092+--------------------------------+---------------------------------------------+
1093| ``%o``                         | ``[-+]?[0-7]+``                             |
1094+--------------------------------+---------------------------------------------+
1095| ``%s``                         | ``\S+``                                     |
1096+--------------------------------+---------------------------------------------+
1097| ``%u``                         | ``\d+``                                     |
1098+--------------------------------+---------------------------------------------+
1099| ``%x``, ``%X``                 | ``[-+]?(0[xX])?[\dA-Fa-f]+``                |
1100+--------------------------------+---------------------------------------------+
1101
1102To extract the filename and numbers from a string like ::
1103
1104   /usr/sbin/sendmail - 0 errors, 4 warnings
1105
1106you would use a :c:func:`scanf` format like ::
1107
1108   %s - %d errors, %d warnings
1109
1110The equivalent regular expression would be ::
1111
1112   (\S+) - (\d+) errors, (\d+) warnings
1113
1114
1115.. _search-vs-match:
1116
1117search() vs. match()
1118^^^^^^^^^^^^^^^^^^^^
1119
1120.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
1121
1122Python offers two different primitive operations based on regular expressions:
1123:func:`re.match` checks for a match only at the beginning of the string, while
1124:func:`re.search` checks for a match anywhere in the string (this is what Perl
1125does by default).
1126
1127For example::
1128
1129   >>> re.match("c", "abcdef")    # No match
1130   >>> re.search("c", "abcdef")   # Match
1131   <_sre.SRE_Match object at ...>
1132
1133Regular expressions beginning with ``'^'`` can be used with :func:`search` to
1134restrict the match at the beginning of the string::
1135
1136   >>> re.match("c", "abcdef")    # No match
1137   >>> re.search("^c", "abcdef")  # No match
1138   >>> re.search("^a", "abcdef")  # Match
1139   <_sre.SRE_Match object at ...>
1140
1141Note however that in :const:`MULTILINE` mode :func:`match` only matches at the
1142beginning of the string, whereas using :func:`search` with a regular expression
1143beginning with ``'^'`` will match at the beginning of each line.
1144
1145   >>> re.match('X', 'A\nB\nX', re.MULTILINE)  # No match
1146   >>> re.search('^X', 'A\nB\nX', re.MULTILINE)  # Match
1147   <_sre.SRE_Match object at ...>
1148
1149
1150Making a Phonebook
1151^^^^^^^^^^^^^^^^^^
1152
1153:func:`split` splits a string into a list delimited by the passed pattern.  The
1154method is invaluable for converting textual data into data structures that can be
1155easily read and modified by Python as demonstrated in the following example that
1156creates a phonebook.
1157
1158First, here is the input.  Normally it may come from a file, here we are using
1159triple-quoted string syntax:
1160
1161   >>> text = """Ross McFluff: 834.345.1254 155 Elm Street
1162   ...
1163   ... Ronald Heathmore: 892.345.3428 436 Finley Avenue
1164   ... Frank Burger: 925.541.7625 662 South Dogwood Way
1165   ...
1166   ...
1167   ... Heather Albrecht: 548.326.4584 919 Park Place"""
1168
1169The entries are separated by one or more newlines. Now we convert the string
1170into a list with each nonempty line having its own entry:
1171
1172.. doctest::
1173   :options: +NORMALIZE_WHITESPACE
1174
1175   >>> entries = re.split("\n+", text)
1176   >>> entries
1177   ['Ross McFluff: 834.345.1254 155 Elm Street',
1178   'Ronald Heathmore: 892.345.3428 436 Finley Avenue',
1179   'Frank Burger: 925.541.7625 662 South Dogwood Way',
1180   'Heather Albrecht: 548.326.4584 919 Park Place']
1181
1182Finally, split each entry into a list with first name, last name, telephone
1183number, and address.  We use the ``maxsplit`` parameter of :func:`split`
1184because the address has spaces, our splitting pattern, in it:
1185
1186.. doctest::
1187   :options: +NORMALIZE_WHITESPACE
1188
1189   >>> [re.split(":? ", entry, 3) for entry in entries]
1190   [['Ross', 'McFluff', '834.345.1254', '155 Elm Street'],
1191   ['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'],
1192   ['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'],
1193   ['Heather', 'Albrecht', '548.326.4584', '919 Park Place']]
1194
1195The ``:?`` pattern matches the colon after the last name, so that it does not
1196occur in the result list.  With a ``maxsplit`` of ``4``, we could separate the
1197house number from the street name:
1198
1199.. doctest::
1200   :options: +NORMALIZE_WHITESPACE
1201
1202   >>> [re.split(":? ", entry, 4) for entry in entries]
1203   [['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'],
1204   ['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'],
1205   ['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'],
1206   ['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']]
1207
1208
1209Text Munging
1210^^^^^^^^^^^^
1211
1212:func:`sub` replaces every occurrence of a pattern with a string or the
1213result of a function.  This example demonstrates using :func:`sub` with
1214a function to "munge" text, or randomize the order of all the characters
1215in each word of a sentence except for the first and last characters::
1216
1217   >>> def repl(m):
1218   ...     inner_word = list(m.group(2))
1219   ...     random.shuffle(inner_word)
1220   ...     return m.group(1) + "".join(inner_word) + m.group(3)
1221   >>> text = "Professor Abdolmalek, please report your absences promptly."
1222   >>> re.sub(r"(\w)(\w+)(\w)", repl, text)
1223   'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.'
1224   >>> re.sub(r"(\w)(\w+)(\w)", repl, text)
1225   'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.'
1226
1227
1228Finding all Adverbs
1229^^^^^^^^^^^^^^^^^^^
1230
1231:func:`findall` matches *all* occurrences of a pattern, not just the first
1232one as :func:`search` does.  For example, if one was a writer and wanted to
1233find all of the adverbs in some text, he or she might use :func:`findall` in
1234the following manner:
1235
1236   >>> text = "He was carefully disguised but captured quickly by police."
1237   >>> re.findall(r"\w+ly", text)
1238   ['carefully', 'quickly']
1239
1240
1241Finding all Adverbs and their Positions
1242^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1243
1244If one wants more information about all matches of a pattern than the matched
1245text, :func:`finditer` is useful as it provides instances of
1246:class:`MatchObject` instead of strings.  Continuing with the previous example,
1247if one was a writer who wanted to find all of the adverbs *and their positions*
1248in some text, he or she would use :func:`finditer` in the following manner:
1249
1250   >>> text = "He was carefully disguised but captured quickly by police."
1251   >>> for m in re.finditer(r"\w+ly", text):
1252   ...     print '%02d-%02d: %s' % (m.start(), m.end(), m.group(0))
1253   07-16: carefully
1254   40-47: quickly
1255
1256
1257Raw String Notation
1258^^^^^^^^^^^^^^^^^^^
1259
1260Raw string notation (``r"text"``) keeps regular expressions sane.  Without it,
1261every backslash (``'\'``) in a regular expression would have to be prefixed with
1262another one to escape it.  For example, the two following lines of code are
1263functionally identical:
1264
1265   >>> re.match(r"\W(.)\1\W", " ff ")
1266   <_sre.SRE_Match object at ...>
1267   >>> re.match("\\W(.)\\1\\W", " ff ")
1268   <_sre.SRE_Match object at ...>
1269
1270When one wants to match a literal backslash, it must be escaped in the regular
1271expression.  With raw string notation, this means ``r"\\"``.  Without raw string
1272notation, one must use ``"\\\\"``, making the following lines of code
1273functionally identical:
1274
1275   >>> re.match(r"\\", r"\\")
1276   <_sre.SRE_Match object at ...>
1277   >>> re.match("\\\\", r"\\")
1278   <_sre.SRE_Match object at ...>
1279