• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1
2:mod:`re` --- Regular expression operations
3===========================================
4
5.. module:: re
6   :synopsis: Regular expression operations.
7.. moduleauthor:: Fredrik Lundh <fredrik@pythonware.com>
8.. sectionauthor:: Andrew M. Kuchling <amk@amk.ca>
9
10
11This module provides regular expression matching operations similar to
12those found in Perl. Both patterns and strings to be searched can be
13Unicode strings as well as 8-bit strings.
14
15Regular expressions use the backslash character (``'\'``) to indicate
16special forms or to allow special characters to be used without invoking
17their special meaning.  This collides with Python's usage of the same
18character for the same purpose in string literals; for example, to match
19a literal backslash, one might have to write ``'\\\\'`` as the pattern
20string, because the regular expression must be ``\\``, and each
21backslash must be expressed as ``\\`` inside a regular Python string
22literal.
23
24The solution is to use Python's raw string notation for regular expression
25patterns; backslashes are not handled in any special way in a string literal
26prefixed with ``'r'``.  So ``r"\n"`` is a two-character string containing
27``'\'`` and ``'n'``, while ``"\n"`` is a one-character string containing a
28newline.  Usually patterns will be expressed in Python code using this raw
29string notation.
30
31It is important to note that most regular expression operations are available as
32module-level functions and :class:`RegexObject` methods.  The functions are
33shortcuts that don't require you to compile a regex object first, but miss some
34fine-tuning parameters.
35
36.. seealso::
37
38   The third-party `regex <https://pypi.org/project/regex/>`_ module,
39   which has an API compatible with the standard library :mod:`re` module,
40   but offers additional functionality and a more thorough Unicode support.
41
42
43.. _re-syntax:
44
45Regular Expression Syntax
46-------------------------
47
48A regular expression (or RE) specifies a set of strings that matches it; the
49functions in this module let you check if a particular string matches a given
50regular expression (or if a given regular expression matches a particular
51string, which comes down to the same thing).
52
53Regular expressions can be concatenated to form new regular expressions; if *A*
54and *B* are both regular expressions, then *AB* is also a regular expression.
55In general, if a string *p* matches *A* and another string *q* matches *B*, the
56string *pq* will match AB.  This holds unless *A* or *B* contain low precedence
57operations; boundary conditions between *A* and *B*; or have numbered group
58references.  Thus, complex expressions can easily be constructed from simpler
59primitive expressions like the ones described here.  For details of the theory
60and implementation of regular expressions, consult the Friedl book referenced
61above, or almost any textbook about compiler construction.
62
63A brief explanation of the format of regular expressions follows.  For further
64information and a gentler presentation, consult the :ref:`regex-howto`.
65
66Regular expressions can contain both special and ordinary characters. Most
67ordinary characters, like ``'A'``, ``'a'``, or ``'0'``, are the simplest regular
68expressions; they simply match themselves.  You can concatenate ordinary
69characters, so ``last`` matches the string ``'last'``.  (In the rest of this
70section, we'll write RE's in ``this special style``, usually without quotes, and
71strings to be matched ``'in single quotes'``.)
72
73Some characters, like ``'|'`` or ``'('``, are special. Special
74characters either stand for classes of ordinary characters, or affect
75how the regular expressions around them are interpreted. Regular
76expression pattern strings may not contain null bytes, but can specify
77the null byte using the ``\number`` notation, e.g., ``'\x00'``.
78
79Repetition qualifiers (``*``, ``+``, ``?``, ``{m,n}``, etc) cannot be
80directly nested. This avoids ambiguity with the non-greedy modifier suffix
81``?``, and with other modifiers in other implementations. To apply a second
82repetition to an inner repetition, parentheses may be used. For example,
83the expression ``(?:a{6})*`` matches any multiple of six ``'a'`` characters.
84
85
86The special characters are:
87
88``'.'``
89   (Dot.)  In the default mode, this matches any character except a newline.  If
90   the :const:`DOTALL` flag has been specified, this matches any character
91   including a newline.
92
93``'^'``
94   (Caret.)  Matches the start of the string, and in :const:`MULTILINE` mode also
95   matches immediately after each newline.
96
97``'$'``
98   Matches the end of the string or just before the newline at the end of the
99   string, and in :const:`MULTILINE` mode also matches before a newline.  ``foo``
100   matches both 'foo' and 'foobar', while the regular expression ``foo$`` matches
101   only 'foo'.  More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'``
102   matches 'foo2' normally, but 'foo1' in :const:`MULTILINE` mode; searching for
103   a single ``$`` in ``'foo\n'`` will find two (empty) matches: one just before
104   the newline, and one at the end of the string.
105
106``'*'``
107   Causes the resulting RE to match 0 or more repetitions of the preceding RE, as
108   many repetitions as are possible.  ``ab*`` will match 'a', 'ab', or 'a' followed
109   by any number of 'b's.
110
111``'+'``
112   Causes the resulting RE to match 1 or more repetitions of the preceding RE.
113   ``ab+`` will match 'a' followed by any non-zero number of 'b's; it will not
114   match just 'a'.
115
116``'?'``
117   Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
118   ``ab?`` will match either 'a' or 'ab'.
119
120``*?``, ``+?``, ``??``
121   The ``'*'``, ``'+'``, and ``'?'`` qualifiers are all :dfn:`greedy`; they match
122   as much text as possible.  Sometimes this behaviour isn't desired; if the RE
123   ``<.*>`` is matched against ``<a> b <c>``, it will match the entire
124   string, and not just ``<a>``.  Adding ``?`` after the qualifier makes it
125   perform the match in :dfn:`non-greedy` or :dfn:`minimal` fashion; as *few*
126   characters as possible will be matched.  Using the RE ``<.*?>`` will match
127   only ``<a>``.
128
129``{m}``
130   Specifies that exactly *m* copies of the previous RE should be matched; fewer
131   matches cause the entire RE not to match.  For example, ``a{6}`` will match
132   exactly six ``'a'`` characters, but not five.
133
134``{m,n}``
135   Causes the resulting RE to match from *m* to *n* repetitions of the preceding
136   RE, attempting to match as many repetitions as possible.  For example,
137   ``a{3,5}`` will match from 3 to 5 ``'a'`` characters.  Omitting *m* specifies a
138   lower bound of zero,  and omitting *n* specifies an infinite upper bound.  As an
139   example, ``a{4,}b`` will match ``aaaab`` or a thousand ``'a'`` characters
140   followed by a ``b``, but not ``aaab``. The comma may not be omitted or the
141   modifier would be confused with the previously described form.
142
143``{m,n}?``
144   Causes the resulting RE to match from *m* to *n* repetitions of the preceding
145   RE, attempting to match as *few* repetitions as possible.  This is the
146   non-greedy version of the previous qualifier.  For example, on the
147   6-character string ``'aaaaaa'``, ``a{3,5}`` will match 5 ``'a'`` characters,
148   while ``a{3,5}?`` will only match 3 characters.
149
150``'\'``
151   Either escapes special characters (permitting you to match characters like
152   ``'*'``, ``'?'``, and so forth), or signals a special sequence; special
153   sequences are discussed below.
154
155   If you're not using a raw string to express the pattern, remember that Python
156   also uses the backslash as an escape sequence in string literals; if the escape
157   sequence isn't recognized by Python's parser, the backslash and subsequent
158   character are included in the resulting string.  However, if Python would
159   recognize the resulting sequence, the backslash should be repeated twice.  This
160   is complicated and hard to understand, so it's highly recommended that you use
161   raw strings for all but the simplest expressions.
162
163``[]``
164   Used to indicate a set of characters.  In a set:
165
166   * Characters can be listed individually, e.g. ``[amk]`` will match ``'a'``,
167     ``'m'``, or ``'k'``.
168
169   * Ranges of characters can be indicated by giving two characters and separating
170     them by a ``'-'``, for example ``[a-z]`` will match any lowercase ASCII letter,
171     ``[0-5][0-9]`` will match all the two-digits numbers from ``00`` to ``59``, and
172     ``[0-9A-Fa-f]`` will match any hexadecimal digit.  If ``-`` is escaped (e.g.
173     ``[a\-z]``) or if it's placed as the first or last character (e.g. ``[a-]``),
174     it will match a literal ``'-'``.
175
176   * Special characters lose their special meaning inside sets.  For example,
177     ``[(+*)]`` will match any of the literal characters ``'('``, ``'+'``,
178     ``'*'``, or ``')'``.
179
180   * Character classes such as ``\w`` or ``\S`` (defined below) are also accepted
181     inside a set, although the characters they match depends on whether
182     :const:`LOCALE` or  :const:`UNICODE` mode is in force.
183
184   * Characters that are not within a range can be matched by :dfn:`complementing`
185     the set.  If the first character of the set is ``'^'``, all the characters
186     that are *not* in the set will be matched.  For example, ``[^5]`` will match
187     any character except ``'5'``, and ``[^^]`` will match any character except
188     ``'^'``.  ``^`` has no special meaning if it's not the first character in
189     the set.
190
191   * To match a literal ``']'`` inside a set, precede it with a backslash, or
192     place it at the beginning of the set.  For example, both ``[()[\]{}]`` and
193     ``[]()[{}]`` will both match a parenthesis.
194
195``'|'``
196   ``A|B``, where A and B can be arbitrary REs, creates a regular expression that
197   will match either A or B.  An arbitrary number of REs can be separated by the
198   ``'|'`` in this way.  This can be used inside groups (see below) as well.  As
199   the target string is scanned, REs separated by ``'|'`` are tried from left to
200   right. When one pattern completely matches, that branch is accepted. This means
201   that once ``A`` matches, ``B`` will not be tested further, even if it would
202   produce a longer overall match.  In other words, the ``'|'`` operator is never
203   greedy.  To match a literal ``'|'``, use ``\|``, or enclose it inside a
204   character class, as in ``[|]``.
205
206``(...)``
207   Matches whatever regular expression is inside the parentheses, and indicates the
208   start and end of a group; the contents of a group can be retrieved after a match
209   has been performed, and can be matched later in the string with the ``\number``
210   special sequence, described below.  To match the literals ``'('`` or ``')'``,
211   use ``\(`` or ``\)``, or enclose them inside a character class: ``[(] [)]``.
212
213``(?...)``
214   This is an extension notation (a ``'?'`` following a ``'('`` is not meaningful
215   otherwise).  The first character after the ``'?'`` determines what the meaning
216   and further syntax of the construct is. Extensions usually do not create a new
217   group; ``(?P<name>...)`` is the only exception to this rule. Following are the
218   currently supported extensions.
219
220``(?iLmsux)``
221   (One or more letters from the set ``'i'``, ``'L'``, ``'m'``, ``'s'``,
222   ``'u'``, ``'x'``.)  The group matches the empty string; the letters
223   set the corresponding flags: :const:`re.I` (ignore case),
224   :const:`re.L` (locale dependent), :const:`re.M` (multi-line),
225   :const:`re.S` (dot matches all), :const:`re.U` (Unicode dependent),
226   and :const:`re.X` (verbose), for the entire regular expression. (The
227   flags are described in :ref:`contents-of-module-re`.) This
228   is useful if you wish to include the flags as part of the regular
229   expression, instead of passing a *flag* argument to the
230   :func:`re.compile` function.
231
232   Note that the ``(?x)`` flag changes how the expression is parsed. It should be
233   used first in the expression string, or after one or more whitespace characters.
234   If there are non-whitespace characters before the flag, the results are
235   undefined.
236
237``(?:...)``
238   A non-capturing version of regular parentheses.  Matches whatever regular
239   expression is inside the parentheses, but the substring matched by the group
240   *cannot* be retrieved after performing a match or referenced later in the
241   pattern.
242
243``(?P<name>...)``
244   Similar to regular parentheses, but the substring matched by the group is
245   accessible via the symbolic group name *name*.  Group names must be valid
246   Python identifiers, and each group name must be defined only once within a
247   regular expression.  A symbolic group is also a numbered group, just as if
248   the group were not named.
249
250   Named groups can be referenced in three contexts.  If the pattern is
251   ``(?P<quote>['"]).*?(?P=quote)`` (i.e. matching a string quoted with either
252   single or double quotes):
253
254   +---------------------------------------+----------------------------------+
255   | Context of reference to group "quote" | Ways to reference it             |
256   +=======================================+==================================+
257   | in the same pattern itself            | * ``(?P=quote)`` (as shown)      |
258   |                                       | * ``\1``                         |
259   +---------------------------------------+----------------------------------+
260   | when processing match object ``m``    | * ``m.group('quote')``           |
261   |                                       | * ``m.end('quote')`` (etc.)      |
262   +---------------------------------------+----------------------------------+
263   | in a string passed to the ``repl``    | * ``\g<quote>``                  |
264   | argument of ``re.sub()``              | * ``\g<1>``                      |
265   |                                       | * ``\1``                         |
266   +---------------------------------------+----------------------------------+
267
268``(?P=name)``
269   A backreference to a named group; it matches whatever text was matched by the
270   earlier group named *name*.
271
272``(?#...)``
273   A comment; the contents of the parentheses are simply ignored.
274
275``(?=...)``
276   Matches if ``...`` matches next, but doesn't consume any of the string.  This is
277   called a lookahead assertion.  For example, ``Isaac (?=Asimov)`` will match
278   ``'Isaac '`` only if it's followed by ``'Asimov'``.
279
280``(?!...)``
281   Matches if ``...`` doesn't match next.  This is a negative lookahead assertion.
282   For example, ``Isaac (?!Asimov)`` will match ``'Isaac '`` only if it's *not*
283   followed by ``'Asimov'``.
284
285``(?<=...)``
286   Matches if the current position in the string is preceded by a match for ``...``
287   that ends at the current position.  This is called a :dfn:`positive lookbehind
288   assertion`. ``(?<=abc)def`` will find a match in ``abcdef``, since the
289   lookbehind will back up 3 characters and check if the contained pattern matches.
290   The contained pattern must only match strings of some fixed length, meaning that
291   ``abc`` or ``a|b`` are allowed, but ``a*`` and ``a{3,4}`` are not.  Group
292   references are not supported even if they match strings of some fixed length.
293   Note that
294   patterns which start with positive lookbehind assertions will not match at the
295   beginning of the string being searched; you will most likely want to use the
296   :func:`search` function rather than the :func:`match` function:
297
298      >>> import re
299      >>> m = re.search('(?<=abc)def', 'abcdef')
300      >>> m.group(0)
301      'def'
302
303   This example looks for a word following a hyphen:
304
305      >>> m = re.search('(?<=-)\w+', 'spam-egg')
306      >>> m.group(0)
307      'egg'
308
309``(?<!...)``
310   Matches if the current position in the string is not preceded by a match for
311   ``...``.  This is called a :dfn:`negative lookbehind assertion`.  Similar to
312   positive lookbehind assertions, the contained pattern must only match strings of
313   some fixed length and shouldn't contain group references.
314   Patterns which start with negative lookbehind assertions may
315   match at the beginning of the string being searched.
316
317``(?(id/name)yes-pattern|no-pattern)``
318   Will try to match with ``yes-pattern`` if the group with given *id* or *name*
319   exists, and with ``no-pattern`` if it doesn't. ``no-pattern`` is optional and
320   can be omitted. For example,  ``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>)`` is a poor email
321   matching pattern, which will match with ``'<user@host.com>'`` as well as
322   ``'user@host.com'``, but not with ``'<user@host.com'``.
323
324   .. versionadded:: 2.4
325
326The special sequences consist of ``'\'`` and a character from the list below.
327If the ordinary character is not on the list, then the resulting RE will match
328the second character.  For example, ``\$`` matches the character ``'$'``.
329
330``\number``
331   Matches the contents of the group of the same number.  Groups are numbered
332   starting from 1.  For example, ``(.+) \1`` matches ``'the the'`` or ``'55 55'``,
333   but not ``'thethe'`` (note the space after the group).  This special sequence
334   can only be used to match one of the first 99 groups.  If the first digit of
335   *number* is 0, or *number* is 3 octal digits long, it will not be interpreted as
336   a group match, but as the character with octal value *number*. Inside the
337   ``'['`` and ``']'`` of a character class, all numeric escapes are treated as
338   characters.
339
340``\A``
341   Matches only at the start of the string.
342
343``\b``
344   Matches the empty string, but only at the beginning or end of a word.  A word is
345   defined as a sequence of alphanumeric or underscore characters, so the end of a
346   word is indicated by whitespace or a non-alphanumeric, non-underscore character.
347   Note that formally, ``\b`` is defined as the boundary between a ``\w`` and
348   a ``\W`` character (or vice versa), or between ``\w`` and the beginning/end
349   of the string, so the precise set of characters deemed to be alphanumeric
350   depends on the values of the ``UNICODE`` and ``LOCALE`` flags.
351   For example, ``r'\bfoo\b'`` matches ``'foo'``, ``'foo.'``, ``'(foo)'``,
352   ``'bar foo baz'`` but not ``'foobar'`` or ``'foo3'``.
353   Inside a character range, ``\b`` represents the backspace character, for
354   compatibility with Python's string literals.
355
356``\B``
357   Matches the empty string, but only when it is *not* at the beginning or end of a
358   word.  This means that ``r'py\B'`` matches ``'python'``, ``'py3'``, ``'py2'``,
359   but not ``'py'``, ``'py.'``, or ``'py!'``.
360   ``\B`` is just the opposite of ``\b``, so is also subject to the settings
361   of ``LOCALE`` and ``UNICODE``.
362
363``\d``
364   When the :const:`UNICODE` flag is not specified, matches any decimal digit; this
365   is equivalent to the set ``[0-9]``.  With :const:`UNICODE`, it will match
366   whatever is classified as a decimal digit in the Unicode character properties
367   database.
368
369``\D``
370   When the :const:`UNICODE` flag is not specified, matches any non-digit
371   character; this is equivalent to the set  ``[^0-9]``.  With :const:`UNICODE`, it
372   will match  anything other than character marked as digits in the Unicode
373   character  properties database.
374
375``\s``
376   When the :const:`UNICODE` flag is not specified, it matches any whitespace
377   character, this is equivalent to the set ``[ \t\n\r\f\v]``. The
378   :const:`LOCALE` flag has no extra effect on matching of the space.
379   If :const:`UNICODE` is set, this will match the characters ``[ \t\n\r\f\v]``
380   plus whatever is classified as space in the Unicode character properties
381   database.
382
383``\S``
384   When the :const:`UNICODE` flag is not specified, matches any non-whitespace
385   character; this is equivalent to the set ``[^ \t\n\r\f\v]`` The
386   :const:`LOCALE` flag has no extra effect on non-whitespace match.  If
387   :const:`UNICODE` is set, then any character not marked as space in the
388   Unicode character properties database is matched.
389
390
391``\w``
392   When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
393   any alphanumeric character and the underscore; this is equivalent to the set
394   ``[a-zA-Z0-9_]``.  With :const:`LOCALE`, it will match the set ``[0-9_]`` plus
395   whatever characters are defined as alphanumeric for the current locale.  If
396   :const:`UNICODE` is set, this will match the characters ``[0-9_]`` plus whatever
397   is classified as alphanumeric in the Unicode character properties database.
398
399``\W``
400   When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
401   any non-alphanumeric character; this is equivalent to the set ``[^a-zA-Z0-9_]``.
402   With :const:`LOCALE`, it will match any character not in the set ``[0-9_]``, and
403   not defined as alphanumeric for the current locale. If :const:`UNICODE` is set,
404   this will match anything other than ``[0-9_]`` plus characters classified as
405   not alphanumeric in the Unicode character properties database.
406
407``\Z``
408   Matches only at the end of the string.
409
410If both :const:`LOCALE` and :const:`UNICODE` flags are included for a
411particular sequence, then :const:`LOCALE` flag takes effect first followed by
412the :const:`UNICODE`.
413
414Most of the standard escapes supported by Python string literals are also
415accepted by the regular expression parser::
416
417   \a      \b      \f      \n
418   \r      \t      \v      \x
419   \\
420
421(Note that ``\b`` is used to represent word boundaries, and means "backspace"
422only inside character classes.)
423
424Octal escapes are included in a limited form: If the first digit is a 0, or if
425there are three octal digits, it is considered an octal escape. Otherwise, it is
426a group reference.  As for string literals, octal escapes are always at most
427three digits in length.
428
429.. seealso::
430
431   Mastering Regular Expressions
432      Book on regular expressions by Jeffrey Friedl, published by O'Reilly.  The
433      second edition of the book no longer covers Python at all, but the first
434      edition covered writing good regular expression patterns in great detail.
435
436
437
438.. _contents-of-module-re:
439
440Module Contents
441---------------
442
443The module defines several functions, constants, and an exception. Some of the
444functions are simplified versions of the full featured methods for compiled
445regular expressions.  Most non-trivial applications always use the compiled
446form.
447
448
449.. function:: compile(pattern, flags=0)
450
451   Compile a regular expression pattern into a regular expression object, which
452   can be used for matching using its :func:`~RegexObject.match` and
453   :func:`~RegexObject.search` methods, described below.
454
455   The expression's behaviour can be modified by specifying a *flags* value.
456   Values can be any of the following variables, combined using bitwise OR (the
457   ``|`` operator).
458
459   The sequence ::
460
461      prog = re.compile(pattern)
462      result = prog.match(string)
463
464   is equivalent to ::
465
466      result = re.match(pattern, string)
467
468   but using :func:`re.compile` and saving the resulting regular expression
469   object for reuse is more efficient when the expression will be used several
470   times in a single program.
471
472   .. note::
473
474      The compiled versions of the most recent patterns passed to
475      :func:`re.match`, :func:`re.search` or :func:`re.compile` are cached, so
476      programs that use only a few regular expressions at a time needn't worry
477      about compiling regular expressions.
478
479
480.. data:: DEBUG
481
482   Display debug information about compiled expression.
483
484
485.. data:: I
486          IGNORECASE
487
488   Perform case-insensitive matching; expressions like ``[A-Z]`` will match
489   lowercase letters, too.  This is not affected by the current locale.  To
490   get this effect on non-ASCII Unicode characters such as ``ü`` and ``Ü``,
491   add the :const:`UNICODE` flag.
492
493
494.. data:: L
495          LOCALE
496
497   Make ``\w``, ``\W``, ``\b``, ``\B``, ``\s`` and ``\S`` dependent on the
498   current locale.
499
500
501.. data:: M
502          MULTILINE
503
504   When specified, the pattern character ``'^'`` matches at the beginning of the
505   string and at the beginning of each line (immediately following each newline);
506   and the pattern character ``'$'`` matches at the end of the string and at the
507   end of each line (immediately preceding each newline).  By default, ``'^'``
508   matches only at the beginning of the string, and ``'$'`` only at the end of the
509   string and immediately before the newline (if any) at the end of the string.
510
511
512.. data:: S
513          DOTALL
514
515   Make the ``'.'`` special character match any character at all, including a
516   newline; without this flag, ``'.'`` will match anything *except* a newline.
517
518
519.. data:: U
520          UNICODE
521
522   Make the ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S``
523   sequences dependent on the Unicode character properties database. Also
524   enables non-ASCII matching for :const:`IGNORECASE`.
525
526   .. versionadded:: 2.0
527
528
529.. data:: X
530          VERBOSE
531
532   This flag allows you to write regular expressions that look nicer and are
533   more readable by allowing you to visually separate logical sections of the
534   pattern and add comments. Whitespace within the pattern is ignored, except
535   when in a character class, or when preceded by an unescaped backslash,
536   or within tokens like ``*?``, ``(?:`` or ``(?P<...>``.
537   When a line contains a ``#`` that is not in a character class and is not
538   preceded by an unescaped backslash, all characters from the leftmost such
539   ``#`` through the end of the line are ignored.
540
541   This means that the two following regular expression objects that match a
542   decimal number are functionally equal::
543
544      a = re.compile(r"""\d +  # the integral part
545                         \.    # the decimal point
546                         \d *  # some fractional digits""", re.X)
547      b = re.compile(r"\d+\.\d*")
548
549
550.. function:: search(pattern, string, flags=0)
551
552   Scan through *string* looking for the first location where the regular expression
553   *pattern* produces a match, and return a corresponding :class:`MatchObject`
554   instance. Return ``None`` if no position in the string matches the pattern; note
555   that this is different from finding a zero-length match at some point in the
556   string.
557
558
559.. function:: match(pattern, string, flags=0)
560
561   If zero or more characters at the beginning of *string* match the regular
562   expression *pattern*, return a corresponding :class:`MatchObject` instance.
563   Return ``None`` if the string does not match the pattern; note that this is
564   different from a zero-length match.
565
566   Note that even in :const:`MULTILINE` mode, :func:`re.match` will only match
567   at the beginning of the string and not at the beginning of each line.
568
569   If you want to locate a match anywhere in *string*, use :func:`search`
570   instead (see also :ref:`search-vs-match`).
571
572
573.. function:: split(pattern, string, maxsplit=0, flags=0)
574
575   Split *string* by the occurrences of *pattern*.  If capturing parentheses are
576   used in *pattern*, then the text of all groups in the pattern are also returned
577   as part of the resulting list. If *maxsplit* is nonzero, at most *maxsplit*
578   splits occur, and the remainder of the string is returned as the final element
579   of the list.  (Incompatibility note: in the original Python 1.5 release,
580   *maxsplit* was ignored.  This has been fixed in later releases.)
581
582      >>> re.split('\W+', 'Words, words, words.')
583      ['Words', 'words', 'words', '']
584      >>> re.split('(\W+)', 'Words, words, words.')
585      ['Words', ', ', 'words', ', ', 'words', '.', '']
586      >>> re.split('\W+', 'Words, words, words.', 1)
587      ['Words', 'words, words.']
588      >>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)
589      ['0', '3', '9']
590
591   If there are capturing groups in the separator and it matches at the start of
592   the string, the result will start with an empty string.  The same holds for
593   the end of the string:
594
595      >>> re.split('(\W+)', '...words, words...')
596      ['', '...', 'words', ', ', 'words', '...', '']
597
598   That way, separator components are always found at the same relative
599   indices within the result list (e.g., if there's one capturing group
600   in the separator, the 0th, the 2nd and so forth).
601
602   Note that *split* will never split a string on an empty pattern match.
603   For example:
604
605      >>> re.split('x*', 'foo')
606      ['foo']
607      >>> re.split("(?m)^$", "foo\n\nbar\n")
608      ['foo\n\nbar\n']
609
610   .. versionchanged:: 2.7
611      Added the optional flags argument.
612
613
614
615.. function:: findall(pattern, string, flags=0)
616
617   Return all non-overlapping matches of *pattern* in *string*, as a list of
618   strings.  The *string* is scanned left-to-right, and matches are returned in
619   the order found.  If one or more groups are present in the pattern, return a
620   list of groups; this will be a list of tuples if the pattern has more than
621   one group.  Empty matches are included in the result.
622
623   .. note::
624
625      Due to the limitation of the current implementation the character
626      following an empty match is not included in a next match, so
627      ``findall(r'^|\w+', 'two words')`` returns ``['', 'wo', 'words']``
628      (note missed "t").  This is changed in Python 3.7.
629
630   .. versionadded:: 1.5.2
631
632   .. versionchanged:: 2.4
633      Added the optional flags argument.
634
635
636.. function:: finditer(pattern, string, flags=0)
637
638   Return an :term:`iterator` yielding :class:`MatchObject` instances over all
639   non-overlapping matches for the RE *pattern* in *string*.  The *string* is
640   scanned left-to-right, and matches are returned in the order found.  Empty
641   matches are included in the result.  See also the note about :func:`findall`.
642
643   .. versionadded:: 2.2
644
645   .. versionchanged:: 2.4
646      Added the optional flags argument.
647
648
649.. function:: sub(pattern, repl, string, count=0, flags=0)
650
651   Return the string obtained by replacing the leftmost non-overlapping occurrences
652   of *pattern* in *string* by the replacement *repl*.  If the pattern isn't found,
653   *string* is returned unchanged.  *repl* can be a string or a function; if it is
654   a string, any backslash escapes in it are processed.  That is, ``\n`` is
655   converted to a single newline character, ``\r`` is converted to a carriage return, and
656   so forth.  Unknown escapes such as ``\j`` are left alone.  Backreferences, such
657   as ``\6``, are replaced with the substring matched by group 6 in the pattern.
658   For example:
659
660      >>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',
661      ...        r'static PyObject*\npy_\1(void)\n{',
662      ...        'def myfunc():')
663      'static PyObject*\npy_myfunc(void)\n{'
664
665   If *repl* is a function, it is called for every non-overlapping occurrence of
666   *pattern*.  The function takes a single match object argument, and returns the
667   replacement string.  For example:
668
669      >>> def dashrepl(matchobj):
670      ...     if matchobj.group(0) == '-': return ' '
671      ...     else: return '-'
672      >>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
673      'pro--gram files'
674      >>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)
675      'Baked Beans & Spam'
676
677   The pattern may be a string or an RE object.
678
679   The optional argument *count* is the maximum number of pattern occurrences to be
680   replaced; *count* must be a non-negative integer.  If omitted or zero, all
681   occurrences will be replaced. Empty matches for the pattern are replaced only
682   when not adjacent to a previous match, so ``sub('x*', '-', 'abc')`` returns
683   ``'-a-b-c-'``.
684
685   In string-type *repl* arguments, in addition to the character escapes and
686   backreferences described above,
687   ``\g<name>`` will use the substring matched by the group named ``name``, as
688   defined by the ``(?P<name>...)`` syntax. ``\g<number>`` uses the corresponding
689   group number; ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous
690   in a replacement such as ``\g<2>0``.  ``\20`` would be interpreted as a
691   reference to group 20, not a reference to group 2 followed by the literal
692   character ``'0'``.  The backreference ``\g<0>`` substitutes in the entire
693   substring matched by the RE.
694
695   .. versionchanged:: 2.7
696      Added the optional flags argument.
697
698
699.. function:: subn(pattern, repl, string, count=0, flags=0)
700
701   Perform the same operation as :func:`sub`, but return a tuple ``(new_string,
702   number_of_subs_made)``.
703
704   .. versionchanged:: 2.7
705      Added the optional flags argument.
706
707
708.. function:: escape(pattern)
709
710   Escape all the characters in *pattern* except ASCII letters and numbers.
711   This is useful if you want to match an arbitrary literal string that may
712   have regular expression metacharacters in it.  For example::
713
714      >>> print re.escape('python.exe')
715      python\.exe
716
717      >>> legal_chars = string.ascii_lowercase + string.digits + "!#$%&'*+-.^_`|~:"
718      >>> print '[%s]+' % re.escape(legal_chars)
719      [abcdefghijklmnopqrstuvwxyz0123456789\!\#\$\%\&\'\*\+\-\.\^\_\`\|\~\:]+
720
721      >>> operators = ['+', '-', '*', '/', '**']
722      >>> print '|'.join(map(re.escape, sorted(operators, reverse=True)))
723      \/|\-|\+|\*\*|\*
724
725
726.. function:: purge()
727
728   Clear the regular expression cache.
729
730
731.. exception:: error
732
733   Exception raised when a string passed to one of the functions here is not a
734   valid regular expression (for example, it might contain unmatched parentheses)
735   or when some other error occurs during compilation or matching.  It is never an
736   error if a string contains no match for a pattern.
737
738
739.. _re-objects:
740
741Regular Expression Objects
742--------------------------
743
744.. class:: RegexObject
745
746   The :class:`RegexObject` class supports the following methods and attributes:
747
748   .. method:: RegexObject.search(string[, pos[, endpos]])
749
750      Scan through *string* looking for a location where this regular expression
751      produces a match, and return a corresponding :class:`MatchObject` instance.
752      Return ``None`` if no position in the string matches the pattern; note that this
753      is different from finding a zero-length match at some point in the string.
754
755      The optional second parameter *pos* gives an index in the string where the
756      search is to start; it defaults to ``0``.  This is not completely equivalent to
757      slicing the string; the ``'^'`` pattern character matches at the real beginning
758      of the string and at positions just after a newline, but not necessarily at the
759      index where the search is to start.
760
761      The optional parameter *endpos* limits how far the string will be searched; it
762      will be as if the string is *endpos* characters long, so only the characters
763      from *pos* to ``endpos - 1`` will be searched for a match.  If *endpos* is less
764      than *pos*, no match will be found, otherwise, if *rx* is a compiled regular
765      expression object, ``rx.search(string, 0, 50)`` is equivalent to
766      ``rx.search(string[:50], 0)``.
767
768      >>> pattern = re.compile("d")
769      >>> pattern.search("dog")     # Match at index 0
770      <_sre.SRE_Match object at ...>
771      >>> pattern.search("dog", 1)  # No match; search doesn't include the "d"
772
773
774   .. method:: RegexObject.match(string[, pos[, endpos]])
775
776      If zero or more characters at the *beginning* of *string* match this regular
777      expression, return a corresponding :class:`MatchObject` instance.  Return
778      ``None`` if the string does not match the pattern; note that this is different
779      from a zero-length match.
780
781      The optional *pos* and *endpos* parameters have the same meaning as for the
782      :meth:`~RegexObject.search` method.
783
784      >>> pattern = re.compile("o")
785      >>> pattern.match("dog")      # No match as "o" is not at the start of "dog".
786      >>> pattern.match("dog", 1)   # Match as "o" is the 2nd character of "dog".
787      <_sre.SRE_Match object at ...>
788
789      If you want to locate a match anywhere in *string*, use
790      :meth:`~RegexObject.search` instead (see also :ref:`search-vs-match`).
791
792
793   .. method:: RegexObject.split(string, maxsplit=0)
794
795      Identical to the :func:`split` function, using the compiled pattern.
796
797
798   .. method:: RegexObject.findall(string[, pos[, endpos]])
799
800      Similar to the :func:`findall` function, using the compiled pattern, but
801      also accepts optional *pos* and *endpos* parameters that limit the search
802      region like for :meth:`match`.
803
804
805   .. method:: RegexObject.finditer(string[, pos[, endpos]])
806
807      Similar to the :func:`finditer` function, using the compiled pattern, but
808      also accepts optional *pos* and *endpos* parameters that limit the search
809      region like for :meth:`match`.
810
811
812   .. method:: RegexObject.sub(repl, string, count=0)
813
814      Identical to the :func:`sub` function, using the compiled pattern.
815
816
817   .. method:: RegexObject.subn(repl, string, count=0)
818
819      Identical to the :func:`subn` function, using the compiled pattern.
820
821
822   .. attribute:: RegexObject.flags
823
824      The regex matching flags.  This is a combination of the flags given to
825      :func:`.compile` and any ``(?...)`` inline flags in the pattern.
826
827
828   .. attribute:: RegexObject.groups
829
830      The number of capturing groups in the pattern.
831
832
833   .. attribute:: RegexObject.groupindex
834
835      A dictionary mapping any symbolic group names defined by ``(?P<id>)`` to group
836      numbers.  The dictionary is empty if no symbolic groups were used in the
837      pattern.
838
839
840   .. attribute:: RegexObject.pattern
841
842      The pattern string from which the RE object was compiled.
843
844
845.. _match-objects:
846
847Match Objects
848-------------
849
850.. class:: MatchObject
851
852   Match objects always have a boolean value of ``True``.
853   Since :meth:`~regex.match` and :meth:`~regex.search` return ``None``
854   when there is no match, you can test whether there was a match with a simple
855   ``if`` statement::
856
857      match = re.search(pattern, string)
858      if match:
859          process(match)
860
861   Match objects support the following methods and attributes:
862
863
864   .. method:: MatchObject.expand(template)
865
866      Return the string obtained by doing backslash substitution on the template
867      string *template*, as done by the :meth:`~RegexObject.sub` method.  Escapes
868      such as ``\n`` are converted to the appropriate characters, and numeric
869      backreferences (``\1``, ``\2``) and named backreferences (``\g<1>``,
870      ``\g<name>``) are replaced by the contents of the corresponding group.
871
872
873   .. method:: MatchObject.group([group1, ...])
874
875      Returns one or more subgroups of the match.  If there is a single argument, the
876      result is a single string; if there are multiple arguments, the result is a
877      tuple with one item per argument. Without arguments, *group1* defaults to zero
878      (the whole match is returned). If a *groupN* argument is zero, the corresponding
879      return value is the entire matching string; if it is in the inclusive range
880      [1..99], it is the string matching the corresponding parenthesized group.  If a
881      group number is negative or larger than the number of groups defined in the
882      pattern, an :exc:`IndexError` exception is raised. If a group is contained in a
883      part of the pattern that did not match, the corresponding result is ``None``.
884      If a group is contained in a part of the pattern that matched multiple times,
885      the last match is returned.
886
887         >>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
888         >>> m.group(0)       # The entire match
889         'Isaac Newton'
890         >>> m.group(1)       # The first parenthesized subgroup.
891         'Isaac'
892         >>> m.group(2)       # The second parenthesized subgroup.
893         'Newton'
894         >>> m.group(1, 2)    # Multiple arguments give us a tuple.
895         ('Isaac', 'Newton')
896
897      If the regular expression uses the ``(?P<name>...)`` syntax, the *groupN*
898      arguments may also be strings identifying groups by their group name.  If a
899      string argument is not used as a group name in the pattern, an :exc:`IndexError`
900      exception is raised.
901
902      A moderately complicated example:
903
904         >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
905         >>> m.group('first_name')
906         'Malcolm'
907         >>> m.group('last_name')
908         'Reynolds'
909
910      Named groups can also be referred to by their index:
911
912         >>> m.group(1)
913         'Malcolm'
914         >>> m.group(2)
915         'Reynolds'
916
917      If a group matches multiple times, only the last match is accessible:
918
919         >>> m = re.match(r"(..)+", "a1b2c3")  # Matches 3 times.
920         >>> m.group(1)                        # Returns only the last match.
921         'c3'
922
923
924   .. method:: MatchObject.groups([default])
925
926      Return a tuple containing all the subgroups of the match, from 1 up to however
927      many groups are in the pattern.  The *default* argument is used for groups that
928      did not participate in the match; it defaults to ``None``.  (Incompatibility
929      note: in the original Python 1.5 release, if the tuple was one element long, a
930      string would be returned instead.  In later versions (from 1.5.1 on), a
931      singleton tuple is returned in such cases.)
932
933      For example:
934
935         >>> m = re.match(r"(\d+)\.(\d+)", "24.1632")
936         >>> m.groups()
937         ('24', '1632')
938
939      If we make the decimal place and everything after it optional, not all groups
940      might participate in the match.  These groups will default to ``None`` unless
941      the *default* argument is given:
942
943         >>> m = re.match(r"(\d+)\.?(\d+)?", "24")
944         >>> m.groups()      # Second group defaults to None.
945         ('24', None)
946         >>> m.groups('0')   # Now, the second group defaults to '0'.
947         ('24', '0')
948
949
950   .. method:: MatchObject.groupdict([default])
951
952      Return a dictionary containing all the *named* subgroups of the match, keyed by
953      the subgroup name.  The *default* argument is used for groups that did not
954      participate in the match; it defaults to ``None``.  For example:
955
956         >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
957         >>> m.groupdict()
958         {'first_name': 'Malcolm', 'last_name': 'Reynolds'}
959
960
961   .. method:: MatchObject.start([group])
962               MatchObject.end([group])
963
964      Return the indices of the start and end of the substring matched by *group*;
965      *group* defaults to zero (meaning the whole matched substring). Return ``-1`` if
966      *group* exists but did not contribute to the match.  For a match object *m*, and
967      a group *g* that did contribute to the match, the substring matched by group *g*
968      (equivalent to ``m.group(g)``) is ::
969
970         m.string[m.start(g):m.end(g)]
971
972      Note that ``m.start(group)`` will equal ``m.end(group)`` if *group* matched a
973      null string.  For example, after ``m = re.search('b(c?)', 'cba')``,
974      ``m.start(0)`` is 1, ``m.end(0)`` is 2, ``m.start(1)`` and ``m.end(1)`` are both
975      2, and ``m.start(2)`` raises an :exc:`IndexError` exception.
976
977      An example that will remove *remove_this* from email addresses:
978
979         >>> email = "tony@tiremove_thisger.net"
980         >>> m = re.search("remove_this", email)
981         >>> email[:m.start()] + email[m.end():]
982         'tony@tiger.net'
983
984
985   .. method:: MatchObject.span([group])
986
987      For :class:`MatchObject` *m*, return the 2-tuple ``(m.start(group),
988      m.end(group))``. Note that if *group* did not contribute to the match, this is
989      ``(-1, -1)``.  *group* defaults to zero, the entire match.
990
991
992   .. attribute:: MatchObject.pos
993
994      The value of *pos* which was passed to the :meth:`~RegexObject.search` or
995      :meth:`~RegexObject.match` method of the :class:`RegexObject`.  This is the
996      index into the string at which the RE engine started looking for a match.
997
998
999   .. attribute:: MatchObject.endpos
1000
1001      The value of *endpos* which was passed to the :meth:`~RegexObject.search` or
1002      :meth:`~RegexObject.match` method of the :class:`RegexObject`.  This is the
1003      index into the string beyond which the RE engine will not go.
1004
1005
1006   .. attribute:: MatchObject.lastindex
1007
1008      The integer index of the last matched capturing group, or ``None`` if no group
1009      was matched at all. For example, the expressions ``(a)b``, ``((a)(b))``, and
1010      ``((ab))`` will have ``lastindex == 1`` if applied to the string ``'ab'``, while
1011      the expression ``(a)(b)`` will have ``lastindex == 2``, if applied to the same
1012      string.
1013
1014
1015   .. attribute:: MatchObject.lastgroup
1016
1017      The name of the last matched capturing group, or ``None`` if the group didn't
1018      have a name, or if no group was matched at all.
1019
1020
1021   .. attribute:: MatchObject.re
1022
1023      The regular expression object whose :meth:`~RegexObject.match` or
1024      :meth:`~RegexObject.search` method produced this :class:`MatchObject`
1025      instance.
1026
1027
1028   .. attribute:: MatchObject.string
1029
1030      The string passed to :meth:`~RegexObject.match` or
1031      :meth:`~RegexObject.search`.
1032
1033
1034Examples
1035--------
1036
1037
1038Checking For a Pair
1039^^^^^^^^^^^^^^^^^^^
1040
1041In this example, we'll use the following helper function to display match
1042objects a little more gracefully:
1043
1044.. testcode::
1045
1046   def displaymatch(match):
1047       if match is None:
1048           return None
1049       return '<Match: %r, groups=%r>' % (match.group(), match.groups())
1050
1051Suppose you are writing a poker program where a player's hand is represented as
1052a 5-character string with each character representing a card, "a" for ace, "k"
1053for king, "q" for queen, "j" for jack, "t" for 10, and "2" through "9"
1054representing the card with that value.
1055
1056To see if a given string is a valid hand, one could do the following:
1057
1058   >>> valid = re.compile(r"^[a2-9tjqk]{5}$")
1059   >>> displaymatch(valid.match("akt5q"))  # Valid.
1060   "<Match: 'akt5q', groups=()>"
1061   >>> displaymatch(valid.match("akt5e"))  # Invalid.
1062   >>> displaymatch(valid.match("akt"))    # Invalid.
1063   >>> displaymatch(valid.match("727ak"))  # Valid.
1064   "<Match: '727ak', groups=()>"
1065
1066That last hand, ``"727ak"``, contained a pair, or two of the same valued cards.
1067To match this with a regular expression, one could use backreferences as such:
1068
1069   >>> pair = re.compile(r".*(.).*\1")
1070   >>> displaymatch(pair.match("717ak"))     # Pair of 7s.
1071   "<Match: '717', groups=('7',)>"
1072   >>> displaymatch(pair.match("718ak"))     # No pairs.
1073   >>> displaymatch(pair.match("354aa"))     # Pair of aces.
1074   "<Match: '354aa', groups=('a',)>"
1075
1076To find out what card the pair consists of, one could use the
1077:meth:`~MatchObject.group` method of :class:`MatchObject` in the following
1078manner:
1079
1080.. doctest::
1081
1082   >>> pair.match("717ak").group(1)
1083   '7'
1084
1085   # Error because re.match() returns None, which doesn't have a group() method:
1086   >>> pair.match("718ak").group(1)
1087   Traceback (most recent call last):
1088     File "<pyshell#23>", line 1, in <module>
1089       re.match(r".*(.).*\1", "718ak").group(1)
1090   AttributeError: 'NoneType' object has no attribute 'group'
1091
1092   >>> pair.match("354aa").group(1)
1093   'a'
1094
1095
1096Simulating scanf()
1097^^^^^^^^^^^^^^^^^^
1098
1099.. index:: single: scanf()
1100
1101Python does not currently have an equivalent to :c:func:`scanf`.  Regular
1102expressions are generally more powerful, though also more verbose, than
1103:c:func:`scanf` format strings.  The table below offers some more-or-less
1104equivalent mappings between :c:func:`scanf` format tokens and regular
1105expressions.
1106
1107+--------------------------------+---------------------------------------------+
1108| :c:func:`scanf` Token          | Regular Expression                          |
1109+================================+=============================================+
1110| ``%c``                         | ``.``                                       |
1111+--------------------------------+---------------------------------------------+
1112| ``%5c``                        | ``.{5}``                                    |
1113+--------------------------------+---------------------------------------------+
1114| ``%d``                         | ``[-+]?\d+``                                |
1115+--------------------------------+---------------------------------------------+
1116| ``%e``, ``%E``, ``%f``, ``%g`` | ``[-+]?(\d+(\.\d*)?|\.\d+)([eE][-+]?\d+)?`` |
1117+--------------------------------+---------------------------------------------+
1118| ``%i``                         | ``[-+]?(0[xX][\dA-Fa-f]+|0[0-7]*|\d+)``     |
1119+--------------------------------+---------------------------------------------+
1120| ``%o``                         | ``[-+]?[0-7]+``                             |
1121+--------------------------------+---------------------------------------------+
1122| ``%s``                         | ``\S+``                                     |
1123+--------------------------------+---------------------------------------------+
1124| ``%u``                         | ``\d+``                                     |
1125+--------------------------------+---------------------------------------------+
1126| ``%x``, ``%X``                 | ``[-+]?(0[xX])?[\dA-Fa-f]+``                |
1127+--------------------------------+---------------------------------------------+
1128
1129To extract the filename and numbers from a string like ::
1130
1131   /usr/sbin/sendmail - 0 errors, 4 warnings
1132
1133you would use a :c:func:`scanf` format like ::
1134
1135   %s - %d errors, %d warnings
1136
1137The equivalent regular expression would be ::
1138
1139   (\S+) - (\d+) errors, (\d+) warnings
1140
1141
1142.. _search-vs-match:
1143
1144search() vs. match()
1145^^^^^^^^^^^^^^^^^^^^
1146
1147.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
1148
1149Python offers two different primitive operations based on regular expressions:
1150:func:`re.match` checks for a match only at the beginning of the string, while
1151:func:`re.search` checks for a match anywhere in the string (this is what Perl
1152does by default).
1153
1154For example::
1155
1156   >>> re.match("c", "abcdef")    # No match
1157   >>> re.search("c", "abcdef")   # Match
1158   <_sre.SRE_Match object at ...>
1159
1160Regular expressions beginning with ``'^'`` can be used with :func:`search` to
1161restrict the match at the beginning of the string::
1162
1163   >>> re.match("c", "abcdef")    # No match
1164   >>> re.search("^c", "abcdef")  # No match
1165   >>> re.search("^a", "abcdef")  # Match
1166   <_sre.SRE_Match object at ...>
1167
1168Note however that in :const:`MULTILINE` mode :func:`match` only matches at the
1169beginning of the string, whereas using :func:`search` with a regular expression
1170beginning with ``'^'`` will match at the beginning of each line.
1171
1172   >>> re.match('X', 'A\nB\nX', re.MULTILINE)  # No match
1173   >>> re.search('^X', 'A\nB\nX', re.MULTILINE)  # Match
1174   <_sre.SRE_Match object at ...>
1175
1176
1177Making a Phonebook
1178^^^^^^^^^^^^^^^^^^
1179
1180:func:`split` splits a string into a list delimited by the passed pattern.  The
1181method is invaluable for converting textual data into data structures that can be
1182easily read and modified by Python as demonstrated in the following example that
1183creates a phonebook.
1184
1185First, here is the input.  Normally it may come from a file, here we are using
1186triple-quoted string syntax:
1187
1188   >>> text = """Ross McFluff: 834.345.1254 155 Elm Street
1189   ...
1190   ... Ronald Heathmore: 892.345.3428 436 Finley Avenue
1191   ... Frank Burger: 925.541.7625 662 South Dogwood Way
1192   ...
1193   ...
1194   ... Heather Albrecht: 548.326.4584 919 Park Place"""
1195
1196The entries are separated by one or more newlines. Now we convert the string
1197into a list with each nonempty line having its own entry:
1198
1199.. doctest::
1200   :options: +NORMALIZE_WHITESPACE
1201
1202   >>> entries = re.split("\n+", text)
1203   >>> entries
1204   ['Ross McFluff: 834.345.1254 155 Elm Street',
1205   'Ronald Heathmore: 892.345.3428 436 Finley Avenue',
1206   'Frank Burger: 925.541.7625 662 South Dogwood Way',
1207   'Heather Albrecht: 548.326.4584 919 Park Place']
1208
1209Finally, split each entry into a list with first name, last name, telephone
1210number, and address.  We use the ``maxsplit`` parameter of :func:`split`
1211because the address has spaces, our splitting pattern, in it:
1212
1213.. doctest::
1214   :options: +NORMALIZE_WHITESPACE
1215
1216   >>> [re.split(":? ", entry, 3) for entry in entries]
1217   [['Ross', 'McFluff', '834.345.1254', '155 Elm Street'],
1218   ['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'],
1219   ['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'],
1220   ['Heather', 'Albrecht', '548.326.4584', '919 Park Place']]
1221
1222The ``:?`` pattern matches the colon after the last name, so that it does not
1223occur in the result list.  With a ``maxsplit`` of ``4``, we could separate the
1224house number from the street name:
1225
1226.. doctest::
1227   :options: +NORMALIZE_WHITESPACE
1228
1229   >>> [re.split(":? ", entry, 4) for entry in entries]
1230   [['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'],
1231   ['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'],
1232   ['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'],
1233   ['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']]
1234
1235
1236Text Munging
1237^^^^^^^^^^^^
1238
1239:func:`sub` replaces every occurrence of a pattern with a string or the
1240result of a function.  This example demonstrates using :func:`sub` with
1241a function to "munge" text, or randomize the order of all the characters
1242in each word of a sentence except for the first and last characters::
1243
1244   >>> def repl(m):
1245   ...     inner_word = list(m.group(2))
1246   ...     random.shuffle(inner_word)
1247   ...     return m.group(1) + "".join(inner_word) + m.group(3)
1248   >>> text = "Professor Abdolmalek, please report your absences promptly."
1249   >>> re.sub(r"(\w)(\w+)(\w)", repl, text)
1250   'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.'
1251   >>> re.sub(r"(\w)(\w+)(\w)", repl, text)
1252   'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.'
1253
1254
1255Finding all Adverbs
1256^^^^^^^^^^^^^^^^^^^
1257
1258:func:`findall` matches *all* occurrences of a pattern, not just the first
1259one as :func:`search` does.  For example, if a writer wanted to
1260find all of the adverbs in some text, they might use :func:`findall` in
1261the following manner:
1262
1263   >>> text = "He was carefully disguised but captured quickly by police."
1264   >>> re.findall(r"\w+ly", text)
1265   ['carefully', 'quickly']
1266
1267
1268Finding all Adverbs and their Positions
1269^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1270
1271If one wants more information about all matches of a pattern than the matched
1272text, :func:`finditer` is useful as it provides instances of
1273:class:`MatchObject` instead of strings.  Continuing with the previous example,
1274if a writer wanted to find all of the adverbs *and their positions*
1275in some text, they would use :func:`finditer` in the following manner:
1276
1277   >>> text = "He was carefully disguised but captured quickly by police."
1278   >>> for m in re.finditer(r"\w+ly", text):
1279   ...     print '%02d-%02d: %s' % (m.start(), m.end(), m.group(0))
1280   07-16: carefully
1281   40-47: quickly
1282
1283
1284Raw String Notation
1285^^^^^^^^^^^^^^^^^^^
1286
1287Raw string notation (``r"text"``) keeps regular expressions sane.  Without it,
1288every backslash (``'\'``) in a regular expression would have to be prefixed with
1289another one to escape it.  For example, the two following lines of code are
1290functionally identical:
1291
1292   >>> re.match(r"\W(.)\1\W", " ff ")
1293   <_sre.SRE_Match object at ...>
1294   >>> re.match("\\W(.)\\1\\W", " ff ")
1295   <_sre.SRE_Match object at ...>
1296
1297When one wants to match a literal backslash, it must be escaped in the regular
1298expression.  With raw string notation, this means ``r"\\"``.  Without raw string
1299notation, one must use ``"\\\\"``, making the following lines of code
1300functionally identical:
1301
1302   >>> re.match(r"\\", r"\\")
1303   <_sre.SRE_Match object at ...>
1304   >>> re.match("\\\\", r"\\")
1305   <_sre.SRE_Match object at ...>
1306