• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1:mod:`!re` --- Regular expression operations
2============================================
3
4.. module:: re
5   :synopsis: Regular expression operations.
6
7.. moduleauthor:: Fredrik Lundh <fredrik@pythonware.com>
8.. sectionauthor:: Andrew M. Kuchling <amk@amk.ca>
9
10**Source code:** :source:`Lib/re/`
11
12--------------
13
14This module provides regular expression matching operations similar to
15those found in Perl.
16
17Both patterns and strings to be searched can be Unicode strings (:class:`str`)
18as well as 8-bit strings (:class:`bytes`).
19However, Unicode strings and 8-bit strings cannot be mixed:
20that is, you cannot match a Unicode string with a bytes pattern or
21vice-versa; similarly, when asking for a substitution, the replacement
22string must be of the same type as both the pattern and the search string.
23
24Regular expressions use the backslash character (``'\'``) to indicate
25special forms or to allow special characters to be used without invoking
26their special meaning.  This collides with Python's usage of the same
27character for the same purpose in string literals; for example, to match
28a literal backslash, one might have to write ``'\\\\'`` as the pattern
29string, because the regular expression must be ``\\``, and each
30backslash must be expressed as ``\\`` inside a regular Python string
31literal. Also, please note that any invalid escape sequences in Python's
32usage of the backslash in string literals now generate a :exc:`SyntaxWarning`
33and in the future this will become a :exc:`SyntaxError`. This behaviour
34will happen even if it is a valid escape sequence for a regular expression.
35
36The solution is to use Python's raw string notation for regular expression
37patterns; backslashes are not handled in any special way in a string literal
38prefixed with ``'r'``.  So ``r"\n"`` is a two-character string containing
39``'\'`` and ``'n'``, while ``"\n"`` is a one-character string containing a
40newline.  Usually patterns will be expressed in Python code using this raw
41string notation.
42
43It is important to note that most regular expression operations are available as
44module-level functions and methods on
45:ref:`compiled regular expressions <re-objects>`.  The functions are shortcuts
46that don't require you to compile a regex object first, but miss some
47fine-tuning parameters.
48
49.. seealso::
50
51   The third-party :pypi:`regex` module,
52   which has an API compatible with the standard library :mod:`re` module,
53   but offers additional functionality and a more thorough Unicode support.
54
55
56.. _re-syntax:
57
58Regular Expression Syntax
59-------------------------
60
61A regular expression (or RE) specifies a set of strings that matches it; the
62functions in this module let you check if a particular string matches a given
63regular expression (or if a given regular expression matches a particular
64string, which comes down to the same thing).
65
66Regular expressions can be concatenated to form new regular expressions; if *A*
67and *B* are both regular expressions, then *AB* is also a regular expression.
68In general, if a string *p* matches *A* and another string *q* matches *B*, the
69string *pq* will match AB.  This holds unless *A* or *B* contain low precedence
70operations; boundary conditions between *A* and *B*; or have numbered group
71references.  Thus, complex expressions can easily be constructed from simpler
72primitive expressions like the ones described here.  For details of the theory
73and implementation of regular expressions, consult the Friedl book [Frie09]_,
74or almost any textbook about compiler construction.
75
76A brief explanation of the format of regular expressions follows.  For further
77information and a gentler presentation, consult the :ref:`regex-howto`.
78
79Regular expressions can contain both special and ordinary characters. Most
80ordinary characters, like ``'A'``, ``'a'``, or ``'0'``, are the simplest regular
81expressions; they simply match themselves.  You can concatenate ordinary
82characters, so ``last`` matches the string ``'last'``.  (In the rest of this
83section, we'll write RE's in ``this special style``, usually without quotes, and
84strings to be matched ``'in single quotes'``.)
85
86Some characters, like ``'|'`` or ``'('``, are special. Special
87characters either stand for classes of ordinary characters, or affect
88how the regular expressions around them are interpreted.
89
90Repetition operators or quantifiers (``*``, ``+``, ``?``, ``{m,n}``, etc) cannot be
91directly nested. This avoids ambiguity with the non-greedy modifier suffix
92``?``, and with other modifiers in other implementations. To apply a second
93repetition to an inner repetition, parentheses may be used. For example,
94the expression ``(?:a{6})*`` matches any multiple of six ``'a'`` characters.
95
96
97The special characters are:
98
99.. index:: single: . (dot); in regular expressions
100
101``.``
102   (Dot.)  In the default mode, this matches any character except a newline.  If
103   the :const:`DOTALL` flag has been specified, this matches any character
104   including a newline.  ``(?s:.)`` matches any character regardless of flags.
105
106.. index:: single: ^ (caret); in regular expressions
107
108``^``
109   (Caret.)  Matches the start of the string, and in :const:`MULTILINE` mode also
110   matches immediately after each newline.
111
112.. index:: single: $ (dollar); in regular expressions
113
114``$``
115   Matches the end of the string or just before the newline at the end of the
116   string, and in :const:`MULTILINE` mode also matches before a newline.  ``foo``
117   matches both 'foo' and 'foobar', while the regular expression ``foo$`` matches
118   only 'foo'.  More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'``
119   matches 'foo2' normally, but 'foo1' in :const:`MULTILINE` mode; searching for
120   a single ``$`` in ``'foo\n'`` will find two (empty) matches: one just before
121   the newline, and one at the end of the string.
122
123.. index:: single: * (asterisk); in regular expressions
124
125``*``
126   Causes the resulting RE to match 0 or more repetitions of the preceding RE, as
127   many repetitions as are possible.  ``ab*`` will match 'a', 'ab', or 'a' followed
128   by any number of 'b's.
129
130.. index:: single: + (plus); in regular expressions
131
132``+``
133   Causes the resulting RE to match 1 or more repetitions of the preceding RE.
134   ``ab+`` will match 'a' followed by any non-zero number of 'b's; it will not
135   match just 'a'.
136
137.. index:: single: ? (question mark); in regular expressions
138
139``?``
140   Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
141   ``ab?`` will match either 'a' or 'ab'.
142
143.. index::
144   single: *?; in regular expressions
145   single: +?; in regular expressions
146   single: ??; in regular expressions
147
148``*?``, ``+?``, ``??``
149   The ``'*'``, ``'+'``, and ``'?'`` quantifiers are all :dfn:`greedy`; they match
150   as much text as possible.  Sometimes this behaviour isn't desired; if the RE
151   ``<.*>`` is matched against ``'<a> b <c>'``, it will match the entire
152   string, and not just ``'<a>'``.  Adding ``?`` after the quantifier makes it
153   perform the match in :dfn:`non-greedy` or :dfn:`minimal` fashion; as *few*
154   characters as possible will be matched.  Using the RE ``<.*?>`` will match
155   only ``'<a>'``.
156
157.. index::
158   single: *+; in regular expressions
159   single: ++; in regular expressions
160   single: ?+; in regular expressions
161
162``*+``, ``++``, ``?+``
163  Like the ``'*'``, ``'+'``, and ``'?'`` quantifiers, those where ``'+'`` is
164  appended also match as many times as possible.
165  However, unlike the true greedy quantifiers, these do not allow
166  back-tracking when the expression following it fails to match.
167  These are known as :dfn:`possessive` quantifiers.
168  For example, ``a*a`` will match ``'aaaa'`` because the ``a*`` will match
169  all 4 ``'a'``\ s, but, when the final ``'a'`` is encountered, the
170  expression is backtracked so that in the end the ``a*`` ends up matching
171  3 ``'a'``\ s total, and the fourth ``'a'`` is matched by the final ``'a'``.
172  However, when ``a*+a`` is used to match ``'aaaa'``, the ``a*+`` will
173  match all 4 ``'a'``, but when the final ``'a'`` fails to find any more
174  characters to match, the expression cannot be backtracked and will thus
175  fail to match.
176  ``x*+``, ``x++`` and ``x?+`` are equivalent to ``(?>x*)``, ``(?>x+)``
177  and ``(?>x?)`` correspondingly.
178
179  .. versionadded:: 3.11
180
181.. index::
182   single: {} (curly brackets); in regular expressions
183
184``{m}``
185   Specifies that exactly *m* copies of the previous RE should be matched; fewer
186   matches cause the entire RE not to match.  For example, ``a{6}`` will match
187   exactly six ``'a'`` characters, but not five.
188
189``{m,n}``
190   Causes the resulting RE to match from *m* to *n* repetitions of the preceding
191   RE, attempting to match as many repetitions as possible.  For example,
192   ``a{3,5}`` will match from 3 to 5 ``'a'`` characters.  Omitting *m* specifies a
193   lower bound of zero,  and omitting *n* specifies an infinite upper bound.  As an
194   example, ``a{4,}b`` will match ``'aaaab'`` or a thousand ``'a'`` characters
195   followed by a ``'b'``, but not ``'aaab'``. The comma may not be omitted or the
196   modifier would be confused with the previously described form.
197
198``{m,n}?``
199   Causes the resulting RE to match from *m* to *n* repetitions of the preceding
200   RE, attempting to match as *few* repetitions as possible.  This is the
201   non-greedy version of the previous quantifier.  For example, on the
202   6-character string ``'aaaaaa'``, ``a{3,5}`` will match 5 ``'a'`` characters,
203   while ``a{3,5}?`` will only match 3 characters.
204
205``{m,n}+``
206   Causes the resulting RE to match from *m* to *n* repetitions of the
207   preceding RE, attempting to match as many repetitions as possible
208   *without* establishing any backtracking points.
209   This is the possessive version of the quantifier above.
210   For example, on the 6-character string ``'aaaaaa'``, ``a{3,5}+aa``
211   attempt to match 5 ``'a'`` characters, then, requiring 2 more ``'a'``\ s,
212   will need more characters than available and thus fail, while
213   ``a{3,5}aa`` will match with ``a{3,5}`` capturing 5, then 4 ``'a'``\ s
214   by backtracking and then the final 2 ``'a'``\ s are matched by the final
215   ``aa`` in the pattern.
216   ``x{m,n}+`` is equivalent to ``(?>x{m,n})``.
217
218   .. versionadded:: 3.11
219
220.. index:: single: \ (backslash); in regular expressions
221
222``\``
223   Either escapes special characters (permitting you to match characters like
224   ``'*'``, ``'?'``, and so forth), or signals a special sequence; special
225   sequences are discussed below.
226
227   If you're not using a raw string to express the pattern, remember that Python
228   also uses the backslash as an escape sequence in string literals; if the escape
229   sequence isn't recognized by Python's parser, the backslash and subsequent
230   character are included in the resulting string.  However, if Python would
231   recognize the resulting sequence, the backslash should be repeated twice.  This
232   is complicated and hard to understand, so it's highly recommended that you use
233   raw strings for all but the simplest expressions.
234
235.. index::
236   single: [] (square brackets); in regular expressions
237
238``[]``
239   Used to indicate a set of characters.  In a set:
240
241   * Characters can be listed individually, e.g. ``[amk]`` will match ``'a'``,
242     ``'m'``, or ``'k'``.
243
244   .. index:: single: - (minus); in regular expressions
245
246   * Ranges of characters can be indicated by giving two characters and separating
247     them by a ``'-'``, for example ``[a-z]`` will match any lowercase ASCII letter,
248     ``[0-5][0-9]`` will match all the two-digits numbers from ``00`` to ``59``, and
249     ``[0-9A-Fa-f]`` will match any hexadecimal digit.  If ``-`` is escaped (e.g.
250     ``[a\-z]``) or if it's placed as the first or last character
251     (e.g. ``[-a]`` or ``[a-]``), it will match a literal ``'-'``.
252
253   * Special characters lose their special meaning inside sets.  For example,
254     ``[(+*)]`` will match any of the literal characters ``'('``, ``'+'``,
255     ``'*'``, or ``')'``.
256
257   .. index:: single: \ (backslash); in regular expressions
258
259   * Character classes such as ``\w`` or ``\S`` (defined below) are also accepted
260     inside a set, although the characters they match depend on the flags_ used.
261
262   .. index:: single: ^ (caret); in regular expressions
263
264   * Characters that are not within a range can be matched by :dfn:`complementing`
265     the set.  If the first character of the set is ``'^'``, all the characters
266     that are *not* in the set will be matched.  For example, ``[^5]`` will match
267     any character except ``'5'``, and ``[^^]`` will match any character except
268     ``'^'``.  ``^`` has no special meaning if it's not the first character in
269     the set.
270
271   * To match a literal ``']'`` inside a set, precede it with a backslash, or
272     place it at the beginning of the set.  For example, both ``[()[\]{}]`` and
273     ``[]()[{}]`` will match a right bracket, as well as left bracket, braces,
274     and parentheses.
275
276   .. .. index:: single: --; in regular expressions
277   .. .. index:: single: &&; in regular expressions
278   .. .. index:: single: ~~; in regular expressions
279   .. .. index:: single: ||; in regular expressions
280
281   * Support of nested sets and set operations as in `Unicode Technical
282     Standard #18`_ might be added in the future.  This would change the
283     syntax, so to facilitate this change a :exc:`FutureWarning` will be raised
284     in ambiguous cases for the time being.
285     That includes sets starting with a literal ``'['`` or containing literal
286     character sequences ``'--'``, ``'&&'``, ``'~~'``, and ``'||'``.  To
287     avoid a warning escape them with a backslash.
288
289   .. _Unicode Technical Standard #18: https://unicode.org/reports/tr18/
290
291   .. versionchanged:: 3.7
292      :exc:`FutureWarning` is raised if a character set contains constructs
293      that will change semantically in the future.
294
295.. index:: single: | (vertical bar); in regular expressions
296
297``|``
298   ``A|B``, where *A* and *B* can be arbitrary REs, creates a regular expression that
299   will match either *A* or *B*.  An arbitrary number of REs can be separated by the
300   ``'|'`` in this way.  This can be used inside groups (see below) as well.  As
301   the target string is scanned, REs separated by ``'|'`` are tried from left to
302   right. When one pattern completely matches, that branch is accepted. This means
303   that once *A* matches, *B* will not be tested further, even if it would
304   produce a longer overall match.  In other words, the ``'|'`` operator is never
305   greedy.  To match a literal ``'|'``, use ``\|``, or enclose it inside a
306   character class, as in ``[|]``.
307
308.. index::
309   single: () (parentheses); in regular expressions
310
311``(...)``
312   Matches whatever regular expression is inside the parentheses, and indicates the
313   start and end of a group; the contents of a group can be retrieved after a match
314   has been performed, and can be matched later in the string with the ``\number``
315   special sequence, described below.  To match the literals ``'('`` or ``')'``,
316   use ``\(`` or ``\)``, or enclose them inside a character class: ``[(]``, ``[)]``.
317
318.. index:: single: (?; in regular expressions
319
320``(?...)``
321   This is an extension notation (a ``'?'`` following a ``'('`` is not meaningful
322   otherwise).  The first character after the ``'?'`` determines what the meaning
323   and further syntax of the construct is. Extensions usually do not create a new
324   group; ``(?P<name>...)`` is the only exception to this rule. Following are the
325   currently supported extensions.
326
327``(?aiLmsux)``
328   (One or more letters from the set
329   ``'a'``, ``'i'``, ``'L'``, ``'m'``, ``'s'``, ``'u'``, ``'x'``.)
330   The group matches the empty string;
331   the letters set the corresponding flags for the entire regular expression:
332
333   * :const:`re.A` (ASCII-only matching)
334   * :const:`re.I` (ignore case)
335   * :const:`re.L` (locale dependent)
336   * :const:`re.M` (multi-line)
337   * :const:`re.S` (dot matches all)
338   * :const:`re.U` (Unicode matching)
339   * :const:`re.X` (verbose)
340
341   (The flags are described in :ref:`contents-of-module-re`.)
342   This is useful if you wish to include the flags as part of the
343   regular expression, instead of passing a *flag* argument to the
344   :func:`re.compile` function.
345   Flags should be used first in the expression string.
346
347   .. versionchanged:: 3.11
348      This construction can only be used at the start of the expression.
349
350.. index:: single: (?:; in regular expressions
351
352``(?:...)``
353   A non-capturing version of regular parentheses.  Matches whatever regular
354   expression is inside the parentheses, but the substring matched by the group
355   *cannot* be retrieved after performing a match or referenced later in the
356   pattern.
357
358``(?aiLmsux-imsx:...)``
359   (Zero or more letters from the set
360   ``'a'``, ``'i'``, ``'L'``, ``'m'``, ``'s'``, ``'u'``, ``'x'``,
361   optionally followed by ``'-'`` followed by
362   one or more letters from the ``'i'``, ``'m'``, ``'s'``, ``'x'``.)
363   The letters set or remove the corresponding flags for the part of the expression:
364
365   * :const:`re.A` (ASCII-only matching)
366   * :const:`re.I` (ignore case)
367   * :const:`re.L` (locale dependent)
368   * :const:`re.M` (multi-line)
369   * :const:`re.S` (dot matches all)
370   * :const:`re.U` (Unicode matching)
371   * :const:`re.X` (verbose)
372
373   (The flags are described in :ref:`contents-of-module-re`.)
374
375   The letters ``'a'``, ``'L'`` and ``'u'`` are mutually exclusive when used
376   as inline flags, so they can't be combined or follow ``'-'``.  Instead,
377   when one of them appears in an inline group, it overrides the matching mode
378   in the enclosing group.  In Unicode patterns ``(?a:...)`` switches to
379   ASCII-only matching, and ``(?u:...)`` switches to Unicode matching
380   (default).  In bytes patterns ``(?L:...)`` switches to locale dependent
381   matching, and ``(?a:...)`` switches to ASCII-only matching (default).
382   This override is only in effect for the narrow inline group, and the
383   original matching mode is restored outside of the group.
384
385   .. versionadded:: 3.6
386
387   .. versionchanged:: 3.7
388      The letters ``'a'``, ``'L'`` and ``'u'`` also can be used in a group.
389
390``(?>...)``
391   Attempts to match ``...`` as if it was a separate regular expression, and
392   if successful, continues to match the rest of the pattern following it.
393   If the subsequent pattern fails to match, the stack can only be unwound
394   to a point *before* the ``(?>...)`` because once exited, the expression,
395   known as an :dfn:`atomic group`, has thrown away all stack points within
396   itself.
397   Thus, ``(?>.*).`` would never match anything because first the ``.*``
398   would match all characters possible, then, having nothing left to match,
399   the final ``.`` would fail to match.
400   Since there are no stack points saved in the Atomic Group, and there is
401   no stack point before it, the entire expression would thus fail to match.
402
403   .. versionadded:: 3.11
404
405.. index:: single: (?P<; in regular expressions
406
407``(?P<name>...)``
408   Similar to regular parentheses, but the substring matched by the group is
409   accessible via the symbolic group name *name*.  Group names must be valid
410   Python identifiers, and in :class:`bytes` patterns they can only contain
411   bytes in the ASCII range.  Each group name must be defined only once within
412   a regular expression.  A symbolic group is also a numbered group, just as if
413   the group were not named.
414
415   Named groups can be referenced in three contexts.  If the pattern is
416   ``(?P<quote>['"]).*?(?P=quote)`` (i.e. matching a string quoted with either
417   single or double quotes):
418
419   +---------------------------------------+----------------------------------+
420   | Context of reference to group "quote" | Ways to reference it             |
421   +=======================================+==================================+
422   | in the same pattern itself            | * ``(?P=quote)`` (as shown)      |
423   |                                       | * ``\1``                         |
424   +---------------------------------------+----------------------------------+
425   | when processing match object *m*      | * ``m.group('quote')``           |
426   |                                       | * ``m.end('quote')`` (etc.)      |
427   +---------------------------------------+----------------------------------+
428   | in a string passed to the *repl*      | * ``\g<quote>``                  |
429   | argument of ``re.sub()``              | * ``\g<1>``                      |
430   |                                       | * ``\1``                         |
431   +---------------------------------------+----------------------------------+
432
433   .. versionchanged:: 3.12
434      In :class:`bytes` patterns, group *name* can only contain bytes
435      in the ASCII range (``b'\x00'``-``b'\x7f'``).
436
437.. index:: single: (?P=; in regular expressions
438
439``(?P=name)``
440   A backreference to a named group; it matches whatever text was matched by the
441   earlier group named *name*.
442
443.. index:: single: (?#; in regular expressions
444
445``(?#...)``
446   A comment; the contents of the parentheses are simply ignored.
447
448.. index:: single: (?=; in regular expressions
449
450``(?=...)``
451   Matches if ``...`` matches next, but doesn't consume any of the string.  This is
452   called a :dfn:`lookahead assertion`.  For example, ``Isaac (?=Asimov)`` will match
453   ``'Isaac '`` only if it's followed by ``'Asimov'``.
454
455.. index:: single: (?!; in regular expressions
456
457``(?!...)``
458   Matches if ``...`` doesn't match next.  This is a :dfn:`negative lookahead assertion`.
459   For example, ``Isaac (?!Asimov)`` will match ``'Isaac '`` only if it's *not*
460   followed by ``'Asimov'``.
461
462.. index:: single: (?<=; in regular expressions
463
464``(?<=...)``
465   Matches if the current position in the string is preceded by a match for ``...``
466   that ends at the current position.  This is called a :dfn:`positive lookbehind
467   assertion`. ``(?<=abc)def`` will find a match in ``'abcdef'``, since the
468   lookbehind will back up 3 characters and check if the contained pattern matches.
469   The contained pattern must only match strings of some fixed length, meaning that
470   ``abc`` or ``a|b`` are allowed, but ``a*`` and ``a{3,4}`` are not.  Note that
471   patterns which start with positive lookbehind assertions will not match at the
472   beginning of the string being searched; you will most likely want to use the
473   :func:`search` function rather than the :func:`match` function:
474
475      >>> import re
476      >>> m = re.search('(?<=abc)def', 'abcdef')
477      >>> m.group(0)
478      'def'
479
480   This example looks for a word following a hyphen:
481
482      >>> m = re.search(r'(?<=-)\w+', 'spam-egg')
483      >>> m.group(0)
484      'egg'
485
486   .. versionchanged:: 3.5
487      Added support for group references of fixed length.
488
489.. index:: single: (?<!; in regular expressions
490
491``(?<!...)``
492   Matches if the current position in the string is not preceded by a match for
493   ``...``.  This is called a :dfn:`negative lookbehind assertion`.  Similar to
494   positive lookbehind assertions, the contained pattern must only match strings of
495   some fixed length.  Patterns which start with negative lookbehind assertions may
496   match at the beginning of the string being searched.
497
498.. _re-conditional-expression:
499.. index:: single: (?(; in regular expressions
500
501``(?(id/name)yes-pattern|no-pattern)``
502   Will try to match with ``yes-pattern`` if the group with given *id* or
503   *name* exists, and with ``no-pattern`` if it doesn't. ``no-pattern`` is
504   optional and can be omitted. For example,
505   ``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>|$)`` is a poor email matching pattern, which
506   will match with ``'<user@host.com>'`` as well as ``'user@host.com'``, but
507   not with ``'<user@host.com'`` nor ``'user@host.com>'``.
508
509   .. versionchanged:: 3.12
510      Group *id* can only contain ASCII digits.
511      In :class:`bytes` patterns, group *name* can only contain bytes
512      in the ASCII range (``b'\x00'``-``b'\x7f'``).
513
514
515.. _re-special-sequences:
516
517The special sequences consist of ``'\'`` and a character from the list below.
518If the ordinary character is not an ASCII digit or an ASCII letter, then the
519resulting RE will match the second character.  For example, ``\$`` matches the
520character ``'$'``.
521
522.. index:: single: \ (backslash); in regular expressions
523
524``\number``
525   Matches the contents of the group of the same number.  Groups are numbered
526   starting from 1.  For example, ``(.+) \1`` matches ``'the the'`` or ``'55 55'``,
527   but not ``'thethe'`` (note the space after the group).  This special sequence
528   can only be used to match one of the first 99 groups.  If the first digit of
529   *number* is 0, or *number* is 3 octal digits long, it will not be interpreted as
530   a group match, but as the character with octal value *number*. Inside the
531   ``'['`` and ``']'`` of a character class, all numeric escapes are treated as
532   characters.
533
534.. index:: single: \A; in regular expressions
535
536``\A``
537   Matches only at the start of the string.
538
539.. index:: single: \b; in regular expressions
540
541``\b``
542   Matches the empty string, but only at the beginning or end of a word.
543   A word is defined as a sequence of word characters.
544   Note that formally, ``\b`` is defined as the boundary
545   between a ``\w`` and a ``\W`` character (or vice versa),
546   or between ``\w`` and the beginning or end of the string.
547   This means that ``r'\bat\b'`` matches ``'at'``, ``'at.'``, ``'(at)'``,
548   and ``'as at ay'`` but not ``'attempt'`` or ``'atlas'``.
549
550   The default word characters in Unicode (str) patterns
551   are Unicode alphanumerics and the underscore,
552   but this can be changed by using the :py:const:`~re.ASCII` flag.
553   Word boundaries are determined by the current locale
554   if the :py:const:`~re.LOCALE` flag is used.
555
556   .. note::
557
558      Inside a character range, ``\b`` represents the backspace character,
559      for compatibility with Python's string literals.
560
561.. index:: single: \B; in regular expressions
562
563``\B``
564   Matches the empty string,
565   but only when it is *not* at the beginning or end of a word.
566   This means that ``r'at\B'`` matches ``'athens'``, ``'atom'``,
567   ``'attorney'``, but not ``'at'``, ``'at.'``, or ``'at!'``.
568   ``\B`` is the opposite of ``\b``,
569   so word characters in Unicode (str) patterns
570   are Unicode alphanumerics or the underscore,
571   although this can be changed by using the :py:const:`~re.ASCII` flag.
572   Word boundaries are determined by the current locale
573   if the :py:const:`~re.LOCALE` flag is used.
574
575   .. note::
576
577      Note that ``\B`` does not match an empty string, which differs from
578      RE implementations in other programming languages such as Perl.
579      This behavior is kept for compatibility reasons.
580
581.. index:: single: \d; in regular expressions
582
583``\d``
584   For Unicode (str) patterns:
585      Matches any Unicode decimal digit
586      (that is, any character in Unicode character category `[Nd]`__).
587      This includes ``[0-9]``, and also many other digit characters.
588
589      Matches ``[0-9]`` if the :py:const:`~re.ASCII` flag is used.
590
591      __ https://www.unicode.org/versions/Unicode15.0.0/ch04.pdf#G134153
592
593   For 8-bit (bytes) patterns:
594      Matches any decimal digit in the ASCII character set;
595      this is equivalent to ``[0-9]``.
596
597.. index:: single: \D; in regular expressions
598
599``\D``
600   Matches any character which is not a decimal digit.
601   This is the opposite of ``\d``.
602
603   Matches ``[^0-9]`` if the :py:const:`~re.ASCII` flag is used.
604
605.. index:: single: \s; in regular expressions
606
607``\s``
608   For Unicode (str) patterns:
609      Matches Unicode whitespace characters (as defined by :py:meth:`str.isspace`).
610      This includes ``[ \t\n\r\f\v]``, and also many other characters, for example the
611      non-breaking spaces mandated by typography rules in many languages.
612
613      Matches ``[ \t\n\r\f\v]`` if the :py:const:`~re.ASCII` flag is used.
614
615   For 8-bit (bytes) patterns:
616      Matches characters considered whitespace in the ASCII character set;
617      this is equivalent to ``[ \t\n\r\f\v]``.
618
619.. index:: single: \S; in regular expressions
620
621``\S``
622   Matches any character which is not a whitespace character. This is
623   the opposite of ``\s``.
624
625   Matches ``[^ \t\n\r\f\v]`` if the :py:const:`~re.ASCII` flag is used.
626
627.. index:: single: \w; in regular expressions
628
629``\w``
630   For Unicode (str) patterns:
631      Matches Unicode word characters;
632      this includes all Unicode alphanumeric characters
633      (as defined by :py:meth:`str.isalnum`),
634      as well as the underscore (``_``).
635
636      Matches ``[a-zA-Z0-9_]`` if the :py:const:`~re.ASCII` flag is used.
637
638   For 8-bit (bytes) patterns:
639      Matches characters considered alphanumeric in the ASCII character set;
640      this is equivalent to ``[a-zA-Z0-9_]``.
641      If the :py:const:`~re.LOCALE` flag is used,
642      matches characters considered alphanumeric in the current locale and the underscore.
643
644.. index:: single: \W; in regular expressions
645
646``\W``
647   Matches any character which is not a word character.
648   This is the opposite of ``\w``.
649   By default, matches non-underscore (``_``) characters
650   for which :py:meth:`str.isalnum` returns ``False``.
651
652   Matches ``[^a-zA-Z0-9_]`` if the :py:const:`~re.ASCII` flag is used.
653
654   If the :py:const:`~re.LOCALE` flag is used,
655   matches characters which are neither alphanumeric in the current locale
656   nor the underscore.
657
658.. index:: single: \Z; in regular expressions
659
660``\Z``
661   Matches only at the end of the string.
662
663.. index::
664   single: \a; in regular expressions
665   single: \b; in regular expressions
666   single: \f; in regular expressions
667   single: \n; in regular expressions
668   single: \N; in regular expressions
669   single: \r; in regular expressions
670   single: \t; in regular expressions
671   single: \u; in regular expressions
672   single: \U; in regular expressions
673   single: \v; in regular expressions
674   single: \x; in regular expressions
675   single: \\; in regular expressions
676
677Most of the :ref:`escape sequences <escape-sequences>` supported by Python
678string literals are also accepted by the regular expression parser::
679
680   \a      \b      \f      \n
681   \N      \r      \t      \u
682   \U      \v      \x      \\
683
684(Note that ``\b`` is used to represent word boundaries, and means "backspace"
685only inside character classes.)
686
687``'\u'``, ``'\U'``, and ``'\N'`` escape sequences are
688only recognized in Unicode (str) patterns.
689In bytes patterns they are errors.
690Unknown escapes of ASCII letters are reserved
691for future use and treated as errors.
692
693Octal escapes are included in a limited form.  If the first digit is a 0, or if
694there are three octal digits, it is considered an octal escape. Otherwise, it is
695a group reference.  As for string literals, octal escapes are always at most
696three digits in length.
697
698.. versionchanged:: 3.3
699   The ``'\u'`` and ``'\U'`` escape sequences have been added.
700
701.. versionchanged:: 3.6
702   Unknown escapes consisting of ``'\'`` and an ASCII letter now are errors.
703
704.. versionchanged:: 3.8
705   The :samp:`'\\N\\{{name}\\}'` escape sequence has been added. As in string literals,
706   it expands to the named Unicode character (e.g. ``'\N{EM DASH}'``).
707
708
709.. _contents-of-module-re:
710
711Module Contents
712---------------
713
714The module defines several functions, constants, and an exception. Some of the
715functions are simplified versions of the full featured methods for compiled
716regular expressions.  Most non-trivial applications always use the compiled
717form.
718
719
720Flags
721^^^^^
722
723.. versionchanged:: 3.6
724   Flag constants are now instances of :class:`RegexFlag`, which is a subclass of
725   :class:`enum.IntFlag`.
726
727
728.. class:: RegexFlag
729
730   An :class:`enum.IntFlag` class containing the regex options listed below.
731
732   .. versionadded:: 3.11 - added to ``__all__``
733
734.. data:: A
735          ASCII
736
737   Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S``
738   perform ASCII-only matching instead of full Unicode matching.  This is only
739   meaningful for Unicode (str) patterns, and is ignored for bytes patterns.
740
741   Corresponds to the inline flag ``(?a)``.
742
743   .. note::
744
745      The :py:const:`~re.U` flag still exists for backward compatibility,
746      but is redundant in Python 3 since
747      matches are Unicode by default for ``str`` patterns,
748      and Unicode matching isn't allowed for bytes patterns.
749      :py:const:`~re.UNICODE` and the inline flag ``(?u)`` are similarly redundant.
750
751
752.. data:: DEBUG
753
754   Display debug information about compiled expression.
755
756   No corresponding inline flag.
757
758
759.. data:: I
760          IGNORECASE
761
762   Perform case-insensitive matching;
763   expressions like ``[A-Z]`` will also  match lowercase letters.
764   Full Unicode matching (such as ``Ü`` matching ``ü``)
765   also works unless the :py:const:`~re.ASCII` flag
766   is used to disable non-ASCII matches.
767   The current locale does not change the effect of this flag
768   unless the :py:const:`~re.LOCALE` flag is also used.
769
770   Corresponds to the inline flag ``(?i)``.
771
772   Note that when the Unicode patterns ``[a-z]`` or ``[A-Z]`` are used in
773   combination with the :const:`IGNORECASE` flag, they will match the 52 ASCII
774   letters and 4 additional non-ASCII letters: 'İ' (U+0130, Latin capital
775   letter I with dot above), 'ı' (U+0131, Latin small letter dotless i),
776   'ſ' (U+017F, Latin small letter long s) and 'K' (U+212A, Kelvin sign).
777   If the :py:const:`~re.ASCII` flag is used, only letters 'a' to 'z'
778   and 'A' to 'Z' are matched.
779
780.. data:: L
781          LOCALE
782
783   Make ``\w``, ``\W``, ``\b``, ``\B`` and case-insensitive matching
784   dependent on the current locale.
785   This flag can be used only with bytes patterns.
786
787   Corresponds to the inline flag ``(?L)``.
788
789   .. warning::
790
791      This flag is discouraged; consider Unicode matching instead.
792      The locale mechanism is very unreliable
793      as it only handles one "culture" at a time
794      and only works with 8-bit locales.
795      Unicode matching is enabled by default for Unicode (str) patterns
796      and it is able to handle different locales and languages.
797
798   .. versionchanged:: 3.6
799      :py:const:`~re.LOCALE` can be used only with bytes patterns
800      and is not compatible with :py:const:`~re.ASCII`.
801
802   .. versionchanged:: 3.7
803      Compiled regular expression objects with the :py:const:`~re.LOCALE` flag
804      no longer depend on the locale at compile time.
805      Only the locale at matching time affects the result of matching.
806
807
808.. data:: M
809          MULTILINE
810
811   When specified, the pattern character ``'^'`` matches at the beginning of the
812   string and at the beginning of each line (immediately following each newline);
813   and the pattern character ``'$'`` matches at the end of the string and at the
814   end of each line (immediately preceding each newline).  By default, ``'^'``
815   matches only at the beginning of the string, and ``'$'`` only at the end of the
816   string and immediately before the newline (if any) at the end of the string.
817
818   Corresponds to the inline flag ``(?m)``.
819
820.. data:: NOFLAG
821
822   Indicates no flag being applied, the value is ``0``.  This flag may be used
823   as a default value for a function keyword argument or as a base value that
824   will be conditionally ORed with other flags.  Example of use as a default
825   value::
826
827      def myfunc(text, flag=re.NOFLAG):
828          return re.match(text, flag)
829
830   .. versionadded:: 3.11
831
832.. data:: S
833          DOTALL
834
835   Make the ``'.'`` special character match any character at all, including a
836   newline; without this flag, ``'.'`` will match anything *except* a newline.
837
838   Corresponds to the inline flag ``(?s)``.
839
840
841.. data:: U
842          UNICODE
843
844   In Python 3, Unicode characters are matched by default
845   for ``str`` patterns.
846   This flag is therefore redundant with **no effect**
847   and is only kept for backward compatibility.
848
849   See :py:const:`~re.ASCII` to restrict matching to ASCII characters instead.
850
851.. data:: X
852          VERBOSE
853
854   .. index:: single: # (hash); in regular expressions
855
856   This flag allows you to write regular expressions that look nicer and are
857   more readable by allowing you to visually separate logical sections of the
858   pattern and add comments. Whitespace within the pattern is ignored, except
859   when in a character class, or when preceded by an unescaped backslash,
860   or within tokens like ``*?``, ``(?:`` or ``(?P<...>``. For example, ``(? :``
861   and ``* ?`` are not allowed.
862   When a line contains a ``#`` that is not in a character class and is not
863   preceded by an unescaped backslash, all characters from the leftmost such
864   ``#`` through the end of the line are ignored.
865
866   This means that the two following regular expression objects that match a
867   decimal number are functionally equal::
868
869      a = re.compile(r"""\d +  # the integral part
870                         \.    # the decimal point
871                         \d *  # some fractional digits""", re.X)
872      b = re.compile(r"\d+\.\d*")
873
874   Corresponds to the inline flag ``(?x)``.
875
876
877Functions
878^^^^^^^^^
879
880.. function:: compile(pattern, flags=0)
881
882   Compile a regular expression pattern into a :ref:`regular expression object
883   <re-objects>`, which can be used for matching using its
884   :func:`~Pattern.match`, :func:`~Pattern.search` and other methods, described
885   below.
886
887   The expression's behaviour can be modified by specifying a *flags* value.
888   Values can be any of the `flags`_ variables, combined using bitwise OR
889   (the ``|`` operator).
890
891   The sequence ::
892
893      prog = re.compile(pattern)
894      result = prog.match(string)
895
896   is equivalent to ::
897
898      result = re.match(pattern, string)
899
900   but using :func:`re.compile` and saving the resulting regular expression
901   object for reuse is more efficient when the expression will be used several
902   times in a single program.
903
904   .. note::
905
906      The compiled versions of the most recent patterns passed to
907      :func:`re.compile` and the module-level matching functions are cached, so
908      programs that use only a few regular expressions at a time needn't worry
909      about compiling regular expressions.
910
911
912.. function:: search(pattern, string, flags=0)
913
914   Scan through *string* looking for the first location where the regular expression
915   *pattern* produces a match, and return a corresponding :class:`~re.Match`. Return
916   ``None`` if no position in the string matches the pattern; note that this is
917   different from finding a zero-length match at some point in the string.
918
919   The expression's behaviour can be modified by specifying a *flags* value.
920   Values can be any of the `flags`_ variables, combined using bitwise OR
921   (the ``|`` operator).
922
923
924.. function:: match(pattern, string, flags=0)
925
926   If zero or more characters at the beginning of *string* match the regular
927   expression *pattern*, return a corresponding :class:`~re.Match`.  Return
928   ``None`` if the string does not match the pattern; note that this is
929   different from a zero-length match.
930
931   Note that even in :const:`MULTILINE` mode, :func:`re.match` will only match
932   at the beginning of the string and not at the beginning of each line.
933
934   If you want to locate a match anywhere in *string*, use :func:`search`
935   instead (see also :ref:`search-vs-match`).
936
937   The expression's behaviour can be modified by specifying a *flags* value.
938   Values can be any of the `flags`_ variables, combined using bitwise OR
939   (the ``|`` operator).
940
941
942.. function:: fullmatch(pattern, string, flags=0)
943
944   If the whole *string* matches the regular expression *pattern*, return a
945   corresponding :class:`~re.Match`.  Return ``None`` if the string does not match
946   the pattern; note that this is different from a zero-length match.
947
948   The expression's behaviour can be modified by specifying a *flags* value.
949   Values can be any of the `flags`_ variables, combined using bitwise OR
950   (the ``|`` operator).
951
952   .. versionadded:: 3.4
953
954
955.. function:: split(pattern, string, maxsplit=0, flags=0)
956
957   Split *string* by the occurrences of *pattern*.  If capturing parentheses are
958   used in *pattern*, then the text of all groups in the pattern are also returned
959   as part of the resulting list. If *maxsplit* is nonzero, at most *maxsplit*
960   splits occur, and the remainder of the string is returned as the final element
961   of the list. ::
962
963      >>> re.split(r'\W+', 'Words, words, words.')
964      ['Words', 'words', 'words', '']
965      >>> re.split(r'(\W+)', 'Words, words, words.')
966      ['Words', ', ', 'words', ', ', 'words', '.', '']
967      >>> re.split(r'\W+', 'Words, words, words.', maxsplit=1)
968      ['Words', 'words, words.']
969      >>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)
970      ['0', '3', '9']
971
972   If there are capturing groups in the separator and it matches at the start of
973   the string, the result will start with an empty string.  The same holds for
974   the end of the string::
975
976      >>> re.split(r'(\W+)', '...words, words...')
977      ['', '...', 'words', ', ', 'words', '...', '']
978
979   That way, separator components are always found at the same relative
980   indices within the result list.
981
982   Empty matches for the pattern split the string only when not adjacent
983   to a previous empty match.
984
985   .. code:: pycon
986
987      >>> re.split(r'\b', 'Words, words, words.')
988      ['', 'Words', ', ', 'words', ', ', 'words', '.']
989      >>> re.split(r'\W*', '...words...')
990      ['', '', 'w', 'o', 'r', 'd', 's', '', '']
991      >>> re.split(r'(\W*)', '...words...')
992      ['', '...', '', '', 'w', '', 'o', '', 'r', '', 'd', '', 's', '...', '', '', '']
993
994   The expression's behaviour can be modified by specifying a *flags* value.
995   Values can be any of the `flags`_ variables, combined using bitwise OR
996   (the ``|`` operator).
997
998   .. versionchanged:: 3.1
999      Added the optional flags argument.
1000
1001   .. versionchanged:: 3.7
1002      Added support of splitting on a pattern that could match an empty string.
1003
1004   .. deprecated:: 3.13
1005      Passing *maxsplit* and *flags* as positional arguments is deprecated.
1006      In future Python versions they will be
1007      :ref:`keyword-only parameters <keyword-only_parameter>`.
1008
1009
1010.. function:: findall(pattern, string, flags=0)
1011
1012   Return all non-overlapping matches of *pattern* in *string*, as a list of
1013   strings or tuples.  The *string* is scanned left-to-right, and matches
1014   are returned in the order found.  Empty matches are included in the result.
1015
1016   The result depends on the number of capturing groups in the pattern.
1017   If there are no groups, return a list of strings matching the whole
1018   pattern.  If there is exactly one group, return a list of strings
1019   matching that group.  If multiple groups are present, return a list
1020   of tuples of strings matching the groups.  Non-capturing groups do not
1021   affect the form of the result.
1022
1023      >>> re.findall(r'\bf[a-z]*', 'which foot or hand fell fastest')
1024      ['foot', 'fell', 'fastest']
1025      >>> re.findall(r'(\w+)=(\d+)', 'set width=20 and height=10')
1026      [('width', '20'), ('height', '10')]
1027
1028   The expression's behaviour can be modified by specifying a *flags* value.
1029   Values can be any of the `flags`_ variables, combined using bitwise OR
1030   (the ``|`` operator).
1031
1032   .. versionchanged:: 3.7
1033      Non-empty matches can now start just after a previous empty match.
1034
1035
1036.. function:: finditer(pattern, string, flags=0)
1037
1038   Return an :term:`iterator` yielding :class:`~re.Match` objects over
1039   all non-overlapping matches for the RE *pattern* in *string*.  The *string*
1040   is scanned left-to-right, and matches are returned in the order found.  Empty
1041   matches are included in the result.
1042
1043   The expression's behaviour can be modified by specifying a *flags* value.
1044   Values can be any of the `flags`_ variables, combined using bitwise OR
1045   (the ``|`` operator).
1046
1047   .. versionchanged:: 3.7
1048      Non-empty matches can now start just after a previous empty match.
1049
1050
1051.. function:: sub(pattern, repl, string, count=0, flags=0)
1052
1053   Return the string obtained by replacing the leftmost non-overlapping occurrences
1054   of *pattern* in *string* by the replacement *repl*.  If the pattern isn't found,
1055   *string* is returned unchanged.  *repl* can be a string or a function; if it is
1056   a string, any backslash escapes in it are processed.  That is, ``\n`` is
1057   converted to a single newline character, ``\r`` is converted to a carriage return, and
1058   so forth.  Unknown escapes of ASCII letters are reserved for future use and
1059   treated as errors.  Other unknown escapes such as ``\&`` are left alone.
1060   Backreferences, such
1061   as ``\6``, are replaced with the substring matched by group 6 in the pattern.
1062   For example::
1063
1064      >>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',
1065      ...        r'static PyObject*\npy_\1(void)\n{',
1066      ...        'def myfunc():')
1067      'static PyObject*\npy_myfunc(void)\n{'
1068
1069   If *repl* is a function, it is called for every non-overlapping occurrence of
1070   *pattern*.  The function takes a single :class:`~re.Match` argument, and returns
1071   the replacement string.  For example::
1072
1073      >>> def dashrepl(matchobj):
1074      ...     if matchobj.group(0) == '-': return ' '
1075      ...     else: return '-'
1076      ...
1077      >>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
1078      'pro--gram files'
1079      >>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)
1080      'Baked Beans & Spam'
1081
1082   The pattern may be a string or a :class:`~re.Pattern`.
1083
1084   The optional argument *count* is the maximum number of pattern occurrences to be
1085   replaced; *count* must be a non-negative integer.  If omitted or zero, all
1086   occurrences will be replaced. Empty matches for the pattern are replaced only
1087   when not adjacent to a previous empty match, so ``sub('x*', '-', 'abxd')`` returns
1088   ``'-a-b--d-'``.
1089
1090   .. index:: single: \g; in regular expressions
1091
1092   In string-type *repl* arguments, in addition to the character escapes and
1093   backreferences described above,
1094   ``\g<name>`` will use the substring matched by the group named ``name``, as
1095   defined by the ``(?P<name>...)`` syntax. ``\g<number>`` uses the corresponding
1096   group number; ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous
1097   in a replacement such as ``\g<2>0``.  ``\20`` would be interpreted as a
1098   reference to group 20, not a reference to group 2 followed by the literal
1099   character ``'0'``.  The backreference ``\g<0>`` substitutes in the entire
1100   substring matched by the RE.
1101
1102   The expression's behaviour can be modified by specifying a *flags* value.
1103   Values can be any of the `flags`_ variables, combined using bitwise OR
1104   (the ``|`` operator).
1105
1106   .. versionchanged:: 3.1
1107      Added the optional flags argument.
1108
1109   .. versionchanged:: 3.5
1110      Unmatched groups are replaced with an empty string.
1111
1112   .. versionchanged:: 3.6
1113      Unknown escapes in *pattern* consisting of ``'\'`` and an ASCII letter
1114      now are errors.
1115
1116   .. versionchanged:: 3.7
1117      Unknown escapes in *repl* consisting of ``'\'`` and an ASCII letter
1118      now are errors.
1119      Empty matches for the pattern are replaced when adjacent to a previous
1120      non-empty match.
1121
1122   .. versionchanged:: 3.12
1123      Group *id* can only contain ASCII digits.
1124      In :class:`bytes` replacement strings, group *name* can only contain bytes
1125      in the ASCII range (``b'\x00'``-``b'\x7f'``).
1126
1127   .. deprecated:: 3.13
1128      Passing *count* and *flags* as positional arguments is deprecated.
1129      In future Python versions they will be
1130      :ref:`keyword-only parameters <keyword-only_parameter>`.
1131
1132
1133.. function:: subn(pattern, repl, string, count=0, flags=0)
1134
1135   Perform the same operation as :func:`sub`, but return a tuple ``(new_string,
1136   number_of_subs_made)``.
1137
1138   The expression's behaviour can be modified by specifying a *flags* value.
1139   Values can be any of the `flags`_ variables, combined using bitwise OR
1140   (the ``|`` operator).
1141
1142
1143.. function:: escape(pattern)
1144
1145   Escape special characters in *pattern*.
1146   This is useful if you want to match an arbitrary literal string that may
1147   have regular expression metacharacters in it.  For example::
1148
1149      >>> print(re.escape('https://www.python.org'))
1150      https://www\.python\.org
1151
1152      >>> legal_chars = string.ascii_lowercase + string.digits + "!#$%&'*+-.^_`|~:"
1153      >>> print('[%s]+' % re.escape(legal_chars))
1154      [abcdefghijklmnopqrstuvwxyz0123456789!\#\$%\&'\*\+\-\.\^_`\|\~:]+
1155
1156      >>> operators = ['+', '-', '*', '/', '**']
1157      >>> print('|'.join(map(re.escape, sorted(operators, reverse=True))))
1158      /|\-|\+|\*\*|\*
1159
1160   This function must not be used for the replacement string in :func:`sub`
1161   and :func:`subn`, only backslashes should be escaped.  For example::
1162
1163      >>> digits_re = r'\d+'
1164      >>> sample = '/usr/sbin/sendmail - 0 errors, 12 warnings'
1165      >>> print(re.sub(digits_re, digits_re.replace('\\', r'\\'), sample))
1166      /usr/sbin/sendmail - \d+ errors, \d+ warnings
1167
1168   .. versionchanged:: 3.3
1169      The ``'_'`` character is no longer escaped.
1170
1171   .. versionchanged:: 3.7
1172      Only characters that can have special meaning in a regular expression
1173      are escaped. As a result, ``'!'``, ``'"'``, ``'%'``, ``"'"``, ``','``,
1174      ``'/'``, ``':'``, ``';'``, ``'<'``, ``'='``, ``'>'``, ``'@'``, and
1175      ``"`"`` are no longer escaped.
1176
1177
1178.. function:: purge()
1179
1180   Clear the regular expression cache.
1181
1182
1183Exceptions
1184^^^^^^^^^^
1185
1186.. exception:: PatternError(msg, pattern=None, pos=None)
1187
1188   Exception raised when a string passed to one of the functions here is not a
1189   valid regular expression (for example, it might contain unmatched parentheses)
1190   or when some other error occurs during compilation or matching.  It is never an
1191   error if a string contains no match for a pattern.  The ``PatternError`` instance has
1192   the following additional attributes:
1193
1194   .. attribute:: msg
1195
1196      The unformatted error message.
1197
1198   .. attribute:: pattern
1199
1200      The regular expression pattern.
1201
1202   .. attribute:: pos
1203
1204      The index in *pattern* where compilation failed (may be ``None``).
1205
1206   .. attribute:: lineno
1207
1208      The line corresponding to *pos* (may be ``None``).
1209
1210   .. attribute:: colno
1211
1212      The column corresponding to *pos* (may be ``None``).
1213
1214   .. versionchanged:: 3.5
1215      Added additional attributes.
1216
1217   .. versionchanged:: 3.13
1218      ``PatternError`` was originally named ``error``; the latter is kept as an alias for
1219      backward compatibility.
1220
1221.. _re-objects:
1222
1223Regular Expression Objects
1224--------------------------
1225
1226.. class:: Pattern
1227
1228   Compiled regular expression object returned by :func:`re.compile`.
1229
1230   .. versionchanged:: 3.9
1231      :py:class:`re.Pattern` supports ``[]`` to indicate a Unicode (str) or bytes pattern.
1232      See :ref:`types-genericalias`.
1233
1234.. method:: Pattern.search(string[, pos[, endpos]])
1235
1236   Scan through *string* looking for the first location where this regular
1237   expression produces a match, and return a corresponding :class:`~re.Match`.
1238   Return ``None`` if no position in the string matches the pattern; note that
1239   this is different from finding a zero-length match at some point in the string.
1240
1241   The optional second parameter *pos* gives an index in the string where the
1242   search is to start; it defaults to ``0``.  This is not completely equivalent to
1243   slicing the string; the ``'^'`` pattern character matches at the real beginning
1244   of the string and at positions just after a newline, but not necessarily at the
1245   index where the search is to start.
1246
1247   The optional parameter *endpos* limits how far the string will be searched; it
1248   will be as if the string is *endpos* characters long, so only the characters
1249   from *pos* to ``endpos - 1`` will be searched for a match.  If *endpos* is less
1250   than *pos*, no match will be found; otherwise, if *rx* is a compiled regular
1251   expression object, ``rx.search(string, 0, 50)`` is equivalent to
1252   ``rx.search(string[:50], 0)``. ::
1253
1254      >>> pattern = re.compile("d")
1255      >>> pattern.search("dog")     # Match at index 0
1256      <re.Match object; span=(0, 1), match='d'>
1257      >>> pattern.search("dog", 1)  # No match; search doesn't include the "d"
1258
1259
1260.. method:: Pattern.match(string[, pos[, endpos]])
1261
1262   If zero or more characters at the *beginning* of *string* match this regular
1263   expression, return a corresponding :class:`~re.Match`. Return ``None`` if the
1264   string does not match the pattern; note that this is different from a
1265   zero-length match.
1266
1267   The optional *pos* and *endpos* parameters have the same meaning as for the
1268   :meth:`~Pattern.search` method. ::
1269
1270      >>> pattern = re.compile("o")
1271      >>> pattern.match("dog")      # No match as "o" is not at the start of "dog".
1272      >>> pattern.match("dog", 1)   # Match as "o" is the 2nd character of "dog".
1273      <re.Match object; span=(1, 2), match='o'>
1274
1275   If you want to locate a match anywhere in *string*, use
1276   :meth:`~Pattern.search` instead (see also :ref:`search-vs-match`).
1277
1278
1279.. method:: Pattern.fullmatch(string[, pos[, endpos]])
1280
1281   If the whole *string* matches this regular expression, return a corresponding
1282   :class:`~re.Match`.  Return ``None`` if the string does not match the pattern;
1283   note that this is different from a zero-length match.
1284
1285   The optional *pos* and *endpos* parameters have the same meaning as for the
1286   :meth:`~Pattern.search` method. ::
1287
1288      >>> pattern = re.compile("o[gh]")
1289      >>> pattern.fullmatch("dog")      # No match as "o" is not at the start of "dog".
1290      >>> pattern.fullmatch("ogre")     # No match as not the full string matches.
1291      >>> pattern.fullmatch("doggie", 1, 3)   # Matches within given limits.
1292      <re.Match object; span=(1, 3), match='og'>
1293
1294   .. versionadded:: 3.4
1295
1296
1297.. method:: Pattern.split(string, maxsplit=0)
1298
1299   Identical to the :func:`split` function, using the compiled pattern.
1300
1301
1302.. method:: Pattern.findall(string[, pos[, endpos]])
1303
1304   Similar to the :func:`findall` function, using the compiled pattern, but
1305   also accepts optional *pos* and *endpos* parameters that limit the search
1306   region like for :meth:`search`.
1307
1308
1309.. method:: Pattern.finditer(string[, pos[, endpos]])
1310
1311   Similar to the :func:`finditer` function, using the compiled pattern, but
1312   also accepts optional *pos* and *endpos* parameters that limit the search
1313   region like for :meth:`search`.
1314
1315
1316.. method:: Pattern.sub(repl, string, count=0)
1317
1318   Identical to the :func:`sub` function, using the compiled pattern.
1319
1320
1321.. method:: Pattern.subn(repl, string, count=0)
1322
1323   Identical to the :func:`subn` function, using the compiled pattern.
1324
1325
1326.. attribute:: Pattern.flags
1327
1328   The regex matching flags.  This is a combination of the flags given to
1329   :func:`.compile`, any ``(?...)`` inline flags in the pattern, and implicit
1330   flags such as :py:const:`~re.UNICODE` if the pattern is a Unicode string.
1331
1332
1333.. attribute:: Pattern.groups
1334
1335   The number of capturing groups in the pattern.
1336
1337
1338.. attribute:: Pattern.groupindex
1339
1340   A dictionary mapping any symbolic group names defined by ``(?P<id>)`` to group
1341   numbers.  The dictionary is empty if no symbolic groups were used in the
1342   pattern.
1343
1344
1345.. attribute:: Pattern.pattern
1346
1347   The pattern string from which the pattern object was compiled.
1348
1349
1350.. versionchanged:: 3.7
1351   Added support of :func:`copy.copy` and :func:`copy.deepcopy`.  Compiled
1352   regular expression objects are considered atomic.
1353
1354
1355.. _match-objects:
1356
1357Match Objects
1358-------------
1359
1360Match objects always have a boolean value of ``True``.
1361Since :meth:`~Pattern.match` and :meth:`~Pattern.search` return ``None``
1362when there is no match, you can test whether there was a match with a simple
1363``if`` statement::
1364
1365   match = re.search(pattern, string)
1366   if match:
1367       process(match)
1368
1369.. class:: Match
1370
1371   Match object returned by successful ``match``\ es and ``search``\ es.
1372
1373   .. versionchanged:: 3.9
1374      :py:class:`re.Match` supports ``[]`` to indicate a Unicode (str) or bytes match.
1375      See :ref:`types-genericalias`.
1376
1377.. method:: Match.expand(template)
1378
1379   Return the string obtained by doing backslash substitution on the template
1380   string *template*, as done by the :meth:`~Pattern.sub` method.
1381   Escapes such as ``\n`` are converted to the appropriate characters,
1382   and numeric backreferences (``\1``, ``\2``) and named backreferences
1383   (``\g<1>``, ``\g<name>``) are replaced by the contents of the
1384   corresponding group. The backreference ``\g<0>`` will be
1385   replaced by the entire match.
1386
1387   .. versionchanged:: 3.5
1388      Unmatched groups are replaced with an empty string.
1389
1390.. method:: Match.group([group1, ...])
1391
1392   Returns one or more subgroups of the match.  If there is a single argument, the
1393   result is a single string; if there are multiple arguments, the result is a
1394   tuple with one item per argument. Without arguments, *group1* defaults to zero
1395   (the whole match is returned). If a *groupN* argument is zero, the corresponding
1396   return value is the entire matching string; if it is in the inclusive range
1397   [1..99], it is the string matching the corresponding parenthesized group.  If a
1398   group number is negative or larger than the number of groups defined in the
1399   pattern, an :exc:`IndexError` exception is raised. If a group is contained in a
1400   part of the pattern that did not match, the corresponding result is ``None``.
1401   If a group is contained in a part of the pattern that matched multiple times,
1402   the last match is returned. ::
1403
1404      >>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
1405      >>> m.group(0)       # The entire match
1406      'Isaac Newton'
1407      >>> m.group(1)       # The first parenthesized subgroup.
1408      'Isaac'
1409      >>> m.group(2)       # The second parenthesized subgroup.
1410      'Newton'
1411      >>> m.group(1, 2)    # Multiple arguments give us a tuple.
1412      ('Isaac', 'Newton')
1413
1414   If the regular expression uses the ``(?P<name>...)`` syntax, the *groupN*
1415   arguments may also be strings identifying groups by their group name.  If a
1416   string argument is not used as a group name in the pattern, an :exc:`IndexError`
1417   exception is raised.
1418
1419   A moderately complicated example::
1420
1421      >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
1422      >>> m.group('first_name')
1423      'Malcolm'
1424      >>> m.group('last_name')
1425      'Reynolds'
1426
1427   Named groups can also be referred to by their index::
1428
1429      >>> m.group(1)
1430      'Malcolm'
1431      >>> m.group(2)
1432      'Reynolds'
1433
1434   If a group matches multiple times, only the last match is accessible::
1435
1436      >>> m = re.match(r"(..)+", "a1b2c3")  # Matches 3 times.
1437      >>> m.group(1)                        # Returns only the last match.
1438      'c3'
1439
1440
1441.. method:: Match.__getitem__(g)
1442
1443   This is identical to ``m.group(g)``.  This allows easier access to
1444   an individual group from a match::
1445
1446      >>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
1447      >>> m[0]       # The entire match
1448      'Isaac Newton'
1449      >>> m[1]       # The first parenthesized subgroup.
1450      'Isaac'
1451      >>> m[2]       # The second parenthesized subgroup.
1452      'Newton'
1453
1454   Named groups are supported as well::
1455
1456      >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Isaac Newton")
1457      >>> m['first_name']
1458      'Isaac'
1459      >>> m['last_name']
1460      'Newton'
1461
1462   .. versionadded:: 3.6
1463
1464
1465.. method:: Match.groups(default=None)
1466
1467   Return a tuple containing all the subgroups of the match, from 1 up to however
1468   many groups are in the pattern.  The *default* argument is used for groups that
1469   did not participate in the match; it defaults to ``None``.
1470
1471   For example::
1472
1473      >>> m = re.match(r"(\d+)\.(\d+)", "24.1632")
1474      >>> m.groups()
1475      ('24', '1632')
1476
1477   If we make the decimal place and everything after it optional, not all groups
1478   might participate in the match.  These groups will default to ``None`` unless
1479   the *default* argument is given::
1480
1481      >>> m = re.match(r"(\d+)\.?(\d+)?", "24")
1482      >>> m.groups()      # Second group defaults to None.
1483      ('24', None)
1484      >>> m.groups('0')   # Now, the second group defaults to '0'.
1485      ('24', '0')
1486
1487
1488.. method:: Match.groupdict(default=None)
1489
1490   Return a dictionary containing all the *named* subgroups of the match, keyed by
1491   the subgroup name.  The *default* argument is used for groups that did not
1492   participate in the match; it defaults to ``None``.  For example::
1493
1494      >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
1495      >>> m.groupdict()
1496      {'first_name': 'Malcolm', 'last_name': 'Reynolds'}
1497
1498
1499.. method:: Match.start([group])
1500            Match.end([group])
1501
1502   Return the indices of the start and end of the substring matched by *group*;
1503   *group* defaults to zero (meaning the whole matched substring). Return ``-1`` if
1504   *group* exists but did not contribute to the match.  For a match object *m*, and
1505   a group *g* that did contribute to the match, the substring matched by group *g*
1506   (equivalent to ``m.group(g)``) is ::
1507
1508      m.string[m.start(g):m.end(g)]
1509
1510   Note that ``m.start(group)`` will equal ``m.end(group)`` if *group* matched a
1511   null string.  For example, after ``m = re.search('b(c?)', 'cba')``,
1512   ``m.start(0)`` is 1, ``m.end(0)`` is 2, ``m.start(1)`` and ``m.end(1)`` are both
1513   2, and ``m.start(2)`` raises an :exc:`IndexError` exception.
1514
1515   An example that will remove *remove_this* from email addresses::
1516
1517      >>> email = "tony@tiremove_thisger.net"
1518      >>> m = re.search("remove_this", email)
1519      >>> email[:m.start()] + email[m.end():]
1520      'tony@tiger.net'
1521
1522
1523.. method:: Match.span([group])
1524
1525   For a match *m*, return the 2-tuple ``(m.start(group), m.end(group))``. Note
1526   that if *group* did not contribute to the match, this is ``(-1, -1)``.
1527   *group* defaults to zero, the entire match.
1528
1529
1530.. attribute:: Match.pos
1531
1532   The value of *pos* which was passed to the :meth:`~Pattern.search` or
1533   :meth:`~Pattern.match` method of a :ref:`regex object <re-objects>`.  This is
1534   the index into the string at which the RE engine started looking for a match.
1535
1536
1537.. attribute:: Match.endpos
1538
1539   The value of *endpos* which was passed to the :meth:`~Pattern.search` or
1540   :meth:`~Pattern.match` method of a :ref:`regex object <re-objects>`.  This is
1541   the index into the string beyond which the RE engine will not go.
1542
1543
1544.. attribute:: Match.lastindex
1545
1546   The integer index of the last matched capturing group, or ``None`` if no group
1547   was matched at all. For example, the expressions ``(a)b``, ``((a)(b))``, and
1548   ``((ab))`` will have ``lastindex == 1`` if applied to the string ``'ab'``, while
1549   the expression ``(a)(b)`` will have ``lastindex == 2``, if applied to the same
1550   string.
1551
1552
1553.. attribute:: Match.lastgroup
1554
1555   The name of the last matched capturing group, or ``None`` if the group didn't
1556   have a name, or if no group was matched at all.
1557
1558
1559.. attribute:: Match.re
1560
1561   The :ref:`regular expression object <re-objects>` whose :meth:`~Pattern.match` or
1562   :meth:`~Pattern.search` method produced this match instance.
1563
1564
1565.. attribute:: Match.string
1566
1567   The string passed to :meth:`~Pattern.match` or :meth:`~Pattern.search`.
1568
1569
1570.. versionchanged:: 3.7
1571   Added support of :func:`copy.copy` and :func:`copy.deepcopy`.  Match objects
1572   are considered atomic.
1573
1574
1575.. _re-examples:
1576
1577Regular Expression Examples
1578---------------------------
1579
1580
1581Checking for a Pair
1582^^^^^^^^^^^^^^^^^^^
1583
1584In this example, we'll use the following helper function to display match
1585objects a little more gracefully::
1586
1587   def displaymatch(match):
1588       if match is None:
1589           return None
1590       return '<Match: %r, groups=%r>' % (match.group(), match.groups())
1591
1592Suppose you are writing a poker program where a player's hand is represented as
1593a 5-character string with each character representing a card, "a" for ace, "k"
1594for king, "q" for queen, "j" for jack, "t" for 10, and "2" through "9"
1595representing the card with that value.
1596
1597To see if a given string is a valid hand, one could do the following::
1598
1599   >>> valid = re.compile(r"^[a2-9tjqk]{5}$")
1600   >>> displaymatch(valid.match("akt5q"))  # Valid.
1601   "<Match: 'akt5q', groups=()>"
1602   >>> displaymatch(valid.match("akt5e"))  # Invalid.
1603   >>> displaymatch(valid.match("akt"))    # Invalid.
1604   >>> displaymatch(valid.match("727ak"))  # Valid.
1605   "<Match: '727ak', groups=()>"
1606
1607That last hand, ``"727ak"``, contained a pair, or two of the same valued cards.
1608To match this with a regular expression, one could use backreferences as such::
1609
1610   >>> pair = re.compile(r".*(.).*\1")
1611   >>> displaymatch(pair.match("717ak"))     # Pair of 7s.
1612   "<Match: '717', groups=('7',)>"
1613   >>> displaymatch(pair.match("718ak"))     # No pairs.
1614   >>> displaymatch(pair.match("354aa"))     # Pair of aces.
1615   "<Match: '354aa', groups=('a',)>"
1616
1617To find out what card the pair consists of, one could use the
1618:meth:`~Match.group` method of the match object in the following manner::
1619
1620   >>> pair = re.compile(r".*(.).*\1")
1621   >>> pair.match("717ak").group(1)
1622   '7'
1623
1624   # Error because re.match() returns None, which doesn't have a group() method:
1625   >>> pair.match("718ak").group(1)
1626   Traceback (most recent call last):
1627     File "<pyshell#23>", line 1, in <module>
1628       re.match(r".*(.).*\1", "718ak").group(1)
1629   AttributeError: 'NoneType' object has no attribute 'group'
1630
1631   >>> pair.match("354aa").group(1)
1632   'a'
1633
1634
1635Simulating scanf()
1636^^^^^^^^^^^^^^^^^^
1637
1638.. index:: single: scanf (C function)
1639
1640Python does not currently have an equivalent to :c:func:`!scanf`.  Regular
1641expressions are generally more powerful, though also more verbose, than
1642:c:func:`!scanf` format strings.  The table below offers some more-or-less
1643equivalent mappings between :c:func:`!scanf` format tokens and regular
1644expressions.
1645
1646+--------------------------------+---------------------------------------------+
1647| :c:func:`!scanf` Token         | Regular Expression                          |
1648+================================+=============================================+
1649| ``%c``                         | ``.``                                       |
1650+--------------------------------+---------------------------------------------+
1651| ``%5c``                        | ``.{5}``                                    |
1652+--------------------------------+---------------------------------------------+
1653| ``%d``                         | ``[-+]?\d+``                                |
1654+--------------------------------+---------------------------------------------+
1655| ``%e``, ``%E``, ``%f``, ``%g`` | ``[-+]?(\d+(\.\d*)?|\.\d+)([eE][-+]?\d+)?`` |
1656+--------------------------------+---------------------------------------------+
1657| ``%i``                         | ``[-+]?(0[xX][\dA-Fa-f]+|0[0-7]*|\d+)``     |
1658+--------------------------------+---------------------------------------------+
1659| ``%o``                         | ``[-+]?[0-7]+``                             |
1660+--------------------------------+---------------------------------------------+
1661| ``%s``                         | ``\S+``                                     |
1662+--------------------------------+---------------------------------------------+
1663| ``%u``                         | ``\d+``                                     |
1664+--------------------------------+---------------------------------------------+
1665| ``%x``, ``%X``                 | ``[-+]?(0[xX])?[\dA-Fa-f]+``                |
1666+--------------------------------+---------------------------------------------+
1667
1668To extract the filename and numbers from a string like ::
1669
1670   /usr/sbin/sendmail - 0 errors, 4 warnings
1671
1672you would use a :c:func:`!scanf` format like ::
1673
1674   %s - %d errors, %d warnings
1675
1676The equivalent regular expression would be ::
1677
1678   (\S+) - (\d+) errors, (\d+) warnings
1679
1680
1681.. _search-vs-match:
1682
1683search() vs. match()
1684^^^^^^^^^^^^^^^^^^^^
1685
1686.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
1687
1688Python offers different primitive operations based on regular expressions:
1689
1690+ :func:`re.match` checks for a match only at the beginning of the string
1691+ :func:`re.search` checks for a match anywhere in the string
1692  (this is what Perl does by default)
1693+ :func:`re.fullmatch` checks for entire string to be a match
1694
1695
1696For example::
1697
1698   >>> re.match("c", "abcdef")    # No match
1699   >>> re.search("c", "abcdef")   # Match
1700   <re.Match object; span=(2, 3), match='c'>
1701   >>> re.fullmatch("p.*n", "python") # Match
1702   <re.Match object; span=(0, 6), match='python'>
1703   >>> re.fullmatch("r.*n", "python") # No match
1704
1705Regular expressions beginning with ``'^'`` can be used with :func:`search` to
1706restrict the match at the beginning of the string::
1707
1708   >>> re.match("c", "abcdef")    # No match
1709   >>> re.search("^c", "abcdef")  # No match
1710   >>> re.search("^a", "abcdef")  # Match
1711   <re.Match object; span=(0, 1), match='a'>
1712
1713Note however that in :const:`MULTILINE` mode :func:`match` only matches at the
1714beginning of the string, whereas using :func:`search` with a regular expression
1715beginning with ``'^'`` will match at the beginning of each line. ::
1716
1717   >>> re.match("X", "A\nB\nX", re.MULTILINE)  # No match
1718   >>> re.search("^X", "A\nB\nX", re.MULTILINE)  # Match
1719   <re.Match object; span=(4, 5), match='X'>
1720
1721
1722Making a Phonebook
1723^^^^^^^^^^^^^^^^^^
1724
1725:func:`split` splits a string into a list delimited by the passed pattern.  The
1726method is invaluable for converting textual data into data structures that can be
1727easily read and modified by Python as demonstrated in the following example that
1728creates a phonebook.
1729
1730First, here is the input.  Normally it may come from a file, here we are using
1731triple-quoted string syntax
1732
1733.. doctest::
1734
1735   >>> text = """Ross McFluff: 834.345.1254 155 Elm Street
1736   ...
1737   ... Ronald Heathmore: 892.345.3428 436 Finley Avenue
1738   ... Frank Burger: 925.541.7625 662 South Dogwood Way
1739   ...
1740   ...
1741   ... Heather Albrecht: 548.326.4584 919 Park Place"""
1742
1743The entries are separated by one or more newlines. Now we convert the string
1744into a list with each nonempty line having its own entry:
1745
1746.. doctest::
1747   :options: +NORMALIZE_WHITESPACE
1748
1749   >>> entries = re.split("\n+", text)
1750   >>> entries
1751   ['Ross McFluff: 834.345.1254 155 Elm Street',
1752   'Ronald Heathmore: 892.345.3428 436 Finley Avenue',
1753   'Frank Burger: 925.541.7625 662 South Dogwood Way',
1754   'Heather Albrecht: 548.326.4584 919 Park Place']
1755
1756Finally, split each entry into a list with first name, last name, telephone
1757number, and address.  We use the ``maxsplit`` parameter of :func:`split`
1758because the address has spaces, our splitting pattern, in it:
1759
1760.. doctest::
1761   :options: +NORMALIZE_WHITESPACE
1762
1763   >>> [re.split(":? ", entry, maxsplit=3) for entry in entries]
1764   [['Ross', 'McFluff', '834.345.1254', '155 Elm Street'],
1765   ['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'],
1766   ['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'],
1767   ['Heather', 'Albrecht', '548.326.4584', '919 Park Place']]
1768
1769The ``:?`` pattern matches the colon after the last name, so that it does not
1770occur in the result list.  With a ``maxsplit`` of ``4``, we could separate the
1771house number from the street name:
1772
1773.. doctest::
1774   :options: +NORMALIZE_WHITESPACE
1775
1776   >>> [re.split(":? ", entry, maxsplit=4) for entry in entries]
1777   [['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'],
1778   ['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'],
1779   ['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'],
1780   ['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']]
1781
1782
1783Text Munging
1784^^^^^^^^^^^^
1785
1786:func:`sub` replaces every occurrence of a pattern with a string or the
1787result of a function.  This example demonstrates using :func:`sub` with
1788a function to "munge" text, or randomize the order of all the characters
1789in each word of a sentence except for the first and last characters::
1790
1791   >>> def repl(m):
1792   ...     inner_word = list(m.group(2))
1793   ...     random.shuffle(inner_word)
1794   ...     return m.group(1) + "".join(inner_word) + m.group(3)
1795   ...
1796   >>> text = "Professor Abdolmalek, please report your absences promptly."
1797   >>> re.sub(r"(\w)(\w+)(\w)", repl, text)
1798   'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.'
1799   >>> re.sub(r"(\w)(\w+)(\w)", repl, text)
1800   'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.'
1801
1802
1803Finding all Adverbs
1804^^^^^^^^^^^^^^^^^^^
1805
1806:func:`findall` matches *all* occurrences of a pattern, not just the first
1807one as :func:`search` does.  For example, if a writer wanted to
1808find all of the adverbs in some text, they might use :func:`findall` in
1809the following manner::
1810
1811   >>> text = "He was carefully disguised but captured quickly by police."
1812   >>> re.findall(r"\w+ly\b", text)
1813   ['carefully', 'quickly']
1814
1815
1816Finding all Adverbs and their Positions
1817^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1818
1819If one wants more information about all matches of a pattern than the matched
1820text, :func:`finditer` is useful as it provides :class:`~re.Match` objects
1821instead of strings.  Continuing with the previous example, if a writer wanted
1822to find all of the adverbs *and their positions* in some text, they would use
1823:func:`finditer` in the following manner::
1824
1825   >>> text = "He was carefully disguised but captured quickly by police."
1826   >>> for m in re.finditer(r"\w+ly\b", text):
1827   ...     print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))
1828   07-16: carefully
1829   40-47: quickly
1830
1831
1832Raw String Notation
1833^^^^^^^^^^^^^^^^^^^
1834
1835Raw string notation (``r"text"``) keeps regular expressions sane.  Without it,
1836every backslash (``'\'``) in a regular expression would have to be prefixed with
1837another one to escape it.  For example, the two following lines of code are
1838functionally identical::
1839
1840   >>> re.match(r"\W(.)\1\W", " ff ")
1841   <re.Match object; span=(0, 4), match=' ff '>
1842   >>> re.match("\\W(.)\\1\\W", " ff ")
1843   <re.Match object; span=(0, 4), match=' ff '>
1844
1845When one wants to match a literal backslash, it must be escaped in the regular
1846expression.  With raw string notation, this means ``r"\\"``.  Without raw string
1847notation, one must use ``"\\\\"``, making the following lines of code
1848functionally identical::
1849
1850   >>> re.match(r"\\", r"\\")
1851   <re.Match object; span=(0, 1), match='\\'>
1852   >>> re.match("\\\\", r"\\")
1853   <re.Match object; span=(0, 1), match='\\'>
1854
1855
1856Writing a Tokenizer
1857^^^^^^^^^^^^^^^^^^^
1858
1859A `tokenizer or scanner <https://en.wikipedia.org/wiki/Lexical_analysis>`_
1860analyzes a string to categorize groups of characters.  This is a useful first
1861step in writing a compiler or interpreter.
1862
1863The text categories are specified with regular expressions.  The technique is
1864to combine those into a single master regular expression and to loop over
1865successive matches::
1866
1867    from typing import NamedTuple
1868    import re
1869
1870    class Token(NamedTuple):
1871        type: str
1872        value: str
1873        line: int
1874        column: int
1875
1876    def tokenize(code):
1877        keywords = {'IF', 'THEN', 'ENDIF', 'FOR', 'NEXT', 'GOSUB', 'RETURN'}
1878        token_specification = [
1879            ('NUMBER',   r'\d+(\.\d*)?'),  # Integer or decimal number
1880            ('ASSIGN',   r':='),           # Assignment operator
1881            ('END',      r';'),            # Statement terminator
1882            ('ID',       r'[A-Za-z]+'),    # Identifiers
1883            ('OP',       r'[+\-*/]'),      # Arithmetic operators
1884            ('NEWLINE',  r'\n'),           # Line endings
1885            ('SKIP',     r'[ \t]+'),       # Skip over spaces and tabs
1886            ('MISMATCH', r'.'),            # Any other character
1887        ]
1888        tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)
1889        line_num = 1
1890        line_start = 0
1891        for mo in re.finditer(tok_regex, code):
1892            kind = mo.lastgroup
1893            value = mo.group()
1894            column = mo.start() - line_start
1895            if kind == 'NUMBER':
1896                value = float(value) if '.' in value else int(value)
1897            elif kind == 'ID' and value in keywords:
1898                kind = value
1899            elif kind == 'NEWLINE':
1900                line_start = mo.end()
1901                line_num += 1
1902                continue
1903            elif kind == 'SKIP':
1904                continue
1905            elif kind == 'MISMATCH':
1906                raise RuntimeError(f'{value!r} unexpected on line {line_num}')
1907            yield Token(kind, value, line_num, column)
1908
1909    statements = '''
1910        IF quantity THEN
1911            total := total + price * quantity;
1912            tax := price * 0.05;
1913        ENDIF;
1914    '''
1915
1916    for token in tokenize(statements):
1917        print(token)
1918
1919The tokenizer produces the following output::
1920
1921    Token(type='IF', value='IF', line=2, column=4)
1922    Token(type='ID', value='quantity', line=2, column=7)
1923    Token(type='THEN', value='THEN', line=2, column=16)
1924    Token(type='ID', value='total', line=3, column=8)
1925    Token(type='ASSIGN', value=':=', line=3, column=14)
1926    Token(type='ID', value='total', line=3, column=17)
1927    Token(type='OP', value='+', line=3, column=23)
1928    Token(type='ID', value='price', line=3, column=25)
1929    Token(type='OP', value='*', line=3, column=31)
1930    Token(type='ID', value='quantity', line=3, column=33)
1931    Token(type='END', value=';', line=3, column=41)
1932    Token(type='ID', value='tax', line=4, column=8)
1933    Token(type='ASSIGN', value=':=', line=4, column=12)
1934    Token(type='ID', value='price', line=4, column=15)
1935    Token(type='OP', value='*', line=4, column=21)
1936    Token(type='NUMBER', value=0.05, line=4, column=23)
1937    Token(type='END', value=';', line=4, column=27)
1938    Token(type='ENDIF', value='ENDIF', line=5, column=4)
1939    Token(type='END', value=';', line=5, column=9)
1940
1941
1942.. [Frie09] Friedl, Jeffrey. Mastering Regular Expressions. 3rd ed., O'Reilly
1943   Media, 2009. The third edition of the book no longer covers Python at all,
1944   but the first edition covered writing good regular expression patterns in
1945   great detail.
1946