• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1:mod:`re` --- Regular expression operations
2===========================================
3
4.. module:: re
5   :synopsis: Regular expression operations.
6
7.. moduleauthor:: Fredrik Lundh <fredrik@pythonware.com>
8.. sectionauthor:: Andrew M. Kuchling <amk@amk.ca>
9
10**Source code:** :source:`Lib/re.py`
11
12--------------
13
14This module provides regular expression matching operations similar to
15those found in Perl.
16
17Both patterns and strings to be searched can be Unicode strings (:class:`str`)
18as well as 8-bit strings (:class:`bytes`).
19However, Unicode strings and 8-bit strings cannot be mixed:
20that is, you cannot match a Unicode string with a byte pattern or
21vice-versa; similarly, when asking for a substitution, the replacement
22string must be of the same type as both the pattern and the search string.
23
24Regular expressions use the backslash character (``'\'``) to indicate
25special forms or to allow special characters to be used without invoking
26their special meaning.  This collides with Python's usage of the same
27character for the same purpose in string literals; for example, to match
28a literal backslash, one might have to write ``'\\\\'`` as the pattern
29string, because the regular expression must be ``\\``, and each
30backslash must be expressed as ``\\`` inside a regular Python string
31literal. Also, please note that any invalid escape sequences in Python's
32usage of the backslash in string literals now generate a :exc:`DeprecationWarning`
33and in the future this will become a :exc:`SyntaxError`. This behaviour
34will happen even if it is a valid escape sequence for a regular expression.
35
36The solution is to use Python's raw string notation for regular expression
37patterns; backslashes are not handled in any special way in a string literal
38prefixed with ``'r'``.  So ``r"\n"`` is a two-character string containing
39``'\'`` and ``'n'``, while ``"\n"`` is a one-character string containing a
40newline.  Usually patterns will be expressed in Python code using this raw
41string notation.
42
43It is important to note that most regular expression operations are available as
44module-level functions and methods on
45:ref:`compiled regular expressions <re-objects>`.  The functions are shortcuts
46that don't require you to compile a regex object first, but miss some
47fine-tuning parameters.
48
49.. seealso::
50
51   The third-party `regex <https://pypi.org/project/regex/>`_ module,
52   which has an API compatible with the standard library :mod:`re` module,
53   but offers additional functionality and a more thorough Unicode support.
54
55
56.. _re-syntax:
57
58Regular Expression Syntax
59-------------------------
60
61A regular expression (or RE) specifies a set of strings that matches it; the
62functions in this module let you check if a particular string matches a given
63regular expression (or if a given regular expression matches a particular
64string, which comes down to the same thing).
65
66Regular expressions can be concatenated to form new regular expressions; if *A*
67and *B* are both regular expressions, then *AB* is also a regular expression.
68In general, if a string *p* matches *A* and another string *q* matches *B*, the
69string *pq* will match AB.  This holds unless *A* or *B* contain low precedence
70operations; boundary conditions between *A* and *B*; or have numbered group
71references.  Thus, complex expressions can easily be constructed from simpler
72primitive expressions like the ones described here.  For details of the theory
73and implementation of regular expressions, consult the Friedl book [Frie09]_,
74or almost any textbook about compiler construction.
75
76A brief explanation of the format of regular expressions follows.  For further
77information and a gentler presentation, consult the :ref:`regex-howto`.
78
79Regular expressions can contain both special and ordinary characters. Most
80ordinary characters, like ``'A'``, ``'a'``, or ``'0'``, are the simplest regular
81expressions; they simply match themselves.  You can concatenate ordinary
82characters, so ``last`` matches the string ``'last'``.  (In the rest of this
83section, we'll write RE's in ``this special style``, usually without quotes, and
84strings to be matched ``'in single quotes'``.)
85
86Some characters, like ``'|'`` or ``'('``, are special. Special
87characters either stand for classes of ordinary characters, or affect
88how the regular expressions around them are interpreted.
89
90Repetition qualifiers (``*``, ``+``, ``?``, ``{m,n}``, etc) cannot be
91directly nested. This avoids ambiguity with the non-greedy modifier suffix
92``?``, and with other modifiers in other implementations. To apply a second
93repetition to an inner repetition, parentheses may be used. For example,
94the expression ``(?:a{6})*`` matches any multiple of six ``'a'`` characters.
95
96
97The special characters are:
98
99.. index:: single: . (dot); in regular expressions
100
101``.``
102   (Dot.)  In the default mode, this matches any character except a newline.  If
103   the :const:`DOTALL` flag has been specified, this matches any character
104   including a newline.
105
106.. index:: single: ^ (caret); in regular expressions
107
108``^``
109   (Caret.)  Matches the start of the string, and in :const:`MULTILINE` mode also
110   matches immediately after each newline.
111
112.. index:: single: $ (dollar); in regular expressions
113
114``$``
115   Matches the end of the string or just before the newline at the end of the
116   string, and in :const:`MULTILINE` mode also matches before a newline.  ``foo``
117   matches both 'foo' and 'foobar', while the regular expression ``foo$`` matches
118   only 'foo'.  More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'``
119   matches 'foo2' normally, but 'foo1' in :const:`MULTILINE` mode; searching for
120   a single ``$`` in ``'foo\n'`` will find two (empty) matches: one just before
121   the newline, and one at the end of the string.
122
123.. index:: single: * (asterisk); in regular expressions
124
125``*``
126   Causes the resulting RE to match 0 or more repetitions of the preceding RE, as
127   many repetitions as are possible.  ``ab*`` will match 'a', 'ab', or 'a' followed
128   by any number of 'b's.
129
130.. index:: single: + (plus); in regular expressions
131
132``+``
133   Causes the resulting RE to match 1 or more repetitions of the preceding RE.
134   ``ab+`` will match 'a' followed by any non-zero number of 'b's; it will not
135   match just 'a'.
136
137.. index:: single: ? (question mark); in regular expressions
138
139``?``
140   Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
141   ``ab?`` will match either 'a' or 'ab'.
142
143.. index::
144   single: *?; in regular expressions
145   single: +?; in regular expressions
146   single: ??; in regular expressions
147
148``*?``, ``+?``, ``??``
149   The ``'*'``, ``'+'``, and ``'?'`` qualifiers are all :dfn:`greedy`; they match
150   as much text as possible.  Sometimes this behaviour isn't desired; if the RE
151   ``<.*>`` is matched against ``'<a> b <c>'``, it will match the entire
152   string, and not just ``'<a>'``.  Adding ``?`` after the qualifier makes it
153   perform the match in :dfn:`non-greedy` or :dfn:`minimal` fashion; as *few*
154   characters as possible will be matched.  Using the RE ``<.*?>`` will match
155   only ``'<a>'``.
156
157.. index::
158   single: {} (curly brackets); in regular expressions
159
160``{m}``
161   Specifies that exactly *m* copies of the previous RE should be matched; fewer
162   matches cause the entire RE not to match.  For example, ``a{6}`` will match
163   exactly six ``'a'`` characters, but not five.
164
165``{m,n}``
166   Causes the resulting RE to match from *m* to *n* repetitions of the preceding
167   RE, attempting to match as many repetitions as possible.  For example,
168   ``a{3,5}`` will match from 3 to 5 ``'a'`` characters.  Omitting *m* specifies a
169   lower bound of zero,  and omitting *n* specifies an infinite upper bound.  As an
170   example, ``a{4,}b`` will match ``'aaaab'`` or a thousand ``'a'`` characters
171   followed by a ``'b'``, but not ``'aaab'``. The comma may not be omitted or the
172   modifier would be confused with the previously described form.
173
174``{m,n}?``
175   Causes the resulting RE to match from *m* to *n* repetitions of the preceding
176   RE, attempting to match as *few* repetitions as possible.  This is the
177   non-greedy version of the previous qualifier.  For example, on the
178   6-character string ``'aaaaaa'``, ``a{3,5}`` will match 5 ``'a'`` characters,
179   while ``a{3,5}?`` will only match 3 characters.
180
181.. index:: single: \ (backslash); in regular expressions
182
183``\``
184   Either escapes special characters (permitting you to match characters like
185   ``'*'``, ``'?'``, and so forth), or signals a special sequence; special
186   sequences are discussed below.
187
188   If you're not using a raw string to express the pattern, remember that Python
189   also uses the backslash as an escape sequence in string literals; if the escape
190   sequence isn't recognized by Python's parser, the backslash and subsequent
191   character are included in the resulting string.  However, if Python would
192   recognize the resulting sequence, the backslash should be repeated twice.  This
193   is complicated and hard to understand, so it's highly recommended that you use
194   raw strings for all but the simplest expressions.
195
196.. index::
197   single: [] (square brackets); in regular expressions
198
199``[]``
200   Used to indicate a set of characters.  In a set:
201
202   * Characters can be listed individually, e.g. ``[amk]`` will match ``'a'``,
203     ``'m'``, or ``'k'``.
204
205   .. index:: single: - (minus); in regular expressions
206
207   * Ranges of characters can be indicated by giving two characters and separating
208     them by a ``'-'``, for example ``[a-z]`` will match any lowercase ASCII letter,
209     ``[0-5][0-9]`` will match all the two-digits numbers from ``00`` to ``59``, and
210     ``[0-9A-Fa-f]`` will match any hexadecimal digit.  If ``-`` is escaped (e.g.
211     ``[a\-z]``) or if it's placed as the first or last character
212     (e.g. ``[-a]`` or ``[a-]``), it will match a literal ``'-'``.
213
214   * Special characters lose their special meaning inside sets.  For example,
215     ``[(+*)]`` will match any of the literal characters ``'('``, ``'+'``,
216     ``'*'``, or ``')'``.
217
218   .. index:: single: \ (backslash); in regular expressions
219
220   * Character classes such as ``\w`` or ``\S`` (defined below) are also accepted
221     inside a set, although the characters they match depends on whether
222     :const:`ASCII` or :const:`LOCALE` mode is in force.
223
224   .. index:: single: ^ (caret); in regular expressions
225
226   * Characters that are not within a range can be matched by :dfn:`complementing`
227     the set.  If the first character of the set is ``'^'``, all the characters
228     that are *not* in the set will be matched.  For example, ``[^5]`` will match
229     any character except ``'5'``, and ``[^^]`` will match any character except
230     ``'^'``.  ``^`` has no special meaning if it's not the first character in
231     the set.
232
233   * To match a literal ``']'`` inside a set, precede it with a backslash, or
234     place it at the beginning of the set.  For example, both ``[()[\]{}]`` and
235     ``[]()[{}]`` will both match a parenthesis.
236
237   .. .. index:: single: --; in regular expressions
238   .. .. index:: single: &&; in regular expressions
239   .. .. index:: single: ~~; in regular expressions
240   .. .. index:: single: ||; in regular expressions
241
242   * Support of nested sets and set operations as in `Unicode Technical
243     Standard #18`_ might be added in the future.  This would change the
244     syntax, so to facilitate this change a :exc:`FutureWarning` will be raised
245     in ambiguous cases for the time being.
246     That includes sets starting with a literal ``'['`` or containing literal
247     character sequences ``'--'``, ``'&&'``, ``'~~'``, and ``'||'``.  To
248     avoid a warning escape them with a backslash.
249
250   .. _Unicode Technical Standard #18: https://unicode.org/reports/tr18/
251
252   .. versionchanged:: 3.7
253      :exc:`FutureWarning` is raised if a character set contains constructs
254      that will change semantically in the future.
255
256.. index:: single: | (vertical bar); in regular expressions
257
258``|``
259   ``A|B``, where *A* and *B* can be arbitrary REs, creates a regular expression that
260   will match either *A* or *B*.  An arbitrary number of REs can be separated by the
261   ``'|'`` in this way.  This can be used inside groups (see below) as well.  As
262   the target string is scanned, REs separated by ``'|'`` are tried from left to
263   right. When one pattern completely matches, that branch is accepted. This means
264   that once *A* matches, *B* will not be tested further, even if it would
265   produce a longer overall match.  In other words, the ``'|'`` operator is never
266   greedy.  To match a literal ``'|'``, use ``\|``, or enclose it inside a
267   character class, as in ``[|]``.
268
269.. index::
270   single: () (parentheses); in regular expressions
271
272``(...)``
273   Matches whatever regular expression is inside the parentheses, and indicates the
274   start and end of a group; the contents of a group can be retrieved after a match
275   has been performed, and can be matched later in the string with the ``\number``
276   special sequence, described below.  To match the literals ``'('`` or ``')'``,
277   use ``\(`` or ``\)``, or enclose them inside a character class: ``[(]``, ``[)]``.
278
279.. index:: single: (?; in regular expressions
280
281``(?...)``
282   This is an extension notation (a ``'?'`` following a ``'('`` is not meaningful
283   otherwise).  The first character after the ``'?'`` determines what the meaning
284   and further syntax of the construct is. Extensions usually do not create a new
285   group; ``(?P<name>...)`` is the only exception to this rule. Following are the
286   currently supported extensions.
287
288``(?aiLmsux)``
289   (One or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``,
290   ``'s'``, ``'u'``, ``'x'``.)  The group matches the empty string; the
291   letters set the corresponding flags: :const:`re.A` (ASCII-only matching),
292   :const:`re.I` (ignore case), :const:`re.L` (locale dependent),
293   :const:`re.M` (multi-line), :const:`re.S` (dot matches all),
294   :const:`re.U` (Unicode matching), and :const:`re.X` (verbose),
295   for the entire regular expression.
296   (The flags are described in :ref:`contents-of-module-re`.)
297   This is useful if you wish to include the flags as part of the
298   regular expression, instead of passing a *flag* argument to the
299   :func:`re.compile` function.  Flags should be used first in the
300   expression string.
301
302.. index:: single: (?:; in regular expressions
303
304``(?:...)``
305   A non-capturing version of regular parentheses.  Matches whatever regular
306   expression is inside the parentheses, but the substring matched by the group
307   *cannot* be retrieved after performing a match or referenced later in the
308   pattern.
309
310``(?aiLmsux-imsx:...)``
311   (Zero or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``,
312   ``'s'``, ``'u'``, ``'x'``, optionally followed by ``'-'`` followed by
313   one or more letters from the ``'i'``, ``'m'``, ``'s'``, ``'x'``.)
314   The letters set or remove the corresponding flags:
315   :const:`re.A` (ASCII-only matching), :const:`re.I` (ignore case),
316   :const:`re.L` (locale dependent), :const:`re.M` (multi-line),
317   :const:`re.S` (dot matches all), :const:`re.U` (Unicode matching),
318   and :const:`re.X` (verbose), for the part of the expression.
319   (The flags are described in :ref:`contents-of-module-re`.)
320
321   The letters ``'a'``, ``'L'`` and ``'u'`` are mutually exclusive when used
322   as inline flags, so they can't be combined or follow ``'-'``.  Instead,
323   when one of them appears in an inline group, it overrides the matching mode
324   in the enclosing group.  In Unicode patterns ``(?a:...)`` switches to
325   ASCII-only matching, and ``(?u:...)`` switches to Unicode matching
326   (default).  In byte pattern ``(?L:...)`` switches to locale depending
327   matching, and ``(?a:...)`` switches to ASCII-only matching (default).
328   This override is only in effect for the narrow inline group, and the
329   original matching mode is restored outside of the group.
330
331   .. versionadded:: 3.6
332
333   .. versionchanged:: 3.7
334      The letters ``'a'``, ``'L'`` and ``'u'`` also can be used in a group.
335
336.. index:: single: (?P<; in regular expressions
337
338``(?P<name>...)``
339   Similar to regular parentheses, but the substring matched by the group is
340   accessible via the symbolic group name *name*.  Group names must be valid
341   Python identifiers, and each group name must be defined only once within a
342   regular expression.  A symbolic group is also a numbered group, just as if
343   the group were not named.
344
345   Named groups can be referenced in three contexts.  If the pattern is
346   ``(?P<quote>['"]).*?(?P=quote)`` (i.e. matching a string quoted with either
347   single or double quotes):
348
349   +---------------------------------------+----------------------------------+
350   | Context of reference to group "quote" | Ways to reference it             |
351   +=======================================+==================================+
352   | in the same pattern itself            | * ``(?P=quote)`` (as shown)      |
353   |                                       | * ``\1``                         |
354   +---------------------------------------+----------------------------------+
355   | when processing match object *m*      | * ``m.group('quote')``           |
356   |                                       | * ``m.end('quote')`` (etc.)      |
357   +---------------------------------------+----------------------------------+
358   | in a string passed to the *repl*      | * ``\g<quote>``                  |
359   | argument of ``re.sub()``              | * ``\g<1>``                      |
360   |                                       | * ``\1``                         |
361   +---------------------------------------+----------------------------------+
362
363.. index:: single: (?P=; in regular expressions
364
365``(?P=name)``
366   A backreference to a named group; it matches whatever text was matched by the
367   earlier group named *name*.
368
369.. index:: single: (?#; in regular expressions
370
371``(?#...)``
372   A comment; the contents of the parentheses are simply ignored.
373
374.. index:: single: (?=; in regular expressions
375
376``(?=...)``
377   Matches if ``...`` matches next, but doesn't consume any of the string.  This is
378   called a :dfn:`lookahead assertion`.  For example, ``Isaac (?=Asimov)`` will match
379   ``'Isaac '`` only if it's followed by ``'Asimov'``.
380
381.. index:: single: (?!; in regular expressions
382
383``(?!...)``
384   Matches if ``...`` doesn't match next.  This is a :dfn:`negative lookahead assertion`.
385   For example, ``Isaac (?!Asimov)`` will match ``'Isaac '`` only if it's *not*
386   followed by ``'Asimov'``.
387
388.. index:: single: (?<=; in regular expressions
389
390``(?<=...)``
391   Matches if the current position in the string is preceded by a match for ``...``
392   that ends at the current position.  This is called a :dfn:`positive lookbehind
393   assertion`. ``(?<=abc)def`` will find a match in ``'abcdef'``, since the
394   lookbehind will back up 3 characters and check if the contained pattern matches.
395   The contained pattern must only match strings of some fixed length, meaning that
396   ``abc`` or ``a|b`` are allowed, but ``a*`` and ``a{3,4}`` are not.  Note that
397   patterns which start with positive lookbehind assertions will not match at the
398   beginning of the string being searched; you will most likely want to use the
399   :func:`search` function rather than the :func:`match` function:
400
401      >>> import re
402      >>> m = re.search('(?<=abc)def', 'abcdef')
403      >>> m.group(0)
404      'def'
405
406   This example looks for a word following a hyphen:
407
408      >>> m = re.search(r'(?<=-)\w+', 'spam-egg')
409      >>> m.group(0)
410      'egg'
411
412   .. versionchanged:: 3.5
413      Added support for group references of fixed length.
414
415.. index:: single: (?<!; in regular expressions
416
417``(?<!...)``
418   Matches if the current position in the string is not preceded by a match for
419   ``...``.  This is called a :dfn:`negative lookbehind assertion`.  Similar to
420   positive lookbehind assertions, the contained pattern must only match strings of
421   some fixed length.  Patterns which start with negative lookbehind assertions may
422   match at the beginning of the string being searched.
423
424``(?(id/name)yes-pattern|no-pattern)``
425   Will try to match with ``yes-pattern`` if the group with given *id* or
426   *name* exists, and with ``no-pattern`` if it doesn't. ``no-pattern`` is
427   optional and can be omitted. For example,
428   ``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>|$)`` is a poor email matching pattern, which
429   will match with ``'<user@host.com>'`` as well as ``'user@host.com'``, but
430   not with ``'<user@host.com'`` nor ``'user@host.com>'``.
431
432
433The special sequences consist of ``'\'`` and a character from the list below.
434If the ordinary character is not an ASCII digit or an ASCII letter, then the
435resulting RE will match the second character.  For example, ``\$`` matches the
436character ``'$'``.
437
438.. index:: single: \ (backslash); in regular expressions
439
440``\number``
441   Matches the contents of the group of the same number.  Groups are numbered
442   starting from 1.  For example, ``(.+) \1`` matches ``'the the'`` or ``'55 55'``,
443   but not ``'thethe'`` (note the space after the group).  This special sequence
444   can only be used to match one of the first 99 groups.  If the first digit of
445   *number* is 0, or *number* is 3 octal digits long, it will not be interpreted as
446   a group match, but as the character with octal value *number*. Inside the
447   ``'['`` and ``']'`` of a character class, all numeric escapes are treated as
448   characters.
449
450.. index:: single: \A; in regular expressions
451
452``\A``
453   Matches only at the start of the string.
454
455.. index:: single: \b; in regular expressions
456
457``\b``
458   Matches the empty string, but only at the beginning or end of a word.
459   A word is defined as a sequence of word characters.  Note that formally,
460   ``\b`` is defined as the boundary between a ``\w`` and a ``\W`` character
461   (or vice versa), or between ``\w`` and the beginning/end of the string.
462   This means that ``r'\bfoo\b'`` matches ``'foo'``, ``'foo.'``, ``'(foo)'``,
463   ``'bar foo baz'`` but not ``'foobar'`` or ``'foo3'``.
464
465   By default Unicode alphanumerics are the ones used in Unicode patterns, but
466   this can be changed by using the :const:`ASCII` flag.  Word boundaries are
467   determined by the current locale if the :const:`LOCALE` flag is used.
468   Inside a character range, ``\b`` represents the backspace character, for
469   compatibility with Python's string literals.
470
471.. index:: single: \B; in regular expressions
472
473``\B``
474   Matches the empty string, but only when it is *not* at the beginning or end
475   of a word.  This means that ``r'py\B'`` matches ``'python'``, ``'py3'``,
476   ``'py2'``, but not ``'py'``, ``'py.'``, or ``'py!'``.
477   ``\B`` is just the opposite of ``\b``, so word characters in Unicode
478   patterns are Unicode alphanumerics or the underscore, although this can
479   be changed by using the :const:`ASCII` flag.  Word boundaries are
480   determined by the current locale if the :const:`LOCALE` flag is used.
481
482.. index:: single: \d; in regular expressions
483
484``\d``
485   For Unicode (str) patterns:
486      Matches any Unicode decimal digit (that is, any character in
487      Unicode character category [Nd]).  This includes ``[0-9]``, and
488      also many other digit characters.  If the :const:`ASCII` flag is
489      used only ``[0-9]`` is matched.
490
491   For 8-bit (bytes) patterns:
492      Matches any decimal digit; this is equivalent to ``[0-9]``.
493
494.. index:: single: \D; in regular expressions
495
496``\D``
497   Matches any character which is not a decimal digit. This is
498   the opposite of ``\d``. If the :const:`ASCII` flag is used this
499   becomes the equivalent of ``[^0-9]``.
500
501.. index:: single: \s; in regular expressions
502
503``\s``
504   For Unicode (str) patterns:
505      Matches Unicode whitespace characters (which includes
506      ``[ \t\n\r\f\v]``, and also many other characters, for example the
507      non-breaking spaces mandated by typography rules in many
508      languages). If the :const:`ASCII` flag is used, only
509      ``[ \t\n\r\f\v]`` is matched.
510
511   For 8-bit (bytes) patterns:
512      Matches characters considered whitespace in the ASCII character set;
513      this is equivalent to ``[ \t\n\r\f\v]``.
514
515.. index:: single: \S; in regular expressions
516
517``\S``
518   Matches any character which is not a whitespace character. This is
519   the opposite of ``\s``. If the :const:`ASCII` flag is used this
520   becomes the equivalent of ``[^ \t\n\r\f\v]``.
521
522.. index:: single: \w; in regular expressions
523
524``\w``
525   For Unicode (str) patterns:
526      Matches Unicode word characters; this includes most characters
527      that can be part of a word in any language, as well as numbers and
528      the underscore. If the :const:`ASCII` flag is used, only
529      ``[a-zA-Z0-9_]`` is matched.
530
531   For 8-bit (bytes) patterns:
532      Matches characters considered alphanumeric in the ASCII character set;
533      this is equivalent to ``[a-zA-Z0-9_]``.  If the :const:`LOCALE` flag is
534      used, matches characters considered alphanumeric in the current locale
535      and the underscore.
536
537.. index:: single: \W; in regular expressions
538
539``\W``
540   Matches any character which is not a word character. This is
541   the opposite of ``\w``. If the :const:`ASCII` flag is used this
542   becomes the equivalent of ``[^a-zA-Z0-9_]``.  If the :const:`LOCALE` flag is
543   used, matches characters which are neither alphanumeric in the current locale
544   nor the underscore.
545
546.. index:: single: \Z; in regular expressions
547
548``\Z``
549   Matches only at the end of the string.
550
551.. index::
552   single: \a; in regular expressions
553   single: \b; in regular expressions
554   single: \f; in regular expressions
555   single: \n; in regular expressions
556   single: \N; in regular expressions
557   single: \r; in regular expressions
558   single: \t; in regular expressions
559   single: \u; in regular expressions
560   single: \U; in regular expressions
561   single: \v; in regular expressions
562   single: \x; in regular expressions
563   single: \\; in regular expressions
564
565Most of the standard escapes supported by Python string literals are also
566accepted by the regular expression parser::
567
568   \a      \b      \f      \n
569   \N      \r      \t      \u
570   \U      \v      \x      \\
571
572(Note that ``\b`` is used to represent word boundaries, and means "backspace"
573only inside character classes.)
574
575``'\u'``, ``'\U'``, and ``'\N'`` escape sequences are only recognized in Unicode
576patterns.  In bytes patterns they are errors.  Unknown escapes of ASCII
577letters are reserved for future use and treated as errors.
578
579Octal escapes are included in a limited form.  If the first digit is a 0, or if
580there are three octal digits, it is considered an octal escape. Otherwise, it is
581a group reference.  As for string literals, octal escapes are always at most
582three digits in length.
583
584.. versionchanged:: 3.3
585   The ``'\u'`` and ``'\U'`` escape sequences have been added.
586
587.. versionchanged:: 3.6
588   Unknown escapes consisting of ``'\'`` and an ASCII letter now are errors.
589
590.. versionchanged:: 3.8
591   The ``'\N{name}'`` escape sequence has been added. As in string literals,
592   it expands to the named Unicode character (e.g. ``'\N{EM DASH}'``).
593
594
595.. _contents-of-module-re:
596
597Module Contents
598---------------
599
600The module defines several functions, constants, and an exception. Some of the
601functions are simplified versions of the full featured methods for compiled
602regular expressions.  Most non-trivial applications always use the compiled
603form.
604
605.. versionchanged:: 3.6
606   Flag constants are now instances of :class:`RegexFlag`, which is a subclass of
607   :class:`enum.IntFlag`.
608
609.. function:: compile(pattern, flags=0)
610
611   Compile a regular expression pattern into a :ref:`regular expression object
612   <re-objects>`, which can be used for matching using its
613   :func:`~Pattern.match`, :func:`~Pattern.search` and other methods, described
614   below.
615
616   The expression's behaviour can be modified by specifying a *flags* value.
617   Values can be any of the following variables, combined using bitwise OR (the
618   ``|`` operator).
619
620   The sequence ::
621
622      prog = re.compile(pattern)
623      result = prog.match(string)
624
625   is equivalent to ::
626
627      result = re.match(pattern, string)
628
629   but using :func:`re.compile` and saving the resulting regular expression
630   object for reuse is more efficient when the expression will be used several
631   times in a single program.
632
633   .. note::
634
635      The compiled versions of the most recent patterns passed to
636      :func:`re.compile` and the module-level matching functions are cached, so
637      programs that use only a few regular expressions at a time needn't worry
638      about compiling regular expressions.
639
640
641.. data:: A
642          ASCII
643
644   Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S``
645   perform ASCII-only matching instead of full Unicode matching.  This is only
646   meaningful for Unicode patterns, and is ignored for byte patterns.
647   Corresponds to the inline flag ``(?a)``.
648
649   Note that for backward compatibility, the :const:`re.U` flag still
650   exists (as well as its synonym :const:`re.UNICODE` and its embedded
651   counterpart ``(?u)``), but these are redundant in Python 3 since
652   matches are Unicode by default for strings (and Unicode matching
653   isn't allowed for bytes).
654
655
656.. data:: DEBUG
657
658   Display debug information about compiled expression.
659   No corresponding inline flag.
660
661
662.. data:: I
663          IGNORECASE
664
665   Perform case-insensitive matching; expressions like ``[A-Z]`` will also
666   match lowercase letters.  Full Unicode matching (such as ``Ü`` matching
667   ``ü``) also works unless the :const:`re.ASCII` flag is used to disable
668   non-ASCII matches.  The current locale does not change the effect of this
669   flag unless the :const:`re.LOCALE` flag is also used.
670   Corresponds to the inline flag ``(?i)``.
671
672   Note that when the Unicode patterns ``[a-z]`` or ``[A-Z]`` are used in
673   combination with the :const:`IGNORECASE` flag, they will match the 52 ASCII
674   letters and 4 additional non-ASCII letters: 'İ' (U+0130, Latin capital
675   letter I with dot above), 'ı' (U+0131, Latin small letter dotless i),
676   'ſ' (U+017F, Latin small letter long s) and 'K' (U+212A, Kelvin sign).
677   If the :const:`ASCII` flag is used, only letters 'a' to 'z'
678   and 'A' to 'Z' are matched.
679
680.. data:: L
681          LOCALE
682
683   Make ``\w``, ``\W``, ``\b``, ``\B`` and case-insensitive matching
684   dependent on the current locale.  This flag can be used only with bytes
685   patterns.  The use of this flag is discouraged as the locale mechanism
686   is very unreliable, it only handles one "culture" at a time, and it only
687   works with 8-bit locales.  Unicode matching is already enabled by default
688   in Python 3 for Unicode (str) patterns, and it is able to handle different
689   locales/languages.
690   Corresponds to the inline flag ``(?L)``.
691
692   .. versionchanged:: 3.6
693      :const:`re.LOCALE` can be used only with bytes patterns and is
694      not compatible with :const:`re.ASCII`.
695
696   .. versionchanged:: 3.7
697      Compiled regular expression objects with the :const:`re.LOCALE` flag no
698      longer depend on the locale at compile time.  Only the locale at
699      matching time affects the result of matching.
700
701
702.. data:: M
703          MULTILINE
704
705   When specified, the pattern character ``'^'`` matches at the beginning of the
706   string and at the beginning of each line (immediately following each newline);
707   and the pattern character ``'$'`` matches at the end of the string and at the
708   end of each line (immediately preceding each newline).  By default, ``'^'``
709   matches only at the beginning of the string, and ``'$'`` only at the end of the
710   string and immediately before the newline (if any) at the end of the string.
711   Corresponds to the inline flag ``(?m)``.
712
713
714.. data:: S
715          DOTALL
716
717   Make the ``'.'`` special character match any character at all, including a
718   newline; without this flag, ``'.'`` will match anything *except* a newline.
719   Corresponds to the inline flag ``(?s)``.
720
721
722.. data:: X
723          VERBOSE
724
725   .. index:: single: # (hash); in regular expressions
726
727   This flag allows you to write regular expressions that look nicer and are
728   more readable by allowing you to visually separate logical sections of the
729   pattern and add comments. Whitespace within the pattern is ignored, except
730   when in a character class, or when preceded by an unescaped backslash,
731   or within tokens like ``*?``, ``(?:`` or ``(?P<...>``.
732   When a line contains a ``#`` that is not in a character class and is not
733   preceded by an unescaped backslash, all characters from the leftmost such
734   ``#`` through the end of the line are ignored.
735
736   This means that the two following regular expression objects that match a
737   decimal number are functionally equal::
738
739      a = re.compile(r"""\d +  # the integral part
740                         \.    # the decimal point
741                         \d *  # some fractional digits""", re.X)
742      b = re.compile(r"\d+\.\d*")
743
744   Corresponds to the inline flag ``(?x)``.
745
746
747.. function:: search(pattern, string, flags=0)
748
749   Scan through *string* looking for the first location where the regular expression
750   *pattern* produces a match, and return a corresponding :ref:`match object
751   <match-objects>`.  Return ``None`` if no position in the string matches the
752   pattern; note that this is different from finding a zero-length match at some
753   point in the string.
754
755
756.. function:: match(pattern, string, flags=0)
757
758   If zero or more characters at the beginning of *string* match the regular
759   expression *pattern*, return a corresponding :ref:`match object
760   <match-objects>`.  Return ``None`` if the string does not match the pattern;
761   note that this is different from a zero-length match.
762
763   Note that even in :const:`MULTILINE` mode, :func:`re.match` will only match
764   at the beginning of the string and not at the beginning of each line.
765
766   If you want to locate a match anywhere in *string*, use :func:`search`
767   instead (see also :ref:`search-vs-match`).
768
769
770.. function:: fullmatch(pattern, string, flags=0)
771
772   If the whole *string* matches the regular expression *pattern*, return a
773   corresponding :ref:`match object <match-objects>`.  Return ``None`` if the
774   string does not match the pattern; note that this is different from a
775   zero-length match.
776
777   .. versionadded:: 3.4
778
779
780.. function:: split(pattern, string, maxsplit=0, flags=0)
781
782   Split *string* by the occurrences of *pattern*.  If capturing parentheses are
783   used in *pattern*, then the text of all groups in the pattern are also returned
784   as part of the resulting list. If *maxsplit* is nonzero, at most *maxsplit*
785   splits occur, and the remainder of the string is returned as the final element
786   of the list. ::
787
788      >>> re.split(r'\W+', 'Words, words, words.')
789      ['Words', 'words', 'words', '']
790      >>> re.split(r'(\W+)', 'Words, words, words.')
791      ['Words', ', ', 'words', ', ', 'words', '.', '']
792      >>> re.split(r'\W+', 'Words, words, words.', 1)
793      ['Words', 'words, words.']
794      >>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)
795      ['0', '3', '9']
796
797   If there are capturing groups in the separator and it matches at the start of
798   the string, the result will start with an empty string.  The same holds for
799   the end of the string::
800
801      >>> re.split(r'(\W+)', '...words, words...')
802      ['', '...', 'words', ', ', 'words', '...', '']
803
804   That way, separator components are always found at the same relative
805   indices within the result list.
806
807   Empty matches for the pattern split the string only when not adjacent
808   to a previous empty match.
809
810      >>> re.split(r'\b', 'Words, words, words.')
811      ['', 'Words', ', ', 'words', ', ', 'words', '.']
812      >>> re.split(r'\W*', '...words...')
813      ['', '', 'w', 'o', 'r', 'd', 's', '', '']
814      >>> re.split(r'(\W*)', '...words...')
815      ['', '...', '', '', 'w', '', 'o', '', 'r', '', 'd', '', 's', '...', '', '', '']
816
817   .. versionchanged:: 3.1
818      Added the optional flags argument.
819
820   .. versionchanged:: 3.7
821      Added support of splitting on a pattern that could match an empty string.
822
823
824.. function:: findall(pattern, string, flags=0)
825
826   Return all non-overlapping matches of *pattern* in *string*, as a list of
827   strings or tuples.  The *string* is scanned left-to-right, and matches
828   are returned in the order found.  Empty matches are included in the result.
829
830   The result depends on the number of capturing groups in the pattern.
831   If there are no groups, return a list of strings matching the whole
832   pattern.  If there is exactly one group, return a list of strings
833   matching that group.  If multiple groups are present, return a list
834   of tuples of strings matching the groups.  Non-capturing groups do not
835   affect the form of the result.
836
837      >>> re.findall(r'\bf[a-z]*', 'which foot or hand fell fastest')
838      ['foot', 'fell', 'fastest']
839      >>> re.findall(r'(\w+)=(\d+)', 'set width=20 and height=10')
840      [('width', '20'), ('height', '10')]
841
842   .. versionchanged:: 3.7
843      Non-empty matches can now start just after a previous empty match.
844
845
846.. function:: finditer(pattern, string, flags=0)
847
848   Return an :term:`iterator` yielding :ref:`match objects <match-objects>` over
849   all non-overlapping matches for the RE *pattern* in *string*.  The *string*
850   is scanned left-to-right, and matches are returned in the order found.  Empty
851   matches are included in the result.
852
853   .. versionchanged:: 3.7
854      Non-empty matches can now start just after a previous empty match.
855
856
857.. function:: sub(pattern, repl, string, count=0, flags=0)
858
859   Return the string obtained by replacing the leftmost non-overlapping occurrences
860   of *pattern* in *string* by the replacement *repl*.  If the pattern isn't found,
861   *string* is returned unchanged.  *repl* can be a string or a function; if it is
862   a string, any backslash escapes in it are processed.  That is, ``\n`` is
863   converted to a single newline character, ``\r`` is converted to a carriage return, and
864   so forth.  Unknown escapes of ASCII letters are reserved for future use and
865   treated as errors.  Other unknown escapes such as ``\&`` are left alone.
866   Backreferences, such
867   as ``\6``, are replaced with the substring matched by group 6 in the pattern.
868   For example::
869
870      >>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',
871      ...        r'static PyObject*\npy_\1(void)\n{',
872      ...        'def myfunc():')
873      'static PyObject*\npy_myfunc(void)\n{'
874
875   If *repl* is a function, it is called for every non-overlapping occurrence of
876   *pattern*.  The function takes a single :ref:`match object <match-objects>`
877   argument, and returns the replacement string.  For example::
878
879      >>> def dashrepl(matchobj):
880      ...     if matchobj.group(0) == '-': return ' '
881      ...     else: return '-'
882      >>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
883      'pro--gram files'
884      >>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)
885      'Baked Beans & Spam'
886
887   The pattern may be a string or a :ref:`pattern object <re-objects>`.
888
889   The optional argument *count* is the maximum number of pattern occurrences to be
890   replaced; *count* must be a non-negative integer.  If omitted or zero, all
891   occurrences will be replaced. Empty matches for the pattern are replaced only
892   when not adjacent to a previous empty match, so ``sub('x*', '-', 'abxd')`` returns
893   ``'-a-b--d-'``.
894
895   .. index:: single: \g; in regular expressions
896
897   In string-type *repl* arguments, in addition to the character escapes and
898   backreferences described above,
899   ``\g<name>`` will use the substring matched by the group named ``name``, as
900   defined by the ``(?P<name>...)`` syntax. ``\g<number>`` uses the corresponding
901   group number; ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous
902   in a replacement such as ``\g<2>0``.  ``\20`` would be interpreted as a
903   reference to group 20, not a reference to group 2 followed by the literal
904   character ``'0'``.  The backreference ``\g<0>`` substitutes in the entire
905   substring matched by the RE.
906
907   .. versionchanged:: 3.1
908      Added the optional flags argument.
909
910   .. versionchanged:: 3.5
911      Unmatched groups are replaced with an empty string.
912
913   .. versionchanged:: 3.6
914      Unknown escapes in *pattern* consisting of ``'\'`` and an ASCII letter
915      now are errors.
916
917   .. versionchanged:: 3.7
918      Unknown escapes in *repl* consisting of ``'\'`` and an ASCII letter
919      now are errors.
920
921   .. versionchanged:: 3.7
922      Empty matches for the pattern are replaced when adjacent to a previous
923      non-empty match.
924
925
926.. function:: subn(pattern, repl, string, count=0, flags=0)
927
928   Perform the same operation as :func:`sub`, but return a tuple ``(new_string,
929   number_of_subs_made)``.
930
931   .. versionchanged:: 3.1
932      Added the optional flags argument.
933
934   .. versionchanged:: 3.5
935      Unmatched groups are replaced with an empty string.
936
937
938.. function:: escape(pattern)
939
940   Escape special characters in *pattern*.
941   This is useful if you want to match an arbitrary literal string that may
942   have regular expression metacharacters in it.  For example::
943
944      >>> print(re.escape('https://www.python.org'))
945      https://www\.python\.org
946
947      >>> legal_chars = string.ascii_lowercase + string.digits + "!#$%&'*+-.^_`|~:"
948      >>> print('[%s]+' % re.escape(legal_chars))
949      [abcdefghijklmnopqrstuvwxyz0123456789!\#\$%\&'\*\+\-\.\^_`\|\~:]+
950
951      >>> operators = ['+', '-', '*', '/', '**']
952      >>> print('|'.join(map(re.escape, sorted(operators, reverse=True))))
953      /|\-|\+|\*\*|\*
954
955   This function must not be used for the replacement string in :func:`sub`
956   and :func:`subn`, only backslashes should be escaped.  For example::
957
958      >>> digits_re = r'\d+'
959      >>> sample = '/usr/sbin/sendmail - 0 errors, 12 warnings'
960      >>> print(re.sub(digits_re, digits_re.replace('\\', r'\\'), sample))
961      /usr/sbin/sendmail - \d+ errors, \d+ warnings
962
963   .. versionchanged:: 3.3
964      The ``'_'`` character is no longer escaped.
965
966   .. versionchanged:: 3.7
967      Only characters that can have special meaning in a regular expression
968      are escaped. As a result, ``'!'``, ``'"'``, ``'%'``, ``"'"``, ``','``,
969      ``'/'``, ``':'``, ``';'``, ``'<'``, ``'='``, ``'>'``, ``'@'``, and
970      ``"`"`` are no longer escaped.
971
972
973.. function:: purge()
974
975   Clear the regular expression cache.
976
977
978.. exception:: error(msg, pattern=None, pos=None)
979
980   Exception raised when a string passed to one of the functions here is not a
981   valid regular expression (for example, it might contain unmatched parentheses)
982   or when some other error occurs during compilation or matching.  It is never an
983   error if a string contains no match for a pattern.  The error instance has
984   the following additional attributes:
985
986   .. attribute:: msg
987
988      The unformatted error message.
989
990   .. attribute:: pattern
991
992      The regular expression pattern.
993
994   .. attribute:: pos
995
996      The index in *pattern* where compilation failed (may be ``None``).
997
998   .. attribute:: lineno
999
1000      The line corresponding to *pos* (may be ``None``).
1001
1002   .. attribute:: colno
1003
1004      The column corresponding to *pos* (may be ``None``).
1005
1006   .. versionchanged:: 3.5
1007      Added additional attributes.
1008
1009.. _re-objects:
1010
1011Regular Expression Objects
1012--------------------------
1013
1014Compiled regular expression objects support the following methods and
1015attributes:
1016
1017.. method:: Pattern.search(string[, pos[, endpos]])
1018
1019   Scan through *string* looking for the first location where this regular
1020   expression produces a match, and return a corresponding :ref:`match object
1021   <match-objects>`.  Return ``None`` if no position in the string matches the
1022   pattern; note that this is different from finding a zero-length match at some
1023   point in the string.
1024
1025   The optional second parameter *pos* gives an index in the string where the
1026   search is to start; it defaults to ``0``.  This is not completely equivalent to
1027   slicing the string; the ``'^'`` pattern character matches at the real beginning
1028   of the string and at positions just after a newline, but not necessarily at the
1029   index where the search is to start.
1030
1031   The optional parameter *endpos* limits how far the string will be searched; it
1032   will be as if the string is *endpos* characters long, so only the characters
1033   from *pos* to ``endpos - 1`` will be searched for a match.  If *endpos* is less
1034   than *pos*, no match will be found; otherwise, if *rx* is a compiled regular
1035   expression object, ``rx.search(string, 0, 50)`` is equivalent to
1036   ``rx.search(string[:50], 0)``. ::
1037
1038      >>> pattern = re.compile("d")
1039      >>> pattern.search("dog")     # Match at index 0
1040      <re.Match object; span=(0, 1), match='d'>
1041      >>> pattern.search("dog", 1)  # No match; search doesn't include the "d"
1042
1043
1044.. method:: Pattern.match(string[, pos[, endpos]])
1045
1046   If zero or more characters at the *beginning* of *string* match this regular
1047   expression, return a corresponding :ref:`match object <match-objects>`.
1048   Return ``None`` if the string does not match the pattern; note that this is
1049   different from a zero-length match.
1050
1051   The optional *pos* and *endpos* parameters have the same meaning as for the
1052   :meth:`~Pattern.search` method. ::
1053
1054      >>> pattern = re.compile("o")
1055      >>> pattern.match("dog")      # No match as "o" is not at the start of "dog".
1056      >>> pattern.match("dog", 1)   # Match as "o" is the 2nd character of "dog".
1057      <re.Match object; span=(1, 2), match='o'>
1058
1059   If you want to locate a match anywhere in *string*, use
1060   :meth:`~Pattern.search` instead (see also :ref:`search-vs-match`).
1061
1062
1063.. method:: Pattern.fullmatch(string[, pos[, endpos]])
1064
1065   If the whole *string* matches this regular expression, return a corresponding
1066   :ref:`match object <match-objects>`.  Return ``None`` if the string does not
1067   match the pattern; note that this is different from a zero-length match.
1068
1069   The optional *pos* and *endpos* parameters have the same meaning as for the
1070   :meth:`~Pattern.search` method. ::
1071
1072      >>> pattern = re.compile("o[gh]")
1073      >>> pattern.fullmatch("dog")      # No match as "o" is not at the start of "dog".
1074      >>> pattern.fullmatch("ogre")     # No match as not the full string matches.
1075      >>> pattern.fullmatch("doggie", 1, 3)   # Matches within given limits.
1076      <re.Match object; span=(1, 3), match='og'>
1077
1078   .. versionadded:: 3.4
1079
1080
1081.. method:: Pattern.split(string, maxsplit=0)
1082
1083   Identical to the :func:`split` function, using the compiled pattern.
1084
1085
1086.. method:: Pattern.findall(string[, pos[, endpos]])
1087
1088   Similar to the :func:`findall` function, using the compiled pattern, but
1089   also accepts optional *pos* and *endpos* parameters that limit the search
1090   region like for :meth:`search`.
1091
1092
1093.. method:: Pattern.finditer(string[, pos[, endpos]])
1094
1095   Similar to the :func:`finditer` function, using the compiled pattern, but
1096   also accepts optional *pos* and *endpos* parameters that limit the search
1097   region like for :meth:`search`.
1098
1099
1100.. method:: Pattern.sub(repl, string, count=0)
1101
1102   Identical to the :func:`sub` function, using the compiled pattern.
1103
1104
1105.. method:: Pattern.subn(repl, string, count=0)
1106
1107   Identical to the :func:`subn` function, using the compiled pattern.
1108
1109
1110.. attribute:: Pattern.flags
1111
1112   The regex matching flags.  This is a combination of the flags given to
1113   :func:`.compile`, any ``(?...)`` inline flags in the pattern, and implicit
1114   flags such as :data:`UNICODE` if the pattern is a Unicode string.
1115
1116
1117.. attribute:: Pattern.groups
1118
1119   The number of capturing groups in the pattern.
1120
1121
1122.. attribute:: Pattern.groupindex
1123
1124   A dictionary mapping any symbolic group names defined by ``(?P<id>)`` to group
1125   numbers.  The dictionary is empty if no symbolic groups were used in the
1126   pattern.
1127
1128
1129.. attribute:: Pattern.pattern
1130
1131   The pattern string from which the pattern object was compiled.
1132
1133
1134.. versionchanged:: 3.7
1135   Added support of :func:`copy.copy` and :func:`copy.deepcopy`.  Compiled
1136   regular expression objects are considered atomic.
1137
1138
1139.. _match-objects:
1140
1141Match Objects
1142-------------
1143
1144Match objects always have a boolean value of ``True``.
1145Since :meth:`~Pattern.match` and :meth:`~Pattern.search` return ``None``
1146when there is no match, you can test whether there was a match with a simple
1147``if`` statement::
1148
1149   match = re.search(pattern, string)
1150   if match:
1151       process(match)
1152
1153Match objects support the following methods and attributes:
1154
1155
1156.. method:: Match.expand(template)
1157
1158   Return the string obtained by doing backslash substitution on the template
1159   string *template*, as done by the :meth:`~Pattern.sub` method.
1160   Escapes such as ``\n`` are converted to the appropriate characters,
1161   and numeric backreferences (``\1``, ``\2``) and named backreferences
1162   (``\g<1>``, ``\g<name>``) are replaced by the contents of the
1163   corresponding group.
1164
1165   .. versionchanged:: 3.5
1166      Unmatched groups are replaced with an empty string.
1167
1168.. method:: Match.group([group1, ...])
1169
1170   Returns one or more subgroups of the match.  If there is a single argument, the
1171   result is a single string; if there are multiple arguments, the result is a
1172   tuple with one item per argument. Without arguments, *group1* defaults to zero
1173   (the whole match is returned). If a *groupN* argument is zero, the corresponding
1174   return value is the entire matching string; if it is in the inclusive range
1175   [1..99], it is the string matching the corresponding parenthesized group.  If a
1176   group number is negative or larger than the number of groups defined in the
1177   pattern, an :exc:`IndexError` exception is raised. If a group is contained in a
1178   part of the pattern that did not match, the corresponding result is ``None``.
1179   If a group is contained in a part of the pattern that matched multiple times,
1180   the last match is returned. ::
1181
1182      >>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
1183      >>> m.group(0)       # The entire match
1184      'Isaac Newton'
1185      >>> m.group(1)       # The first parenthesized subgroup.
1186      'Isaac'
1187      >>> m.group(2)       # The second parenthesized subgroup.
1188      'Newton'
1189      >>> m.group(1, 2)    # Multiple arguments give us a tuple.
1190      ('Isaac', 'Newton')
1191
1192   If the regular expression uses the ``(?P<name>...)`` syntax, the *groupN*
1193   arguments may also be strings identifying groups by their group name.  If a
1194   string argument is not used as a group name in the pattern, an :exc:`IndexError`
1195   exception is raised.
1196
1197   A moderately complicated example::
1198
1199      >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
1200      >>> m.group('first_name')
1201      'Malcolm'
1202      >>> m.group('last_name')
1203      'Reynolds'
1204
1205   Named groups can also be referred to by their index::
1206
1207      >>> m.group(1)
1208      'Malcolm'
1209      >>> m.group(2)
1210      'Reynolds'
1211
1212   If a group matches multiple times, only the last match is accessible::
1213
1214      >>> m = re.match(r"(..)+", "a1b2c3")  # Matches 3 times.
1215      >>> m.group(1)                        # Returns only the last match.
1216      'c3'
1217
1218
1219.. method:: Match.__getitem__(g)
1220
1221   This is identical to ``m.group(g)``.  This allows easier access to
1222   an individual group from a match::
1223
1224      >>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
1225      >>> m[0]       # The entire match
1226      'Isaac Newton'
1227      >>> m[1]       # The first parenthesized subgroup.
1228      'Isaac'
1229      >>> m[2]       # The second parenthesized subgroup.
1230      'Newton'
1231
1232   .. versionadded:: 3.6
1233
1234
1235.. method:: Match.groups(default=None)
1236
1237   Return a tuple containing all the subgroups of the match, from 1 up to however
1238   many groups are in the pattern.  The *default* argument is used for groups that
1239   did not participate in the match; it defaults to ``None``.
1240
1241   For example::
1242
1243      >>> m = re.match(r"(\d+)\.(\d+)", "24.1632")
1244      >>> m.groups()
1245      ('24', '1632')
1246
1247   If we make the decimal place and everything after it optional, not all groups
1248   might participate in the match.  These groups will default to ``None`` unless
1249   the *default* argument is given::
1250
1251      >>> m = re.match(r"(\d+)\.?(\d+)?", "24")
1252      >>> m.groups()      # Second group defaults to None.
1253      ('24', None)
1254      >>> m.groups('0')   # Now, the second group defaults to '0'.
1255      ('24', '0')
1256
1257
1258.. method:: Match.groupdict(default=None)
1259
1260   Return a dictionary containing all the *named* subgroups of the match, keyed by
1261   the subgroup name.  The *default* argument is used for groups that did not
1262   participate in the match; it defaults to ``None``.  For example::
1263
1264      >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
1265      >>> m.groupdict()
1266      {'first_name': 'Malcolm', 'last_name': 'Reynolds'}
1267
1268
1269.. method:: Match.start([group])
1270            Match.end([group])
1271
1272   Return the indices of the start and end of the substring matched by *group*;
1273   *group* defaults to zero (meaning the whole matched substring). Return ``-1`` if
1274   *group* exists but did not contribute to the match.  For a match object *m*, and
1275   a group *g* that did contribute to the match, the substring matched by group *g*
1276   (equivalent to ``m.group(g)``) is ::
1277
1278      m.string[m.start(g):m.end(g)]
1279
1280   Note that ``m.start(group)`` will equal ``m.end(group)`` if *group* matched a
1281   null string.  For example, after ``m = re.search('b(c?)', 'cba')``,
1282   ``m.start(0)`` is 1, ``m.end(0)`` is 2, ``m.start(1)`` and ``m.end(1)`` are both
1283   2, and ``m.start(2)`` raises an :exc:`IndexError` exception.
1284
1285   An example that will remove *remove_this* from email addresses::
1286
1287      >>> email = "tony@tiremove_thisger.net"
1288      >>> m = re.search("remove_this", email)
1289      >>> email[:m.start()] + email[m.end():]
1290      'tony@tiger.net'
1291
1292
1293.. method:: Match.span([group])
1294
1295   For a match *m*, return the 2-tuple ``(m.start(group), m.end(group))``. Note
1296   that if *group* did not contribute to the match, this is ``(-1, -1)``.
1297   *group* defaults to zero, the entire match.
1298
1299
1300.. attribute:: Match.pos
1301
1302   The value of *pos* which was passed to the :meth:`~Pattern.search` or
1303   :meth:`~Pattern.match` method of a :ref:`regex object <re-objects>`.  This is
1304   the index into the string at which the RE engine started looking for a match.
1305
1306
1307.. attribute:: Match.endpos
1308
1309   The value of *endpos* which was passed to the :meth:`~Pattern.search` or
1310   :meth:`~Pattern.match` method of a :ref:`regex object <re-objects>`.  This is
1311   the index into the string beyond which the RE engine will not go.
1312
1313
1314.. attribute:: Match.lastindex
1315
1316   The integer index of the last matched capturing group, or ``None`` if no group
1317   was matched at all. For example, the expressions ``(a)b``, ``((a)(b))``, and
1318   ``((ab))`` will have ``lastindex == 1`` if applied to the string ``'ab'``, while
1319   the expression ``(a)(b)`` will have ``lastindex == 2``, if applied to the same
1320   string.
1321
1322
1323.. attribute:: Match.lastgroup
1324
1325   The name of the last matched capturing group, or ``None`` if the group didn't
1326   have a name, or if no group was matched at all.
1327
1328
1329.. attribute:: Match.re
1330
1331   The :ref:`regular expression object <re-objects>` whose :meth:`~Pattern.match` or
1332   :meth:`~Pattern.search` method produced this match instance.
1333
1334
1335.. attribute:: Match.string
1336
1337   The string passed to :meth:`~Pattern.match` or :meth:`~Pattern.search`.
1338
1339
1340.. versionchanged:: 3.7
1341   Added support of :func:`copy.copy` and :func:`copy.deepcopy`.  Match objects
1342   are considered atomic.
1343
1344
1345.. _re-examples:
1346
1347Regular Expression Examples
1348---------------------------
1349
1350
1351Checking for a Pair
1352^^^^^^^^^^^^^^^^^^^
1353
1354In this example, we'll use the following helper function to display match
1355objects a little more gracefully::
1356
1357   def displaymatch(match):
1358       if match is None:
1359           return None
1360       return '<Match: %r, groups=%r>' % (match.group(), match.groups())
1361
1362Suppose you are writing a poker program where a player's hand is represented as
1363a 5-character string with each character representing a card, "a" for ace, "k"
1364for king, "q" for queen, "j" for jack, "t" for 10, and "2" through "9"
1365representing the card with that value.
1366
1367To see if a given string is a valid hand, one could do the following::
1368
1369   >>> valid = re.compile(r"^[a2-9tjqk]{5}$")
1370   >>> displaymatch(valid.match("akt5q"))  # Valid.
1371   "<Match: 'akt5q', groups=()>"
1372   >>> displaymatch(valid.match("akt5e"))  # Invalid.
1373   >>> displaymatch(valid.match("akt"))    # Invalid.
1374   >>> displaymatch(valid.match("727ak"))  # Valid.
1375   "<Match: '727ak', groups=()>"
1376
1377That last hand, ``"727ak"``, contained a pair, or two of the same valued cards.
1378To match this with a regular expression, one could use backreferences as such::
1379
1380   >>> pair = re.compile(r".*(.).*\1")
1381   >>> displaymatch(pair.match("717ak"))     # Pair of 7s.
1382   "<Match: '717', groups=('7',)>"
1383   >>> displaymatch(pair.match("718ak"))     # No pairs.
1384   >>> displaymatch(pair.match("354aa"))     # Pair of aces.
1385   "<Match: '354aa', groups=('a',)>"
1386
1387To find out what card the pair consists of, one could use the
1388:meth:`~Match.group` method of the match object in the following manner::
1389
1390   >>> pair = re.compile(r".*(.).*\1")
1391   >>> pair.match("717ak").group(1)
1392   '7'
1393
1394   # Error because re.match() returns None, which doesn't have a group() method:
1395   >>> pair.match("718ak").group(1)
1396   Traceback (most recent call last):
1397     File "<pyshell#23>", line 1, in <module>
1398       re.match(r".*(.).*\1", "718ak").group(1)
1399   AttributeError: 'NoneType' object has no attribute 'group'
1400
1401   >>> pair.match("354aa").group(1)
1402   'a'
1403
1404
1405Simulating scanf()
1406^^^^^^^^^^^^^^^^^^
1407
1408.. index:: single: scanf()
1409
1410Python does not currently have an equivalent to :c:func:`scanf`.  Regular
1411expressions are generally more powerful, though also more verbose, than
1412:c:func:`scanf` format strings.  The table below offers some more-or-less
1413equivalent mappings between :c:func:`scanf` format tokens and regular
1414expressions.
1415
1416+--------------------------------+---------------------------------------------+
1417| :c:func:`scanf` Token          | Regular Expression                          |
1418+================================+=============================================+
1419| ``%c``                         | ``.``                                       |
1420+--------------------------------+---------------------------------------------+
1421| ``%5c``                        | ``.{5}``                                    |
1422+--------------------------------+---------------------------------------------+
1423| ``%d``                         | ``[-+]?\d+``                                |
1424+--------------------------------+---------------------------------------------+
1425| ``%e``, ``%E``, ``%f``, ``%g`` | ``[-+]?(\d+(\.\d*)?|\.\d+)([eE][-+]?\d+)?`` |
1426+--------------------------------+---------------------------------------------+
1427| ``%i``                         | ``[-+]?(0[xX][\dA-Fa-f]+|0[0-7]*|\d+)``     |
1428+--------------------------------+---------------------------------------------+
1429| ``%o``                         | ``[-+]?[0-7]+``                             |
1430+--------------------------------+---------------------------------------------+
1431| ``%s``                         | ``\S+``                                     |
1432+--------------------------------+---------------------------------------------+
1433| ``%u``                         | ``\d+``                                     |
1434+--------------------------------+---------------------------------------------+
1435| ``%x``, ``%X``                 | ``[-+]?(0[xX])?[\dA-Fa-f]+``                |
1436+--------------------------------+---------------------------------------------+
1437
1438To extract the filename and numbers from a string like ::
1439
1440   /usr/sbin/sendmail - 0 errors, 4 warnings
1441
1442you would use a :c:func:`scanf` format like ::
1443
1444   %s - %d errors, %d warnings
1445
1446The equivalent regular expression would be ::
1447
1448   (\S+) - (\d+) errors, (\d+) warnings
1449
1450
1451.. _search-vs-match:
1452
1453search() vs. match()
1454^^^^^^^^^^^^^^^^^^^^
1455
1456.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
1457
1458Python offers two different primitive operations based on regular expressions:
1459:func:`re.match` checks for a match only at the beginning of the string, while
1460:func:`re.search` checks for a match anywhere in the string (this is what Perl
1461does by default).
1462
1463For example::
1464
1465   >>> re.match("c", "abcdef")    # No match
1466   >>> re.search("c", "abcdef")   # Match
1467   <re.Match object; span=(2, 3), match='c'>
1468
1469Regular expressions beginning with ``'^'`` can be used with :func:`search` to
1470restrict the match at the beginning of the string::
1471
1472   >>> re.match("c", "abcdef")    # No match
1473   >>> re.search("^c", "abcdef")  # No match
1474   >>> re.search("^a", "abcdef")  # Match
1475   <re.Match object; span=(0, 1), match='a'>
1476
1477Note however that in :const:`MULTILINE` mode :func:`match` only matches at the
1478beginning of the string, whereas using :func:`search` with a regular expression
1479beginning with ``'^'`` will match at the beginning of each line. ::
1480
1481   >>> re.match('X', 'A\nB\nX', re.MULTILINE)  # No match
1482   >>> re.search('^X', 'A\nB\nX', re.MULTILINE)  # Match
1483   <re.Match object; span=(4, 5), match='X'>
1484
1485
1486Making a Phonebook
1487^^^^^^^^^^^^^^^^^^
1488
1489:func:`split` splits a string into a list delimited by the passed pattern.  The
1490method is invaluable for converting textual data into data structures that can be
1491easily read and modified by Python as demonstrated in the following example that
1492creates a phonebook.
1493
1494First, here is the input.  Normally it may come from a file, here we are using
1495triple-quoted string syntax
1496
1497.. doctest::
1498
1499   >>> text = """Ross McFluff: 834.345.1254 155 Elm Street
1500   ...
1501   ... Ronald Heathmore: 892.345.3428 436 Finley Avenue
1502   ... Frank Burger: 925.541.7625 662 South Dogwood Way
1503   ...
1504   ...
1505   ... Heather Albrecht: 548.326.4584 919 Park Place"""
1506
1507The entries are separated by one or more newlines. Now we convert the string
1508into a list with each nonempty line having its own entry:
1509
1510.. doctest::
1511   :options: +NORMALIZE_WHITESPACE
1512
1513   >>> entries = re.split("\n+", text)
1514   >>> entries
1515   ['Ross McFluff: 834.345.1254 155 Elm Street',
1516   'Ronald Heathmore: 892.345.3428 436 Finley Avenue',
1517   'Frank Burger: 925.541.7625 662 South Dogwood Way',
1518   'Heather Albrecht: 548.326.4584 919 Park Place']
1519
1520Finally, split each entry into a list with first name, last name, telephone
1521number, and address.  We use the ``maxsplit`` parameter of :func:`split`
1522because the address has spaces, our splitting pattern, in it:
1523
1524.. doctest::
1525   :options: +NORMALIZE_WHITESPACE
1526
1527   >>> [re.split(":? ", entry, 3) for entry in entries]
1528   [['Ross', 'McFluff', '834.345.1254', '155 Elm Street'],
1529   ['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'],
1530   ['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'],
1531   ['Heather', 'Albrecht', '548.326.4584', '919 Park Place']]
1532
1533The ``:?`` pattern matches the colon after the last name, so that it does not
1534occur in the result list.  With a ``maxsplit`` of ``4``, we could separate the
1535house number from the street name:
1536
1537.. doctest::
1538   :options: +NORMALIZE_WHITESPACE
1539
1540   >>> [re.split(":? ", entry, 4) for entry in entries]
1541   [['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'],
1542   ['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'],
1543   ['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'],
1544   ['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']]
1545
1546
1547Text Munging
1548^^^^^^^^^^^^
1549
1550:func:`sub` replaces every occurrence of a pattern with a string or the
1551result of a function.  This example demonstrates using :func:`sub` with
1552a function to "munge" text, or randomize the order of all the characters
1553in each word of a sentence except for the first and last characters::
1554
1555   >>> def repl(m):
1556   ...     inner_word = list(m.group(2))
1557   ...     random.shuffle(inner_word)
1558   ...     return m.group(1) + "".join(inner_word) + m.group(3)
1559   >>> text = "Professor Abdolmalek, please report your absences promptly."
1560   >>> re.sub(r"(\w)(\w+)(\w)", repl, text)
1561   'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.'
1562   >>> re.sub(r"(\w)(\w+)(\w)", repl, text)
1563   'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.'
1564
1565
1566Finding all Adverbs
1567^^^^^^^^^^^^^^^^^^^
1568
1569:func:`findall` matches *all* occurrences of a pattern, not just the first
1570one as :func:`search` does.  For example, if a writer wanted to
1571find all of the adverbs in some text, they might use :func:`findall` in
1572the following manner::
1573
1574   >>> text = "He was carefully disguised but captured quickly by police."
1575   >>> re.findall(r"\w+ly\b", text)
1576   ['carefully', 'quickly']
1577
1578
1579Finding all Adverbs and their Positions
1580^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1581
1582If one wants more information about all matches of a pattern than the matched
1583text, :func:`finditer` is useful as it provides :ref:`match objects
1584<match-objects>` instead of strings.  Continuing with the previous example, if
1585a writer wanted to find all of the adverbs *and their positions* in
1586some text, they would use :func:`finditer` in the following manner::
1587
1588   >>> text = "He was carefully disguised but captured quickly by police."
1589   >>> for m in re.finditer(r"\w+ly\b", text):
1590   ...     print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))
1591   07-16: carefully
1592   40-47: quickly
1593
1594
1595Raw String Notation
1596^^^^^^^^^^^^^^^^^^^
1597
1598Raw string notation (``r"text"``) keeps regular expressions sane.  Without it,
1599every backslash (``'\'``) in a regular expression would have to be prefixed with
1600another one to escape it.  For example, the two following lines of code are
1601functionally identical::
1602
1603   >>> re.match(r"\W(.)\1\W", " ff ")
1604   <re.Match object; span=(0, 4), match=' ff '>
1605   >>> re.match("\\W(.)\\1\\W", " ff ")
1606   <re.Match object; span=(0, 4), match=' ff '>
1607
1608When one wants to match a literal backslash, it must be escaped in the regular
1609expression.  With raw string notation, this means ``r"\\"``.  Without raw string
1610notation, one must use ``"\\\\"``, making the following lines of code
1611functionally identical::
1612
1613   >>> re.match(r"\\", r"\\")
1614   <re.Match object; span=(0, 1), match='\\'>
1615   >>> re.match("\\\\", r"\\")
1616   <re.Match object; span=(0, 1), match='\\'>
1617
1618
1619Writing a Tokenizer
1620^^^^^^^^^^^^^^^^^^^
1621
1622A `tokenizer or scanner <https://en.wikipedia.org/wiki/Lexical_analysis>`_
1623analyzes a string to categorize groups of characters.  This is a useful first
1624step in writing a compiler or interpreter.
1625
1626The text categories are specified with regular expressions.  The technique is
1627to combine those into a single master regular expression and to loop over
1628successive matches::
1629
1630    from typing import NamedTuple
1631    import re
1632
1633    class Token(NamedTuple):
1634        type: str
1635        value: str
1636        line: int
1637        column: int
1638
1639    def tokenize(code):
1640        keywords = {'IF', 'THEN', 'ENDIF', 'FOR', 'NEXT', 'GOSUB', 'RETURN'}
1641        token_specification = [
1642            ('NUMBER',   r'\d+(\.\d*)?'),  # Integer or decimal number
1643            ('ASSIGN',   r':='),           # Assignment operator
1644            ('END',      r';'),            # Statement terminator
1645            ('ID',       r'[A-Za-z]+'),    # Identifiers
1646            ('OP',       r'[+\-*/]'),      # Arithmetic operators
1647            ('NEWLINE',  r'\n'),           # Line endings
1648            ('SKIP',     r'[ \t]+'),       # Skip over spaces and tabs
1649            ('MISMATCH', r'.'),            # Any other character
1650        ]
1651        tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)
1652        line_num = 1
1653        line_start = 0
1654        for mo in re.finditer(tok_regex, code):
1655            kind = mo.lastgroup
1656            value = mo.group()
1657            column = mo.start() - line_start
1658            if kind == 'NUMBER':
1659                value = float(value) if '.' in value else int(value)
1660            elif kind == 'ID' and value in keywords:
1661                kind = value
1662            elif kind == 'NEWLINE':
1663                line_start = mo.end()
1664                line_num += 1
1665                continue
1666            elif kind == 'SKIP':
1667                continue
1668            elif kind == 'MISMATCH':
1669                raise RuntimeError(f'{value!r} unexpected on line {line_num}')
1670            yield Token(kind, value, line_num, column)
1671
1672    statements = '''
1673        IF quantity THEN
1674            total := total + price * quantity;
1675            tax := price * 0.05;
1676        ENDIF;
1677    '''
1678
1679    for token in tokenize(statements):
1680        print(token)
1681
1682The tokenizer produces the following output::
1683
1684    Token(type='IF', value='IF', line=2, column=4)
1685    Token(type='ID', value='quantity', line=2, column=7)
1686    Token(type='THEN', value='THEN', line=2, column=16)
1687    Token(type='ID', value='total', line=3, column=8)
1688    Token(type='ASSIGN', value=':=', line=3, column=14)
1689    Token(type='ID', value='total', line=3, column=17)
1690    Token(type='OP', value='+', line=3, column=23)
1691    Token(type='ID', value='price', line=3, column=25)
1692    Token(type='OP', value='*', line=3, column=31)
1693    Token(type='ID', value='quantity', line=3, column=33)
1694    Token(type='END', value=';', line=3, column=41)
1695    Token(type='ID', value='tax', line=4, column=8)
1696    Token(type='ASSIGN', value=':=', line=4, column=12)
1697    Token(type='ID', value='price', line=4, column=15)
1698    Token(type='OP', value='*', line=4, column=21)
1699    Token(type='NUMBER', value=0.05, line=4, column=23)
1700    Token(type='END', value=';', line=4, column=27)
1701    Token(type='ENDIF', value='ENDIF', line=5, column=4)
1702    Token(type='END', value=';', line=5, column=9)
1703
1704
1705.. [Frie09] Friedl, Jeffrey. Mastering Regular Expressions. 3rd ed., O'Reilly
1706   Media, 2009. The third edition of the book no longer covers Python at all,
1707   but the first edition covered writing good regular expression patterns in
1708   great detail.
1709