1:mod:`!re` --- Regular expression operations 2============================================ 3 4.. module:: re 5 :synopsis: Regular expression operations. 6 7.. moduleauthor:: Fredrik Lundh <fredrik@pythonware.com> 8.. sectionauthor:: Andrew M. Kuchling <amk@amk.ca> 9 10**Source code:** :source:`Lib/re/` 11 12-------------- 13 14This module provides regular expression matching operations similar to 15those found in Perl. 16 17Both patterns and strings to be searched can be Unicode strings (:class:`str`) 18as well as 8-bit strings (:class:`bytes`). 19However, Unicode strings and 8-bit strings cannot be mixed: 20that is, you cannot match a Unicode string with a bytes pattern or 21vice-versa; similarly, when asking for a substitution, the replacement 22string must be of the same type as both the pattern and the search string. 23 24Regular expressions use the backslash character (``'\'``) to indicate 25special forms or to allow special characters to be used without invoking 26their special meaning. This collides with Python's usage of the same 27character for the same purpose in string literals; for example, to match 28a literal backslash, one might have to write ``'\\\\'`` as the pattern 29string, because the regular expression must be ``\\``, and each 30backslash must be expressed as ``\\`` inside a regular Python string 31literal. Also, please note that any invalid escape sequences in Python's 32usage of the backslash in string literals now generate a :exc:`SyntaxWarning` 33and in the future this will become a :exc:`SyntaxError`. This behaviour 34will happen even if it is a valid escape sequence for a regular expression. 35 36The solution is to use Python's raw string notation for regular expression 37patterns; backslashes are not handled in any special way in a string literal 38prefixed with ``'r'``. So ``r"\n"`` is a two-character string containing 39``'\'`` and ``'n'``, while ``"\n"`` is a one-character string containing a 40newline. Usually patterns will be expressed in Python code using this raw 41string notation. 42 43It is important to note that most regular expression operations are available as 44module-level functions and methods on 45:ref:`compiled regular expressions <re-objects>`. The functions are shortcuts 46that don't require you to compile a regex object first, but miss some 47fine-tuning parameters. 48 49.. seealso:: 50 51 The third-party :pypi:`regex` module, 52 which has an API compatible with the standard library :mod:`re` module, 53 but offers additional functionality and a more thorough Unicode support. 54 55 56.. _re-syntax: 57 58Regular Expression Syntax 59------------------------- 60 61A regular expression (or RE) specifies a set of strings that matches it; the 62functions in this module let you check if a particular string matches a given 63regular expression (or if a given regular expression matches a particular 64string, which comes down to the same thing). 65 66Regular expressions can be concatenated to form new regular expressions; if *A* 67and *B* are both regular expressions, then *AB* is also a regular expression. 68In general, if a string *p* matches *A* and another string *q* matches *B*, the 69string *pq* will match AB. This holds unless *A* or *B* contain low precedence 70operations; boundary conditions between *A* and *B*; or have numbered group 71references. Thus, complex expressions can easily be constructed from simpler 72primitive expressions like the ones described here. For details of the theory 73and implementation of regular expressions, consult the Friedl book [Frie09]_, 74or almost any textbook about compiler construction. 75 76A brief explanation of the format of regular expressions follows. For further 77information and a gentler presentation, consult the :ref:`regex-howto`. 78 79Regular expressions can contain both special and ordinary characters. Most 80ordinary characters, like ``'A'``, ``'a'``, or ``'0'``, are the simplest regular 81expressions; they simply match themselves. You can concatenate ordinary 82characters, so ``last`` matches the string ``'last'``. (In the rest of this 83section, we'll write RE's in ``this special style``, usually without quotes, and 84strings to be matched ``'in single quotes'``.) 85 86Some characters, like ``'|'`` or ``'('``, are special. Special 87characters either stand for classes of ordinary characters, or affect 88how the regular expressions around them are interpreted. 89 90Repetition operators or quantifiers (``*``, ``+``, ``?``, ``{m,n}``, etc) cannot be 91directly nested. This avoids ambiguity with the non-greedy modifier suffix 92``?``, and with other modifiers in other implementations. To apply a second 93repetition to an inner repetition, parentheses may be used. For example, 94the expression ``(?:a{6})*`` matches any multiple of six ``'a'`` characters. 95 96 97The special characters are: 98 99.. index:: single: . (dot); in regular expressions 100 101``.`` 102 (Dot.) In the default mode, this matches any character except a newline. If 103 the :const:`DOTALL` flag has been specified, this matches any character 104 including a newline. ``(?s:.)`` matches any character regardless of flags. 105 106.. index:: single: ^ (caret); in regular expressions 107 108``^`` 109 (Caret.) Matches the start of the string, and in :const:`MULTILINE` mode also 110 matches immediately after each newline. 111 112.. index:: single: $ (dollar); in regular expressions 113 114``$`` 115 Matches the end of the string or just before the newline at the end of the 116 string, and in :const:`MULTILINE` mode also matches before a newline. ``foo`` 117 matches both 'foo' and 'foobar', while the regular expression ``foo$`` matches 118 only 'foo'. More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'`` 119 matches 'foo2' normally, but 'foo1' in :const:`MULTILINE` mode; searching for 120 a single ``$`` in ``'foo\n'`` will find two (empty) matches: one just before 121 the newline, and one at the end of the string. 122 123.. index:: single: * (asterisk); in regular expressions 124 125``*`` 126 Causes the resulting RE to match 0 or more repetitions of the preceding RE, as 127 many repetitions as are possible. ``ab*`` will match 'a', 'ab', or 'a' followed 128 by any number of 'b's. 129 130.. index:: single: + (plus); in regular expressions 131 132``+`` 133 Causes the resulting RE to match 1 or more repetitions of the preceding RE. 134 ``ab+`` will match 'a' followed by any non-zero number of 'b's; it will not 135 match just 'a'. 136 137.. index:: single: ? (question mark); in regular expressions 138 139``?`` 140 Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. 141 ``ab?`` will match either 'a' or 'ab'. 142 143.. index:: 144 single: *?; in regular expressions 145 single: +?; in regular expressions 146 single: ??; in regular expressions 147 148``*?``, ``+?``, ``??`` 149 The ``'*'``, ``'+'``, and ``'?'`` quantifiers are all :dfn:`greedy`; they match 150 as much text as possible. Sometimes this behaviour isn't desired; if the RE 151 ``<.*>`` is matched against ``'<a> b <c>'``, it will match the entire 152 string, and not just ``'<a>'``. Adding ``?`` after the quantifier makes it 153 perform the match in :dfn:`non-greedy` or :dfn:`minimal` fashion; as *few* 154 characters as possible will be matched. Using the RE ``<.*?>`` will match 155 only ``'<a>'``. 156 157.. index:: 158 single: *+; in regular expressions 159 single: ++; in regular expressions 160 single: ?+; in regular expressions 161 162``*+``, ``++``, ``?+`` 163 Like the ``'*'``, ``'+'``, and ``'?'`` quantifiers, those where ``'+'`` is 164 appended also match as many times as possible. 165 However, unlike the true greedy quantifiers, these do not allow 166 back-tracking when the expression following it fails to match. 167 These are known as :dfn:`possessive` quantifiers. 168 For example, ``a*a`` will match ``'aaaa'`` because the ``a*`` will match 169 all 4 ``'a'``\ s, but, when the final ``'a'`` is encountered, the 170 expression is backtracked so that in the end the ``a*`` ends up matching 171 3 ``'a'``\ s total, and the fourth ``'a'`` is matched by the final ``'a'``. 172 However, when ``a*+a`` is used to match ``'aaaa'``, the ``a*+`` will 173 match all 4 ``'a'``, but when the final ``'a'`` fails to find any more 174 characters to match, the expression cannot be backtracked and will thus 175 fail to match. 176 ``x*+``, ``x++`` and ``x?+`` are equivalent to ``(?>x*)``, ``(?>x+)`` 177 and ``(?>x?)`` correspondingly. 178 179 .. versionadded:: 3.11 180 181.. index:: 182 single: {} (curly brackets); in regular expressions 183 184``{m}`` 185 Specifies that exactly *m* copies of the previous RE should be matched; fewer 186 matches cause the entire RE not to match. For example, ``a{6}`` will match 187 exactly six ``'a'`` characters, but not five. 188 189``{m,n}`` 190 Causes the resulting RE to match from *m* to *n* repetitions of the preceding 191 RE, attempting to match as many repetitions as possible. For example, 192 ``a{3,5}`` will match from 3 to 5 ``'a'`` characters. Omitting *m* specifies a 193 lower bound of zero, and omitting *n* specifies an infinite upper bound. As an 194 example, ``a{4,}b`` will match ``'aaaab'`` or a thousand ``'a'`` characters 195 followed by a ``'b'``, but not ``'aaab'``. The comma may not be omitted or the 196 modifier would be confused with the previously described form. 197 198``{m,n}?`` 199 Causes the resulting RE to match from *m* to *n* repetitions of the preceding 200 RE, attempting to match as *few* repetitions as possible. This is the 201 non-greedy version of the previous quantifier. For example, on the 202 6-character string ``'aaaaaa'``, ``a{3,5}`` will match 5 ``'a'`` characters, 203 while ``a{3,5}?`` will only match 3 characters. 204 205``{m,n}+`` 206 Causes the resulting RE to match from *m* to *n* repetitions of the 207 preceding RE, attempting to match as many repetitions as possible 208 *without* establishing any backtracking points. 209 This is the possessive version of the quantifier above. 210 For example, on the 6-character string ``'aaaaaa'``, ``a{3,5}+aa`` 211 attempt to match 5 ``'a'`` characters, then, requiring 2 more ``'a'``\ s, 212 will need more characters than available and thus fail, while 213 ``a{3,5}aa`` will match with ``a{3,5}`` capturing 5, then 4 ``'a'``\ s 214 by backtracking and then the final 2 ``'a'``\ s are matched by the final 215 ``aa`` in the pattern. 216 ``x{m,n}+`` is equivalent to ``(?>x{m,n})``. 217 218 .. versionadded:: 3.11 219 220.. index:: single: \ (backslash); in regular expressions 221 222``\`` 223 Either escapes special characters (permitting you to match characters like 224 ``'*'``, ``'?'``, and so forth), or signals a special sequence; special 225 sequences are discussed below. 226 227 If you're not using a raw string to express the pattern, remember that Python 228 also uses the backslash as an escape sequence in string literals; if the escape 229 sequence isn't recognized by Python's parser, the backslash and subsequent 230 character are included in the resulting string. However, if Python would 231 recognize the resulting sequence, the backslash should be repeated twice. This 232 is complicated and hard to understand, so it's highly recommended that you use 233 raw strings for all but the simplest expressions. 234 235.. index:: 236 single: [] (square brackets); in regular expressions 237 238``[]`` 239 Used to indicate a set of characters. In a set: 240 241 * Characters can be listed individually, e.g. ``[amk]`` will match ``'a'``, 242 ``'m'``, or ``'k'``. 243 244 .. index:: single: - (minus); in regular expressions 245 246 * Ranges of characters can be indicated by giving two characters and separating 247 them by a ``'-'``, for example ``[a-z]`` will match any lowercase ASCII letter, 248 ``[0-5][0-9]`` will match all the two-digits numbers from ``00`` to ``59``, and 249 ``[0-9A-Fa-f]`` will match any hexadecimal digit. If ``-`` is escaped (e.g. 250 ``[a\-z]``) or if it's placed as the first or last character 251 (e.g. ``[-a]`` or ``[a-]``), it will match a literal ``'-'``. 252 253 * Special characters lose their special meaning inside sets. For example, 254 ``[(+*)]`` will match any of the literal characters ``'('``, ``'+'``, 255 ``'*'``, or ``')'``. 256 257 .. index:: single: \ (backslash); in regular expressions 258 259 * Character classes such as ``\w`` or ``\S`` (defined below) are also accepted 260 inside a set, although the characters they match depend on the flags_ used. 261 262 .. index:: single: ^ (caret); in regular expressions 263 264 * Characters that are not within a range can be matched by :dfn:`complementing` 265 the set. If the first character of the set is ``'^'``, all the characters 266 that are *not* in the set will be matched. For example, ``[^5]`` will match 267 any character except ``'5'``, and ``[^^]`` will match any character except 268 ``'^'``. ``^`` has no special meaning if it's not the first character in 269 the set. 270 271 * To match a literal ``']'`` inside a set, precede it with a backslash, or 272 place it at the beginning of the set. For example, both ``[()[\]{}]`` and 273 ``[]()[{}]`` will match a right bracket, as well as left bracket, braces, 274 and parentheses. 275 276 .. .. index:: single: --; in regular expressions 277 .. .. index:: single: &&; in regular expressions 278 .. .. index:: single: ~~; in regular expressions 279 .. .. index:: single: ||; in regular expressions 280 281 * Support of nested sets and set operations as in `Unicode Technical 282 Standard #18`_ might be added in the future. This would change the 283 syntax, so to facilitate this change a :exc:`FutureWarning` will be raised 284 in ambiguous cases for the time being. 285 That includes sets starting with a literal ``'['`` or containing literal 286 character sequences ``'--'``, ``'&&'``, ``'~~'``, and ``'||'``. To 287 avoid a warning escape them with a backslash. 288 289 .. _Unicode Technical Standard #18: https://unicode.org/reports/tr18/ 290 291 .. versionchanged:: 3.7 292 :exc:`FutureWarning` is raised if a character set contains constructs 293 that will change semantically in the future. 294 295.. index:: single: | (vertical bar); in regular expressions 296 297``|`` 298 ``A|B``, where *A* and *B* can be arbitrary REs, creates a regular expression that 299 will match either *A* or *B*. An arbitrary number of REs can be separated by the 300 ``'|'`` in this way. This can be used inside groups (see below) as well. As 301 the target string is scanned, REs separated by ``'|'`` are tried from left to 302 right. When one pattern completely matches, that branch is accepted. This means 303 that once *A* matches, *B* will not be tested further, even if it would 304 produce a longer overall match. In other words, the ``'|'`` operator is never 305 greedy. To match a literal ``'|'``, use ``\|``, or enclose it inside a 306 character class, as in ``[|]``. 307 308.. index:: 309 single: () (parentheses); in regular expressions 310 311``(...)`` 312 Matches whatever regular expression is inside the parentheses, and indicates the 313 start and end of a group; the contents of a group can be retrieved after a match 314 has been performed, and can be matched later in the string with the ``\number`` 315 special sequence, described below. To match the literals ``'('`` or ``')'``, 316 use ``\(`` or ``\)``, or enclose them inside a character class: ``[(]``, ``[)]``. 317 318.. index:: single: (?; in regular expressions 319 320``(?...)`` 321 This is an extension notation (a ``'?'`` following a ``'('`` is not meaningful 322 otherwise). The first character after the ``'?'`` determines what the meaning 323 and further syntax of the construct is. Extensions usually do not create a new 324 group; ``(?P<name>...)`` is the only exception to this rule. Following are the 325 currently supported extensions. 326 327``(?aiLmsux)`` 328 (One or more letters from the set 329 ``'a'``, ``'i'``, ``'L'``, ``'m'``, ``'s'``, ``'u'``, ``'x'``.) 330 The group matches the empty string; 331 the letters set the corresponding flags for the entire regular expression: 332 333 * :const:`re.A` (ASCII-only matching) 334 * :const:`re.I` (ignore case) 335 * :const:`re.L` (locale dependent) 336 * :const:`re.M` (multi-line) 337 * :const:`re.S` (dot matches all) 338 * :const:`re.U` (Unicode matching) 339 * :const:`re.X` (verbose) 340 341 (The flags are described in :ref:`contents-of-module-re`.) 342 This is useful if you wish to include the flags as part of the 343 regular expression, instead of passing a *flag* argument to the 344 :func:`re.compile` function. 345 Flags should be used first in the expression string. 346 347 .. versionchanged:: 3.11 348 This construction can only be used at the start of the expression. 349 350.. index:: single: (?:; in regular expressions 351 352``(?:...)`` 353 A non-capturing version of regular parentheses. Matches whatever regular 354 expression is inside the parentheses, but the substring matched by the group 355 *cannot* be retrieved after performing a match or referenced later in the 356 pattern. 357 358``(?aiLmsux-imsx:...)`` 359 (Zero or more letters from the set 360 ``'a'``, ``'i'``, ``'L'``, ``'m'``, ``'s'``, ``'u'``, ``'x'``, 361 optionally followed by ``'-'`` followed by 362 one or more letters from the ``'i'``, ``'m'``, ``'s'``, ``'x'``.) 363 The letters set or remove the corresponding flags for the part of the expression: 364 365 * :const:`re.A` (ASCII-only matching) 366 * :const:`re.I` (ignore case) 367 * :const:`re.L` (locale dependent) 368 * :const:`re.M` (multi-line) 369 * :const:`re.S` (dot matches all) 370 * :const:`re.U` (Unicode matching) 371 * :const:`re.X` (verbose) 372 373 (The flags are described in :ref:`contents-of-module-re`.) 374 375 The letters ``'a'``, ``'L'`` and ``'u'`` are mutually exclusive when used 376 as inline flags, so they can't be combined or follow ``'-'``. Instead, 377 when one of them appears in an inline group, it overrides the matching mode 378 in the enclosing group. In Unicode patterns ``(?a:...)`` switches to 379 ASCII-only matching, and ``(?u:...)`` switches to Unicode matching 380 (default). In bytes patterns ``(?L:...)`` switches to locale dependent 381 matching, and ``(?a:...)`` switches to ASCII-only matching (default). 382 This override is only in effect for the narrow inline group, and the 383 original matching mode is restored outside of the group. 384 385 .. versionadded:: 3.6 386 387 .. versionchanged:: 3.7 388 The letters ``'a'``, ``'L'`` and ``'u'`` also can be used in a group. 389 390``(?>...)`` 391 Attempts to match ``...`` as if it was a separate regular expression, and 392 if successful, continues to match the rest of the pattern following it. 393 If the subsequent pattern fails to match, the stack can only be unwound 394 to a point *before* the ``(?>...)`` because once exited, the expression, 395 known as an :dfn:`atomic group`, has thrown away all stack points within 396 itself. 397 Thus, ``(?>.*).`` would never match anything because first the ``.*`` 398 would match all characters possible, then, having nothing left to match, 399 the final ``.`` would fail to match. 400 Since there are no stack points saved in the Atomic Group, and there is 401 no stack point before it, the entire expression would thus fail to match. 402 403 .. versionadded:: 3.11 404 405.. index:: single: (?P<; in regular expressions 406 407``(?P<name>...)`` 408 Similar to regular parentheses, but the substring matched by the group is 409 accessible via the symbolic group name *name*. Group names must be valid 410 Python identifiers, and in :class:`bytes` patterns they can only contain 411 bytes in the ASCII range. Each group name must be defined only once within 412 a regular expression. A symbolic group is also a numbered group, just as if 413 the group were not named. 414 415 Named groups can be referenced in three contexts. If the pattern is 416 ``(?P<quote>['"]).*?(?P=quote)`` (i.e. matching a string quoted with either 417 single or double quotes): 418 419 +---------------------------------------+----------------------------------+ 420 | Context of reference to group "quote" | Ways to reference it | 421 +=======================================+==================================+ 422 | in the same pattern itself | * ``(?P=quote)`` (as shown) | 423 | | * ``\1`` | 424 +---------------------------------------+----------------------------------+ 425 | when processing match object *m* | * ``m.group('quote')`` | 426 | | * ``m.end('quote')`` (etc.) | 427 +---------------------------------------+----------------------------------+ 428 | in a string passed to the *repl* | * ``\g<quote>`` | 429 | argument of ``re.sub()`` | * ``\g<1>`` | 430 | | * ``\1`` | 431 +---------------------------------------+----------------------------------+ 432 433 .. versionchanged:: 3.12 434 In :class:`bytes` patterns, group *name* can only contain bytes 435 in the ASCII range (``b'\x00'``-``b'\x7f'``). 436 437.. index:: single: (?P=; in regular expressions 438 439``(?P=name)`` 440 A backreference to a named group; it matches whatever text was matched by the 441 earlier group named *name*. 442 443.. index:: single: (?#; in regular expressions 444 445``(?#...)`` 446 A comment; the contents of the parentheses are simply ignored. 447 448.. index:: single: (?=; in regular expressions 449 450``(?=...)`` 451 Matches if ``...`` matches next, but doesn't consume any of the string. This is 452 called a :dfn:`lookahead assertion`. For example, ``Isaac (?=Asimov)`` will match 453 ``'Isaac '`` only if it's followed by ``'Asimov'``. 454 455.. index:: single: (?!; in regular expressions 456 457``(?!...)`` 458 Matches if ``...`` doesn't match next. This is a :dfn:`negative lookahead assertion`. 459 For example, ``Isaac (?!Asimov)`` will match ``'Isaac '`` only if it's *not* 460 followed by ``'Asimov'``. 461 462.. index:: single: (?<=; in regular expressions 463 464``(?<=...)`` 465 Matches if the current position in the string is preceded by a match for ``...`` 466 that ends at the current position. This is called a :dfn:`positive lookbehind 467 assertion`. ``(?<=abc)def`` will find a match in ``'abcdef'``, since the 468 lookbehind will back up 3 characters and check if the contained pattern matches. 469 The contained pattern must only match strings of some fixed length, meaning that 470 ``abc`` or ``a|b`` are allowed, but ``a*`` and ``a{3,4}`` are not. Note that 471 patterns which start with positive lookbehind assertions will not match at the 472 beginning of the string being searched; you will most likely want to use the 473 :func:`search` function rather than the :func:`match` function: 474 475 >>> import re 476 >>> m = re.search('(?<=abc)def', 'abcdef') 477 >>> m.group(0) 478 'def' 479 480 This example looks for a word following a hyphen: 481 482 >>> m = re.search(r'(?<=-)\w+', 'spam-egg') 483 >>> m.group(0) 484 'egg' 485 486 .. versionchanged:: 3.5 487 Added support for group references of fixed length. 488 489.. index:: single: (?<!; in regular expressions 490 491``(?<!...)`` 492 Matches if the current position in the string is not preceded by a match for 493 ``...``. This is called a :dfn:`negative lookbehind assertion`. Similar to 494 positive lookbehind assertions, the contained pattern must only match strings of 495 some fixed length. Patterns which start with negative lookbehind assertions may 496 match at the beginning of the string being searched. 497 498.. _re-conditional-expression: 499.. index:: single: (?(; in regular expressions 500 501``(?(id/name)yes-pattern|no-pattern)`` 502 Will try to match with ``yes-pattern`` if the group with given *id* or 503 *name* exists, and with ``no-pattern`` if it doesn't. ``no-pattern`` is 504 optional and can be omitted. For example, 505 ``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>|$)`` is a poor email matching pattern, which 506 will match with ``'<user@host.com>'`` as well as ``'user@host.com'``, but 507 not with ``'<user@host.com'`` nor ``'user@host.com>'``. 508 509 .. versionchanged:: 3.12 510 Group *id* can only contain ASCII digits. 511 In :class:`bytes` patterns, group *name* can only contain bytes 512 in the ASCII range (``b'\x00'``-``b'\x7f'``). 513 514 515.. _re-special-sequences: 516 517The special sequences consist of ``'\'`` and a character from the list below. 518If the ordinary character is not an ASCII digit or an ASCII letter, then the 519resulting RE will match the second character. For example, ``\$`` matches the 520character ``'$'``. 521 522.. index:: single: \ (backslash); in regular expressions 523 524``\number`` 525 Matches the contents of the group of the same number. Groups are numbered 526 starting from 1. For example, ``(.+) \1`` matches ``'the the'`` or ``'55 55'``, 527 but not ``'thethe'`` (note the space after the group). This special sequence 528 can only be used to match one of the first 99 groups. If the first digit of 529 *number* is 0, or *number* is 3 octal digits long, it will not be interpreted as 530 a group match, but as the character with octal value *number*. Inside the 531 ``'['`` and ``']'`` of a character class, all numeric escapes are treated as 532 characters. 533 534.. index:: single: \A; in regular expressions 535 536``\A`` 537 Matches only at the start of the string. 538 539.. index:: single: \b; in regular expressions 540 541``\b`` 542 Matches the empty string, but only at the beginning or end of a word. 543 A word is defined as a sequence of word characters. 544 Note that formally, ``\b`` is defined as the boundary 545 between a ``\w`` and a ``\W`` character (or vice versa), 546 or between ``\w`` and the beginning or end of the string. 547 This means that ``r'\bat\b'`` matches ``'at'``, ``'at.'``, ``'(at)'``, 548 and ``'as at ay'`` but not ``'attempt'`` or ``'atlas'``. 549 550 The default word characters in Unicode (str) patterns 551 are Unicode alphanumerics and the underscore, 552 but this can be changed by using the :py:const:`~re.ASCII` flag. 553 Word boundaries are determined by the current locale 554 if the :py:const:`~re.LOCALE` flag is used. 555 556 .. note:: 557 558 Inside a character range, ``\b`` represents the backspace character, 559 for compatibility with Python's string literals. 560 561.. index:: single: \B; in regular expressions 562 563``\B`` 564 Matches the empty string, 565 but only when it is *not* at the beginning or end of a word. 566 This means that ``r'at\B'`` matches ``'athens'``, ``'atom'``, 567 ``'attorney'``, but not ``'at'``, ``'at.'``, or ``'at!'``. 568 ``\B`` is the opposite of ``\b``, 569 so word characters in Unicode (str) patterns 570 are Unicode alphanumerics or the underscore, 571 although this can be changed by using the :py:const:`~re.ASCII` flag. 572 Word boundaries are determined by the current locale 573 if the :py:const:`~re.LOCALE` flag is used. 574 575 .. note:: 576 577 Note that ``\B`` does not match an empty string, which differs from 578 RE implementations in other programming languages such as Perl. 579 This behavior is kept for compatibility reasons. 580 581.. index:: single: \d; in regular expressions 582 583``\d`` 584 For Unicode (str) patterns: 585 Matches any Unicode decimal digit 586 (that is, any character in Unicode character category `[Nd]`__). 587 This includes ``[0-9]``, and also many other digit characters. 588 589 Matches ``[0-9]`` if the :py:const:`~re.ASCII` flag is used. 590 591 __ https://www.unicode.org/versions/Unicode15.0.0/ch04.pdf#G134153 592 593 For 8-bit (bytes) patterns: 594 Matches any decimal digit in the ASCII character set; 595 this is equivalent to ``[0-9]``. 596 597.. index:: single: \D; in regular expressions 598 599``\D`` 600 Matches any character which is not a decimal digit. 601 This is the opposite of ``\d``. 602 603 Matches ``[^0-9]`` if the :py:const:`~re.ASCII` flag is used. 604 605.. index:: single: \s; in regular expressions 606 607``\s`` 608 For Unicode (str) patterns: 609 Matches Unicode whitespace characters (as defined by :py:meth:`str.isspace`). 610 This includes ``[ \t\n\r\f\v]``, and also many other characters, for example the 611 non-breaking spaces mandated by typography rules in many languages. 612 613 Matches ``[ \t\n\r\f\v]`` if the :py:const:`~re.ASCII` flag is used. 614 615 For 8-bit (bytes) patterns: 616 Matches characters considered whitespace in the ASCII character set; 617 this is equivalent to ``[ \t\n\r\f\v]``. 618 619.. index:: single: \S; in regular expressions 620 621``\S`` 622 Matches any character which is not a whitespace character. This is 623 the opposite of ``\s``. 624 625 Matches ``[^ \t\n\r\f\v]`` if the :py:const:`~re.ASCII` flag is used. 626 627.. index:: single: \w; in regular expressions 628 629``\w`` 630 For Unicode (str) patterns: 631 Matches Unicode word characters; 632 this includes all Unicode alphanumeric characters 633 (as defined by :py:meth:`str.isalnum`), 634 as well as the underscore (``_``). 635 636 Matches ``[a-zA-Z0-9_]`` if the :py:const:`~re.ASCII` flag is used. 637 638 For 8-bit (bytes) patterns: 639 Matches characters considered alphanumeric in the ASCII character set; 640 this is equivalent to ``[a-zA-Z0-9_]``. 641 If the :py:const:`~re.LOCALE` flag is used, 642 matches characters considered alphanumeric in the current locale and the underscore. 643 644.. index:: single: \W; in regular expressions 645 646``\W`` 647 Matches any character which is not a word character. 648 This is the opposite of ``\w``. 649 By default, matches non-underscore (``_``) characters 650 for which :py:meth:`str.isalnum` returns ``False``. 651 652 Matches ``[^a-zA-Z0-9_]`` if the :py:const:`~re.ASCII` flag is used. 653 654 If the :py:const:`~re.LOCALE` flag is used, 655 matches characters which are neither alphanumeric in the current locale 656 nor the underscore. 657 658.. index:: single: \Z; in regular expressions 659 660``\Z`` 661 Matches only at the end of the string. 662 663.. index:: 664 single: \a; in regular expressions 665 single: \b; in regular expressions 666 single: \f; in regular expressions 667 single: \n; in regular expressions 668 single: \N; in regular expressions 669 single: \r; in regular expressions 670 single: \t; in regular expressions 671 single: \u; in regular expressions 672 single: \U; in regular expressions 673 single: \v; in regular expressions 674 single: \x; in regular expressions 675 single: \\; in regular expressions 676 677Most of the :ref:`escape sequences <escape-sequences>` supported by Python 678string literals are also accepted by the regular expression parser:: 679 680 \a \b \f \n 681 \N \r \t \u 682 \U \v \x \\ 683 684(Note that ``\b`` is used to represent word boundaries, and means "backspace" 685only inside character classes.) 686 687``'\u'``, ``'\U'``, and ``'\N'`` escape sequences are 688only recognized in Unicode (str) patterns. 689In bytes patterns they are errors. 690Unknown escapes of ASCII letters are reserved 691for future use and treated as errors. 692 693Octal escapes are included in a limited form. If the first digit is a 0, or if 694there are three octal digits, it is considered an octal escape. Otherwise, it is 695a group reference. As for string literals, octal escapes are always at most 696three digits in length. 697 698.. versionchanged:: 3.3 699 The ``'\u'`` and ``'\U'`` escape sequences have been added. 700 701.. versionchanged:: 3.6 702 Unknown escapes consisting of ``'\'`` and an ASCII letter now are errors. 703 704.. versionchanged:: 3.8 705 The :samp:`'\\N\\{{name}\\}'` escape sequence has been added. As in string literals, 706 it expands to the named Unicode character (e.g. ``'\N{EM DASH}'``). 707 708 709.. _contents-of-module-re: 710 711Module Contents 712--------------- 713 714The module defines several functions, constants, and an exception. Some of the 715functions are simplified versions of the full featured methods for compiled 716regular expressions. Most non-trivial applications always use the compiled 717form. 718 719 720Flags 721^^^^^ 722 723.. versionchanged:: 3.6 724 Flag constants are now instances of :class:`RegexFlag`, which is a subclass of 725 :class:`enum.IntFlag`. 726 727 728.. class:: RegexFlag 729 730 An :class:`enum.IntFlag` class containing the regex options listed below. 731 732 .. versionadded:: 3.11 - added to ``__all__`` 733 734.. data:: A 735 ASCII 736 737 Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S`` 738 perform ASCII-only matching instead of full Unicode matching. This is only 739 meaningful for Unicode (str) patterns, and is ignored for bytes patterns. 740 741 Corresponds to the inline flag ``(?a)``. 742 743 .. note:: 744 745 The :py:const:`~re.U` flag still exists for backward compatibility, 746 but is redundant in Python 3 since 747 matches are Unicode by default for ``str`` patterns, 748 and Unicode matching isn't allowed for bytes patterns. 749 :py:const:`~re.UNICODE` and the inline flag ``(?u)`` are similarly redundant. 750 751 752.. data:: DEBUG 753 754 Display debug information about compiled expression. 755 756 No corresponding inline flag. 757 758 759.. data:: I 760 IGNORECASE 761 762 Perform case-insensitive matching; 763 expressions like ``[A-Z]`` will also match lowercase letters. 764 Full Unicode matching (such as ``Ü`` matching ``ü``) 765 also works unless the :py:const:`~re.ASCII` flag 766 is used to disable non-ASCII matches. 767 The current locale does not change the effect of this flag 768 unless the :py:const:`~re.LOCALE` flag is also used. 769 770 Corresponds to the inline flag ``(?i)``. 771 772 Note that when the Unicode patterns ``[a-z]`` or ``[A-Z]`` are used in 773 combination with the :const:`IGNORECASE` flag, they will match the 52 ASCII 774 letters and 4 additional non-ASCII letters: 'İ' (U+0130, Latin capital 775 letter I with dot above), 'ı' (U+0131, Latin small letter dotless i), 776 'ſ' (U+017F, Latin small letter long s) and 'K' (U+212A, Kelvin sign). 777 If the :py:const:`~re.ASCII` flag is used, only letters 'a' to 'z' 778 and 'A' to 'Z' are matched. 779 780.. data:: L 781 LOCALE 782 783 Make ``\w``, ``\W``, ``\b``, ``\B`` and case-insensitive matching 784 dependent on the current locale. 785 This flag can be used only with bytes patterns. 786 787 Corresponds to the inline flag ``(?L)``. 788 789 .. warning:: 790 791 This flag is discouraged; consider Unicode matching instead. 792 The locale mechanism is very unreliable 793 as it only handles one "culture" at a time 794 and only works with 8-bit locales. 795 Unicode matching is enabled by default for Unicode (str) patterns 796 and it is able to handle different locales and languages. 797 798 .. versionchanged:: 3.6 799 :py:const:`~re.LOCALE` can be used only with bytes patterns 800 and is not compatible with :py:const:`~re.ASCII`. 801 802 .. versionchanged:: 3.7 803 Compiled regular expression objects with the :py:const:`~re.LOCALE` flag 804 no longer depend on the locale at compile time. 805 Only the locale at matching time affects the result of matching. 806 807 808.. data:: M 809 MULTILINE 810 811 When specified, the pattern character ``'^'`` matches at the beginning of the 812 string and at the beginning of each line (immediately following each newline); 813 and the pattern character ``'$'`` matches at the end of the string and at the 814 end of each line (immediately preceding each newline). By default, ``'^'`` 815 matches only at the beginning of the string, and ``'$'`` only at the end of the 816 string and immediately before the newline (if any) at the end of the string. 817 818 Corresponds to the inline flag ``(?m)``. 819 820.. data:: NOFLAG 821 822 Indicates no flag being applied, the value is ``0``. This flag may be used 823 as a default value for a function keyword argument or as a base value that 824 will be conditionally ORed with other flags. Example of use as a default 825 value:: 826 827 def myfunc(text, flag=re.NOFLAG): 828 return re.match(text, flag) 829 830 .. versionadded:: 3.11 831 832.. data:: S 833 DOTALL 834 835 Make the ``'.'`` special character match any character at all, including a 836 newline; without this flag, ``'.'`` will match anything *except* a newline. 837 838 Corresponds to the inline flag ``(?s)``. 839 840 841.. data:: U 842 UNICODE 843 844 In Python 3, Unicode characters are matched by default 845 for ``str`` patterns. 846 This flag is therefore redundant with **no effect** 847 and is only kept for backward compatibility. 848 849 See :py:const:`~re.ASCII` to restrict matching to ASCII characters instead. 850 851.. data:: X 852 VERBOSE 853 854 .. index:: single: # (hash); in regular expressions 855 856 This flag allows you to write regular expressions that look nicer and are 857 more readable by allowing you to visually separate logical sections of the 858 pattern and add comments. Whitespace within the pattern is ignored, except 859 when in a character class, or when preceded by an unescaped backslash, 860 or within tokens like ``*?``, ``(?:`` or ``(?P<...>``. For example, ``(? :`` 861 and ``* ?`` are not allowed. 862 When a line contains a ``#`` that is not in a character class and is not 863 preceded by an unescaped backslash, all characters from the leftmost such 864 ``#`` through the end of the line are ignored. 865 866 This means that the two following regular expression objects that match a 867 decimal number are functionally equal:: 868 869 a = re.compile(r"""\d + # the integral part 870 \. # the decimal point 871 \d * # some fractional digits""", re.X) 872 b = re.compile(r"\d+\.\d*") 873 874 Corresponds to the inline flag ``(?x)``. 875 876 877Functions 878^^^^^^^^^ 879 880.. function:: compile(pattern, flags=0) 881 882 Compile a regular expression pattern into a :ref:`regular expression object 883 <re-objects>`, which can be used for matching using its 884 :func:`~Pattern.match`, :func:`~Pattern.search` and other methods, described 885 below. 886 887 The expression's behaviour can be modified by specifying a *flags* value. 888 Values can be any of the `flags`_ variables, combined using bitwise OR 889 (the ``|`` operator). 890 891 The sequence :: 892 893 prog = re.compile(pattern) 894 result = prog.match(string) 895 896 is equivalent to :: 897 898 result = re.match(pattern, string) 899 900 but using :func:`re.compile` and saving the resulting regular expression 901 object for reuse is more efficient when the expression will be used several 902 times in a single program. 903 904 .. note:: 905 906 The compiled versions of the most recent patterns passed to 907 :func:`re.compile` and the module-level matching functions are cached, so 908 programs that use only a few regular expressions at a time needn't worry 909 about compiling regular expressions. 910 911 912.. function:: search(pattern, string, flags=0) 913 914 Scan through *string* looking for the first location where the regular expression 915 *pattern* produces a match, and return a corresponding :class:`~re.Match`. Return 916 ``None`` if no position in the string matches the pattern; note that this is 917 different from finding a zero-length match at some point in the string. 918 919 The expression's behaviour can be modified by specifying a *flags* value. 920 Values can be any of the `flags`_ variables, combined using bitwise OR 921 (the ``|`` operator). 922 923 924.. function:: match(pattern, string, flags=0) 925 926 If zero or more characters at the beginning of *string* match the regular 927 expression *pattern*, return a corresponding :class:`~re.Match`. Return 928 ``None`` if the string does not match the pattern; note that this is 929 different from a zero-length match. 930 931 Note that even in :const:`MULTILINE` mode, :func:`re.match` will only match 932 at the beginning of the string and not at the beginning of each line. 933 934 If you want to locate a match anywhere in *string*, use :func:`search` 935 instead (see also :ref:`search-vs-match`). 936 937 The expression's behaviour can be modified by specifying a *flags* value. 938 Values can be any of the `flags`_ variables, combined using bitwise OR 939 (the ``|`` operator). 940 941 942.. function:: fullmatch(pattern, string, flags=0) 943 944 If the whole *string* matches the regular expression *pattern*, return a 945 corresponding :class:`~re.Match`. Return ``None`` if the string does not match 946 the pattern; note that this is different from a zero-length match. 947 948 The expression's behaviour can be modified by specifying a *flags* value. 949 Values can be any of the `flags`_ variables, combined using bitwise OR 950 (the ``|`` operator). 951 952 .. versionadded:: 3.4 953 954 955.. function:: split(pattern, string, maxsplit=0, flags=0) 956 957 Split *string* by the occurrences of *pattern*. If capturing parentheses are 958 used in *pattern*, then the text of all groups in the pattern are also returned 959 as part of the resulting list. If *maxsplit* is nonzero, at most *maxsplit* 960 splits occur, and the remainder of the string is returned as the final element 961 of the list. :: 962 963 >>> re.split(r'\W+', 'Words, words, words.') 964 ['Words', 'words', 'words', ''] 965 >>> re.split(r'(\W+)', 'Words, words, words.') 966 ['Words', ', ', 'words', ', ', 'words', '.', ''] 967 >>> re.split(r'\W+', 'Words, words, words.', maxsplit=1) 968 ['Words', 'words, words.'] 969 >>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE) 970 ['0', '3', '9'] 971 972 If there are capturing groups in the separator and it matches at the start of 973 the string, the result will start with an empty string. The same holds for 974 the end of the string:: 975 976 >>> re.split(r'(\W+)', '...words, words...') 977 ['', '...', 'words', ', ', 'words', '...', ''] 978 979 That way, separator components are always found at the same relative 980 indices within the result list. 981 982 Empty matches for the pattern split the string only when not adjacent 983 to a previous empty match. 984 985 .. code:: pycon 986 987 >>> re.split(r'\b', 'Words, words, words.') 988 ['', 'Words', ', ', 'words', ', ', 'words', '.'] 989 >>> re.split(r'\W*', '...words...') 990 ['', '', 'w', 'o', 'r', 'd', 's', '', ''] 991 >>> re.split(r'(\W*)', '...words...') 992 ['', '...', '', '', 'w', '', 'o', '', 'r', '', 'd', '', 's', '...', '', '', ''] 993 994 The expression's behaviour can be modified by specifying a *flags* value. 995 Values can be any of the `flags`_ variables, combined using bitwise OR 996 (the ``|`` operator). 997 998 .. versionchanged:: 3.1 999 Added the optional flags argument. 1000 1001 .. versionchanged:: 3.7 1002 Added support of splitting on a pattern that could match an empty string. 1003 1004 .. deprecated:: 3.13 1005 Passing *maxsplit* and *flags* as positional arguments is deprecated. 1006 In future Python versions they will be 1007 :ref:`keyword-only parameters <keyword-only_parameter>`. 1008 1009 1010.. function:: findall(pattern, string, flags=0) 1011 1012 Return all non-overlapping matches of *pattern* in *string*, as a list of 1013 strings or tuples. The *string* is scanned left-to-right, and matches 1014 are returned in the order found. Empty matches are included in the result. 1015 1016 The result depends on the number of capturing groups in the pattern. 1017 If there are no groups, return a list of strings matching the whole 1018 pattern. If there is exactly one group, return a list of strings 1019 matching that group. If multiple groups are present, return a list 1020 of tuples of strings matching the groups. Non-capturing groups do not 1021 affect the form of the result. 1022 1023 >>> re.findall(r'\bf[a-z]*', 'which foot or hand fell fastest') 1024 ['foot', 'fell', 'fastest'] 1025 >>> re.findall(r'(\w+)=(\d+)', 'set width=20 and height=10') 1026 [('width', '20'), ('height', '10')] 1027 1028 The expression's behaviour can be modified by specifying a *flags* value. 1029 Values can be any of the `flags`_ variables, combined using bitwise OR 1030 (the ``|`` operator). 1031 1032 .. versionchanged:: 3.7 1033 Non-empty matches can now start just after a previous empty match. 1034 1035 1036.. function:: finditer(pattern, string, flags=0) 1037 1038 Return an :term:`iterator` yielding :class:`~re.Match` objects over 1039 all non-overlapping matches for the RE *pattern* in *string*. The *string* 1040 is scanned left-to-right, and matches are returned in the order found. Empty 1041 matches are included in the result. 1042 1043 The expression's behaviour can be modified by specifying a *flags* value. 1044 Values can be any of the `flags`_ variables, combined using bitwise OR 1045 (the ``|`` operator). 1046 1047 .. versionchanged:: 3.7 1048 Non-empty matches can now start just after a previous empty match. 1049 1050 1051.. function:: sub(pattern, repl, string, count=0, flags=0) 1052 1053 Return the string obtained by replacing the leftmost non-overlapping occurrences 1054 of *pattern* in *string* by the replacement *repl*. If the pattern isn't found, 1055 *string* is returned unchanged. *repl* can be a string or a function; if it is 1056 a string, any backslash escapes in it are processed. That is, ``\n`` is 1057 converted to a single newline character, ``\r`` is converted to a carriage return, and 1058 so forth. Unknown escapes of ASCII letters are reserved for future use and 1059 treated as errors. Other unknown escapes such as ``\&`` are left alone. 1060 Backreferences, such 1061 as ``\6``, are replaced with the substring matched by group 6 in the pattern. 1062 For example:: 1063 1064 >>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):', 1065 ... r'static PyObject*\npy_\1(void)\n{', 1066 ... 'def myfunc():') 1067 'static PyObject*\npy_myfunc(void)\n{' 1068 1069 If *repl* is a function, it is called for every non-overlapping occurrence of 1070 *pattern*. The function takes a single :class:`~re.Match` argument, and returns 1071 the replacement string. For example:: 1072 1073 >>> def dashrepl(matchobj): 1074 ... if matchobj.group(0) == '-': return ' ' 1075 ... else: return '-' 1076 ... 1077 >>> re.sub('-{1,2}', dashrepl, 'pro----gram-files') 1078 'pro--gram files' 1079 >>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE) 1080 'Baked Beans & Spam' 1081 1082 The pattern may be a string or a :class:`~re.Pattern`. 1083 1084 The optional argument *count* is the maximum number of pattern occurrences to be 1085 replaced; *count* must be a non-negative integer. If omitted or zero, all 1086 occurrences will be replaced. Empty matches for the pattern are replaced only 1087 when not adjacent to a previous empty match, so ``sub('x*', '-', 'abxd')`` returns 1088 ``'-a-b--d-'``. 1089 1090 .. index:: single: \g; in regular expressions 1091 1092 In string-type *repl* arguments, in addition to the character escapes and 1093 backreferences described above, 1094 ``\g<name>`` will use the substring matched by the group named ``name``, as 1095 defined by the ``(?P<name>...)`` syntax. ``\g<number>`` uses the corresponding 1096 group number; ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous 1097 in a replacement such as ``\g<2>0``. ``\20`` would be interpreted as a 1098 reference to group 20, not a reference to group 2 followed by the literal 1099 character ``'0'``. The backreference ``\g<0>`` substitutes in the entire 1100 substring matched by the RE. 1101 1102 The expression's behaviour can be modified by specifying a *flags* value. 1103 Values can be any of the `flags`_ variables, combined using bitwise OR 1104 (the ``|`` operator). 1105 1106 .. versionchanged:: 3.1 1107 Added the optional flags argument. 1108 1109 .. versionchanged:: 3.5 1110 Unmatched groups are replaced with an empty string. 1111 1112 .. versionchanged:: 3.6 1113 Unknown escapes in *pattern* consisting of ``'\'`` and an ASCII letter 1114 now are errors. 1115 1116 .. versionchanged:: 3.7 1117 Unknown escapes in *repl* consisting of ``'\'`` and an ASCII letter 1118 now are errors. 1119 Empty matches for the pattern are replaced when adjacent to a previous 1120 non-empty match. 1121 1122 .. versionchanged:: 3.12 1123 Group *id* can only contain ASCII digits. 1124 In :class:`bytes` replacement strings, group *name* can only contain bytes 1125 in the ASCII range (``b'\x00'``-``b'\x7f'``). 1126 1127 .. deprecated:: 3.13 1128 Passing *count* and *flags* as positional arguments is deprecated. 1129 In future Python versions they will be 1130 :ref:`keyword-only parameters <keyword-only_parameter>`. 1131 1132 1133.. function:: subn(pattern, repl, string, count=0, flags=0) 1134 1135 Perform the same operation as :func:`sub`, but return a tuple ``(new_string, 1136 number_of_subs_made)``. 1137 1138 The expression's behaviour can be modified by specifying a *flags* value. 1139 Values can be any of the `flags`_ variables, combined using bitwise OR 1140 (the ``|`` operator). 1141 1142 1143.. function:: escape(pattern) 1144 1145 Escape special characters in *pattern*. 1146 This is useful if you want to match an arbitrary literal string that may 1147 have regular expression metacharacters in it. For example:: 1148 1149 >>> print(re.escape('https://www.python.org')) 1150 https://www\.python\.org 1151 1152 >>> legal_chars = string.ascii_lowercase + string.digits + "!#$%&'*+-.^_`|~:" 1153 >>> print('[%s]+' % re.escape(legal_chars)) 1154 [abcdefghijklmnopqrstuvwxyz0123456789!\#\$%\&'\*\+\-\.\^_`\|\~:]+ 1155 1156 >>> operators = ['+', '-', '*', '/', '**'] 1157 >>> print('|'.join(map(re.escape, sorted(operators, reverse=True)))) 1158 /|\-|\+|\*\*|\* 1159 1160 This function must not be used for the replacement string in :func:`sub` 1161 and :func:`subn`, only backslashes should be escaped. For example:: 1162 1163 >>> digits_re = r'\d+' 1164 >>> sample = '/usr/sbin/sendmail - 0 errors, 12 warnings' 1165 >>> print(re.sub(digits_re, digits_re.replace('\\', r'\\'), sample)) 1166 /usr/sbin/sendmail - \d+ errors, \d+ warnings 1167 1168 .. versionchanged:: 3.3 1169 The ``'_'`` character is no longer escaped. 1170 1171 .. versionchanged:: 3.7 1172 Only characters that can have special meaning in a regular expression 1173 are escaped. As a result, ``'!'``, ``'"'``, ``'%'``, ``"'"``, ``','``, 1174 ``'/'``, ``':'``, ``';'``, ``'<'``, ``'='``, ``'>'``, ``'@'``, and 1175 ``"`"`` are no longer escaped. 1176 1177 1178.. function:: purge() 1179 1180 Clear the regular expression cache. 1181 1182 1183Exceptions 1184^^^^^^^^^^ 1185 1186.. exception:: PatternError(msg, pattern=None, pos=None) 1187 1188 Exception raised when a string passed to one of the functions here is not a 1189 valid regular expression (for example, it might contain unmatched parentheses) 1190 or when some other error occurs during compilation or matching. It is never an 1191 error if a string contains no match for a pattern. The ``PatternError`` instance has 1192 the following additional attributes: 1193 1194 .. attribute:: msg 1195 1196 The unformatted error message. 1197 1198 .. attribute:: pattern 1199 1200 The regular expression pattern. 1201 1202 .. attribute:: pos 1203 1204 The index in *pattern* where compilation failed (may be ``None``). 1205 1206 .. attribute:: lineno 1207 1208 The line corresponding to *pos* (may be ``None``). 1209 1210 .. attribute:: colno 1211 1212 The column corresponding to *pos* (may be ``None``). 1213 1214 .. versionchanged:: 3.5 1215 Added additional attributes. 1216 1217 .. versionchanged:: 3.13 1218 ``PatternError`` was originally named ``error``; the latter is kept as an alias for 1219 backward compatibility. 1220 1221.. _re-objects: 1222 1223Regular Expression Objects 1224-------------------------- 1225 1226.. class:: Pattern 1227 1228 Compiled regular expression object returned by :func:`re.compile`. 1229 1230 .. versionchanged:: 3.9 1231 :py:class:`re.Pattern` supports ``[]`` to indicate a Unicode (str) or bytes pattern. 1232 See :ref:`types-genericalias`. 1233 1234.. method:: Pattern.search(string[, pos[, endpos]]) 1235 1236 Scan through *string* looking for the first location where this regular 1237 expression produces a match, and return a corresponding :class:`~re.Match`. 1238 Return ``None`` if no position in the string matches the pattern; note that 1239 this is different from finding a zero-length match at some point in the string. 1240 1241 The optional second parameter *pos* gives an index in the string where the 1242 search is to start; it defaults to ``0``. This is not completely equivalent to 1243 slicing the string; the ``'^'`` pattern character matches at the real beginning 1244 of the string and at positions just after a newline, but not necessarily at the 1245 index where the search is to start. 1246 1247 The optional parameter *endpos* limits how far the string will be searched; it 1248 will be as if the string is *endpos* characters long, so only the characters 1249 from *pos* to ``endpos - 1`` will be searched for a match. If *endpos* is less 1250 than *pos*, no match will be found; otherwise, if *rx* is a compiled regular 1251 expression object, ``rx.search(string, 0, 50)`` is equivalent to 1252 ``rx.search(string[:50], 0)``. :: 1253 1254 >>> pattern = re.compile("d") 1255 >>> pattern.search("dog") # Match at index 0 1256 <re.Match object; span=(0, 1), match='d'> 1257 >>> pattern.search("dog", 1) # No match; search doesn't include the "d" 1258 1259 1260.. method:: Pattern.match(string[, pos[, endpos]]) 1261 1262 If zero or more characters at the *beginning* of *string* match this regular 1263 expression, return a corresponding :class:`~re.Match`. Return ``None`` if the 1264 string does not match the pattern; note that this is different from a 1265 zero-length match. 1266 1267 The optional *pos* and *endpos* parameters have the same meaning as for the 1268 :meth:`~Pattern.search` method. :: 1269 1270 >>> pattern = re.compile("o") 1271 >>> pattern.match("dog") # No match as "o" is not at the start of "dog". 1272 >>> pattern.match("dog", 1) # Match as "o" is the 2nd character of "dog". 1273 <re.Match object; span=(1, 2), match='o'> 1274 1275 If you want to locate a match anywhere in *string*, use 1276 :meth:`~Pattern.search` instead (see also :ref:`search-vs-match`). 1277 1278 1279.. method:: Pattern.fullmatch(string[, pos[, endpos]]) 1280 1281 If the whole *string* matches this regular expression, return a corresponding 1282 :class:`~re.Match`. Return ``None`` if the string does not match the pattern; 1283 note that this is different from a zero-length match. 1284 1285 The optional *pos* and *endpos* parameters have the same meaning as for the 1286 :meth:`~Pattern.search` method. :: 1287 1288 >>> pattern = re.compile("o[gh]") 1289 >>> pattern.fullmatch("dog") # No match as "o" is not at the start of "dog". 1290 >>> pattern.fullmatch("ogre") # No match as not the full string matches. 1291 >>> pattern.fullmatch("doggie", 1, 3) # Matches within given limits. 1292 <re.Match object; span=(1, 3), match='og'> 1293 1294 .. versionadded:: 3.4 1295 1296 1297.. method:: Pattern.split(string, maxsplit=0) 1298 1299 Identical to the :func:`split` function, using the compiled pattern. 1300 1301 1302.. method:: Pattern.findall(string[, pos[, endpos]]) 1303 1304 Similar to the :func:`findall` function, using the compiled pattern, but 1305 also accepts optional *pos* and *endpos* parameters that limit the search 1306 region like for :meth:`search`. 1307 1308 1309.. method:: Pattern.finditer(string[, pos[, endpos]]) 1310 1311 Similar to the :func:`finditer` function, using the compiled pattern, but 1312 also accepts optional *pos* and *endpos* parameters that limit the search 1313 region like for :meth:`search`. 1314 1315 1316.. method:: Pattern.sub(repl, string, count=0) 1317 1318 Identical to the :func:`sub` function, using the compiled pattern. 1319 1320 1321.. method:: Pattern.subn(repl, string, count=0) 1322 1323 Identical to the :func:`subn` function, using the compiled pattern. 1324 1325 1326.. attribute:: Pattern.flags 1327 1328 The regex matching flags. This is a combination of the flags given to 1329 :func:`.compile`, any ``(?...)`` inline flags in the pattern, and implicit 1330 flags such as :py:const:`~re.UNICODE` if the pattern is a Unicode string. 1331 1332 1333.. attribute:: Pattern.groups 1334 1335 The number of capturing groups in the pattern. 1336 1337 1338.. attribute:: Pattern.groupindex 1339 1340 A dictionary mapping any symbolic group names defined by ``(?P<id>)`` to group 1341 numbers. The dictionary is empty if no symbolic groups were used in the 1342 pattern. 1343 1344 1345.. attribute:: Pattern.pattern 1346 1347 The pattern string from which the pattern object was compiled. 1348 1349 1350.. versionchanged:: 3.7 1351 Added support of :func:`copy.copy` and :func:`copy.deepcopy`. Compiled 1352 regular expression objects are considered atomic. 1353 1354 1355.. _match-objects: 1356 1357Match Objects 1358------------- 1359 1360Match objects always have a boolean value of ``True``. 1361Since :meth:`~Pattern.match` and :meth:`~Pattern.search` return ``None`` 1362when there is no match, you can test whether there was a match with a simple 1363``if`` statement:: 1364 1365 match = re.search(pattern, string) 1366 if match: 1367 process(match) 1368 1369.. class:: Match 1370 1371 Match object returned by successful ``match``\ es and ``search``\ es. 1372 1373 .. versionchanged:: 3.9 1374 :py:class:`re.Match` supports ``[]`` to indicate a Unicode (str) or bytes match. 1375 See :ref:`types-genericalias`. 1376 1377.. method:: Match.expand(template) 1378 1379 Return the string obtained by doing backslash substitution on the template 1380 string *template*, as done by the :meth:`~Pattern.sub` method. 1381 Escapes such as ``\n`` are converted to the appropriate characters, 1382 and numeric backreferences (``\1``, ``\2``) and named backreferences 1383 (``\g<1>``, ``\g<name>``) are replaced by the contents of the 1384 corresponding group. The backreference ``\g<0>`` will be 1385 replaced by the entire match. 1386 1387 .. versionchanged:: 3.5 1388 Unmatched groups are replaced with an empty string. 1389 1390.. method:: Match.group([group1, ...]) 1391 1392 Returns one or more subgroups of the match. If there is a single argument, the 1393 result is a single string; if there are multiple arguments, the result is a 1394 tuple with one item per argument. Without arguments, *group1* defaults to zero 1395 (the whole match is returned). If a *groupN* argument is zero, the corresponding 1396 return value is the entire matching string; if it is in the inclusive range 1397 [1..99], it is the string matching the corresponding parenthesized group. If a 1398 group number is negative or larger than the number of groups defined in the 1399 pattern, an :exc:`IndexError` exception is raised. If a group is contained in a 1400 part of the pattern that did not match, the corresponding result is ``None``. 1401 If a group is contained in a part of the pattern that matched multiple times, 1402 the last match is returned. :: 1403 1404 >>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist") 1405 >>> m.group(0) # The entire match 1406 'Isaac Newton' 1407 >>> m.group(1) # The first parenthesized subgroup. 1408 'Isaac' 1409 >>> m.group(2) # The second parenthesized subgroup. 1410 'Newton' 1411 >>> m.group(1, 2) # Multiple arguments give us a tuple. 1412 ('Isaac', 'Newton') 1413 1414 If the regular expression uses the ``(?P<name>...)`` syntax, the *groupN* 1415 arguments may also be strings identifying groups by their group name. If a 1416 string argument is not used as a group name in the pattern, an :exc:`IndexError` 1417 exception is raised. 1418 1419 A moderately complicated example:: 1420 1421 >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds") 1422 >>> m.group('first_name') 1423 'Malcolm' 1424 >>> m.group('last_name') 1425 'Reynolds' 1426 1427 Named groups can also be referred to by their index:: 1428 1429 >>> m.group(1) 1430 'Malcolm' 1431 >>> m.group(2) 1432 'Reynolds' 1433 1434 If a group matches multiple times, only the last match is accessible:: 1435 1436 >>> m = re.match(r"(..)+", "a1b2c3") # Matches 3 times. 1437 >>> m.group(1) # Returns only the last match. 1438 'c3' 1439 1440 1441.. method:: Match.__getitem__(g) 1442 1443 This is identical to ``m.group(g)``. This allows easier access to 1444 an individual group from a match:: 1445 1446 >>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist") 1447 >>> m[0] # The entire match 1448 'Isaac Newton' 1449 >>> m[1] # The first parenthesized subgroup. 1450 'Isaac' 1451 >>> m[2] # The second parenthesized subgroup. 1452 'Newton' 1453 1454 Named groups are supported as well:: 1455 1456 >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Isaac Newton") 1457 >>> m['first_name'] 1458 'Isaac' 1459 >>> m['last_name'] 1460 'Newton' 1461 1462 .. versionadded:: 3.6 1463 1464 1465.. method:: Match.groups(default=None) 1466 1467 Return a tuple containing all the subgroups of the match, from 1 up to however 1468 many groups are in the pattern. The *default* argument is used for groups that 1469 did not participate in the match; it defaults to ``None``. 1470 1471 For example:: 1472 1473 >>> m = re.match(r"(\d+)\.(\d+)", "24.1632") 1474 >>> m.groups() 1475 ('24', '1632') 1476 1477 If we make the decimal place and everything after it optional, not all groups 1478 might participate in the match. These groups will default to ``None`` unless 1479 the *default* argument is given:: 1480 1481 >>> m = re.match(r"(\d+)\.?(\d+)?", "24") 1482 >>> m.groups() # Second group defaults to None. 1483 ('24', None) 1484 >>> m.groups('0') # Now, the second group defaults to '0'. 1485 ('24', '0') 1486 1487 1488.. method:: Match.groupdict(default=None) 1489 1490 Return a dictionary containing all the *named* subgroups of the match, keyed by 1491 the subgroup name. The *default* argument is used for groups that did not 1492 participate in the match; it defaults to ``None``. For example:: 1493 1494 >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds") 1495 >>> m.groupdict() 1496 {'first_name': 'Malcolm', 'last_name': 'Reynolds'} 1497 1498 1499.. method:: Match.start([group]) 1500 Match.end([group]) 1501 1502 Return the indices of the start and end of the substring matched by *group*; 1503 *group* defaults to zero (meaning the whole matched substring). Return ``-1`` if 1504 *group* exists but did not contribute to the match. For a match object *m*, and 1505 a group *g* that did contribute to the match, the substring matched by group *g* 1506 (equivalent to ``m.group(g)``) is :: 1507 1508 m.string[m.start(g):m.end(g)] 1509 1510 Note that ``m.start(group)`` will equal ``m.end(group)`` if *group* matched a 1511 null string. For example, after ``m = re.search('b(c?)', 'cba')``, 1512 ``m.start(0)`` is 1, ``m.end(0)`` is 2, ``m.start(1)`` and ``m.end(1)`` are both 1513 2, and ``m.start(2)`` raises an :exc:`IndexError` exception. 1514 1515 An example that will remove *remove_this* from email addresses:: 1516 1517 >>> email = "tony@tiremove_thisger.net" 1518 >>> m = re.search("remove_this", email) 1519 >>> email[:m.start()] + email[m.end():] 1520 'tony@tiger.net' 1521 1522 1523.. method:: Match.span([group]) 1524 1525 For a match *m*, return the 2-tuple ``(m.start(group), m.end(group))``. Note 1526 that if *group* did not contribute to the match, this is ``(-1, -1)``. 1527 *group* defaults to zero, the entire match. 1528 1529 1530.. attribute:: Match.pos 1531 1532 The value of *pos* which was passed to the :meth:`~Pattern.search` or 1533 :meth:`~Pattern.match` method of a :ref:`regex object <re-objects>`. This is 1534 the index into the string at which the RE engine started looking for a match. 1535 1536 1537.. attribute:: Match.endpos 1538 1539 The value of *endpos* which was passed to the :meth:`~Pattern.search` or 1540 :meth:`~Pattern.match` method of a :ref:`regex object <re-objects>`. This is 1541 the index into the string beyond which the RE engine will not go. 1542 1543 1544.. attribute:: Match.lastindex 1545 1546 The integer index of the last matched capturing group, or ``None`` if no group 1547 was matched at all. For example, the expressions ``(a)b``, ``((a)(b))``, and 1548 ``((ab))`` will have ``lastindex == 1`` if applied to the string ``'ab'``, while 1549 the expression ``(a)(b)`` will have ``lastindex == 2``, if applied to the same 1550 string. 1551 1552 1553.. attribute:: Match.lastgroup 1554 1555 The name of the last matched capturing group, or ``None`` if the group didn't 1556 have a name, or if no group was matched at all. 1557 1558 1559.. attribute:: Match.re 1560 1561 The :ref:`regular expression object <re-objects>` whose :meth:`~Pattern.match` or 1562 :meth:`~Pattern.search` method produced this match instance. 1563 1564 1565.. attribute:: Match.string 1566 1567 The string passed to :meth:`~Pattern.match` or :meth:`~Pattern.search`. 1568 1569 1570.. versionchanged:: 3.7 1571 Added support of :func:`copy.copy` and :func:`copy.deepcopy`. Match objects 1572 are considered atomic. 1573 1574 1575.. _re-examples: 1576 1577Regular Expression Examples 1578--------------------------- 1579 1580 1581Checking for a Pair 1582^^^^^^^^^^^^^^^^^^^ 1583 1584In this example, we'll use the following helper function to display match 1585objects a little more gracefully:: 1586 1587 def displaymatch(match): 1588 if match is None: 1589 return None 1590 return '<Match: %r, groups=%r>' % (match.group(), match.groups()) 1591 1592Suppose you are writing a poker program where a player's hand is represented as 1593a 5-character string with each character representing a card, "a" for ace, "k" 1594for king, "q" for queen, "j" for jack, "t" for 10, and "2" through "9" 1595representing the card with that value. 1596 1597To see if a given string is a valid hand, one could do the following:: 1598 1599 >>> valid = re.compile(r"^[a2-9tjqk]{5}$") 1600 >>> displaymatch(valid.match("akt5q")) # Valid. 1601 "<Match: 'akt5q', groups=()>" 1602 >>> displaymatch(valid.match("akt5e")) # Invalid. 1603 >>> displaymatch(valid.match("akt")) # Invalid. 1604 >>> displaymatch(valid.match("727ak")) # Valid. 1605 "<Match: '727ak', groups=()>" 1606 1607That last hand, ``"727ak"``, contained a pair, or two of the same valued cards. 1608To match this with a regular expression, one could use backreferences as such:: 1609 1610 >>> pair = re.compile(r".*(.).*\1") 1611 >>> displaymatch(pair.match("717ak")) # Pair of 7s. 1612 "<Match: '717', groups=('7',)>" 1613 >>> displaymatch(pair.match("718ak")) # No pairs. 1614 >>> displaymatch(pair.match("354aa")) # Pair of aces. 1615 "<Match: '354aa', groups=('a',)>" 1616 1617To find out what card the pair consists of, one could use the 1618:meth:`~Match.group` method of the match object in the following manner:: 1619 1620 >>> pair = re.compile(r".*(.).*\1") 1621 >>> pair.match("717ak").group(1) 1622 '7' 1623 1624 # Error because re.match() returns None, which doesn't have a group() method: 1625 >>> pair.match("718ak").group(1) 1626 Traceback (most recent call last): 1627 File "<pyshell#23>", line 1, in <module> 1628 re.match(r".*(.).*\1", "718ak").group(1) 1629 AttributeError: 'NoneType' object has no attribute 'group' 1630 1631 >>> pair.match("354aa").group(1) 1632 'a' 1633 1634 1635Simulating scanf() 1636^^^^^^^^^^^^^^^^^^ 1637 1638.. index:: single: scanf (C function) 1639 1640Python does not currently have an equivalent to :c:func:`!scanf`. Regular 1641expressions are generally more powerful, though also more verbose, than 1642:c:func:`!scanf` format strings. The table below offers some more-or-less 1643equivalent mappings between :c:func:`!scanf` format tokens and regular 1644expressions. 1645 1646+--------------------------------+---------------------------------------------+ 1647| :c:func:`!scanf` Token | Regular Expression | 1648+================================+=============================================+ 1649| ``%c`` | ``.`` | 1650+--------------------------------+---------------------------------------------+ 1651| ``%5c`` | ``.{5}`` | 1652+--------------------------------+---------------------------------------------+ 1653| ``%d`` | ``[-+]?\d+`` | 1654+--------------------------------+---------------------------------------------+ 1655| ``%e``, ``%E``, ``%f``, ``%g`` | ``[-+]?(\d+(\.\d*)?|\.\d+)([eE][-+]?\d+)?`` | 1656+--------------------------------+---------------------------------------------+ 1657| ``%i`` | ``[-+]?(0[xX][\dA-Fa-f]+|0[0-7]*|\d+)`` | 1658+--------------------------------+---------------------------------------------+ 1659| ``%o`` | ``[-+]?[0-7]+`` | 1660+--------------------------------+---------------------------------------------+ 1661| ``%s`` | ``\S+`` | 1662+--------------------------------+---------------------------------------------+ 1663| ``%u`` | ``\d+`` | 1664+--------------------------------+---------------------------------------------+ 1665| ``%x``, ``%X`` | ``[-+]?(0[xX])?[\dA-Fa-f]+`` | 1666+--------------------------------+---------------------------------------------+ 1667 1668To extract the filename and numbers from a string like :: 1669 1670 /usr/sbin/sendmail - 0 errors, 4 warnings 1671 1672you would use a :c:func:`!scanf` format like :: 1673 1674 %s - %d errors, %d warnings 1675 1676The equivalent regular expression would be :: 1677 1678 (\S+) - (\d+) errors, (\d+) warnings 1679 1680 1681.. _search-vs-match: 1682 1683search() vs. match() 1684^^^^^^^^^^^^^^^^^^^^ 1685 1686.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org> 1687 1688Python offers different primitive operations based on regular expressions: 1689 1690+ :func:`re.match` checks for a match only at the beginning of the string 1691+ :func:`re.search` checks for a match anywhere in the string 1692 (this is what Perl does by default) 1693+ :func:`re.fullmatch` checks for entire string to be a match 1694 1695 1696For example:: 1697 1698 >>> re.match("c", "abcdef") # No match 1699 >>> re.search("c", "abcdef") # Match 1700 <re.Match object; span=(2, 3), match='c'> 1701 >>> re.fullmatch("p.*n", "python") # Match 1702 <re.Match object; span=(0, 6), match='python'> 1703 >>> re.fullmatch("r.*n", "python") # No match 1704 1705Regular expressions beginning with ``'^'`` can be used with :func:`search` to 1706restrict the match at the beginning of the string:: 1707 1708 >>> re.match("c", "abcdef") # No match 1709 >>> re.search("^c", "abcdef") # No match 1710 >>> re.search("^a", "abcdef") # Match 1711 <re.Match object; span=(0, 1), match='a'> 1712 1713Note however that in :const:`MULTILINE` mode :func:`match` only matches at the 1714beginning of the string, whereas using :func:`search` with a regular expression 1715beginning with ``'^'`` will match at the beginning of each line. :: 1716 1717 >>> re.match("X", "A\nB\nX", re.MULTILINE) # No match 1718 >>> re.search("^X", "A\nB\nX", re.MULTILINE) # Match 1719 <re.Match object; span=(4, 5), match='X'> 1720 1721 1722Making a Phonebook 1723^^^^^^^^^^^^^^^^^^ 1724 1725:func:`split` splits a string into a list delimited by the passed pattern. The 1726method is invaluable for converting textual data into data structures that can be 1727easily read and modified by Python as demonstrated in the following example that 1728creates a phonebook. 1729 1730First, here is the input. Normally it may come from a file, here we are using 1731triple-quoted string syntax 1732 1733.. doctest:: 1734 1735 >>> text = """Ross McFluff: 834.345.1254 155 Elm Street 1736 ... 1737 ... Ronald Heathmore: 892.345.3428 436 Finley Avenue 1738 ... Frank Burger: 925.541.7625 662 South Dogwood Way 1739 ... 1740 ... 1741 ... Heather Albrecht: 548.326.4584 919 Park Place""" 1742 1743The entries are separated by one or more newlines. Now we convert the string 1744into a list with each nonempty line having its own entry: 1745 1746.. doctest:: 1747 :options: +NORMALIZE_WHITESPACE 1748 1749 >>> entries = re.split("\n+", text) 1750 >>> entries 1751 ['Ross McFluff: 834.345.1254 155 Elm Street', 1752 'Ronald Heathmore: 892.345.3428 436 Finley Avenue', 1753 'Frank Burger: 925.541.7625 662 South Dogwood Way', 1754 'Heather Albrecht: 548.326.4584 919 Park Place'] 1755 1756Finally, split each entry into a list with first name, last name, telephone 1757number, and address. We use the ``maxsplit`` parameter of :func:`split` 1758because the address has spaces, our splitting pattern, in it: 1759 1760.. doctest:: 1761 :options: +NORMALIZE_WHITESPACE 1762 1763 >>> [re.split(":? ", entry, maxsplit=3) for entry in entries] 1764 [['Ross', 'McFluff', '834.345.1254', '155 Elm Street'], 1765 ['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'], 1766 ['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'], 1767 ['Heather', 'Albrecht', '548.326.4584', '919 Park Place']] 1768 1769The ``:?`` pattern matches the colon after the last name, so that it does not 1770occur in the result list. With a ``maxsplit`` of ``4``, we could separate the 1771house number from the street name: 1772 1773.. doctest:: 1774 :options: +NORMALIZE_WHITESPACE 1775 1776 >>> [re.split(":? ", entry, maxsplit=4) for entry in entries] 1777 [['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'], 1778 ['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'], 1779 ['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'], 1780 ['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']] 1781 1782 1783Text Munging 1784^^^^^^^^^^^^ 1785 1786:func:`sub` replaces every occurrence of a pattern with a string or the 1787result of a function. This example demonstrates using :func:`sub` with 1788a function to "munge" text, or randomize the order of all the characters 1789in each word of a sentence except for the first and last characters:: 1790 1791 >>> def repl(m): 1792 ... inner_word = list(m.group(2)) 1793 ... random.shuffle(inner_word) 1794 ... return m.group(1) + "".join(inner_word) + m.group(3) 1795 ... 1796 >>> text = "Professor Abdolmalek, please report your absences promptly." 1797 >>> re.sub(r"(\w)(\w+)(\w)", repl, text) 1798 'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.' 1799 >>> re.sub(r"(\w)(\w+)(\w)", repl, text) 1800 'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.' 1801 1802 1803Finding all Adverbs 1804^^^^^^^^^^^^^^^^^^^ 1805 1806:func:`findall` matches *all* occurrences of a pattern, not just the first 1807one as :func:`search` does. For example, if a writer wanted to 1808find all of the adverbs in some text, they might use :func:`findall` in 1809the following manner:: 1810 1811 >>> text = "He was carefully disguised but captured quickly by police." 1812 >>> re.findall(r"\w+ly\b", text) 1813 ['carefully', 'quickly'] 1814 1815 1816Finding all Adverbs and their Positions 1817^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1818 1819If one wants more information about all matches of a pattern than the matched 1820text, :func:`finditer` is useful as it provides :class:`~re.Match` objects 1821instead of strings. Continuing with the previous example, if a writer wanted 1822to find all of the adverbs *and their positions* in some text, they would use 1823:func:`finditer` in the following manner:: 1824 1825 >>> text = "He was carefully disguised but captured quickly by police." 1826 >>> for m in re.finditer(r"\w+ly\b", text): 1827 ... print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0))) 1828 07-16: carefully 1829 40-47: quickly 1830 1831 1832Raw String Notation 1833^^^^^^^^^^^^^^^^^^^ 1834 1835Raw string notation (``r"text"``) keeps regular expressions sane. Without it, 1836every backslash (``'\'``) in a regular expression would have to be prefixed with 1837another one to escape it. For example, the two following lines of code are 1838functionally identical:: 1839 1840 >>> re.match(r"\W(.)\1\W", " ff ") 1841 <re.Match object; span=(0, 4), match=' ff '> 1842 >>> re.match("\\W(.)\\1\\W", " ff ") 1843 <re.Match object; span=(0, 4), match=' ff '> 1844 1845When one wants to match a literal backslash, it must be escaped in the regular 1846expression. With raw string notation, this means ``r"\\"``. Without raw string 1847notation, one must use ``"\\\\"``, making the following lines of code 1848functionally identical:: 1849 1850 >>> re.match(r"\\", r"\\") 1851 <re.Match object; span=(0, 1), match='\\'> 1852 >>> re.match("\\\\", r"\\") 1853 <re.Match object; span=(0, 1), match='\\'> 1854 1855 1856Writing a Tokenizer 1857^^^^^^^^^^^^^^^^^^^ 1858 1859A `tokenizer or scanner <https://en.wikipedia.org/wiki/Lexical_analysis>`_ 1860analyzes a string to categorize groups of characters. This is a useful first 1861step in writing a compiler or interpreter. 1862 1863The text categories are specified with regular expressions. The technique is 1864to combine those into a single master regular expression and to loop over 1865successive matches:: 1866 1867 from typing import NamedTuple 1868 import re 1869 1870 class Token(NamedTuple): 1871 type: str 1872 value: str 1873 line: int 1874 column: int 1875 1876 def tokenize(code): 1877 keywords = {'IF', 'THEN', 'ENDIF', 'FOR', 'NEXT', 'GOSUB', 'RETURN'} 1878 token_specification = [ 1879 ('NUMBER', r'\d+(\.\d*)?'), # Integer or decimal number 1880 ('ASSIGN', r':='), # Assignment operator 1881 ('END', r';'), # Statement terminator 1882 ('ID', r'[A-Za-z]+'), # Identifiers 1883 ('OP', r'[+\-*/]'), # Arithmetic operators 1884 ('NEWLINE', r'\n'), # Line endings 1885 ('SKIP', r'[ \t]+'), # Skip over spaces and tabs 1886 ('MISMATCH', r'.'), # Any other character 1887 ] 1888 tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification) 1889 line_num = 1 1890 line_start = 0 1891 for mo in re.finditer(tok_regex, code): 1892 kind = mo.lastgroup 1893 value = mo.group() 1894 column = mo.start() - line_start 1895 if kind == 'NUMBER': 1896 value = float(value) if '.' in value else int(value) 1897 elif kind == 'ID' and value in keywords: 1898 kind = value 1899 elif kind == 'NEWLINE': 1900 line_start = mo.end() 1901 line_num += 1 1902 continue 1903 elif kind == 'SKIP': 1904 continue 1905 elif kind == 'MISMATCH': 1906 raise RuntimeError(f'{value!r} unexpected on line {line_num}') 1907 yield Token(kind, value, line_num, column) 1908 1909 statements = ''' 1910 IF quantity THEN 1911 total := total + price * quantity; 1912 tax := price * 0.05; 1913 ENDIF; 1914 ''' 1915 1916 for token in tokenize(statements): 1917 print(token) 1918 1919The tokenizer produces the following output:: 1920 1921 Token(type='IF', value='IF', line=2, column=4) 1922 Token(type='ID', value='quantity', line=2, column=7) 1923 Token(type='THEN', value='THEN', line=2, column=16) 1924 Token(type='ID', value='total', line=3, column=8) 1925 Token(type='ASSIGN', value=':=', line=3, column=14) 1926 Token(type='ID', value='total', line=3, column=17) 1927 Token(type='OP', value='+', line=3, column=23) 1928 Token(type='ID', value='price', line=3, column=25) 1929 Token(type='OP', value='*', line=3, column=31) 1930 Token(type='ID', value='quantity', line=3, column=33) 1931 Token(type='END', value=';', line=3, column=41) 1932 Token(type='ID', value='tax', line=4, column=8) 1933 Token(type='ASSIGN', value=':=', line=4, column=12) 1934 Token(type='ID', value='price', line=4, column=15) 1935 Token(type='OP', value='*', line=4, column=21) 1936 Token(type='NUMBER', value=0.05, line=4, column=23) 1937 Token(type='END', value=';', line=4, column=27) 1938 Token(type='ENDIF', value='ENDIF', line=5, column=4) 1939 Token(type='END', value=';', line=5, column=9) 1940 1941 1942.. [Frie09] Friedl, Jeffrey. Mastering Regular Expressions. 3rd ed., O'Reilly 1943 Media, 2009. The third edition of the book no longer covers Python at all, 1944 but the first edition covered writing good regular expression patterns in 1945 great detail. 1946