1 2:mod:`re` --- Regular expression operations 3=========================================== 4 5.. module:: re 6 :synopsis: Regular expression operations. 7.. moduleauthor:: Fredrik Lundh <fredrik@pythonware.com> 8.. sectionauthor:: Andrew M. Kuchling <amk@amk.ca> 9 10 11This module provides regular expression matching operations similar to 12those found in Perl. Both patterns and strings to be searched can be 13Unicode strings as well as 8-bit strings. 14 15Regular expressions use the backslash character (``'\'``) to indicate 16special forms or to allow special characters to be used without invoking 17their special meaning. This collides with Python's usage of the same 18character for the same purpose in string literals; for example, to match 19a literal backslash, one might have to write ``'\\\\'`` as the pattern 20string, because the regular expression must be ``\\``, and each 21backslash must be expressed as ``\\`` inside a regular Python string 22literal. 23 24The solution is to use Python's raw string notation for regular expression 25patterns; backslashes are not handled in any special way in a string literal 26prefixed with ``'r'``. So ``r"\n"`` is a two-character string containing 27``'\'`` and ``'n'``, while ``"\n"`` is a one-character string containing a 28newline. Usually patterns will be expressed in Python code using this raw 29string notation. 30 31It is important to note that most regular expression operations are available as 32module-level functions and :class:`RegexObject` methods. The functions are 33shortcuts that don't require you to compile a regex object first, but miss some 34fine-tuning parameters. 35 36.. seealso:: 37 38 The third-party `regex <https://pypi.org/project/regex/>`_ module, 39 which has an API compatible with the standard library :mod:`re` module, 40 but offers additional functionality and a more thorough Unicode support. 41 42 43.. _re-syntax: 44 45Regular Expression Syntax 46------------------------- 47 48A regular expression (or RE) specifies a set of strings that matches it; the 49functions in this module let you check if a particular string matches a given 50regular expression (or if a given regular expression matches a particular 51string, which comes down to the same thing). 52 53Regular expressions can be concatenated to form new regular expressions; if *A* 54and *B* are both regular expressions, then *AB* is also a regular expression. 55In general, if a string *p* matches *A* and another string *q* matches *B*, the 56string *pq* will match AB. This holds unless *A* or *B* contain low precedence 57operations; boundary conditions between *A* and *B*; or have numbered group 58references. Thus, complex expressions can easily be constructed from simpler 59primitive expressions like the ones described here. For details of the theory 60and implementation of regular expressions, consult the Friedl book referenced 61above, or almost any textbook about compiler construction. 62 63A brief explanation of the format of regular expressions follows. For further 64information and a gentler presentation, consult the :ref:`regex-howto`. 65 66Regular expressions can contain both special and ordinary characters. Most 67ordinary characters, like ``'A'``, ``'a'``, or ``'0'``, are the simplest regular 68expressions; they simply match themselves. You can concatenate ordinary 69characters, so ``last`` matches the string ``'last'``. (In the rest of this 70section, we'll write RE's in ``this special style``, usually without quotes, and 71strings to be matched ``'in single quotes'``.) 72 73Some characters, like ``'|'`` or ``'('``, are special. Special 74characters either stand for classes of ordinary characters, or affect 75how the regular expressions around them are interpreted. Regular 76expression pattern strings may not contain null bytes, but can specify 77the null byte using the ``\number`` notation, e.g., ``'\x00'``. 78 79Repetition qualifiers (``*``, ``+``, ``?``, ``{m,n}``, etc) cannot be 80directly nested. This avoids ambiguity with the non-greedy modifier suffix 81``?``, and with other modifiers in other implementations. To apply a second 82repetition to an inner repetition, parentheses may be used. For example, 83the expression ``(?:a{6})*`` matches any multiple of six ``'a'`` characters. 84 85 86The special characters are: 87 88``'.'`` 89 (Dot.) In the default mode, this matches any character except a newline. If 90 the :const:`DOTALL` flag has been specified, this matches any character 91 including a newline. 92 93``'^'`` 94 (Caret.) Matches the start of the string, and in :const:`MULTILINE` mode also 95 matches immediately after each newline. 96 97``'$'`` 98 Matches the end of the string or just before the newline at the end of the 99 string, and in :const:`MULTILINE` mode also matches before a newline. ``foo`` 100 matches both 'foo' and 'foobar', while the regular expression ``foo$`` matches 101 only 'foo'. More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'`` 102 matches 'foo2' normally, but 'foo1' in :const:`MULTILINE` mode; searching for 103 a single ``$`` in ``'foo\n'`` will find two (empty) matches: one just before 104 the newline, and one at the end of the string. 105 106``'*'`` 107 Causes the resulting RE to match 0 or more repetitions of the preceding RE, as 108 many repetitions as are possible. ``ab*`` will match 'a', 'ab', or 'a' followed 109 by any number of 'b's. 110 111``'+'`` 112 Causes the resulting RE to match 1 or more repetitions of the preceding RE. 113 ``ab+`` will match 'a' followed by any non-zero number of 'b's; it will not 114 match just 'a'. 115 116``'?'`` 117 Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. 118 ``ab?`` will match either 'a' or 'ab'. 119 120``*?``, ``+?``, ``??`` 121 The ``'*'``, ``'+'``, and ``'?'`` qualifiers are all :dfn:`greedy`; they match 122 as much text as possible. Sometimes this behaviour isn't desired; if the RE 123 ``<.*>`` is matched against ``<a> b <c>``, it will match the entire 124 string, and not just ``<a>``. Adding ``?`` after the qualifier makes it 125 perform the match in :dfn:`non-greedy` or :dfn:`minimal` fashion; as *few* 126 characters as possible will be matched. Using the RE ``<.*?>`` will match 127 only ``<a>``. 128 129``{m}`` 130 Specifies that exactly *m* copies of the previous RE should be matched; fewer 131 matches cause the entire RE not to match. For example, ``a{6}`` will match 132 exactly six ``'a'`` characters, but not five. 133 134``{m,n}`` 135 Causes the resulting RE to match from *m* to *n* repetitions of the preceding 136 RE, attempting to match as many repetitions as possible. For example, 137 ``a{3,5}`` will match from 3 to 5 ``'a'`` characters. Omitting *m* specifies a 138 lower bound of zero, and omitting *n* specifies an infinite upper bound. As an 139 example, ``a{4,}b`` will match ``aaaab`` or a thousand ``'a'`` characters 140 followed by a ``b``, but not ``aaab``. The comma may not be omitted or the 141 modifier would be confused with the previously described form. 142 143``{m,n}?`` 144 Causes the resulting RE to match from *m* to *n* repetitions of the preceding 145 RE, attempting to match as *few* repetitions as possible. This is the 146 non-greedy version of the previous qualifier. For example, on the 147 6-character string ``'aaaaaa'``, ``a{3,5}`` will match 5 ``'a'`` characters, 148 while ``a{3,5}?`` will only match 3 characters. 149 150``'\'`` 151 Either escapes special characters (permitting you to match characters like 152 ``'*'``, ``'?'``, and so forth), or signals a special sequence; special 153 sequences are discussed below. 154 155 If you're not using a raw string to express the pattern, remember that Python 156 also uses the backslash as an escape sequence in string literals; if the escape 157 sequence isn't recognized by Python's parser, the backslash and subsequent 158 character are included in the resulting string. However, if Python would 159 recognize the resulting sequence, the backslash should be repeated twice. This 160 is complicated and hard to understand, so it's highly recommended that you use 161 raw strings for all but the simplest expressions. 162 163``[]`` 164 Used to indicate a set of characters. In a set: 165 166 * Characters can be listed individually, e.g. ``[amk]`` will match ``'a'``, 167 ``'m'``, or ``'k'``. 168 169 * Ranges of characters can be indicated by giving two characters and separating 170 them by a ``'-'``, for example ``[a-z]`` will match any lowercase ASCII letter, 171 ``[0-5][0-9]`` will match all the two-digits numbers from ``00`` to ``59``, and 172 ``[0-9A-Fa-f]`` will match any hexadecimal digit. If ``-`` is escaped (e.g. 173 ``[a\-z]``) or if it's placed as the first or last character (e.g. ``[a-]``), 174 it will match a literal ``'-'``. 175 176 * Special characters lose their special meaning inside sets. For example, 177 ``[(+*)]`` will match any of the literal characters ``'('``, ``'+'``, 178 ``'*'``, or ``')'``. 179 180 * Character classes such as ``\w`` or ``\S`` (defined below) are also accepted 181 inside a set, although the characters they match depends on whether 182 :const:`LOCALE` or :const:`UNICODE` mode is in force. 183 184 * Characters that are not within a range can be matched by :dfn:`complementing` 185 the set. If the first character of the set is ``'^'``, all the characters 186 that are *not* in the set will be matched. For example, ``[^5]`` will match 187 any character except ``'5'``, and ``[^^]`` will match any character except 188 ``'^'``. ``^`` has no special meaning if it's not the first character in 189 the set. 190 191 * To match a literal ``']'`` inside a set, precede it with a backslash, or 192 place it at the beginning of the set. For example, both ``[()[\]{}]`` and 193 ``[]()[{}]`` will both match a parenthesis. 194 195``'|'`` 196 ``A|B``, where A and B can be arbitrary REs, creates a regular expression that 197 will match either A or B. An arbitrary number of REs can be separated by the 198 ``'|'`` in this way. This can be used inside groups (see below) as well. As 199 the target string is scanned, REs separated by ``'|'`` are tried from left to 200 right. When one pattern completely matches, that branch is accepted. This means 201 that once ``A`` matches, ``B`` will not be tested further, even if it would 202 produce a longer overall match. In other words, the ``'|'`` operator is never 203 greedy. To match a literal ``'|'``, use ``\|``, or enclose it inside a 204 character class, as in ``[|]``. 205 206``(...)`` 207 Matches whatever regular expression is inside the parentheses, and indicates the 208 start and end of a group; the contents of a group can be retrieved after a match 209 has been performed, and can be matched later in the string with the ``\number`` 210 special sequence, described below. To match the literals ``'('`` or ``')'``, 211 use ``\(`` or ``\)``, or enclose them inside a character class: ``[(] [)]``. 212 213``(?...)`` 214 This is an extension notation (a ``'?'`` following a ``'('`` is not meaningful 215 otherwise). The first character after the ``'?'`` determines what the meaning 216 and further syntax of the construct is. Extensions usually do not create a new 217 group; ``(?P<name>...)`` is the only exception to this rule. Following are the 218 currently supported extensions. 219 220``(?iLmsux)`` 221 (One or more letters from the set ``'i'``, ``'L'``, ``'m'``, ``'s'``, 222 ``'u'``, ``'x'``.) The group matches the empty string; the letters 223 set the corresponding flags: :const:`re.I` (ignore case), 224 :const:`re.L` (locale dependent), :const:`re.M` (multi-line), 225 :const:`re.S` (dot matches all), :const:`re.U` (Unicode dependent), 226 and :const:`re.X` (verbose), for the entire regular expression. (The 227 flags are described in :ref:`contents-of-module-re`.) This 228 is useful if you wish to include the flags as part of the regular 229 expression, instead of passing a *flag* argument to the 230 :func:`re.compile` function. 231 232 Note that the ``(?x)`` flag changes how the expression is parsed. It should be 233 used first in the expression string, or after one or more whitespace characters. 234 If there are non-whitespace characters before the flag, the results are 235 undefined. 236 237``(?:...)`` 238 A non-capturing version of regular parentheses. Matches whatever regular 239 expression is inside the parentheses, but the substring matched by the group 240 *cannot* be retrieved after performing a match or referenced later in the 241 pattern. 242 243``(?P<name>...)`` 244 Similar to regular parentheses, but the substring matched by the group is 245 accessible via the symbolic group name *name*. Group names must be valid 246 Python identifiers, and each group name must be defined only once within a 247 regular expression. A symbolic group is also a numbered group, just as if 248 the group were not named. 249 250 Named groups can be referenced in three contexts. If the pattern is 251 ``(?P<quote>['"]).*?(?P=quote)`` (i.e. matching a string quoted with either 252 single or double quotes): 253 254 +---------------------------------------+----------------------------------+ 255 | Context of reference to group "quote" | Ways to reference it | 256 +=======================================+==================================+ 257 | in the same pattern itself | * ``(?P=quote)`` (as shown) | 258 | | * ``\1`` | 259 +---------------------------------------+----------------------------------+ 260 | when processing match object ``m`` | * ``m.group('quote')`` | 261 | | * ``m.end('quote')`` (etc.) | 262 +---------------------------------------+----------------------------------+ 263 | in a string passed to the ``repl`` | * ``\g<quote>`` | 264 | argument of ``re.sub()`` | * ``\g<1>`` | 265 | | * ``\1`` | 266 +---------------------------------------+----------------------------------+ 267 268``(?P=name)`` 269 A backreference to a named group; it matches whatever text was matched by the 270 earlier group named *name*. 271 272``(?#...)`` 273 A comment; the contents of the parentheses are simply ignored. 274 275``(?=...)`` 276 Matches if ``...`` matches next, but doesn't consume any of the string. This is 277 called a lookahead assertion. For example, ``Isaac (?=Asimov)`` will match 278 ``'Isaac '`` only if it's followed by ``'Asimov'``. 279 280``(?!...)`` 281 Matches if ``...`` doesn't match next. This is a negative lookahead assertion. 282 For example, ``Isaac (?!Asimov)`` will match ``'Isaac '`` only if it's *not* 283 followed by ``'Asimov'``. 284 285``(?<=...)`` 286 Matches if the current position in the string is preceded by a match for ``...`` 287 that ends at the current position. This is called a :dfn:`positive lookbehind 288 assertion`. ``(?<=abc)def`` will find a match in ``abcdef``, since the 289 lookbehind will back up 3 characters and check if the contained pattern matches. 290 The contained pattern must only match strings of some fixed length, meaning that 291 ``abc`` or ``a|b`` are allowed, but ``a*`` and ``a{3,4}`` are not. Group 292 references are not supported even if they match strings of some fixed length. 293 Note that 294 patterns which start with positive lookbehind assertions will not match at the 295 beginning of the string being searched; you will most likely want to use the 296 :func:`search` function rather than the :func:`match` function: 297 298 >>> import re 299 >>> m = re.search('(?<=abc)def', 'abcdef') 300 >>> m.group(0) 301 'def' 302 303 This example looks for a word following a hyphen: 304 305 >>> m = re.search('(?<=-)\w+', 'spam-egg') 306 >>> m.group(0) 307 'egg' 308 309``(?<!...)`` 310 Matches if the current position in the string is not preceded by a match for 311 ``...``. This is called a :dfn:`negative lookbehind assertion`. Similar to 312 positive lookbehind assertions, the contained pattern must only match strings of 313 some fixed length and shouldn't contain group references. 314 Patterns which start with negative lookbehind assertions may 315 match at the beginning of the string being searched. 316 317``(?(id/name)yes-pattern|no-pattern)`` 318 Will try to match with ``yes-pattern`` if the group with given *id* or *name* 319 exists, and with ``no-pattern`` if it doesn't. ``no-pattern`` is optional and 320 can be omitted. For example, ``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>)`` is a poor email 321 matching pattern, which will match with ``'<user@host.com>'`` as well as 322 ``'user@host.com'``, but not with ``'<user@host.com'``. 323 324 .. versionadded:: 2.4 325 326The special sequences consist of ``'\'`` and a character from the list below. 327If the ordinary character is not on the list, then the resulting RE will match 328the second character. For example, ``\$`` matches the character ``'$'``. 329 330``\number`` 331 Matches the contents of the group of the same number. Groups are numbered 332 starting from 1. For example, ``(.+) \1`` matches ``'the the'`` or ``'55 55'``, 333 but not ``'thethe'`` (note the space after the group). This special sequence 334 can only be used to match one of the first 99 groups. If the first digit of 335 *number* is 0, or *number* is 3 octal digits long, it will not be interpreted as 336 a group match, but as the character with octal value *number*. Inside the 337 ``'['`` and ``']'`` of a character class, all numeric escapes are treated as 338 characters. 339 340``\A`` 341 Matches only at the start of the string. 342 343``\b`` 344 Matches the empty string, but only at the beginning or end of a word. A word is 345 defined as a sequence of alphanumeric or underscore characters, so the end of a 346 word is indicated by whitespace or a non-alphanumeric, non-underscore character. 347 Note that formally, ``\b`` is defined as the boundary between a ``\w`` and 348 a ``\W`` character (or vice versa), or between ``\w`` and the beginning/end 349 of the string, so the precise set of characters deemed to be alphanumeric 350 depends on the values of the ``UNICODE`` and ``LOCALE`` flags. 351 For example, ``r'\bfoo\b'`` matches ``'foo'``, ``'foo.'``, ``'(foo)'``, 352 ``'bar foo baz'`` but not ``'foobar'`` or ``'foo3'``. 353 Inside a character range, ``\b`` represents the backspace character, for 354 compatibility with Python's string literals. 355 356``\B`` 357 Matches the empty string, but only when it is *not* at the beginning or end of a 358 word. This means that ``r'py\B'`` matches ``'python'``, ``'py3'``, ``'py2'``, 359 but not ``'py'``, ``'py.'``, or ``'py!'``. 360 ``\B`` is just the opposite of ``\b``, so is also subject to the settings 361 of ``LOCALE`` and ``UNICODE``. 362 363``\d`` 364 When the :const:`UNICODE` flag is not specified, matches any decimal digit; this 365 is equivalent to the set ``[0-9]``. With :const:`UNICODE`, it will match 366 whatever is classified as a decimal digit in the Unicode character properties 367 database. 368 369``\D`` 370 When the :const:`UNICODE` flag is not specified, matches any non-digit 371 character; this is equivalent to the set ``[^0-9]``. With :const:`UNICODE`, it 372 will match anything other than character marked as digits in the Unicode 373 character properties database. 374 375``\s`` 376 When the :const:`UNICODE` flag is not specified, it matches any whitespace 377 character, this is equivalent to the set ``[ \t\n\r\f\v]``. The 378 :const:`LOCALE` flag has no extra effect on matching of the space. 379 If :const:`UNICODE` is set, this will match the characters ``[ \t\n\r\f\v]`` 380 plus whatever is classified as space in the Unicode character properties 381 database. 382 383``\S`` 384 When the :const:`UNICODE` flag is not specified, matches any non-whitespace 385 character; this is equivalent to the set ``[^ \t\n\r\f\v]`` The 386 :const:`LOCALE` flag has no extra effect on non-whitespace match. If 387 :const:`UNICODE` is set, then any character not marked as space in the 388 Unicode character properties database is matched. 389 390 391``\w`` 392 When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches 393 any alphanumeric character and the underscore; this is equivalent to the set 394 ``[a-zA-Z0-9_]``. With :const:`LOCALE`, it will match the set ``[0-9_]`` plus 395 whatever characters are defined as alphanumeric for the current locale. If 396 :const:`UNICODE` is set, this will match the characters ``[0-9_]`` plus whatever 397 is classified as alphanumeric in the Unicode character properties database. 398 399``\W`` 400 When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches 401 any non-alphanumeric character; this is equivalent to the set ``[^a-zA-Z0-9_]``. 402 With :const:`LOCALE`, it will match any character not in the set ``[0-9_]``, and 403 not defined as alphanumeric for the current locale. If :const:`UNICODE` is set, 404 this will match anything other than ``[0-9_]`` plus characters classified as 405 not alphanumeric in the Unicode character properties database. 406 407``\Z`` 408 Matches only at the end of the string. 409 410If both :const:`LOCALE` and :const:`UNICODE` flags are included for a 411particular sequence, then :const:`LOCALE` flag takes effect first followed by 412the :const:`UNICODE`. 413 414Most of the standard escapes supported by Python string literals are also 415accepted by the regular expression parser:: 416 417 \a \b \f \n 418 \r \t \v \x 419 \\ 420 421(Note that ``\b`` is used to represent word boundaries, and means "backspace" 422only inside character classes.) 423 424Octal escapes are included in a limited form: If the first digit is a 0, or if 425there are three octal digits, it is considered an octal escape. Otherwise, it is 426a group reference. As for string literals, octal escapes are always at most 427three digits in length. 428 429.. seealso:: 430 431 Mastering Regular Expressions 432 Book on regular expressions by Jeffrey Friedl, published by O'Reilly. The 433 second edition of the book no longer covers Python at all, but the first 434 edition covered writing good regular expression patterns in great detail. 435 436 437 438.. _contents-of-module-re: 439 440Module Contents 441--------------- 442 443The module defines several functions, constants, and an exception. Some of the 444functions are simplified versions of the full featured methods for compiled 445regular expressions. Most non-trivial applications always use the compiled 446form. 447 448 449.. function:: compile(pattern, flags=0) 450 451 Compile a regular expression pattern into a regular expression object, which 452 can be used for matching using its :func:`~RegexObject.match` and 453 :func:`~RegexObject.search` methods, described below. 454 455 The expression's behaviour can be modified by specifying a *flags* value. 456 Values can be any of the following variables, combined using bitwise OR (the 457 ``|`` operator). 458 459 The sequence :: 460 461 prog = re.compile(pattern) 462 result = prog.match(string) 463 464 is equivalent to :: 465 466 result = re.match(pattern, string) 467 468 but using :func:`re.compile` and saving the resulting regular expression 469 object for reuse is more efficient when the expression will be used several 470 times in a single program. 471 472 .. note:: 473 474 The compiled versions of the most recent patterns passed to 475 :func:`re.match`, :func:`re.search` or :func:`re.compile` are cached, so 476 programs that use only a few regular expressions at a time needn't worry 477 about compiling regular expressions. 478 479 480.. data:: DEBUG 481 482 Display debug information about compiled expression. 483 484 485.. data:: I 486 IGNORECASE 487 488 Perform case-insensitive matching; expressions like ``[A-Z]`` will match 489 lowercase letters, too. This is not affected by the current locale. To 490 get this effect on non-ASCII Unicode characters such as ``ü`` and ``Ü``, 491 add the :const:`UNICODE` flag. 492 493 494.. data:: L 495 LOCALE 496 497 Make ``\w``, ``\W``, ``\b``, ``\B``, ``\s`` and ``\S`` dependent on the 498 current locale. 499 500 501.. data:: M 502 MULTILINE 503 504 When specified, the pattern character ``'^'`` matches at the beginning of the 505 string and at the beginning of each line (immediately following each newline); 506 and the pattern character ``'$'`` matches at the end of the string and at the 507 end of each line (immediately preceding each newline). By default, ``'^'`` 508 matches only at the beginning of the string, and ``'$'`` only at the end of the 509 string and immediately before the newline (if any) at the end of the string. 510 511 512.. data:: S 513 DOTALL 514 515 Make the ``'.'`` special character match any character at all, including a 516 newline; without this flag, ``'.'`` will match anything *except* a newline. 517 518 519.. data:: U 520 UNICODE 521 522 Make the ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S`` 523 sequences dependent on the Unicode character properties database. Also 524 enables non-ASCII matching for :const:`IGNORECASE`. 525 526 .. versionadded:: 2.0 527 528 529.. data:: X 530 VERBOSE 531 532 This flag allows you to write regular expressions that look nicer and are 533 more readable by allowing you to visually separate logical sections of the 534 pattern and add comments. Whitespace within the pattern is ignored, except 535 when in a character class, or when preceded by an unescaped backslash, 536 or within tokens like ``*?``, ``(?:`` or ``(?P<...>``. 537 When a line contains a ``#`` that is not in a character class and is not 538 preceded by an unescaped backslash, all characters from the leftmost such 539 ``#`` through the end of the line are ignored. 540 541 This means that the two following regular expression objects that match a 542 decimal number are functionally equal:: 543 544 a = re.compile(r"""\d + # the integral part 545 \. # the decimal point 546 \d * # some fractional digits""", re.X) 547 b = re.compile(r"\d+\.\d*") 548 549 550.. function:: search(pattern, string, flags=0) 551 552 Scan through *string* looking for the first location where the regular expression 553 *pattern* produces a match, and return a corresponding :class:`MatchObject` 554 instance. Return ``None`` if no position in the string matches the pattern; note 555 that this is different from finding a zero-length match at some point in the 556 string. 557 558 559.. function:: match(pattern, string, flags=0) 560 561 If zero or more characters at the beginning of *string* match the regular 562 expression *pattern*, return a corresponding :class:`MatchObject` instance. 563 Return ``None`` if the string does not match the pattern; note that this is 564 different from a zero-length match. 565 566 Note that even in :const:`MULTILINE` mode, :func:`re.match` will only match 567 at the beginning of the string and not at the beginning of each line. 568 569 If you want to locate a match anywhere in *string*, use :func:`search` 570 instead (see also :ref:`search-vs-match`). 571 572 573.. function:: split(pattern, string, maxsplit=0, flags=0) 574 575 Split *string* by the occurrences of *pattern*. If capturing parentheses are 576 used in *pattern*, then the text of all groups in the pattern are also returned 577 as part of the resulting list. If *maxsplit* is nonzero, at most *maxsplit* 578 splits occur, and the remainder of the string is returned as the final element 579 of the list. (Incompatibility note: in the original Python 1.5 release, 580 *maxsplit* was ignored. This has been fixed in later releases.) 581 582 >>> re.split('\W+', 'Words, words, words.') 583 ['Words', 'words', 'words', ''] 584 >>> re.split('(\W+)', 'Words, words, words.') 585 ['Words', ', ', 'words', ', ', 'words', '.', ''] 586 >>> re.split('\W+', 'Words, words, words.', 1) 587 ['Words', 'words, words.'] 588 >>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE) 589 ['0', '3', '9'] 590 591 If there are capturing groups in the separator and it matches at the start of 592 the string, the result will start with an empty string. The same holds for 593 the end of the string: 594 595 >>> re.split('(\W+)', '...words, words...') 596 ['', '...', 'words', ', ', 'words', '...', ''] 597 598 That way, separator components are always found at the same relative 599 indices within the result list (e.g., if there's one capturing group 600 in the separator, the 0th, the 2nd and so forth). 601 602 Note that *split* will never split a string on an empty pattern match. 603 For example: 604 605 >>> re.split('x*', 'foo') 606 ['foo'] 607 >>> re.split("(?m)^$", "foo\n\nbar\n") 608 ['foo\n\nbar\n'] 609 610 .. versionchanged:: 2.7 611 Added the optional flags argument. 612 613 614 615.. function:: findall(pattern, string, flags=0) 616 617 Return all non-overlapping matches of *pattern* in *string*, as a list of 618 strings. The *string* is scanned left-to-right, and matches are returned in 619 the order found. If one or more groups are present in the pattern, return a 620 list of groups; this will be a list of tuples if the pattern has more than 621 one group. Empty matches are included in the result. 622 623 .. note:: 624 625 Due to the limitation of the current implementation the character 626 following an empty match is not included in a next match, so 627 ``findall(r'^|\w+', 'two words')`` returns ``['', 'wo', 'words']`` 628 (note missed "t"). This is changed in Python 3.7. 629 630 .. versionadded:: 1.5.2 631 632 .. versionchanged:: 2.4 633 Added the optional flags argument. 634 635 636.. function:: finditer(pattern, string, flags=0) 637 638 Return an :term:`iterator` yielding :class:`MatchObject` instances over all 639 non-overlapping matches for the RE *pattern* in *string*. The *string* is 640 scanned left-to-right, and matches are returned in the order found. Empty 641 matches are included in the result. See also the note about :func:`findall`. 642 643 .. versionadded:: 2.2 644 645 .. versionchanged:: 2.4 646 Added the optional flags argument. 647 648 649.. function:: sub(pattern, repl, string, count=0, flags=0) 650 651 Return the string obtained by replacing the leftmost non-overlapping occurrences 652 of *pattern* in *string* by the replacement *repl*. If the pattern isn't found, 653 *string* is returned unchanged. *repl* can be a string or a function; if it is 654 a string, any backslash escapes in it are processed. That is, ``\n`` is 655 converted to a single newline character, ``\r`` is converted to a carriage return, and 656 so forth. Unknown escapes such as ``\j`` are left alone. Backreferences, such 657 as ``\6``, are replaced with the substring matched by group 6 in the pattern. 658 For example: 659 660 >>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):', 661 ... r'static PyObject*\npy_\1(void)\n{', 662 ... 'def myfunc():') 663 'static PyObject*\npy_myfunc(void)\n{' 664 665 If *repl* is a function, it is called for every non-overlapping occurrence of 666 *pattern*. The function takes a single match object argument, and returns the 667 replacement string. For example: 668 669 >>> def dashrepl(matchobj): 670 ... if matchobj.group(0) == '-': return ' ' 671 ... else: return '-' 672 >>> re.sub('-{1,2}', dashrepl, 'pro----gram-files') 673 'pro--gram files' 674 >>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE) 675 'Baked Beans & Spam' 676 677 The pattern may be a string or an RE object. 678 679 The optional argument *count* is the maximum number of pattern occurrences to be 680 replaced; *count* must be a non-negative integer. If omitted or zero, all 681 occurrences will be replaced. Empty matches for the pattern are replaced only 682 when not adjacent to a previous match, so ``sub('x*', '-', 'abc')`` returns 683 ``'-a-b-c-'``. 684 685 In string-type *repl* arguments, in addition to the character escapes and 686 backreferences described above, 687 ``\g<name>`` will use the substring matched by the group named ``name``, as 688 defined by the ``(?P<name>...)`` syntax. ``\g<number>`` uses the corresponding 689 group number; ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous 690 in a replacement such as ``\g<2>0``. ``\20`` would be interpreted as a 691 reference to group 20, not a reference to group 2 followed by the literal 692 character ``'0'``. The backreference ``\g<0>`` substitutes in the entire 693 substring matched by the RE. 694 695 .. versionchanged:: 2.7 696 Added the optional flags argument. 697 698 699.. function:: subn(pattern, repl, string, count=0, flags=0) 700 701 Perform the same operation as :func:`sub`, but return a tuple ``(new_string, 702 number_of_subs_made)``. 703 704 .. versionchanged:: 2.7 705 Added the optional flags argument. 706 707 708.. function:: escape(pattern) 709 710 Escape all the characters in *pattern* except ASCII letters and numbers. 711 This is useful if you want to match an arbitrary literal string that may 712 have regular expression metacharacters in it. For example:: 713 714 >>> print re.escape('python.exe') 715 python\.exe 716 717 >>> legal_chars = string.ascii_lowercase + string.digits + "!#$%&'*+-.^_`|~:" 718 >>> print '[%s]+' % re.escape(legal_chars) 719 [abcdefghijklmnopqrstuvwxyz0123456789\!\#\$\%\&\'\*\+\-\.\^\_\`\|\~\:]+ 720 721 >>> operators = ['+', '-', '*', '/', '**'] 722 >>> print '|'.join(map(re.escape, sorted(operators, reverse=True))) 723 \/|\-|\+|\*\*|\* 724 725 726.. function:: purge() 727 728 Clear the regular expression cache. 729 730 731.. exception:: error 732 733 Exception raised when a string passed to one of the functions here is not a 734 valid regular expression (for example, it might contain unmatched parentheses) 735 or when some other error occurs during compilation or matching. It is never an 736 error if a string contains no match for a pattern. 737 738 739.. _re-objects: 740 741Regular Expression Objects 742-------------------------- 743 744.. class:: RegexObject 745 746 The :class:`RegexObject` class supports the following methods and attributes: 747 748 .. method:: RegexObject.search(string[, pos[, endpos]]) 749 750 Scan through *string* looking for a location where this regular expression 751 produces a match, and return a corresponding :class:`MatchObject` instance. 752 Return ``None`` if no position in the string matches the pattern; note that this 753 is different from finding a zero-length match at some point in the string. 754 755 The optional second parameter *pos* gives an index in the string where the 756 search is to start; it defaults to ``0``. This is not completely equivalent to 757 slicing the string; the ``'^'`` pattern character matches at the real beginning 758 of the string and at positions just after a newline, but not necessarily at the 759 index where the search is to start. 760 761 The optional parameter *endpos* limits how far the string will be searched; it 762 will be as if the string is *endpos* characters long, so only the characters 763 from *pos* to ``endpos - 1`` will be searched for a match. If *endpos* is less 764 than *pos*, no match will be found, otherwise, if *rx* is a compiled regular 765 expression object, ``rx.search(string, 0, 50)`` is equivalent to 766 ``rx.search(string[:50], 0)``. 767 768 >>> pattern = re.compile("d") 769 >>> pattern.search("dog") # Match at index 0 770 <_sre.SRE_Match object at ...> 771 >>> pattern.search("dog", 1) # No match; search doesn't include the "d" 772 773 774 .. method:: RegexObject.match(string[, pos[, endpos]]) 775 776 If zero or more characters at the *beginning* of *string* match this regular 777 expression, return a corresponding :class:`MatchObject` instance. Return 778 ``None`` if the string does not match the pattern; note that this is different 779 from a zero-length match. 780 781 The optional *pos* and *endpos* parameters have the same meaning as for the 782 :meth:`~RegexObject.search` method. 783 784 >>> pattern = re.compile("o") 785 >>> pattern.match("dog") # No match as "o" is not at the start of "dog". 786 >>> pattern.match("dog", 1) # Match as "o" is the 2nd character of "dog". 787 <_sre.SRE_Match object at ...> 788 789 If you want to locate a match anywhere in *string*, use 790 :meth:`~RegexObject.search` instead (see also :ref:`search-vs-match`). 791 792 793 .. method:: RegexObject.split(string, maxsplit=0) 794 795 Identical to the :func:`split` function, using the compiled pattern. 796 797 798 .. method:: RegexObject.findall(string[, pos[, endpos]]) 799 800 Similar to the :func:`findall` function, using the compiled pattern, but 801 also accepts optional *pos* and *endpos* parameters that limit the search 802 region like for :meth:`match`. 803 804 805 .. method:: RegexObject.finditer(string[, pos[, endpos]]) 806 807 Similar to the :func:`finditer` function, using the compiled pattern, but 808 also accepts optional *pos* and *endpos* parameters that limit the search 809 region like for :meth:`match`. 810 811 812 .. method:: RegexObject.sub(repl, string, count=0) 813 814 Identical to the :func:`sub` function, using the compiled pattern. 815 816 817 .. method:: RegexObject.subn(repl, string, count=0) 818 819 Identical to the :func:`subn` function, using the compiled pattern. 820 821 822 .. attribute:: RegexObject.flags 823 824 The regex matching flags. This is a combination of the flags given to 825 :func:`.compile` and any ``(?...)`` inline flags in the pattern. 826 827 828 .. attribute:: RegexObject.groups 829 830 The number of capturing groups in the pattern. 831 832 833 .. attribute:: RegexObject.groupindex 834 835 A dictionary mapping any symbolic group names defined by ``(?P<id>)`` to group 836 numbers. The dictionary is empty if no symbolic groups were used in the 837 pattern. 838 839 840 .. attribute:: RegexObject.pattern 841 842 The pattern string from which the RE object was compiled. 843 844 845.. _match-objects: 846 847Match Objects 848------------- 849 850.. class:: MatchObject 851 852 Match objects always have a boolean value of ``True``. 853 Since :meth:`~regex.match` and :meth:`~regex.search` return ``None`` 854 when there is no match, you can test whether there was a match with a simple 855 ``if`` statement:: 856 857 match = re.search(pattern, string) 858 if match: 859 process(match) 860 861 Match objects support the following methods and attributes: 862 863 864 .. method:: MatchObject.expand(template) 865 866 Return the string obtained by doing backslash substitution on the template 867 string *template*, as done by the :meth:`~RegexObject.sub` method. Escapes 868 such as ``\n`` are converted to the appropriate characters, and numeric 869 backreferences (``\1``, ``\2``) and named backreferences (``\g<1>``, 870 ``\g<name>``) are replaced by the contents of the corresponding group. 871 872 873 .. method:: MatchObject.group([group1, ...]) 874 875 Returns one or more subgroups of the match. If there is a single argument, the 876 result is a single string; if there are multiple arguments, the result is a 877 tuple with one item per argument. Without arguments, *group1* defaults to zero 878 (the whole match is returned). If a *groupN* argument is zero, the corresponding 879 return value is the entire matching string; if it is in the inclusive range 880 [1..99], it is the string matching the corresponding parenthesized group. If a 881 group number is negative or larger than the number of groups defined in the 882 pattern, an :exc:`IndexError` exception is raised. If a group is contained in a 883 part of the pattern that did not match, the corresponding result is ``None``. 884 If a group is contained in a part of the pattern that matched multiple times, 885 the last match is returned. 886 887 >>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist") 888 >>> m.group(0) # The entire match 889 'Isaac Newton' 890 >>> m.group(1) # The first parenthesized subgroup. 891 'Isaac' 892 >>> m.group(2) # The second parenthesized subgroup. 893 'Newton' 894 >>> m.group(1, 2) # Multiple arguments give us a tuple. 895 ('Isaac', 'Newton') 896 897 If the regular expression uses the ``(?P<name>...)`` syntax, the *groupN* 898 arguments may also be strings identifying groups by their group name. If a 899 string argument is not used as a group name in the pattern, an :exc:`IndexError` 900 exception is raised. 901 902 A moderately complicated example: 903 904 >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds") 905 >>> m.group('first_name') 906 'Malcolm' 907 >>> m.group('last_name') 908 'Reynolds' 909 910 Named groups can also be referred to by their index: 911 912 >>> m.group(1) 913 'Malcolm' 914 >>> m.group(2) 915 'Reynolds' 916 917 If a group matches multiple times, only the last match is accessible: 918 919 >>> m = re.match(r"(..)+", "a1b2c3") # Matches 3 times. 920 >>> m.group(1) # Returns only the last match. 921 'c3' 922 923 924 .. method:: MatchObject.groups([default]) 925 926 Return a tuple containing all the subgroups of the match, from 1 up to however 927 many groups are in the pattern. The *default* argument is used for groups that 928 did not participate in the match; it defaults to ``None``. (Incompatibility 929 note: in the original Python 1.5 release, if the tuple was one element long, a 930 string would be returned instead. In later versions (from 1.5.1 on), a 931 singleton tuple is returned in such cases.) 932 933 For example: 934 935 >>> m = re.match(r"(\d+)\.(\d+)", "24.1632") 936 >>> m.groups() 937 ('24', '1632') 938 939 If we make the decimal place and everything after it optional, not all groups 940 might participate in the match. These groups will default to ``None`` unless 941 the *default* argument is given: 942 943 >>> m = re.match(r"(\d+)\.?(\d+)?", "24") 944 >>> m.groups() # Second group defaults to None. 945 ('24', None) 946 >>> m.groups('0') # Now, the second group defaults to '0'. 947 ('24', '0') 948 949 950 .. method:: MatchObject.groupdict([default]) 951 952 Return a dictionary containing all the *named* subgroups of the match, keyed by 953 the subgroup name. The *default* argument is used for groups that did not 954 participate in the match; it defaults to ``None``. For example: 955 956 >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds") 957 >>> m.groupdict() 958 {'first_name': 'Malcolm', 'last_name': 'Reynolds'} 959 960 961 .. method:: MatchObject.start([group]) 962 MatchObject.end([group]) 963 964 Return the indices of the start and end of the substring matched by *group*; 965 *group* defaults to zero (meaning the whole matched substring). Return ``-1`` if 966 *group* exists but did not contribute to the match. For a match object *m*, and 967 a group *g* that did contribute to the match, the substring matched by group *g* 968 (equivalent to ``m.group(g)``) is :: 969 970 m.string[m.start(g):m.end(g)] 971 972 Note that ``m.start(group)`` will equal ``m.end(group)`` if *group* matched a 973 null string. For example, after ``m = re.search('b(c?)', 'cba')``, 974 ``m.start(0)`` is 1, ``m.end(0)`` is 2, ``m.start(1)`` and ``m.end(1)`` are both 975 2, and ``m.start(2)`` raises an :exc:`IndexError` exception. 976 977 An example that will remove *remove_this* from email addresses: 978 979 >>> email = "tony@tiremove_thisger.net" 980 >>> m = re.search("remove_this", email) 981 >>> email[:m.start()] + email[m.end():] 982 'tony@tiger.net' 983 984 985 .. method:: MatchObject.span([group]) 986 987 For :class:`MatchObject` *m*, return the 2-tuple ``(m.start(group), 988 m.end(group))``. Note that if *group* did not contribute to the match, this is 989 ``(-1, -1)``. *group* defaults to zero, the entire match. 990 991 992 .. attribute:: MatchObject.pos 993 994 The value of *pos* which was passed to the :meth:`~RegexObject.search` or 995 :meth:`~RegexObject.match` method of the :class:`RegexObject`. This is the 996 index into the string at which the RE engine started looking for a match. 997 998 999 .. attribute:: MatchObject.endpos 1000 1001 The value of *endpos* which was passed to the :meth:`~RegexObject.search` or 1002 :meth:`~RegexObject.match` method of the :class:`RegexObject`. This is the 1003 index into the string beyond which the RE engine will not go. 1004 1005 1006 .. attribute:: MatchObject.lastindex 1007 1008 The integer index of the last matched capturing group, or ``None`` if no group 1009 was matched at all. For example, the expressions ``(a)b``, ``((a)(b))``, and 1010 ``((ab))`` will have ``lastindex == 1`` if applied to the string ``'ab'``, while 1011 the expression ``(a)(b)`` will have ``lastindex == 2``, if applied to the same 1012 string. 1013 1014 1015 .. attribute:: MatchObject.lastgroup 1016 1017 The name of the last matched capturing group, or ``None`` if the group didn't 1018 have a name, or if no group was matched at all. 1019 1020 1021 .. attribute:: MatchObject.re 1022 1023 The regular expression object whose :meth:`~RegexObject.match` or 1024 :meth:`~RegexObject.search` method produced this :class:`MatchObject` 1025 instance. 1026 1027 1028 .. attribute:: MatchObject.string 1029 1030 The string passed to :meth:`~RegexObject.match` or 1031 :meth:`~RegexObject.search`. 1032 1033 1034Examples 1035-------- 1036 1037 1038Checking For a Pair 1039^^^^^^^^^^^^^^^^^^^ 1040 1041In this example, we'll use the following helper function to display match 1042objects a little more gracefully: 1043 1044.. testcode:: 1045 1046 def displaymatch(match): 1047 if match is None: 1048 return None 1049 return '<Match: %r, groups=%r>' % (match.group(), match.groups()) 1050 1051Suppose you are writing a poker program where a player's hand is represented as 1052a 5-character string with each character representing a card, "a" for ace, "k" 1053for king, "q" for queen, "j" for jack, "t" for 10, and "2" through "9" 1054representing the card with that value. 1055 1056To see if a given string is a valid hand, one could do the following: 1057 1058 >>> valid = re.compile(r"^[a2-9tjqk]{5}$") 1059 >>> displaymatch(valid.match("akt5q")) # Valid. 1060 "<Match: 'akt5q', groups=()>" 1061 >>> displaymatch(valid.match("akt5e")) # Invalid. 1062 >>> displaymatch(valid.match("akt")) # Invalid. 1063 >>> displaymatch(valid.match("727ak")) # Valid. 1064 "<Match: '727ak', groups=()>" 1065 1066That last hand, ``"727ak"``, contained a pair, or two of the same valued cards. 1067To match this with a regular expression, one could use backreferences as such: 1068 1069 >>> pair = re.compile(r".*(.).*\1") 1070 >>> displaymatch(pair.match("717ak")) # Pair of 7s. 1071 "<Match: '717', groups=('7',)>" 1072 >>> displaymatch(pair.match("718ak")) # No pairs. 1073 >>> displaymatch(pair.match("354aa")) # Pair of aces. 1074 "<Match: '354aa', groups=('a',)>" 1075 1076To find out what card the pair consists of, one could use the 1077:meth:`~MatchObject.group` method of :class:`MatchObject` in the following 1078manner: 1079 1080.. doctest:: 1081 1082 >>> pair.match("717ak").group(1) 1083 '7' 1084 1085 # Error because re.match() returns None, which doesn't have a group() method: 1086 >>> pair.match("718ak").group(1) 1087 Traceback (most recent call last): 1088 File "<pyshell#23>", line 1, in <module> 1089 re.match(r".*(.).*\1", "718ak").group(1) 1090 AttributeError: 'NoneType' object has no attribute 'group' 1091 1092 >>> pair.match("354aa").group(1) 1093 'a' 1094 1095 1096Simulating scanf() 1097^^^^^^^^^^^^^^^^^^ 1098 1099.. index:: single: scanf() 1100 1101Python does not currently have an equivalent to :c:func:`scanf`. Regular 1102expressions are generally more powerful, though also more verbose, than 1103:c:func:`scanf` format strings. The table below offers some more-or-less 1104equivalent mappings between :c:func:`scanf` format tokens and regular 1105expressions. 1106 1107+--------------------------------+---------------------------------------------+ 1108| :c:func:`scanf` Token | Regular Expression | 1109+================================+=============================================+ 1110| ``%c`` | ``.`` | 1111+--------------------------------+---------------------------------------------+ 1112| ``%5c`` | ``.{5}`` | 1113+--------------------------------+---------------------------------------------+ 1114| ``%d`` | ``[-+]?\d+`` | 1115+--------------------------------+---------------------------------------------+ 1116| ``%e``, ``%E``, ``%f``, ``%g`` | ``[-+]?(\d+(\.\d*)?|\.\d+)([eE][-+]?\d+)?`` | 1117+--------------------------------+---------------------------------------------+ 1118| ``%i`` | ``[-+]?(0[xX][\dA-Fa-f]+|0[0-7]*|\d+)`` | 1119+--------------------------------+---------------------------------------------+ 1120| ``%o`` | ``[-+]?[0-7]+`` | 1121+--------------------------------+---------------------------------------------+ 1122| ``%s`` | ``\S+`` | 1123+--------------------------------+---------------------------------------------+ 1124| ``%u`` | ``\d+`` | 1125+--------------------------------+---------------------------------------------+ 1126| ``%x``, ``%X`` | ``[-+]?(0[xX])?[\dA-Fa-f]+`` | 1127+--------------------------------+---------------------------------------------+ 1128 1129To extract the filename and numbers from a string like :: 1130 1131 /usr/sbin/sendmail - 0 errors, 4 warnings 1132 1133you would use a :c:func:`scanf` format like :: 1134 1135 %s - %d errors, %d warnings 1136 1137The equivalent regular expression would be :: 1138 1139 (\S+) - (\d+) errors, (\d+) warnings 1140 1141 1142.. _search-vs-match: 1143 1144search() vs. match() 1145^^^^^^^^^^^^^^^^^^^^ 1146 1147.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org> 1148 1149Python offers two different primitive operations based on regular expressions: 1150:func:`re.match` checks for a match only at the beginning of the string, while 1151:func:`re.search` checks for a match anywhere in the string (this is what Perl 1152does by default). 1153 1154For example:: 1155 1156 >>> re.match("c", "abcdef") # No match 1157 >>> re.search("c", "abcdef") # Match 1158 <_sre.SRE_Match object at ...> 1159 1160Regular expressions beginning with ``'^'`` can be used with :func:`search` to 1161restrict the match at the beginning of the string:: 1162 1163 >>> re.match("c", "abcdef") # No match 1164 >>> re.search("^c", "abcdef") # No match 1165 >>> re.search("^a", "abcdef") # Match 1166 <_sre.SRE_Match object at ...> 1167 1168Note however that in :const:`MULTILINE` mode :func:`match` only matches at the 1169beginning of the string, whereas using :func:`search` with a regular expression 1170beginning with ``'^'`` will match at the beginning of each line. 1171 1172 >>> re.match('X', 'A\nB\nX', re.MULTILINE) # No match 1173 >>> re.search('^X', 'A\nB\nX', re.MULTILINE) # Match 1174 <_sre.SRE_Match object at ...> 1175 1176 1177Making a Phonebook 1178^^^^^^^^^^^^^^^^^^ 1179 1180:func:`split` splits a string into a list delimited by the passed pattern. The 1181method is invaluable for converting textual data into data structures that can be 1182easily read and modified by Python as demonstrated in the following example that 1183creates a phonebook. 1184 1185First, here is the input. Normally it may come from a file, here we are using 1186triple-quoted string syntax: 1187 1188 >>> text = """Ross McFluff: 834.345.1254 155 Elm Street 1189 ... 1190 ... Ronald Heathmore: 892.345.3428 436 Finley Avenue 1191 ... Frank Burger: 925.541.7625 662 South Dogwood Way 1192 ... 1193 ... 1194 ... Heather Albrecht: 548.326.4584 919 Park Place""" 1195 1196The entries are separated by one or more newlines. Now we convert the string 1197into a list with each nonempty line having its own entry: 1198 1199.. doctest:: 1200 :options: +NORMALIZE_WHITESPACE 1201 1202 >>> entries = re.split("\n+", text) 1203 >>> entries 1204 ['Ross McFluff: 834.345.1254 155 Elm Street', 1205 'Ronald Heathmore: 892.345.3428 436 Finley Avenue', 1206 'Frank Burger: 925.541.7625 662 South Dogwood Way', 1207 'Heather Albrecht: 548.326.4584 919 Park Place'] 1208 1209Finally, split each entry into a list with first name, last name, telephone 1210number, and address. We use the ``maxsplit`` parameter of :func:`split` 1211because the address has spaces, our splitting pattern, in it: 1212 1213.. doctest:: 1214 :options: +NORMALIZE_WHITESPACE 1215 1216 >>> [re.split(":? ", entry, 3) for entry in entries] 1217 [['Ross', 'McFluff', '834.345.1254', '155 Elm Street'], 1218 ['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'], 1219 ['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'], 1220 ['Heather', 'Albrecht', '548.326.4584', '919 Park Place']] 1221 1222The ``:?`` pattern matches the colon after the last name, so that it does not 1223occur in the result list. With a ``maxsplit`` of ``4``, we could separate the 1224house number from the street name: 1225 1226.. doctest:: 1227 :options: +NORMALIZE_WHITESPACE 1228 1229 >>> [re.split(":? ", entry, 4) for entry in entries] 1230 [['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'], 1231 ['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'], 1232 ['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'], 1233 ['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']] 1234 1235 1236Text Munging 1237^^^^^^^^^^^^ 1238 1239:func:`sub` replaces every occurrence of a pattern with a string or the 1240result of a function. This example demonstrates using :func:`sub` with 1241a function to "munge" text, or randomize the order of all the characters 1242in each word of a sentence except for the first and last characters:: 1243 1244 >>> def repl(m): 1245 ... inner_word = list(m.group(2)) 1246 ... random.shuffle(inner_word) 1247 ... return m.group(1) + "".join(inner_word) + m.group(3) 1248 >>> text = "Professor Abdolmalek, please report your absences promptly." 1249 >>> re.sub(r"(\w)(\w+)(\w)", repl, text) 1250 'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.' 1251 >>> re.sub(r"(\w)(\w+)(\w)", repl, text) 1252 'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.' 1253 1254 1255Finding all Adverbs 1256^^^^^^^^^^^^^^^^^^^ 1257 1258:func:`findall` matches *all* occurrences of a pattern, not just the first 1259one as :func:`search` does. For example, if a writer wanted to 1260find all of the adverbs in some text, they might use :func:`findall` in 1261the following manner: 1262 1263 >>> text = "He was carefully disguised but captured quickly by police." 1264 >>> re.findall(r"\w+ly", text) 1265 ['carefully', 'quickly'] 1266 1267 1268Finding all Adverbs and their Positions 1269^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1270 1271If one wants more information about all matches of a pattern than the matched 1272text, :func:`finditer` is useful as it provides instances of 1273:class:`MatchObject` instead of strings. Continuing with the previous example, 1274if a writer wanted to find all of the adverbs *and their positions* 1275in some text, they would use :func:`finditer` in the following manner: 1276 1277 >>> text = "He was carefully disguised but captured quickly by police." 1278 >>> for m in re.finditer(r"\w+ly", text): 1279 ... print '%02d-%02d: %s' % (m.start(), m.end(), m.group(0)) 1280 07-16: carefully 1281 40-47: quickly 1282 1283 1284Raw String Notation 1285^^^^^^^^^^^^^^^^^^^ 1286 1287Raw string notation (``r"text"``) keeps regular expressions sane. Without it, 1288every backslash (``'\'``) in a regular expression would have to be prefixed with 1289another one to escape it. For example, the two following lines of code are 1290functionally identical: 1291 1292 >>> re.match(r"\W(.)\1\W", " ff ") 1293 <_sre.SRE_Match object at ...> 1294 >>> re.match("\\W(.)\\1\\W", " ff ") 1295 <_sre.SRE_Match object at ...> 1296 1297When one wants to match a literal backslash, it must be escaped in the regular 1298expression. With raw string notation, this means ``r"\\"``. Without raw string 1299notation, one must use ``"\\\\"``, making the following lines of code 1300functionally identical: 1301 1302 >>> re.match(r"\\", r"\\") 1303 <_sre.SRE_Match object at ...> 1304 >>> re.match("\\\\", r"\\") 1305 <_sre.SRE_Match object at ...> 1306