• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1#
2# Secret Labs' Regular Expression Engine
3#
4# re-compatible interface for the sre matching engine
5#
6# Copyright (c) 1998-2001 by Secret Labs AB.  All rights reserved.
7#
8# This version of the SRE library can be redistributed under CNRI's
9# Python 1.6 license.  For any other use, please contact Secret Labs
10# AB (info@pythonware.com).
11#
12# Portions of this engine have been developed in cooperation with
13# CNRI.  Hewlett-Packard provided funding for 1.6 integration and
14# other compatibility work.
15#
16
17r"""Support for regular expressions (RE).
18
19This module provides regular expression matching operations similar to
20those found in Perl.  It supports both 8-bit and Unicode strings; both
21the pattern and the strings being processed can contain null bytes and
22characters outside the US ASCII range.
23
24Regular expressions can contain both special and ordinary characters.
25Most ordinary characters, like "A", "a", or "0", are the simplest
26regular expressions; they simply match themselves.  You can
27concatenate ordinary characters, so last matches the string 'last'.
28
29The special characters are:
30    "."      Matches any character except a newline.
31    "^"      Matches the start of the string.
32    "$"      Matches the end of the string or just before the newline at
33             the end of the string.
34    "*"      Matches 0 or more (greedy) repetitions of the preceding RE.
35             Greedy means that it will match as many repetitions as possible.
36    "+"      Matches 1 or more (greedy) repetitions of the preceding RE.
37    "?"      Matches 0 or 1 (greedy) of the preceding RE.
38    *?,+?,?? Non-greedy versions of the previous three special characters.
39    {m,n}    Matches from m to n repetitions of the preceding RE.
40    {m,n}?   Non-greedy version of the above.
41    "\\"     Either escapes special characters or signals a special sequence.
42    []       Indicates a set of characters.
43             A "^" as the first character indicates a complementing set.
44    "|"      A|B, creates an RE that will match either A or B.
45    (...)    Matches the RE inside the parentheses.
46             The contents can be retrieved or matched later in the string.
47    (?aiLmsux) The letters set the corresponding flags defined below.
48    (?:...)  Non-grouping version of regular parentheses.
49    (?P<name>...) The substring matched by the group is accessible by name.
50    (?P=name)     Matches the text matched earlier by the group named name.
51    (?#...)  A comment; ignored.
52    (?=...)  Matches if ... matches next, but doesn't consume the string.
53    (?!...)  Matches if ... doesn't match next.
54    (?<=...) Matches if preceded by ... (must be fixed length).
55    (?<!...) Matches if not preceded by ... (must be fixed length).
56    (?(id/name)yes|no) Matches yes pattern if the group with id/name matched,
57                       the (optional) no pattern otherwise.
58
59The special sequences consist of "\\" and a character from the list
60below.  If the ordinary character is not on the list, then the
61resulting RE will match the second character.
62    \number  Matches the contents of the group of the same number.
63    \A       Matches only at the start of the string.
64    \Z       Matches only at the end of the string.
65    \b       Matches the empty string, but only at the start or end of a word.
66    \B       Matches the empty string, but not at the start or end of a word.
67    \d       Matches any decimal digit; equivalent to the set [0-9] in
68             bytes patterns or string patterns with the ASCII flag.
69             In string patterns without the ASCII flag, it will match the whole
70             range of Unicode digits.
71    \D       Matches any non-digit character; equivalent to [^\d].
72    \s       Matches any whitespace character; equivalent to [ \t\n\r\f\v] in
73             bytes patterns or string patterns with the ASCII flag.
74             In string patterns without the ASCII flag, it will match the whole
75             range of Unicode whitespace characters.
76    \S       Matches any non-whitespace character; equivalent to [^\s].
77    \w       Matches any alphanumeric character; equivalent to [a-zA-Z0-9_]
78             in bytes patterns or string patterns with the ASCII flag.
79             In string patterns without the ASCII flag, it will match the
80             range of Unicode alphanumeric characters (letters plus digits
81             plus underscore).
82             With LOCALE, it will match the set [0-9_] plus characters defined
83             as letters for the current locale.
84    \W       Matches the complement of \w.
85    \\       Matches a literal backslash.
86
87This module exports the following functions:
88    match     Match a regular expression pattern to the beginning of a string.
89    fullmatch Match a regular expression pattern to all of a string.
90    search    Search a string for the presence of a pattern.
91    sub       Substitute occurrences of a pattern found in a string.
92    subn      Same as sub, but also return the number of substitutions made.
93    split     Split a string by the occurrences of a pattern.
94    findall   Find all occurrences of a pattern in a string.
95    finditer  Return an iterator yielding a Match object for each match.
96    compile   Compile a pattern into a Pattern object.
97    purge     Clear the regular expression cache.
98    escape    Backslash all non-alphanumerics in a string.
99
100Each function other than purge and escape can take an optional 'flags' argument
101consisting of one or more of the following module constants, joined by "|".
102A, L, and U are mutually exclusive.
103    A  ASCII       For string patterns, make \w, \W, \b, \B, \d, \D
104                   match the corresponding ASCII character categories
105                   (rather than the whole Unicode categories, which is the
106                   default).
107                   For bytes patterns, this flag is the only available
108                   behaviour and needn't be specified.
109    I  IGNORECASE  Perform case-insensitive matching.
110    L  LOCALE      Make \w, \W, \b, \B, dependent on the current locale.
111    M  MULTILINE   "^" matches the beginning of lines (after a newline)
112                   as well as the string.
113                   "$" matches the end of lines (before a newline) as well
114                   as the end of the string.
115    S  DOTALL      "." matches any character at all, including the newline.
116    X  VERBOSE     Ignore whitespace and comments for nicer looking RE's.
117    U  UNICODE     For compatibility only. Ignored for string patterns (it
118                   is the default), and forbidden for bytes patterns.
119
120This module also defines exception 'PatternError', aliased to 'error' for
121backward compatibility.
122
123"""
124
125import enum
126from . import _compiler, _parser
127import functools
128import _sre
129
130
131# public symbols
132__all__ = [
133    "match", "fullmatch", "search", "sub", "subn", "split",
134    "findall", "finditer", "compile", "purge", "escape",
135    "error", "Pattern", "Match", "A", "I", "L", "M", "S", "X", "U",
136    "ASCII", "IGNORECASE", "LOCALE", "MULTILINE", "DOTALL", "VERBOSE",
137    "UNICODE", "NOFLAG", "RegexFlag", "PatternError"
138]
139
140__version__ = "2.2.1"
141
142@enum.global_enum
143@enum._simple_enum(enum.IntFlag, boundary=enum.KEEP)
144class RegexFlag:
145    NOFLAG = 0
146    ASCII = A = _compiler.SRE_FLAG_ASCII # assume ascii "locale"
147    IGNORECASE = I = _compiler.SRE_FLAG_IGNORECASE # ignore case
148    LOCALE = L = _compiler.SRE_FLAG_LOCALE # assume current 8-bit locale
149    UNICODE = U = _compiler.SRE_FLAG_UNICODE # assume unicode "locale"
150    MULTILINE = M = _compiler.SRE_FLAG_MULTILINE # make anchors look for newline
151    DOTALL = S = _compiler.SRE_FLAG_DOTALL # make dot match newline
152    VERBOSE = X = _compiler.SRE_FLAG_VERBOSE # ignore whitespace and comments
153    # sre extensions (experimental, don't rely on these)
154    DEBUG = _compiler.SRE_FLAG_DEBUG # dump pattern after compilation
155    __str__ = object.__str__
156    _numeric_repr_ = hex
157
158# sre exception
159PatternError = error = _compiler.PatternError
160
161# --------------------------------------------------------------------
162# public interface
163
164def match(pattern, string, flags=0):
165    """Try to apply the pattern at the start of the string, returning
166    a Match object, or None if no match was found."""
167    return _compile(pattern, flags).match(string)
168
169def fullmatch(pattern, string, flags=0):
170    """Try to apply the pattern to all of the string, returning
171    a Match object, or None if no match was found."""
172    return _compile(pattern, flags).fullmatch(string)
173
174def search(pattern, string, flags=0):
175    """Scan through string looking for a match to the pattern, returning
176    a Match object, or None if no match was found."""
177    return _compile(pattern, flags).search(string)
178
179class _ZeroSentinel(int):
180    pass
181_zero_sentinel = _ZeroSentinel()
182
183def sub(pattern, repl, string, *args, count=_zero_sentinel, flags=_zero_sentinel):
184    """Return the string obtained by replacing the leftmost
185    non-overlapping occurrences of the pattern in string by the
186    replacement repl.  repl can be either a string or a callable;
187    if a string, backslash escapes in it are processed.  If it is
188    a callable, it's passed the Match object and must return
189    a replacement string to be used."""
190    if args:
191        if count is not _zero_sentinel:
192            raise TypeError("sub() got multiple values for argument 'count'")
193        count, *args = args
194        if args:
195            if flags is not _zero_sentinel:
196                raise TypeError("sub() got multiple values for argument 'flags'")
197            flags, *args = args
198            if args:
199                raise TypeError("sub() takes from 3 to 5 positional arguments "
200                                "but %d were given" % (5 + len(args)))
201
202        import warnings
203        warnings.warn(
204            "'count' is passed as positional argument",
205            DeprecationWarning, stacklevel=2
206        )
207
208    return _compile(pattern, flags).sub(repl, string, count)
209sub.__text_signature__ = '(pattern, repl, string, count=0, flags=0)'
210
211def subn(pattern, repl, string, *args, count=_zero_sentinel, flags=_zero_sentinel):
212    """Return a 2-tuple containing (new_string, number).
213    new_string is the string obtained by replacing the leftmost
214    non-overlapping occurrences of the pattern in the source
215    string by the replacement repl.  number is the number of
216    substitutions that were made. repl can be either a string or a
217    callable; if a string, backslash escapes in it are processed.
218    If it is a callable, it's passed the Match object and must
219    return a replacement string to be used."""
220    if args:
221        if count is not _zero_sentinel:
222            raise TypeError("subn() got multiple values for argument 'count'")
223        count, *args = args
224        if args:
225            if flags is not _zero_sentinel:
226                raise TypeError("subn() got multiple values for argument 'flags'")
227            flags, *args = args
228            if args:
229                raise TypeError("subn() takes from 3 to 5 positional arguments "
230                                "but %d were given" % (5 + len(args)))
231
232        import warnings
233        warnings.warn(
234            "'count' is passed as positional argument",
235            DeprecationWarning, stacklevel=2
236        )
237
238    return _compile(pattern, flags).subn(repl, string, count)
239subn.__text_signature__ = '(pattern, repl, string, count=0, flags=0)'
240
241def split(pattern, string, *args, maxsplit=_zero_sentinel, flags=_zero_sentinel):
242    """Split the source string by the occurrences of the pattern,
243    returning a list containing the resulting substrings.  If
244    capturing parentheses are used in pattern, then the text of all
245    groups in the pattern are also returned as part of the resulting
246    list.  If maxsplit is nonzero, at most maxsplit splits occur,
247    and the remainder of the string is returned as the final element
248    of the list."""
249    if args:
250        if maxsplit is not _zero_sentinel:
251            raise TypeError("split() got multiple values for argument 'maxsplit'")
252        maxsplit, *args = args
253        if args:
254            if flags is not _zero_sentinel:
255                raise TypeError("split() got multiple values for argument 'flags'")
256            flags, *args = args
257            if args:
258                raise TypeError("split() takes from 2 to 4 positional arguments "
259                                "but %d were given" % (4 + len(args)))
260
261        import warnings
262        warnings.warn(
263            "'maxsplit' is passed as positional argument",
264            DeprecationWarning, stacklevel=2
265        )
266
267    return _compile(pattern, flags).split(string, maxsplit)
268split.__text_signature__ = '(pattern, string, maxsplit=0, flags=0)'
269
270def findall(pattern, string, flags=0):
271    """Return a list of all non-overlapping matches in the string.
272
273    If one or more capturing groups are present in the pattern, return
274    a list of groups; this will be a list of tuples if the pattern
275    has more than one group.
276
277    Empty matches are included in the result."""
278    return _compile(pattern, flags).findall(string)
279
280def finditer(pattern, string, flags=0):
281    """Return an iterator over all non-overlapping matches in the
282    string.  For each match, the iterator returns a Match object.
283
284    Empty matches are included in the result."""
285    return _compile(pattern, flags).finditer(string)
286
287def compile(pattern, flags=0):
288    "Compile a regular expression pattern, returning a Pattern object."
289    return _compile(pattern, flags)
290
291def purge():
292    "Clear the regular expression caches"
293    _cache.clear()
294    _cache2.clear()
295    _compile_template.cache_clear()
296
297
298# SPECIAL_CHARS
299# closing ')', '}' and ']'
300# '-' (a range in character set)
301# '&', '~', (extended character set operations)
302# '#' (comment) and WHITESPACE (ignored) in verbose mode
303_special_chars_map = {i: '\\' + chr(i) for i in b'()[]{}?*+-|^$\\.&~# \t\n\r\v\f'}
304
305def escape(pattern):
306    """
307    Escape special characters in a string.
308    """
309    if isinstance(pattern, str):
310        return pattern.translate(_special_chars_map)
311    else:
312        pattern = str(pattern, 'latin1')
313        return pattern.translate(_special_chars_map).encode('latin1')
314
315Pattern = type(_compiler.compile('', 0))
316Match = type(_compiler.compile('', 0).match(''))
317
318# --------------------------------------------------------------------
319# internals
320
321# Use the fact that dict keeps the insertion order.
322# _cache2 uses the simple FIFO policy which has better latency.
323# _cache uses the LRU policy which has better hit rate.
324_cache = {}  # LRU
325_cache2 = {}  # FIFO
326_MAXCACHE = 512
327_MAXCACHE2 = 256
328assert _MAXCACHE2 < _MAXCACHE
329
330def _compile(pattern, flags):
331    # internal: compile pattern
332    if isinstance(flags, RegexFlag):
333        flags = flags.value
334    try:
335        return _cache2[type(pattern), pattern, flags]
336    except KeyError:
337        pass
338
339    key = (type(pattern), pattern, flags)
340    # Item in _cache should be moved to the end if found.
341    p = _cache.pop(key, None)
342    if p is None:
343        if isinstance(pattern, Pattern):
344            if flags:
345                raise ValueError(
346                    "cannot process flags argument with a compiled pattern")
347            return pattern
348        if not _compiler.isstring(pattern):
349            raise TypeError("first argument must be string or compiled pattern")
350        p = _compiler.compile(pattern, flags)
351        if flags & DEBUG:
352            return p
353        if len(_cache) >= _MAXCACHE:
354            # Drop the least recently used item.
355            # next(iter(_cache)) is known to have linear amortized time,
356            # but it is used here to avoid a dependency from using OrderedDict.
357            # For the small _MAXCACHE value it doesn't make much of a difference.
358            try:
359                del _cache[next(iter(_cache))]
360            except (StopIteration, RuntimeError, KeyError):
361                pass
362    # Append to the end.
363    _cache[key] = p
364
365    if len(_cache2) >= _MAXCACHE2:
366        # Drop the oldest item.
367        try:
368            del _cache2[next(iter(_cache2))]
369        except (StopIteration, RuntimeError, KeyError):
370            pass
371    _cache2[key] = p
372    return p
373
374@functools.lru_cache(_MAXCACHE)
375def _compile_template(pattern, repl):
376    # internal: compile replacement pattern
377    return _sre.template(pattern, _parser.parse_template(repl, pattern))
378
379# register myself for pickling
380
381import copyreg
382
383def _pickle(p):
384    return _compile, (p.pattern, p.flags)
385
386copyreg.pickle(Pattern, _pickle, _compile)
387
388# --------------------------------------------------------------------
389# experimental stuff (see python-dev discussions for details)
390
391class Scanner:
392    def __init__(self, lexicon, flags=0):
393        from ._constants import BRANCH, SUBPATTERN
394        if isinstance(flags, RegexFlag):
395            flags = flags.value
396        self.lexicon = lexicon
397        # combine phrases into a compound pattern
398        p = []
399        s = _parser.State()
400        s.flags = flags
401        for phrase, action in lexicon:
402            gid = s.opengroup()
403            p.append(_parser.SubPattern(s, [
404                (SUBPATTERN, (gid, 0, 0, _parser.parse(phrase, flags))),
405                ]))
406            s.closegroup(gid, p[-1])
407        p = _parser.SubPattern(s, [(BRANCH, (None, p))])
408        self.scanner = _compiler.compile(p)
409    def scan(self, string):
410        result = []
411        append = result.append
412        match = self.scanner.scanner(string).match
413        i = 0
414        while True:
415            m = match()
416            if not m:
417                break
418            j = m.end()
419            if i == j:
420                break
421            action = self.lexicon[m.lastindex-1][1]
422            if callable(action):
423                self.match = m
424                action = action(self, m.group())
425            if action is not None:
426                append(action)
427            i = j
428        return result, string[i:]
429