1# -*- coding: UTF-8 -*- 2# BASED-ON: https://github.com/r1chardj0n3s/parse/parse.py 3# VERSION: parse 1.12.0 4# Same as original parse modules. 5# 6# pylint: disable=line-too-long, invalid-name, too-many-locals, too-many-arguments 7# pylint: disable=redefined-builtin, too-few-public-methods, no-else-return 8# pylint: disable=unused-variable, no-self-use, missing-docstring 9# pylint: disable=unused-argument, unused-variable 10# pylint: disable=too-many-branches, too-many-statements 11# pylint: disable=all 12# 13# -- ORIGINAL-CODE STARTS-HERE ------------------------------------------------ 14r'''Parse strings using a specification based on the Python format() syntax. 15 16 ``parse()`` is the opposite of ``format()`` 17 18The module is set up to only export ``parse()``, ``search()``, ``findall()``, 19and ``with_pattern()`` when ``import \*`` is used: 20 21>>> from parse import * 22 23From there it's a simple thing to parse a string: 24 25>>> parse("It's {}, I love it!", "It's spam, I love it!") 26<Result ('spam',) {}> 27>>> _[0] 28'spam' 29 30Or to search a string for some pattern: 31 32>>> search('Age: {:d}\n', 'Name: Rufus\nAge: 42\nColor: red\n') 33<Result (42,) {}> 34 35Or find all the occurrences of some pattern in a string: 36 37>>> ''.join(r.fixed[0] for r in findall(">{}<", "<p>the <b>bold</b> text</p>")) 38'the bold text' 39 40If you're going to use the same pattern to match lots of strings you can 41compile it once: 42 43>>> from parse import compile 44>>> p = compile("It's {}, I love it!") 45>>> print(p) 46<Parser "It's {}, I love it!"> 47>>> p.parse("It's spam, I love it!") 48<Result ('spam',) {}> 49 50("compile" is not exported for ``import *`` usage as it would override the 51built-in ``compile()`` function) 52 53The default behaviour is to match strings case insensitively. You may match with 54case by specifying `case_sensitive=True`: 55 56>>> parse('SPAM', 'spam', case_sensitive=True) is None 57True 58 59 60Format Syntax 61------------- 62 63A basic version of the `Format String Syntax`_ is supported with anonymous 64(fixed-position), named and formatted fields:: 65 66 {[field name]:[format spec]} 67 68Field names must be a valid Python identifiers, including dotted names; 69element indexes imply dictionaries (see below for example). 70 71Numbered fields are also not supported: the result of parsing will include 72the parsed fields in the order they are parsed. 73 74The conversion of fields to types other than strings is done based on the 75type in the format specification, which mirrors the ``format()`` behaviour. 76There are no "!" field conversions like ``format()`` has. 77 78Some simple parse() format string examples: 79 80>>> parse("Bring me a {}", "Bring me a shrubbery") 81<Result ('shrubbery',) {}> 82>>> r = parse("The {} who say {}", "The knights who say Ni!") 83>>> print(r) 84<Result ('knights', 'Ni!') {}> 85>>> print(r.fixed) 86('knights', 'Ni!') 87>>> r = parse("Bring out the holy {item}", "Bring out the holy hand grenade") 88>>> print(r) 89<Result () {'item': 'hand grenade'}> 90>>> print(r.named) 91{'item': 'hand grenade'} 92>>> print(r['item']) 93hand grenade 94>>> 'item' in r 95True 96 97Note that `in` only works if you have named fields. Dotted names and indexes 98are possible though the application must make additional sense of the result: 99 100>>> r = parse("Mmm, {food.type}, I love it!", "Mmm, spam, I love it!") 101>>> print(r) 102<Result () {'food.type': 'spam'}> 103>>> print(r.named) 104{'food.type': 'spam'} 105>>> print(r['food.type']) 106spam 107>>> r = parse("My quest is {quest[name]}", "My quest is to seek the holy grail!") 108>>> print(r) 109<Result () {'quest': {'name': 'to seek the holy grail!'}}> 110>>> print(r['quest']) 111{'name': 'to seek the holy grail!'} 112>>> print(r['quest']['name']) 113to seek the holy grail! 114 115If the text you're matching has braces in it you can match those by including 116a double-brace ``{{`` or ``}}`` in your format string, just like format() does. 117 118 119Format Specification 120-------------------- 121 122Most often a straight format-less ``{}`` will suffice where a more complex 123format specification might have been used. 124 125Most of `format()`'s `Format Specification Mini-Language`_ is supported: 126 127 [[fill]align][0][width][.precision][type] 128 129The differences between `parse()` and `format()` are: 130 131- The align operators will cause spaces (or specified fill character) to be 132 stripped from the parsed value. The width is not enforced; it just indicates 133 there may be whitespace or "0"s to strip. 134- Numeric parsing will automatically handle a "0b", "0o" or "0x" prefix. 135 That is, the "#" format character is handled automatically by d, b, o 136 and x formats. For "d" any will be accepted, but for the others the correct 137 prefix must be present if at all. 138- Numeric sign is handled automatically. 139- The thousands separator is handled automatically if the "n" type is used. 140- The types supported are a slightly different mix to the format() types. Some 141 format() types come directly over: "d", "n", "%", "f", "e", "b", "o" and "x". 142 In addition some regular expression character group types "D", "w", "W", "s" 143 and "S" are also available. 144- The "e" and "g" types are case-insensitive so there is not need for 145 the "E" or "G" types. 146 147===== =========================================== ======== 148Type Characters Matched Output 149===== =========================================== ======== 150l Letters (ASCII) str 151w Letters, numbers and underscore str 152W Not letters, numbers and underscore str 153s Whitespace str 154S Non-whitespace str 155d Digits (effectively integer numbers) int 156D Non-digit str 157n Numbers with thousands separators (, or .) int 158% Percentage (converted to value/100.0) float 159f Fixed-point numbers float 160F Decimal numbers Decimal 161e Floating-point numbers with exponent float 162 e.g. 1.1e-10, NAN (all case insensitive) 163g General number format (either d, f or e) float 164b Binary numbers int 165o Octal numbers int 166x Hexadecimal numbers (lower and upper case) int 167ti ISO 8601 format date/time datetime 168 e.g. 1972-01-20T10:21:36Z ("T" and "Z" 169 optional) 170te RFC2822 e-mail format date/time datetime 171 e.g. Mon, 20 Jan 1972 10:21:36 +1000 172tg Global (day/month) format date/time datetime 173 e.g. 20/1/1972 10:21:36 AM +1:00 174ta US (month/day) format date/time datetime 175 e.g. 1/20/1972 10:21:36 PM +10:30 176tc ctime() format date/time datetime 177 e.g. Sun Sep 16 01:03:52 1973 178th HTTP log format date/time datetime 179 e.g. 21/Nov/2011:00:07:11 +0000 180ts Linux system log format date/time datetime 181 e.g. Nov 9 03:37:44 182tt Time time 183 e.g. 10:21:36 PM -5:30 184===== =========================================== ======== 185 186Some examples of typed parsing with ``None`` returned if the typing 187does not match: 188 189>>> parse('Our {:d} {:w} are...', 'Our 3 weapons are...') 190<Result (3, 'weapons') {}> 191>>> parse('Our {:d} {:w} are...', 'Our three weapons are...') 192>>> parse('Meet at {:tg}', 'Meet at 1/2/2011 11:00 PM') 193<Result (datetime.datetime(2011, 2, 1, 23, 0),) {}> 194 195And messing about with alignment: 196 197>>> parse('with {:>} herring', 'with a herring') 198<Result ('a',) {}> 199>>> parse('spam {:^} spam', 'spam lovely spam') 200<Result ('lovely',) {}> 201 202Note that the "center" alignment does not test to make sure the value is 203centered - it just strips leading and trailing whitespace. 204 205Width and precision may be used to restrict the size of matched text 206from the input. Width specifies a minimum size and precision specifies 207a maximum. For example: 208 209>>> parse('{:.2}{:.2}', 'look') # specifying precision 210<Result ('lo', 'ok') {}> 211>>> parse('{:4}{:4}', 'look at that') # specifying width 212<Result ('look', 'at that') {}> 213>>> parse('{:4}{:.4}', 'look at that') # specifying both 214<Result ('look at ', 'that') {}> 215>>> parse('{:2d}{:2d}', '0440') # parsing two contiguous numbers 216<Result (4, 40) {}> 217 218Some notes for the date and time types: 219 220- the presence of the time part is optional (including ISO 8601, starting 221 at the "T"). A full datetime object will always be returned; the time 222 will be set to 00:00:00. You may also specify a time without seconds. 223- when a seconds amount is present in the input fractions will be parsed 224 to give microseconds. 225- except in ISO 8601 the day and month digits may be 0-padded. 226- the date separator for the tg and ta formats may be "-" or "/". 227- named months (abbreviations or full names) may be used in the ta and tg 228 formats in place of numeric months. 229- as per RFC 2822 the e-mail format may omit the day (and comma), and the 230 seconds but nothing else. 231- hours greater than 12 will be happily accepted. 232- the AM/PM are optional, and if PM is found then 12 hours will be added 233 to the datetime object's hours amount - even if the hour is greater 234 than 12 (for consistency.) 235- in ISO 8601 the "Z" (UTC) timezone part may be a numeric offset 236- timezones are specified as "+HH:MM" or "-HH:MM". The hour may be one or two 237 digits (0-padded is OK.) Also, the ":" is optional. 238- the timezone is optional in all except the e-mail format (it defaults to 239 UTC.) 240- named timezones are not handled yet. 241 242Note: attempting to match too many datetime fields in a single parse() will 243currently result in a resource allocation issue. A TooManyFields exception 244will be raised in this instance. The current limit is about 15. It is hoped 245that this limit will be removed one day. 246 247.. _`Format String Syntax`: 248 http://docs.python.org/library/string.html#format-string-syntax 249.. _`Format Specification Mini-Language`: 250 http://docs.python.org/library/string.html#format-specification-mini-language 251 252 253Result and Match Objects 254------------------------ 255 256The result of a ``parse()`` and ``search()`` operation is either ``None`` (no match), a 257``Result`` instance or a ``Match`` instance if ``evaluate_result`` is False. 258 259The ``Result`` instance has three attributes: 260 261fixed 262 A tuple of the fixed-position, anonymous fields extracted from the input. 263named 264 A dictionary of the named fields extracted from the input. 265spans 266 A dictionary mapping the names and fixed position indices matched to a 267 2-tuple slice range of where the match occurred in the input. 268 The span does not include any stripped padding (alignment or width). 269 270The ``Match`` instance has one method: 271 272evaluate_result() 273 Generates and returns a ``Result`` instance for this ``Match`` object. 274 275 276 277Custom Type Conversions 278----------------------- 279 280If you wish to have matched fields automatically converted to your own type you 281may pass in a dictionary of type conversion information to ``parse()`` and 282``compile()``. 283 284The converter will be passed the field string matched. Whatever it returns 285will be substituted in the ``Result`` instance for that field. 286 287Your custom type conversions may override the builtin types if you supply one 288with the same identifier. 289 290>>> def shouty(string): 291... return string.upper() 292... 293>>> parse('{:shouty} world', 'hello world', dict(shouty=shouty)) 294<Result ('HELLO',) {}> 295 296If the type converter has the optional ``pattern`` attribute, it is used as 297regular expression for better pattern matching (instead of the default one). 298 299>>> def parse_number(text): 300... return int(text) 301>>> parse_number.pattern = r'\d+' 302>>> parse('Answer: {number:Number}', 'Answer: 42', dict(Number=parse_number)) 303<Result () {'number': 42}> 304>>> _ = parse('Answer: {:Number}', 'Answer: Alice', dict(Number=parse_number)) 305>>> assert _ is None, "MISMATCH" 306 307You can also use the ``with_pattern(pattern)`` decorator to add this 308information to a type converter function: 309 310>>> from parse import with_pattern 311>>> @with_pattern(r'\d+') 312... def parse_number(text): 313... return int(text) 314>>> parse('Answer: {number:Number}', 'Answer: 42', dict(Number=parse_number)) 315<Result () {'number': 42}> 316 317A more complete example of a custom type might be: 318 319>>> yesno_mapping = { 320... "yes": True, "no": False, 321... "on": True, "off": False, 322... "true": True, "false": False, 323... } 324>>> @with_pattern(r"|".join(yesno_mapping)) 325... def parse_yesno(text): 326... return yesno_mapping[text.lower()] 327 328 329If the type converter ``pattern`` uses regex-grouping (with parenthesis), 330you should indicate this by using the optional ``regex_group_count`` parameter 331in the ``with_pattern()`` decorator: 332 333>>> @with_pattern(r'((\d+))', regex_group_count=2) 334... def parse_number2(text): 335... return int(text) 336>>> parse('Answer: {:Number2} {:Number2}', 'Answer: 42 43', dict(Number2=parse_number2)) 337<Result (42, 43) {}> 338 339Otherwise, this may cause parsing problems with unnamed/fixed parameters. 340 341 342Potential Gotchas 343----------------- 344 345`parse()` will always match the shortest text necessary (from left to right) 346to fulfil the parse pattern, so for example: 347 348>>> pattern = '{dir1}/{dir2}' 349>>> data = 'root/parent/subdir' 350>>> sorted(parse(pattern, data).named.items()) 351[('dir1', 'root'), ('dir2', 'parent/subdir')] 352 353So, even though `{'dir1': 'root/parent', 'dir2': 'subdir'}` would also fit 354the pattern, the actual match represents the shortest successful match for 355`dir1`. 356 357---- 358 359**Version history (in brief)**: 360 361- 1.12.1 Actually use the `case_sensitive` arg in compile (thanks @jacquev6) 362- 1.12.0 Do not assume closing brace when an opening one is found (thanks @mattsep) 363- 1.11.1 Revert having unicode char in docstring, it breaks Bamboo builds(?!) 364- 1.11.0 Implement `__contains__` for Result instances. 365- 1.10.0 Introduce a "letters" matcher, since "w" matches numbers 366 also. 367- 1.9.1 Fix deprecation warnings around backslashes in regex strings 368 (thanks Mickael Schoentgen). Also fix some documentation formatting 369 issues. 370- 1.9.0 We now honor precision and width specifiers when parsing numbers 371 and strings, allowing parsing of concatenated elements of fixed width 372 (thanks Julia Signell) 373- 1.8.4 Add LICENSE file at request of packagers. 374 Correct handling of AM/PM to follow most common interpretation. 375 Correct parsing of hexadecimal that looks like a binary prefix. 376 Add ability to parse case sensitively. 377 Add parsing of numbers to Decimal with "F" (thanks John Vandenberg) 378- 1.8.3 Add regex_group_count to with_pattern() decorator to support 379 user-defined types that contain brackets/parenthesis (thanks Jens Engel) 380- 1.8.2 add documentation for including braces in format string 381- 1.8.1 ensure bare hexadecimal digits are not matched 382- 1.8.0 support manual control over result evaluation (thanks Timo Furrer) 383- 1.7.0 parse dict fields (thanks Mark Visser) and adapted to allow 384 more than 100 re groups in Python 3.5+ (thanks David King) 385- 1.6.6 parse Linux system log dates (thanks Alex Cowan) 386- 1.6.5 handle precision in float format (thanks Levi Kilcher) 387- 1.6.4 handle pipe "|" characters in parse string (thanks Martijn Pieters) 388- 1.6.3 handle repeated instances of named fields, fix bug in PM time 389 overflow 390- 1.6.2 fix logging to use local, not root logger (thanks Necku) 391- 1.6.1 be more flexible regarding matched ISO datetimes and timezones in 392 general, fix bug in timezones without ":" and improve docs 393- 1.6.0 add support for optional ``pattern`` attribute in user-defined types 394 (thanks Jens Engel) 395- 1.5.3 fix handling of question marks 396- 1.5.2 fix type conversion error with dotted names (thanks Sebastian Thiel) 397- 1.5.1 implement handling of named datetime fields 398- 1.5 add handling of dotted field names (thanks Sebastian Thiel) 399- 1.4.1 fix parsing of "0" in int conversion (thanks James Rowe) 400- 1.4 add __getitem__ convenience access on Result. 401- 1.3.3 fix Python 2.5 setup.py issue. 402- 1.3.2 fix Python 3.2 setup.py issue. 403- 1.3.1 fix a couple of Python 3.2 compatibility issues. 404- 1.3 added search() and findall(); removed compile() from ``import *`` 405 export as it overwrites builtin. 406- 1.2 added ability for custom and override type conversions to be 407 provided; some cleanup 408- 1.1.9 to keep things simpler number sign is handled automatically; 409 significant robustification in the face of edge-case input. 410- 1.1.8 allow "d" fields to have number base "0x" etc. prefixes; 411 fix up some field type interactions after stress-testing the parser; 412 implement "%" type. 413- 1.1.7 Python 3 compatibility tweaks (2.5 to 2.7 and 3.2 are supported). 414- 1.1.6 add "e" and "g" field types; removed redundant "h" and "X"; 415 removed need for explicit "#". 416- 1.1.5 accept textual dates in more places; Result now holds match span 417 positions. 418- 1.1.4 fixes to some int type conversion; implemented "=" alignment; added 419 date/time parsing with a variety of formats handled. 420- 1.1.3 type conversion is automatic based on specified field types. Also added 421 "f" and "n" types. 422- 1.1.2 refactored, added compile() and limited ``from parse import *`` 423- 1.1.1 documentation improvements 424- 1.1.0 implemented more of the `Format Specification Mini-Language`_ 425 and removed the restriction on mixing fixed-position and named fields 426- 1.0.0 initial release 427 428This code is copyright 2012-2019 Richard Jones <richard@python.org> 429See the end of the source file for the license of use. 430''' 431 432from __future__ import absolute_import 433__version__ = '1.12.1' 434 435# yes, I now have two problems 436import re 437import sys 438from datetime import datetime, time, tzinfo, timedelta 439from decimal import Decimal 440from functools import partial 441import logging 442 443__all__ = 'parse search findall with_pattern'.split() 444 445log = logging.getLogger(__name__) 446 447 448def with_pattern(pattern, regex_group_count=None): 449 r"""Attach a regular expression pattern matcher to a custom type converter 450 function. 451 452 This annotates the type converter with the :attr:`pattern` attribute. 453 454 EXAMPLE: 455 >>> import parse 456 >>> @parse.with_pattern(r"\d+") 457 ... def parse_number(text): 458 ... return int(text) 459 460 is equivalent to: 461 462 >>> def parse_number(text): 463 ... return int(text) 464 >>> parse_number.pattern = r"\d+" 465 466 :param pattern: regular expression pattern (as text) 467 :param regex_group_count: Indicates how many regex-groups are in pattern. 468 :return: wrapped function 469 """ 470 def decorator(func): 471 func.pattern = pattern 472 func.regex_group_count = regex_group_count 473 return func 474 return decorator 475 476 477def int_convert(base): 478 '''Convert a string to an integer. 479 480 The string may start with a sign. 481 482 It may be of a base other than 10. 483 484 If may start with a base indicator, 0#nnnn, which we assume should 485 override the specified base. 486 487 It may also have other non-numeric characters that we can ignore. 488 ''' 489 CHARS = '0123456789abcdefghijklmnopqrstuvwxyz' 490 491 def f(string, match, base=base): 492 if string[0] == '-': 493 sign = -1 494 else: 495 sign = 1 496 497 if string[0] == '0' and len(string) > 2: 498 if string[1] in 'bB': 499 base = 2 500 elif string[1] in 'oO': 501 base = 8 502 elif string[1] in 'xX': 503 base = 16 504 else: 505 # just go with the base specifed 506 pass 507 508 chars = CHARS[:base] 509 string = re.sub('[^%s]' % chars, '', string.lower()) 510 return sign * int(string, base) 511 return f 512 513 514def percentage(string, match): 515 return float(string[:-1]) / 100. 516 517 518class FixedTzOffset(tzinfo): 519 """Fixed offset in minutes east from UTC. 520 """ 521 ZERO = timedelta(0) 522 523 def __init__(self, offset, name): 524 self._offset = timedelta(minutes=offset) 525 self._name = name 526 527 def __repr__(self): 528 return '<%s %s %s>' % (self.__class__.__name__, self._name, 529 self._offset) 530 531 def utcoffset(self, dt): 532 return self._offset 533 534 def tzname(self, dt): 535 return self._name 536 537 def dst(self, dt): 538 return self.ZERO 539 540 def __eq__(self, other): 541 return self._name == other._name and self._offset == other._offset 542 543 544MONTHS_MAP = dict( 545 Jan=1, January=1, 546 Feb=2, February=2, 547 Mar=3, March=3, 548 Apr=4, April=4, 549 May=5, 550 Jun=6, June=6, 551 Jul=7, July=7, 552 Aug=8, August=8, 553 Sep=9, September=9, 554 Oct=10, October=10, 555 Nov=11, November=11, 556 Dec=12, December=12 557) 558DAYS_PAT = r'(Mon|Tue|Wed|Thu|Fri|Sat|Sun)' 559MONTHS_PAT = r'(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)' 560ALL_MONTHS_PAT = r'(%s)' % '|'.join(MONTHS_MAP) 561TIME_PAT = r'(\d{1,2}:\d{1,2}(:\d{1,2}(\.\d+)?)?)' 562AM_PAT = r'(\s+[AP]M)' 563TZ_PAT = r'(\s+[-+]\d\d?:?\d\d)' 564 565 566def date_convert(string, match, ymd=None, mdy=None, dmy=None, 567 d_m_y=None, hms=None, am=None, tz=None, mm=None, dd=None): 568 '''Convert the incoming string containing some date / time info into a 569 datetime instance. 570 ''' 571 groups = match.groups() 572 time_only = False 573 if mm and dd: 574 y=datetime.today().year 575 m=groups[mm] 576 d=groups[dd] 577 elif ymd is not None: 578 y, m, d = re.split(r'[-/\s]', groups[ymd]) 579 elif mdy is not None: 580 m, d, y = re.split(r'[-/\s]', groups[mdy]) 581 elif dmy is not None: 582 d, m, y = re.split(r'[-/\s]', groups[dmy]) 583 elif d_m_y is not None: 584 d, m, y = d_m_y 585 d = groups[d] 586 m = groups[m] 587 y = groups[y] 588 else: 589 time_only = True 590 591 H = M = S = u = 0 592 if hms is not None and groups[hms]: 593 t = groups[hms].split(':') 594 if len(t) == 2: 595 H, M = t 596 else: 597 H, M, S = t 598 if '.' in S: 599 S, u = S.split('.') 600 u = int(float('.' + u) * 1000000) 601 S = int(S) 602 H = int(H) 603 M = int(M) 604 605 if am is not None: 606 am = groups[am] 607 if am: 608 am = am.strip() 609 if am == 'AM' and H == 12: 610 # correction for "12" hour functioning as "0" hour: 12:15 AM = 00:15 by 24 hr clock 611 H -= 12 612 elif am == 'PM' and H == 12: 613 # no correction needed: 12PM is midday, 12:00 by 24 hour clock 614 pass 615 elif am == 'PM': 616 H += 12 617 618 if tz is not None: 619 tz = groups[tz] 620 if tz == 'Z': 621 tz = FixedTzOffset(0, 'UTC') 622 elif tz: 623 tz = tz.strip() 624 if tz.isupper(): 625 # TODO use the awesome python TZ module? 626 pass 627 else: 628 sign = tz[0] 629 if ':' in tz: 630 tzh, tzm = tz[1:].split(':') 631 elif len(tz) == 4: # 'snnn' 632 tzh, tzm = tz[1], tz[2:4] 633 else: 634 tzh, tzm = tz[1:3], tz[3:5] 635 offset = int(tzm) + int(tzh) * 60 636 if sign == '-': 637 offset = -offset 638 tz = FixedTzOffset(offset, tz) 639 640 if time_only: 641 d = time(H, M, S, u, tzinfo=tz) 642 else: 643 y = int(y) 644 if m.isdigit(): 645 m = int(m) 646 else: 647 m = MONTHS_MAP[m] 648 d = int(d) 649 d = datetime(y, m, d, H, M, S, u, tzinfo=tz) 650 651 return d 652 653 654class TooManyFields(ValueError): 655 pass 656 657 658class RepeatedNameError(ValueError): 659 pass 660 661 662# note: {} are handled separately 663# note: I don't use r'' here because Sublime Text 2 syntax highlight has a fit 664REGEX_SAFETY = re.compile(r'([?\\\\.[\]()*+\^$!\|])') 665 666# allowed field types 667ALLOWED_TYPES = set(list('nbox%fFegwWdDsSl') + 668 ['t' + c for c in 'ieahgcts']) 669 670 671def extract_format(format, extra_types): 672 '''Pull apart the format [[fill]align][0][width][.precision][type] 673 ''' 674 fill = align = None 675 if format[0] in '<>=^': 676 align = format[0] 677 format = format[1:] 678 elif len(format) > 1 and format[1] in '<>=^': 679 fill = format[0] 680 align = format[1] 681 format = format[2:] 682 683 zero = False 684 if format and format[0] == '0': 685 zero = True 686 format = format[1:] 687 688 width = '' 689 while format: 690 if not format[0].isdigit(): 691 break 692 width += format[0] 693 format = format[1:] 694 695 if format.startswith('.'): 696 # Precision isn't needed but we need to capture it so that 697 # the ValueError isn't raised. 698 format = format[1:] # drop the '.' 699 precision = '' 700 while format: 701 if not format[0].isdigit(): 702 break 703 precision += format[0] 704 format = format[1:] 705 706 # the rest is the type, if present 707 type = format 708 if type and type not in ALLOWED_TYPES and type not in extra_types: 709 raise ValueError('format spec %r not recognised' % type) 710 711 return locals() 712 713 714PARSE_RE = re.compile(r"""({{|}}|{\w*(?:(?:\.\w+)|(?:\[[^\]]+\]))*(?::[^}]+)?})""") 715 716 717class Parser(object): 718 '''Encapsulate a format string that may be used to parse other strings. 719 ''' 720 def __init__(self, format, extra_types=None, case_sensitive=False): 721 # a mapping of a name as in {hello.world} to a regex-group compatible 722 # name, like hello__world Its used to prevent the transformation of 723 # name-to-group and group to name to fail subtly, such as in: 724 # hello_.world-> hello___world->hello._world 725 self._group_to_name_map = {} 726 # also store the original field name to group name mapping to allow 727 # multiple instances of a name in the format string 728 self._name_to_group_map = {} 729 # and to sanity check the repeated instances store away the first 730 # field type specification for the named field 731 self._name_types = {} 732 733 self._format = format 734 if extra_types is None: 735 extra_types = {} 736 self._extra_types = extra_types 737 if case_sensitive: 738 self._re_flags = re.DOTALL 739 else: 740 self._re_flags = re.IGNORECASE | re.DOTALL 741 self._fixed_fields = [] 742 self._named_fields = [] 743 self._group_index = 0 744 self._type_conversions = {} 745 self._expression = self._generate_expression() 746 self.__search_re = None 747 self.__match_re = None 748 749 log.debug('format %r -> %r', format, self._expression) 750 751 def __repr__(self): 752 if len(self._format) > 20: 753 return '<%s %r>' % (self.__class__.__name__, 754 self._format[:17] + '...') 755 return '<%s %r>' % (self.__class__.__name__, self._format) 756 757 @property 758 def _search_re(self): 759 if self.__search_re is None: 760 try: 761 self.__search_re = re.compile(self._expression, self._re_flags) 762 except AssertionError: 763 # access error through sys to keep py3k and backward compat 764 e = str(sys.exc_info()[1]) 765 if e.endswith('this version only supports 100 named groups'): 766 raise TooManyFields('sorry, you are attempting to parse ' 767 'too many complex fields') 768 return self.__search_re 769 770 @property 771 def _match_re(self): 772 if self.__match_re is None: 773 expression = r'^%s$' % self._expression 774 try: 775 self.__match_re = re.compile(expression, self._re_flags) 776 except AssertionError: 777 # access error through sys to keep py3k and backward compat 778 e = str(sys.exc_info()[1]) 779 if e.endswith('this version only supports 100 named groups'): 780 raise TooManyFields('sorry, you are attempting to parse ' 781 'too many complex fields') 782 except re.error: 783 raise NotImplementedError("Group names (e.g. (?P<name>) can " 784 "cause failure, as they are not escaped properly: '%s'" % 785 expression) 786 return self.__match_re 787 788 def parse(self, string, evaluate_result=True): 789 '''Match my format to the string exactly. 790 791 Return a Result or Match instance or None if there's no match. 792 ''' 793 m = self._match_re.match(string) 794 if m is None: 795 return None 796 797 if evaluate_result: 798 return self.evaluate_result(m) 799 else: 800 return Match(self, m) 801 802 def search(self, string, pos=0, endpos=None, evaluate_result=True): 803 '''Search the string for my format. 804 805 Optionally start the search at "pos" character index and limit the 806 search to a maximum index of endpos - equivalent to 807 search(string[:endpos]). 808 809 If the ``evaluate_result`` argument is set to ``False`` a 810 Match instance is returned instead of the actual Result instance. 811 812 Return either a Result instance or None if there's no match. 813 ''' 814 if endpos is None: 815 endpos = len(string) 816 m = self._search_re.search(string, pos, endpos) 817 if m is None: 818 return None 819 820 if evaluate_result: 821 return self.evaluate_result(m) 822 else: 823 return Match(self, m) 824 825 def findall(self, string, pos=0, endpos=None, extra_types=None, evaluate_result=True): 826 '''Search "string" for all occurrences of "format". 827 828 Optionally start the search at "pos" character index and limit the 829 search to a maximum index of endpos - equivalent to 830 search(string[:endpos]). 831 832 Returns an iterator that holds Result or Match instances for each format match 833 found. 834 ''' 835 if endpos is None: 836 endpos = len(string) 837 return ResultIterator(self, string, pos, endpos, evaluate_result=evaluate_result) 838 839 def _expand_named_fields(self, named_fields): 840 result = {} 841 for field, value in named_fields.items(): 842 # split 'aaa[bbb][ccc]...' into 'aaa' and '[bbb][ccc]...' 843 basename, subkeys = re.match(r'([^\[]+)(.*)', field).groups() 844 845 # create nested dictionaries {'aaa': {'bbb': {'ccc': ...}}} 846 d = result 847 k = basename 848 849 if subkeys: 850 for subkey in re.findall(r'\[[^\]]+\]', subkeys): 851 d = d.setdefault(k,{}) 852 k = subkey[1:-1] 853 854 # assign the value to the last key 855 d[k] = value 856 857 return result 858 859 def evaluate_result(self, m): 860 '''Generate a Result instance for the given regex match object''' 861 # ok, figure the fixed fields we've pulled out and type convert them 862 fixed_fields = list(m.groups()) 863 for n in self._fixed_fields: 864 if n in self._type_conversions: 865 fixed_fields[n] = self._type_conversions[n](fixed_fields[n], m) 866 fixed_fields = tuple(fixed_fields[n] for n in self._fixed_fields) 867 868 # grab the named fields, converting where requested 869 groupdict = m.groupdict() 870 named_fields = {} 871 name_map = {} 872 for k in self._named_fields: 873 korig = self._group_to_name_map[k] 874 name_map[korig] = k 875 if k in self._type_conversions: 876 value = self._type_conversions[k](groupdict[k], m) 877 else: 878 value = groupdict[k] 879 880 named_fields[korig] = value 881 882 # now figure the match spans 883 spans = dict((n, m.span(name_map[n])) for n in named_fields) 884 spans.update((i, m.span(n + 1)) 885 for i, n in enumerate(self._fixed_fields)) 886 887 # and that's our result 888 return Result(fixed_fields, self._expand_named_fields(named_fields), spans) 889 890 def _regex_replace(self, match): 891 return '\\' + match.group(1) 892 893 def _generate_expression(self): 894 # turn my _format attribute into the _expression attribute 895 e = [] 896 for part in PARSE_RE.split(self._format): 897 if not part: 898 continue 899 elif part == '{{': 900 e.append(r'\{') 901 elif part == '}}': 902 e.append(r'\}') 903 elif part[0] == '{' and part[-1] == '}': 904 # this will be a braces-delimited field to handle 905 e.append(self._handle_field(part)) 906 else: 907 # just some text to match 908 e.append(REGEX_SAFETY.sub(self._regex_replace, part)) 909 return ''.join(e) 910 911 def _to_group_name(self, field): 912 # return a version of field which can be used as capture group, even 913 # though it might contain '.' 914 group = field.replace('.', '_').replace('[', '_').replace(']', '_') 915 916 # make sure we don't collide ("a.b" colliding with "a_b") 917 n = 1 918 while group in self._group_to_name_map: 919 n += 1 920 if '.' in field: 921 group = field.replace('.', '_' * n) 922 elif '_' in field: 923 group = field.replace('_', '_' * n) 924 else: 925 raise KeyError('duplicated group name %r' % (field,)) 926 927 # save off the mapping 928 self._group_to_name_map[group] = field 929 self._name_to_group_map[field] = group 930 return group 931 932 def _handle_field(self, field): 933 # first: lose the braces 934 field = field[1:-1] 935 936 # now figure whether this is an anonymous or named field, and whether 937 # there's any format specification 938 format = '' 939 if field and field[0].isalpha(): 940 if ':' in field: 941 name, format = field.split(':') 942 else: 943 name = field 944 if name in self._name_to_group_map: 945 if self._name_types[name] != format: 946 raise RepeatedNameError('field type %r for field "%s" ' 947 'does not match previous seen type %r' % (format, 948 name, self._name_types[name])) 949 group = self._name_to_group_map[name] 950 # match previously-seen value 951 return r'(?P=%s)' % group 952 else: 953 group = self._to_group_name(name) 954 self._name_types[name] = format 955 self._named_fields.append(group) 956 # this will become a group, which must not contain dots 957 wrap = r'(?P<%s>%%s)' % group 958 else: 959 self._fixed_fields.append(self._group_index) 960 wrap = r'(%s)' 961 if ':' in field: 962 format = field[1:] 963 group = self._group_index 964 965 # simplest case: no type specifier ({} or {name}) 966 if not format: 967 self._group_index += 1 968 return wrap % r'.+?' 969 970 # decode the format specification 971 format = extract_format(format, self._extra_types) 972 973 # figure type conversions, if any 974 type = format['type'] 975 is_numeric = type and type in 'n%fegdobh' 976 if type in self._extra_types: 977 type_converter = self._extra_types[type] 978 s = getattr(type_converter, 'pattern', r'.+?') 979 regex_group_count = getattr(type_converter, 'regex_group_count', 0) 980 if regex_group_count is None: 981 regex_group_count = 0 982 self._group_index += regex_group_count 983 984 def f(string, m): 985 return type_converter(string) 986 self._type_conversions[group] = f 987 elif type == 'n': 988 s = r'\d{1,3}([,.]\d{3})*' 989 self._group_index += 1 990 self._type_conversions[group] = int_convert(10) 991 elif type == 'b': 992 s = r'(0[bB])?[01]+' 993 self._type_conversions[group] = int_convert(2) 994 self._group_index += 1 995 elif type == 'o': 996 s = r'(0[oO])?[0-7]+' 997 self._type_conversions[group] = int_convert(8) 998 self._group_index += 1 999 elif type == 'x': 1000 s = r'(0[xX])?[0-9a-fA-F]+' 1001 self._type_conversions[group] = int_convert(16) 1002 self._group_index += 1 1003 elif type == '%': 1004 s = r'\d+(\.\d+)?%' 1005 self._group_index += 1 1006 self._type_conversions[group] = percentage 1007 elif type == 'f': 1008 s = r'\d+\.\d+' 1009 self._type_conversions[group] = lambda s, m: float(s) 1010 elif type == 'F': 1011 s = r'\d+\.\d+' 1012 self._type_conversions[group] = lambda s, m: Decimal(s) 1013 elif type == 'e': 1014 s = r'\d+\.\d+[eE][-+]?\d+|nan|NAN|[-+]?inf|[-+]?INF' 1015 self._type_conversions[group] = lambda s, m: float(s) 1016 elif type == 'g': 1017 s = r'\d+(\.\d+)?([eE][-+]?\d+)?|nan|NAN|[-+]?inf|[-+]?INF' 1018 self._group_index += 2 1019 self._type_conversions[group] = lambda s, m: float(s) 1020 elif type == 'd': 1021 if format.get('width'): 1022 width = r'{1,%s}' % int(format['width']) 1023 else: 1024 width = '+' 1025 s = r'\d{w}|0[xX][0-9a-fA-F]{w}|0[bB][01]{w}|0[oO][0-7]{w}'.format(w=width) 1026 self._type_conversions[group] = int_convert(10) 1027 elif type == 'ti': 1028 s = r'(\d{4}-\d\d-\d\d)((\s+|T)%s)?(Z|\s*[-+]\d\d:?\d\d)?' % \ 1029 TIME_PAT 1030 n = self._group_index 1031 self._type_conversions[group] = partial(date_convert, ymd=n + 1, 1032 hms=n + 4, tz=n + 7) 1033 self._group_index += 7 1034 elif type == 'tg': 1035 s = r'(\d{1,2}[-/](\d{1,2}|%s)[-/]\d{4})(\s+%s)?%s?%s?' % ( 1036 ALL_MONTHS_PAT, TIME_PAT, AM_PAT, TZ_PAT) 1037 n = self._group_index 1038 self._type_conversions[group] = partial(date_convert, dmy=n + 1, 1039 hms=n + 5, am=n + 8, tz=n + 9) 1040 self._group_index += 9 1041 elif type == 'ta': 1042 s = r'((\d{1,2}|%s)[-/]\d{1,2}[-/]\d{4})(\s+%s)?%s?%s?' % ( 1043 ALL_MONTHS_PAT, TIME_PAT, AM_PAT, TZ_PAT) 1044 n = self._group_index 1045 self._type_conversions[group] = partial(date_convert, mdy=n + 1, 1046 hms=n + 5, am=n + 8, tz=n + 9) 1047 self._group_index += 9 1048 elif type == 'te': 1049 # this will allow microseconds through if they're present, but meh 1050 s = r'(%s,\s+)?(\d{1,2}\s+%s\s+\d{4})\s+%s%s' % (DAYS_PAT, 1051 MONTHS_PAT, TIME_PAT, TZ_PAT) 1052 n = self._group_index 1053 self._type_conversions[group] = partial(date_convert, dmy=n + 3, 1054 hms=n + 5, tz=n + 8) 1055 self._group_index += 8 1056 elif type == 'th': 1057 # slight flexibility here from the stock Apache format 1058 s = r'(\d{1,2}[-/]%s[-/]\d{4}):%s%s' % (MONTHS_PAT, TIME_PAT, 1059 TZ_PAT) 1060 n = self._group_index 1061 self._type_conversions[group] = partial(date_convert, dmy=n + 1, 1062 hms=n + 3, tz=n + 6) 1063 self._group_index += 6 1064 elif type == 'tc': 1065 s = r'(%s)\s+%s\s+(\d{1,2})\s+%s\s+(\d{4})' % ( 1066 DAYS_PAT, MONTHS_PAT, TIME_PAT) 1067 n = self._group_index 1068 self._type_conversions[group] = partial(date_convert, 1069 d_m_y=(n + 4, n + 3, n + 8), hms=n + 5) 1070 self._group_index += 8 1071 elif type == 'tt': 1072 s = r'%s?%s?%s?' % (TIME_PAT, AM_PAT, TZ_PAT) 1073 n = self._group_index 1074 self._type_conversions[group] = partial(date_convert, hms=n + 1, 1075 am=n + 4, tz=n + 5) 1076 self._group_index += 5 1077 elif type == 'ts': 1078 s = r'%s(\s+)(\d+)(\s+)(\d{1,2}:\d{1,2}:\d{1,2})?' % MONTHS_PAT 1079 n = self._group_index 1080 self._type_conversions[group] = partial(date_convert, mm=n+1, dd=n+3, 1081 hms=n + 5) 1082 self._group_index += 5 1083 elif type == 'l': 1084 s = r'[A-Za-z]+' 1085 elif type: 1086 s = r'\%s+' % type 1087 elif format.get('precision'): 1088 if format.get('width'): 1089 s = r'.{%s,%s}?' % (format['width'], format['precision']) 1090 else: 1091 s = r'.{1,%s}?' % format['precision'] 1092 elif format.get('width'): 1093 s = r'.{%s,}?' % format['width'] 1094 else: 1095 s = r'.+?' 1096 1097 align = format['align'] 1098 fill = format['fill'] 1099 1100 # handle some numeric-specific things like fill and sign 1101 if is_numeric: 1102 # prefix with something (align "=" trumps zero) 1103 if align == '=': 1104 # special case - align "=" acts like the zero above but with 1105 # configurable fill defaulting to "0" 1106 if not fill: 1107 fill = '0' 1108 s = r'%s*' % fill + s 1109 1110 # allow numbers to be prefixed with a sign 1111 s = r'[-+ ]?' + s 1112 1113 if not fill: 1114 fill = ' ' 1115 1116 # Place into a group now - this captures the value we want to keep. 1117 # Everything else from now is just padding to be stripped off 1118 if wrap: 1119 s = wrap % s 1120 self._group_index += 1 1121 1122 if format['width']: 1123 # all we really care about is that if the format originally 1124 # specified a width then there will probably be padding - without 1125 # an explicit alignment that'll mean right alignment with spaces 1126 # padding 1127 if not align: 1128 align = '>' 1129 1130 if fill in r'.\+?*[](){}^$': 1131 fill = '\\' + fill 1132 1133 # align "=" has been handled 1134 if align == '<': 1135 s = '%s%s*' % (s, fill) 1136 elif align == '>': 1137 s = '%s*%s' % (fill, s) 1138 elif align == '^': 1139 s = '%s*%s%s*' % (fill, s, fill) 1140 1141 return s 1142 1143 1144class Result(object): 1145 '''The result of a parse() or search(). 1146 1147 Fixed results may be looked up using `result[index]`. 1148 1149 Named results may be looked up using `result['name']`. 1150 1151 Named results may be tested for existence using `'name' in result`. 1152 ''' 1153 def __init__(self, fixed, named, spans): 1154 self.fixed = fixed 1155 self.named = named 1156 self.spans = spans 1157 1158 def __getitem__(self, item): 1159 if isinstance(item, int): 1160 return self.fixed[item] 1161 return self.named[item] 1162 1163 def __repr__(self): 1164 return '<%s %r %r>' % (self.__class__.__name__, self.fixed, 1165 self.named) 1166 1167 def __contains__(self, name): 1168 return name in self.named 1169 1170 1171class Match(object): 1172 '''The result of a parse() or search() if no results are generated. 1173 1174 This class is only used to expose internal used regex match objects 1175 to the user and use them for external Parser.evaluate_result calls. 1176 ''' 1177 def __init__(self, parser, match): 1178 self.parser = parser 1179 self.match = match 1180 1181 def evaluate_result(self): 1182 '''Generate results for this Match''' 1183 return self.parser.evaluate_result(self.match) 1184 1185 1186class ResultIterator(object): 1187 '''The result of a findall() operation. 1188 1189 Each element is a Result instance. 1190 ''' 1191 def __init__(self, parser, string, pos, endpos, evaluate_result=True): 1192 self.parser = parser 1193 self.string = string 1194 self.pos = pos 1195 self.endpos = endpos 1196 self.evaluate_result = evaluate_result 1197 1198 def __iter__(self): 1199 return self 1200 1201 def __next__(self): 1202 m = self.parser._search_re.search(self.string, self.pos, self.endpos) 1203 if m is None: 1204 raise StopIteration() 1205 self.pos = m.end() 1206 1207 if self.evaluate_result: 1208 return self.parser.evaluate_result(m) 1209 else: 1210 return Match(self.parser, m) 1211 1212 # pre-py3k compat 1213 next = __next__ 1214 1215 1216def parse(format, string, extra_types=None, evaluate_result=True, case_sensitive=False): 1217 '''Using "format" attempt to pull values from "string". 1218 1219 The format must match the string contents exactly. If the value 1220 you're looking for is instead just a part of the string use 1221 search(). 1222 1223 If ``evaluate_result`` is True the return value will be an Result instance with two attributes: 1224 1225 .fixed - tuple of fixed-position values from the string 1226 .named - dict of named values from the string 1227 1228 If ``evaluate_result`` is False the return value will be a Match instance with one method: 1229 1230 .evaluate_result() - This will return a Result instance like you would get 1231 with ``evaluate_result`` set to True 1232 1233 The default behaviour is to match strings case insensitively. You may match with 1234 case by specifying case_sensitive=True. 1235 1236 If the format is invalid a ValueError will be raised. 1237 1238 See the module documentation for the use of "extra_types". 1239 1240 In the case there is no match parse() will return None. 1241 ''' 1242 p = Parser(format, extra_types=extra_types, case_sensitive=case_sensitive) 1243 return p.parse(string, evaluate_result=evaluate_result) 1244 1245 1246def search(format, string, pos=0, endpos=None, extra_types=None, evaluate_result=True, 1247 case_sensitive=False): 1248 '''Search "string" for the first occurrence of "format". 1249 1250 The format may occur anywhere within the string. If 1251 instead you wish for the format to exactly match the string 1252 use parse(). 1253 1254 Optionally start the search at "pos" character index and limit the search 1255 to a maximum index of endpos - equivalent to search(string[:endpos]). 1256 1257 If ``evaluate_result`` is True the return value will be an Result instance with two attributes: 1258 1259 .fixed - tuple of fixed-position values from the string 1260 .named - dict of named values from the string 1261 1262 If ``evaluate_result`` is False the return value will be a Match instance with one method: 1263 1264 .evaluate_result() - This will return a Result instance like you would get 1265 with ``evaluate_result`` set to True 1266 1267 The default behaviour is to match strings case insensitively. You may match with 1268 case by specifying case_sensitive=True. 1269 1270 If the format is invalid a ValueError will be raised. 1271 1272 See the module documentation for the use of "extra_types". 1273 1274 In the case there is no match parse() will return None. 1275 ''' 1276 p = Parser(format, extra_types=extra_types, case_sensitive=case_sensitive) 1277 return p.search(string, pos, endpos, evaluate_result=evaluate_result) 1278 1279 1280def findall(format, string, pos=0, endpos=None, extra_types=None, evaluate_result=True, 1281 case_sensitive=False): 1282 '''Search "string" for all occurrences of "format". 1283 1284 You will be returned an iterator that holds Result instances 1285 for each format match found. 1286 1287 Optionally start the search at "pos" character index and limit the search 1288 to a maximum index of endpos - equivalent to search(string[:endpos]). 1289 1290 If ``evaluate_result`` is True each returned Result instance has two attributes: 1291 1292 .fixed - tuple of fixed-position values from the string 1293 .named - dict of named values from the string 1294 1295 If ``evaluate_result`` is False each returned value is a Match instance with one method: 1296 1297 .evaluate_result() - This will return a Result instance like you would get 1298 with ``evaluate_result`` set to True 1299 1300 The default behaviour is to match strings case insensitively. You may match with 1301 case by specifying case_sensitive=True. 1302 1303 If the format is invalid a ValueError will be raised. 1304 1305 See the module documentation for the use of "extra_types". 1306 ''' 1307 p = Parser(format, extra_types=extra_types, case_sensitive=case_sensitive) 1308 return Parser(format, extra_types=extra_types).findall(string, pos, endpos, evaluate_result=evaluate_result) 1309 1310 1311def compile(format, extra_types=None, case_sensitive=False): 1312 '''Create a Parser instance to parse "format". 1313 1314 The resultant Parser has a method .parse(string) which 1315 behaves in the same manner as parse(format, string). 1316 1317 The default behaviour is to match strings case insensitively. You may match with 1318 case by specifying case_sensitive=True. 1319 1320 Use this function if you intend to parse many strings 1321 with the same format. 1322 1323 See the module documentation for the use of "extra_types". 1324 1325 Returns a Parser instance. 1326 ''' 1327 return Parser(format, extra_types=extra_types, case_sensitive=case_sensitive) 1328 1329 1330# Copyright (c) 2012-2019 Richard Jones <richard@python.org> 1331# 1332# Permission is hereby granted, free of charge, to any person obtaining a copy 1333# of this software and associated documentation files (the "Software"), to deal 1334# in the Software without restriction, including without limitation the rights 1335# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 1336# copies of the Software, and to permit persons to whom the Software is 1337# furnished to do so, subject to the following conditions: 1338# 1339# The above copyright notice and this permission notice shall be included in 1340# all copies or substantial portions of the Software. 1341# 1342# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 1343# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 1344# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 1345# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 1346# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 1347# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 1348# SOFTWARE. 1349 1350# vim: set filetype=python ts=4 sw=4 et si tw=75 1351