pcre2pattern.html - OpenGrok cross reference for /external/pcre/doc/html/pcre2pattern.html

Lines Matching +full:posix +full:- +full:character +full:- +full:classes
17 <li><a name="TOC2" href="#SEC2">SPECIAL START-OF-PATTERN ITEMS</a>
18 <li><a name="TOC3" href="#SEC3">EBCDIC CHARACTER CODES</a>
24 <li><a name="TOC9" href="#SEC9">SQUARE BRACKETS AND CHARACTER CLASSES</a>
25 <li><a name="TOC10" href="#SEC10">POSIX CHARACTER CLASSES</a>
36 <li><a name="TOC21" href="#SEC21">NON-ATOMIC ASSERTIONS</a>
52 are described in detail below. There is a quick-reference syntax summary in the
70 using a different algorithm that is not Perl-compatible. Some of the features
77 <br><a name="SEC2" href="#TOC1">SPECIAL START-OF-PATTERN ITEMS</a><br>
80 by special items at the start of a pattern. These are not Perl-compatible, but
90 In the 8-bit and 16-bit PCRE2 libraries, characters may be coded either as
91 single code units, or as multiple UTF-8 or UTF-16 code units. UTF-32 can be
92 specified for the 32-bit library, in which case it constrains the character
105 restrict them to non-UTF data for security reasons. If the PCRE2_NEVER_UTF
115 such as \d and \w to use Unicode properties to determine character types,
140 Disabling auto-possessification
151 Disabling start-up optimizations
166 apply to patterns whose top-level branches all start with .* (match any number
227 strings: a single CR (carriage return) character, a single LF (linefeed)
228 character, the two-character sequence CRLF, any of the three preceding, any
229 Unicode newline sequence, or the NUL character (binary zero). The
245   (*NUL)       the NUL character (binary zero)
278 <br><a name="SEC3" href="#TOC1">EBCDIC CHARACTER CODES</a><br>
281 character code instead of ASCII or Unicode (typically a mainframe system). In
282 the sections below, character code values are ASCII or Unicode; in an EBCDIC
298 equivalents, are case-equivalent with Unicode U+212A (Kelvin sign) and U+017F
305 character classes, alternatives, and repetitions in the pattern. These are
315   \      general escape character with several uses
318   .      match any character except newline (by default)
319   [      start character class definition
338 Part of a pattern that is in square brackets is called a "character class". In
339 a character class the only metacharacters are:
341   \      general escape character
342   ^      negate the class, but only if the first character
343   -      indicates character range
344   [      POSIX character class (if followed by POSIX syntax)
345   ]      terminates the character class
348 the pattern, other than in a character class, within a \Q...\E sequence, or
349 between a # outside a character class and the next newline, inclusive, are
351 character as part of the pattern. If the PCRE2_EXTENDED_MORE option is set, the
353 ignored inside a character class. Note: only these two characters are ignored,
355 character class. Option settings can be changed within a pattern; see the
365 The backslash character has several uses. Firstly, if it is followed by a
366 character that is not a digit or a letter, it takes away any special meaning
367 that character may have. This use of backslash as an escape character applies
368 both inside and outside character classes.
371 For example, if you want to match a * character, you must write \* in the
372 pattern. This escaping action applies whether or not the following character
374 precede a non-alphanumeric with backslash to specify that it stands for itself.
388 interpolation. Also, Perl does "double-quotish backslash interpolation" on any
391 other character. Note the following examples:
401 The \Q...\E sequence is recognized both inside and outside character classes.
405 a character class, this causes an error, because the character class is then
409 Non-printing characters
412 A second use of backslash provides a way of encoding non-printing characters
414 non-printing characters in a pattern, but when a pattern is being prepared by
416 instead of the binary character it represents. In an ASCII or Unicode
419   \a          alarm, that is, the BEL character (hex 07)
420   \cx         "control-x", where x is a non-control ASCII character
426   \0dd        character with octal code 0dd
427   \ddd        character with octal code ddd, or backreference
428   \o{ddd..}   character with octal code ddd..
429   \xhh        character with hex code hh
430   \x{hhh..}   character with hex code hhh..
431   \N{U+hhh..} character with Unicode hex code point hhh..
435 hexadecimal digits may appear between \x{ and }. If a character other than a
447 two compile-time options. If PCRE2_ALT_BSUX is set, the sequence \x followed
449 recognized as a character escape. Otherwise it is interpreted as a literal "x"
450 character. In this mode, support for code points greater than 256 is provided
452 interpreted as a literal "u" character.
456 \u{hhh..} is recognized as the character specified by hexadecimal code point.
465 (curly bracket) it has an entirely different meaning, matching any character
472 (carriage return) character.
475 An error occurs if \c is not followed by a character whose ASCII code point
477 lower case letter, it is converted to upper case. Then bit 6 of the character
481 a compile-time error occurs.
487 only characters that are allowed after \c are A-Z, a-z, or one of @, [, \, ],
488 ^, _, or ?. Any other character provokes a compile-time error. The sequence
489 \c@ encodes character code 0; after \c the letters (in either case) encode
490 characters 1-26 (hex 01 to hex 1A); [, \, ], ^, and _ encode characters 27-31
494 Thus, apart from \c?, these escapes generate the same character code values as
501 because 127 is not a control character in EBCDIC, Perl makes it generate the
502 APC character. Unfortunately, there are several variants of EBCDIC. In most of
503 them the APC character has the value 255 (hex FF), but in the one Perl calls
504 POSIX-BC its value is 95 (hex 5F). If certain other characters have POSIX-BC
510 specifies two binary zeros followed by a CR character (code value 13). Make
511 sure you supply two digits after the initial zero if the pattern character that
517 addition to Perl; it provides way of specifying character code points as octal
524 character code points, and \g{...} to specify backreferences. The following
532 Outside a character class, PCRE2 reads the digit and any following digits as a
540 Otherwise, up to three octal digits are read to form a character code.
543 Inside a character class, PCRE2 handles \8 and \9 as the literal characters
545 backslash, using them to generate a data character. Any subsequent digits stand
546 for themselves. For example, outside a character class:
553   \0113  is a tab followed by the character "3"
554   \113   might be a backreference, otherwise the character with octal code 113
563 Constraints on character values
569   8-bit non-UTF mode    no greater than 0xff
570   16-bit non-UTF mode   no greater than 0xffff
571   32-bit non-UTF mode   no greater than 0xffffffff
575 so-called "surrogate" code points). The check for these can be disabled by the
577 PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES. However, this is possible only in UTF-8
578 and UTF-32 modes, because these values are not representable in UTF-16.
581 Escape sequences in character classes
584 All the sequences that define a single character value can be used both inside
585 and outside character classes. In addition, inside a character class, \b is
586 interpreted as the backspace character (hex 08).
589 When not followed by an opening brace, \N is not allowed in a character class.
590 \B, \R, and \X are not special inside a character class. Like other
592 character class, these sequences have different meanings.
602 character, and \u can be used to define a character by code point, as
620 For compatibility with Oniguruma, the non-Perl syntax \g followed by a name or
630 Generic character types
633 Another use of backslash is for specifying generic character types:
636   \D     any character that is not a decimal digit
637   \h     any horizontal white space character
638   \H     any character that is not a horizontal white space character
639   \N     any character that is not a newline
640   \s     any white space character
641   \S     any character that is not a white space character
642   \v     any vertical white space character
643   \V     any character that is not a vertical white space character
644   \w     any "word" character
645   \W     any "non-word" character
652 <a href="#digitsafterbackslash">"Non-printing characters"</a>
658 of characters into two disjoint sets. Any given character matches one, and only
659 one, of each pair. The sequences can appear both inside and outside character
660 classes. They each match one character of the appropriate type. If the current
662 there is no character to match.
667 vary if locale-specific matching is taking place. For example, in some locales
668 the "non-breaking space" character (\xA0) is recognized as white space, and in
669 others the VT character is not.
672 A "word" character is an underscore or any character that is a letter or digit.
674 low-valued character tables, and may vary if locale-specific matching is taking
679 page). For example, in a French locale such as "fr_FR" in Unix-like systems,
680 or "french" in Windows, some character codes greater than 127 are used for
687 for characters in the range 128-255 when locale-specific matching is happening.
691 determine character types, as follows:
693   \d  any character that matches \p{Nd} (decimal digit)
694   \s  any character that matches \p{Z} or \h or \v
695   \w  any character that matches \p{L}, \p{N}, \p{Mn}, or \p{Pc}
697 The addition of \p{Mn} (non-spacing mark) and the replacement of an explicit
704 other character categories. Note also that PCRE2_UCP affects \b, and
722   U+00A0     Non-break space
729   U+2004     Three-per-em space
730   U+2005     Four-per-em space
731   U+2006     Six-per-em space
736   U+202F     Narrow no-break space
750 In 8-bit, non-UTF-8 mode, only the characters with code points less than 256
757 Outside a character class, by default, the escape sequence \R matches any
758 Unicode newline sequence. In 8-bit non-UTF-8 mode \R is equivalent to the
765 This particular group matches either the two-character sequence CR followed by
768 line, U+0085). Because this is an atomic group, the two-character sequence is
789 Note that these special settings, which are not Perl-compatible, are recognized
797 character class, \R is treated as an unrecognized escape sequence, and causes
801 Unicode character properties
806 can be used in any mode, though in 8-bit and 16-bit non-UTF modes these
808 less than U+0100 and U+10000, respectively. In 32-bit non-UTF mode, code points
814 multistage table lookup in order to find a character's property. That is why
822   \p{<i>xx</i>}   a character with the <i>xx</i> property
823   \P{<i>xx</i>}   a character without the <i>xx</i> property
826 The property names represented by <i>xx</i> above are not case-sensitive, and in
829 general category properties, "Any", which matches any character (including
842 character has a basic script and, optionally, a list of other scripts ("Script
853 Unassigned characters (and in non-UTF 32-bit mode, characters with code points
856 of recognized script names and their 4-character abbreviations can be obtained
859   pcre2test -LS
867 Each character has exactly one Unicode general category property, specified by
868 a two-letter abbreviation. For compatibility with Perl, negation can be
900   Mn    Non-spacing mark
928 matches a character that has the Lu, Ll, or Lt property, in other words, a
934 character when PCRE2 is not in UTF mode (using the 16-bit or 32-bit library).
947 No character that is in the Unicode table has the Cn (unassigned) property.
964   pcre2test -LP
973   \p{Bidi_Class:&#60;class&#62;}   matches a character with the given class
974   \p{BC:&#60;class&#62;}           matches a character with the given class
976 The recognized classes are:
987   L           left-to-right
988   LRE         left-to-right embedding
989   LRI         left-to-right isolate
990   LRO         left-to-right override
991   NSM         non-spacing mark
995   R           right-to-left
996   RLE         right-to-left embedding
997   RLI         right-to-left isolate
998   RLO         right-to-left override
1003 case-insensitive; only the short names listed above are recognized.
1012 Unicode supports various kinds of composite character by giving each character
1017 Instead it introduced various emoji-specific properties. PCRE2 uses only the
1021 \X always matches at least one character. Then it decides whether to add
1028 2. Do not end between CR and LF; otherwise end after any control character.
1032 are of five types: L, V, T, LV, and LVT. An L character may be followed by an
1033 L, V, LV, or LVT character; an LV or V character may be followed by a V or T
1034 character; an LVT or T character may be followed only by a T character.
1037 4. Do not end before extending characters or spacing marks or the zero-width
1038 joiner (ZWJ) character. Characters with the "mark" property always have the
1045 6. Do not end within emoji modifier sequences or emoji ZWJ (zero-width
1046 joiner) sequences. An emoji ZWJ sequence consists of a character with the
1048 with the Extend property, followed by the ZWJ character, followed by another
1049 Extended_Pictographic character.
1065 and \s to use Unicode properties. PCRE2 uses these non-standard, non-Perl
1069   Xan   Any alphanumeric character
1070   Xps   Any POSIX space character
1071   Xsp   Any Perl space character
1072   Xwd   Any Perl "word" character
1076 carriage return, and any other character that has the Z (separator) property.
1079 those that match Mn (non-spacing mark) or Pc (connector punctuation, which
1083 There is another non-standard property, Xuc, which matches any character that
1084 can be represented by a Universal Character Name in C++ and other programming
1088 excluded. (Universal Character Names are of the form \uHHHH or \UHHHHHHHH
1125 <b>pcre2_compile()</b> to re-enable the previous behaviour. When this option is
1157 Inside a character class, \b has a different meaning; it matches the backspace
1158 character. If any other of these assertions appears in a character class, an
1162 A word boundary is a position in the subject string where the current character
1163 and the previous character do not both match \w or \W (i.e. one matches
1165 first or last character matches \w, respectively. When PCRE2 is built with
1179 argument of <b>pcre2_match()</b> is non-zero, indicating that matching is to
1189 <i>startoffset</i> is non-zero. By calling <b>pcre2_match()</b> multiple times
1195 character of the matching process, is subtly different from Perl's, which
1207 The circumflex and dollar metacharacters are zero-width assertions. That is,
1211 only the two-character sequence CRLF is recognized as a newline, isolated CR
1216 Outside a character class, in the default matching mode, the circumflex
1217 character is an assertion that is true only if the current matching point is at
1219 <b>pcre2_match()</b> is non-zero, or if PCRE2_NOTBOL is set, circumflex can
1220 never match if the PCRE2_MULTILINE option is unset. Inside a character class,
1225 Circumflex need not be the first character of the pattern if a number of
1234 The dollar character is an assertion that is true only if the current matching
1238 character of the pattern if a number of alternatives are involved, but it
1240 special meaning in a character class.
1249 PCRE2_MULTILINE option is set. When this is the case, a dollar character
1261 when the <i>startoffset</i> argument of <b>pcre2_match()</b> is non-zero. The
1267 below) recognizes the two-character sequence CRLF as a newline, this is
1281 Outside a character class, a dot in the pattern matches any one character in
1282 the subject string except (by default) a character that signifies the end of a
1288 Dot never matches a single line-ending character. When the two-character
1297 PCRE2_DOTALL option is set, a dot matches any one character, without exception.
1298 If the two-character sequence CRLF is present in the subject string, it takes
1304 special meaning in a character class.
1309 it matches any character except one that signifies the end of a line.
1314 <a href="digitsafterbackslash">"Non-printing characters"</a>
1320 Outside a character class, the escape sequence \C matches any one code unit,
1321 whether or not a UTF mode is set. In the 8-bit library, one code unit is one
1322 byte; in the 16-bit library it is a 16-bit unit; in the 32-bit library it is a
1323 32-bit unit. Unlike a dot, \C always matches line-ending characters. The
1324 feature is provided in Perl in order to match individual bytes in UTF-8 mode,
1329 with \C in UTF-8 or UTF-16 mode means that the rest of the string may start
1330 with a malformed UTF character. This has undefined results, because PCRE2
1331 assumes that it is matching character by character in a valid UTF string (by
1343 in UTF-8 or UTF-16 modes, because this would make it impossible to calculate
1346 The former gives a match-time error; the latter fails to optimize and so the
1350 In the 32-bit library, however, \C is always supported (when not explicitly
1351 locked out) because it always matches a single code unit, whether or not UTF-32
1356 it that avoids the problem of malformed UTF-8 or UTF-16 characters is to use a
1357 lookahead to check the length of the next character, as in this pattern, which
1358 could be used with a UTF-8 string (ignore white space and line breaks):
1360   (?| (?=[\x00-\x7f])(\C) |
1361       (?=[\x80-\x{7ff}])(\C)(\C) |
1362       (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
1363       (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
1368 below). The assertions at the start of each branch check the next UTF-8
1369 character for values whose encoding uses 1, 2, 3, or 4 bytes, respectively. The
1370 character's individual bytes are then captured by the appropriate number of
1373 <br><a name="SEC9" href="#TOC1">SQUARE BRACKETS AND CHARACTER CLASSES</a><br>
1375 An opening square bracket introduces a character class, terminated by a closing
1378 the first data character in the class (after an initial circumflex, if present)
1384 A character class matches a single character in the subject. A matched
1385 character must be in the set of characters defined by the class, unless the
1386 first character in the class definition is a circumflex, in which case the
1387 subject character must not be in the set defined by the class. If a circumflex
1389 character, or escape it with a backslash.
1392 For example, the character class [aeiou] matches any lower case vowel, while
1393 [^aeiou] matches any character that is not a lower case vowel. Note that a
1396 circumflex is not an assertion; it still consumes a character from the subject
1407 are case-equivalent with Unicode U+212A (Kelvin sign) and U+017F (long S)
1412 when matching character classes, whatever line-ending sequence is in use, and
1417 The generic character type escape sequences \d, \D, \h, \H, \p, \P, \s,
1418 \S, \v, \V, \w, and \W may appear in a character class, and add the
1422 outside a character class, as described in the section entitled
1423 <a href="#genericchartypes">"Generic character types"</a>
1424 above. The escape sequence \b has a different meaning inside a character
1425 class; it matches the backspace character. The sequences \B, \R, and \X are
1426 not special inside a character class. Like any other unrecognized escape
1431 The minus (hyphen) character can be used to specify a range of characters in a
1432 character class. For example, [d-m] matches any letter between d and m,
1433 inclusive. If a minus character is required in a class, it must be escaped with
1435 indicating a range, typically as the first or last character in the class,
1436 or immediately after a range. For example, [b-d-z] matches letters in the range
1437 b to d, a hyphen character, or z.
1440 Perl treats a hyphen as a literal if it appears before or after a POSIX class
1441 (see below) or before or after a character type escape such as \d or \H.
1442 However, unless the hyphen is the last character in the class, Perl outputs a
1447 It is not possible to have the literal character "]" as the end character of a
1448 range. A pattern such as [W-]46] is interpreted as a class of two characters
1449 ("W" and "-") followed by a literal string "46]", so it would match "W46]" or
1450 "-46]". However, if the "]" is escaped with a backslash it is interpreted as
1451 the end of range, so [W-\]46] is interpreted as a class containing a range
1458 example [\000-\037]. Ranges can include any characters that are valid for the
1459 current mode. In any UTF mode, the so-called "surrogate" characters (those
1462 this check). However, ranges such as [\x{d7ff}-\x{e000}], which include the
1469 example, [h-k] matches only four characters, even though the codes for h and k
1471 specified numerically, for example, [\x88-\x92] or [h-\x92], all code points
1476 matches the letters in either case. For example, [W-c] is equivalent to
1477 [][\\^_`wxyzabc], matched caselessly, and in a non-UTF mode, if character
1478 tables for a French locale are in use, [\xc8-\xcb] matches accented E
1482 A circumflex can conveniently be used with the upper case character types to
1485 whereas [\w] includes underscore. A positive character class should be read as
1490 The only metacharacters that are recognized in character classes are backslash,
1493 introducing a POSIX class name, or for a special compatibility feature - see
1495 escaping other non-alphanumeric characters does no harm.
1497 <br><a name="SEC10" href="#TOC1">POSIX CHARACTER CLASSES</a><br>
1499 Perl supports the POSIX notation for character classes. This uses names
1505 matches "0", "1", any alphabetic character, or "%". The supported class names
1510   ascii    character codes 0 - 127
1524 and space (32). If locale-specific matching is taking place, the list of space
1530 5.8. Another Perl extension is negation, which is indicated by a ^ character
1535 matches "1", "2", or any non-digit. PCRE2 (and Perl) also recognize the POSIX
1541 POSIX character classes, although this may be different for characters in the
1542 range 128-255 when locale-specific matching is happening. However, in UCP mode,
1543 unless certain options are set (see below), some of the classes are changed so
1544 that Unicode character properties are used. This is achieved by replacing
1545 POSIX classes with other sequences, as follows:
1557 Negated versions, such as [:^alpha:] use \P instead of \p. Four other POSIX
1558 classes are handled specially in UCP mode:
1568   U+2066 - U+2069  Various "isolate"s
1590 The other POSIX classes are unchanged by PCRE2_UCP, and match only characters
1594 There are two options that can be used to restrict the POSIX classes to ASCII
1597 (?aT) and (?-aT). The PCRE2_EXTRA_ASCII_POSIX option disables UCP processing
1598 for all POSIX classes, including [:digit:] and [:xdigit:]. Within a pattern,
1599 (?aP) and (?-aP) set and unset both these options for consistency.
1603 In the POSIX.2 compliant library that was included in 4.4BSD Unix, the ugly
1610 Only these exact character sequences are recognized. A sequence such as
1611 [a[:&#60;:]b] provokes error for an unrecognized POSIX class name. This support is
1616 above), and in a Perl-style pattern the preceding or following character
1618 used above in order to give exactly the POSIX behaviour. Note also that the
1620 it also affects these POSIX sequences.
1640 of letters enclosed between "(?" and ")". The following are Perl-compatible,
1654 example (?-im). The two "extended" options are not independent; unsetting
1658 A combined setting and unsetting such as (?im-sx), which sets PCRE2_CASELESS
1665 If the first character following (? is a circumflex, it causes all of the above
1667 be re-instated, but a hyphen may not appear.
1670 Some PCRE2-specific options can be changed by the same mechanism using these
1683 (?-imnrsx). If 'a' is not followed by any of the upper case letters shown
1688 is set, but including it in (?aP) means that (?-aP) suppresses all ASCII
1689 restrictions for POSIX classes.
1714 a non-capturing group (see the next section), the option letters may
1723 <b>Note:</b> There are other PCRE2-specific options, applying to the whole
1780 a non-capturing group, the option letters may appear between the "?" and the
1795 itself a non-capturing group. For example, consider this pattern:
1809   # before  ---------------branch-reset----------- after
1824 A relative reference such as (?-1) is no different: it is just a convenient way
1830 for a group's having matched refers to a non-unique number, the test is
1850 alphanumeric characters and underscores, but must start with a non-digit. When
1855   ^[_A-Za-z][_A-Za-z0-9]*\z   when PCRE2_UTF is not set
1870 name-to-number translation table from a compiled pattern, as well as
1887 compile-time error. However, there is still scope for confusion. Consider this
1911 either as a 3-letter abbreviation or as the full name, and in both cases you
1930 If you make a backreference to a non-unique named group from elsewhere in the
1941 If you make a subroutine call to a non-unique named group, the one that
1962   a literal data character
1967   any escape sequence that matches a single character
1968   a character class
1982 character. If the second number is omitted, but the comma is present, there is
2003 quantifier, the brace is taken as a literal character. In particular, this
2013 which is represented by a two-byte sequence in a UTF-8 string. Similarly,
2028 For convenience, the three most common quantifiers have single-character
2093 character position in the subject string, so there is no point in retrying the
2110 If the subject is "xyz123abc123" the match point is the fourth character. For
2142 re-evaluated to see if a different number of repeats allows the rest of the
2156 that once a group has matched, it is not to be re-evaluated in this way.
2192 additional + character following a quantifier. Using this notation, the
2229 matches an unlimited number of substrings that either consist of non-digits, or
2238 than a single character at the end, because both PCRE2 and Perl have an
2239 optimization that allows for fast failure when a single character is used. They
2240 remember the last single character that is required for a match, and fail early
2246 sequences of non-digits cannot be broken, and failure happens quickly.
2250 Outside a character class, a backslash followed by a digit greater than 0 (and
2266 interpreted as a character defined in octal. See the subsection entitled
2267 "Non-printing characters"
2287   (abc(def)ghi)\g{-1}
2289 The sequence \g{-1} is a reference to the capture group whose number is one
2291 the next group would be numbered 3) is it equivalent to \2, and \g{-2} would
2293 that group is included in the count, so in this example \g{-2} also refers to
2296   (A)(\g{-2}B)
2356 continues with a digit character, some delimiter must be used to terminate the
2373 the group, the backreference matches the character string corresponding to the
2403 The Perl-compatible lookaround assertions are atomic. If an assertion is true,
2405 assertion. However, there are some cases where non-atomic assertions can be
2407 <a href="#nonatomicassertions">"Non-atomic assertions"</a>
2408 below, but they are not Perl-compatible.
2421 such as (.)\g{-1} can be used to check that two adjacent characters are the
2515 If every top-level alternative matches a fixed length, for example
2526 which one or more top-level alternatives can match more than one string length,
2539 In UTF-8 and UTF-16 modes, PCRE2 does not allow the \C escape (which matches a
2548 as the called capture group matches a limited-length string. However,
2560 same character:
2579 there is no following "a"), it backtracks to match all but the last character,
2627 <br><a name="SEC21" href="#TOC1">NON-ATOMIC ASSERTIONS</a><br>
2631 assertion. However, there are some cases where non-atomic positive assertions
2637 Consider the problem of finding the right-most word in a string that also
2649 succeeds, it captures the right-most word in the string.
2657 assertion could not be re-entered, and the whole match would fail. The pattern
2661 Using a non-atomic lookahead, however, means that when the last word does not
2662 occur twice in the string, the lookahead can backtrack and find the second-last
2667 Two conditions must be met for a non-atomic assertion to be useful: the
2671 as before because nothing has changed, so using a non-atomic assertion just
2675 There is one exception to backtracking into a non-atomic assertion. If an
2680 Non-atomic assertions are not supported by the alternative matching function
2706 the matched characters in a sequence of non-spaces that follow white space are
2716 This works as long as the first character is expected to be a character in that
2721   \s+(?=[0-9_.]*\p{Latin})(*sr:\S+)
2738 support. A compile-time error is given if any of the above constructs is
2756   (?(condition)yes-pattern)
2757   (?(condition)yes-pattern|no-pattern)
2759 If the condition is satisfied, the yes-pattern is used; otherwise the
2760 no-pattern (if present) is used. An absent no-pattern is equivalent to an empty
2762 group, a compile-time error occurs. Each of the two alternatives may itself
2773 recursion, two pseudo-conditions called DEFINE and VERSION, and assertions.
2787 this condition) can be referenced by (?(-1), the next most recent by (?(-2),
2790 value zero in any of these forms is not used; it provokes a compile-time error.
2793 Consider the following pattern, which contains non-significant white space to
2800 character is present, sets it as the first captured substring. The second part
2804 the condition is true, and so the yes-pattern is executed and a closing
2805 parenthesis is required. Otherwise, since no-pattern is not present, the
2807 sequence of non-parentheses, optionally enclosed in parentheses.
2813   ...other stuff... ( \( )?    [^()]+    (?(-1) \) ) ...
2838 "Recursion" in this sense refers to any subroutine-like call from one part of
2892   (?(DEFINE) (?&#60;byte&#62; 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
2899 pattern uses references to the named group to match the four dot-separated
2927 <a href="#nonatomicassertions">non-atomic assertions.</a>
2930 Consider this pattern, again containing non-significant white space, and with
2933   (?(?=[^a-z]*[a-z])
2934   \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )
2937 sequence of non-letters followed by a letter. In other words, it tests for the
2941 dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits.
2947 assertion, whether it succeeds or fails. (Compare non-conditional assertions,
2953 PCRE2. In both cases, the start of the comment must not be in a character
2961 PCRE2_EXTENDED or PCRE2_EXTENDED_MORE option is set, an unescaped # character
2963 the next newline character or character sequence in the pattern. Which
2971 default newline convention (a single linefeed character) is in force:
2975 On encountering the # character, <b>pcre2_compile()</b> skips along, looking for
2977 it does not terminate the comment. Only an actual character with the code value
3009 <a href="#groupsassubroutines">non-recursive subroutine</a>
3020 substrings which can either be a sequence of non-parentheses, or a recursive
3023 to avoid backtracking into sequences of non-parentheses.
3037 pattern above you can write (?-2) to refer to the second most recently opened
3047   (?|(a)|(b)) (c) (?-2)
3050 is number 2. When the reference (?-2) is encountered, the second most recently
3060 <a href="#groupsassubroutines">non-recursive subroutine</a>
3076 non-parentheses is important when applying the pattern to strings that do not
3109 alternatives for the recursive and non-recursive cases. The (?R) item is the
3121 once it had matched some of the subject string, it was never re-entered, even
3127 as atomic. That is, they can be re-entered to try unused alternatives if there
3138 The second branch in the group matches a single central character in the
3142 typical palindromic phrases, the pattern has to ignore all non-word characters,
3149 avoid backtracking into sequences of non-word characters. Without this, PCRE2
3178   (...(relative)...)...(?-1)...
3200 Processing options such as case-independence are fixed when a group is
3204   (abc)(?i:(?-1))
3218 For compatibility with Oniguruma, the non-Perl syntax \g followed by a name or
3229   (abc)(?i:\g&#60;-1&#62;)
3267 one side-effect is that sometimes callouts are skipped. If you need all
3323 is no longer Perl-compatible.
3329 \x{100} that define character code points. Character type escapes such as \d
3336 skipped, and #-comments are recognized, exactly as in the rest of the pattern.
3341 The maximum length of a name is 255 in the 8-bit library and 65535 in the
3342 16-bit and 32-bit libraries. If the name is empty, that is, if the closing
3368 minimum length of matching subject, or that a particular character must be
3371 the start-of-match optimizations by setting the PCRE2_NO_START_OPTIMIZE option
3456 When a match succeeds, the name of the last-encountered mark name on the
3487 name is recorded and passed back if it is the last-encountered. This does not
3555 unless PCRE2's start-of-match optimizations are turned off, as shown in this
3569 the optimization that skips along to the first character. The pattern is now
3578 starting character then happens. Backtracking can occur as usual to the left of
3595 pattern is unanchored, the "bumpalong" advance is not to the next character,
3603 the first character in the string), the starting point skips on to start the
3606 first match attempt, the second attempt would start at the second character
3627 assertions, because they are never re-entered by backtracking. Compare the
3642 the second branch of the pattern to be tried at the first character position.
3645 matching attempt to start at the second character. This time, the (*MARK) is
3658 pattern-based if-then-else block:
3676 A group that does not contain a | character is just a part of the enclosing
3697 because only one is ever used. In other words, the | character in a conditional
3704 character "b" is matched, but "c" is not. At this point, matching does not
3706 character. The conditional group is part of the single alternative that
3714 starting position, but allowing an advance to the next character (for an
3716 than one character. (*COMMIT) is the strongest, causing the entire match to
3778 reach them. This means that, for the Perl-compatible assertions, their effect
3785 PCRE2 now supports non-atomic positive assertions, as described in the section
3787 <a href="#nonatomicassertions">"Non-atomic assertions"</a>
3789 not Perl-compatible. For these assertions, a later backtrack does jump back
3851 Copyright &copy; 1997-2024 University of Cambridge.