pcre2pattern.html - OpenGrok cross reference for /external/pcre/doc/html/pcre2pattern.html

Lines Matching +full:- +full:- +full:without +full:- +full:perl
17 <li><a name="TOC2" href="#SEC2">SPECIAL START-OF-PATTERN ITEMS</a>
36 <li><a name="TOC21" href="#SEC21">NON-ATOMIC ASSERTIONS</a>
52 are described in detail below. There is a quick-reference syntax summary in the
54 page. PCRE2 tries to match Perl syntax and semantics as closely as it can.
56 conflict with the Perl syntax) in order to provide some compatibility with
60 Perl's regular expressions are described in its own documentation, and regular
70 using a different algorithm that is not Perl-compatible. Some of the features
77 <br><a name="SEC2" href="#TOC1">SPECIAL START-OF-PATTERN ITEMS</a><br>
80 by special items at the start of a pattern. These are not Perl-compatible, but
90 In the 8-bit and 16-bit PCRE2 libraries, characters may be coded either as
91 single code units, or as multiple UTF-8 or UTF-16 code units. UTF-32 can be
92 specified for the 32-bit library, in which case it constrains the character
105 restrict them to non-UTF data for security reasons. If the PCRE2_NEVER_UTF
137 Disabling auto-possessification
148 Disabling start-up optimizations
163 apply to patterns whose top-level branches all start with .* (match any number
225 character, the two-character sequence CRLF, any of the three preceding, any
258 matches. By default, this is any Unicode newline sequence, for Perl
295 equivalents, are case-equivalent with Unicode U+212A (Kelvin sign) and U+017F
328   -      indicates character range
359 precede a non-alphanumeric with backslash to specify that it stands for itself.
369 putting them between \Q and \E. This is different from Perl in that $ and @
370 are handled as literals in \Q...\E sequences in PCRE2, whereas in Perl, $ and
371 @ cause variable interpolation. Also, Perl does "double-quotish backslash
376   Pattern            PCRE2 matches   Perl matches
392 Non-printing characters
395 A second use of backslash provides a way of encoding non-printing characters
397 non-printing characters in a pattern, but when a pattern is being prepared by
403   \cx         "control-x", where x is any printable ASCII character
430 two compile-time options. If PCRE2_ALT_BSUX is set, the sequence \x followed
445 UTF mode. Perl also uses \N{name} to specify characters by Unicode name; PCRE2
462 compile-time error occurs.
467 escape is processed as specified for Perl in the <b>perlebcdic</b> document. The
468 only characters that are allowed after \c are A-Z, a-z, or one of @, [, \, ],
469 ^, _, or ?. Any other character provokes a compile-time error. The sequence
471 characters 1-26 (hex 01 to hex 1A); [, \, ], ^, and _ encode characters 27-31
482 because 127 is not a control character in EBCDIC, Perl makes it generate the
484 them the APC character has the value 255 (hex FF), but in the one Perl calls
485 POSIX-BC its value is 95 (hex 5F). If certain other characters have POSIX-BC
498 addition to Perl; it provides way of specifying character code points as octal
510 and Perl has changed over time, causing PCRE2 also to change.
550   8-bit non-UTF mode    no greater than 0xff
551   16-bit non-UTF mode   no greater than 0xffff
552   32-bit non-UTF mode   no greater than 0xffffffff
556 so-called "surrogate" code points). The check for these can be disabled by the
558 PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES. However, this is possible only in UTF-8
559 and UTF-32 modes, because these values are not representable in UTF-16.
579 In Perl, the sequences \F, \l, \L, \u, and \U are recognized by its string
601 For compatibility with Oniguruma, the non-Perl syntax \g followed by a name or
605 Note that \g{...} (Perl syntax) and \g&#60;...&#62; (Oniguruma syntax) are <i>not</i>
626   \W     any "non-word" character
633 <a href="#digitsafterbackslash">"Non-printing characters"</a>
634 above for details. Perl also uses \N{name} to specify characters by Unicode
648 vary if locale-specific matching is taking place. For example, in some locales
649 the "non-breaking space" character (\xA0) is recognized as white space, and in
655 low-valued character tables, and may vary if locale-specific matching is taking
660 page). For example, in a French locale such as "fr_FR" in Unix-like systems,
668 for characters in the range 128-255 when locale-specific matching is happening.
691   U+00A0     Non-break space
698   U+2004     Three-per-em space
699   U+2005     Four-per-em space
700   U+2006     Six-per-em space
705   U+202F     Narrow no-break space
719 In 8-bit, non-UTF-8 mode, only the characters with code points less than 256
727 Unicode newline sequence. In 8-bit non-UTF-8 mode \R is equivalent to the
734 This particular group matches either the two-character sequence CR followed by
737 line, U+0085). Because this is an atomic group, the two-character sequence is
758 Note that these special settings, which are not Perl-compatible, are recognized
775 can be used in any mode, though in 8-bit and 16-bit non-UTF modes these
777 less than U+0100 and U+10000, respectively. In 32-bit non-UTF mode, code points
792   \P{<i>xx</i>}   a character without the <i>xx</i> property
795 The property names represented by <i>xx</i> above are not case-sensitive, and in
802 Certain other Perl properties such as "InMusicalSymbols" are not supported by
817 colon. If a script name is given without a property type, for example,
818 \p{Adlam}, it is treated as \p{scx:Adlam}. Perl changed to this
822 Unassigned characters (and in non-UTF 32-bit mode, characters with code points
825 of recognized script names and their 4-character abbreviations can be obtained
828   pcre2test -LS
837 a two-letter abbreviation. For compatibility with Perl, negation can be
869   Mn    Non-spacing mark
903 character when PCRE2 is not in UTF mode (using the 16-bit or 32-bit library).
911 The long synonyms for property names that Perl supports (such as \p{Letter})
923 the behaviour of current versions of Perl.
933   pcre2test -LP
956   L           left-to-right
957   LRE         left-to-right embedding
958   LRI         left-to-right isolate
959   LRO         left-to-right override
960   NSM         non-spacing mark
964   R           right-to-left
965   RLE         right-to-left embedding
966   RLI         right-to-left isolate
967   RLO         right-to-left override
972 case-insensitive; only the short names listed above are recognized.
986 Instead it introduced various emoji-specific properties. PCRE2 uses only the
1006 4. Do not end before extending characters or spacing marks or the "zero-width
1032 and \s to use Unicode properties. PCRE2 uses these non-standard, non-Perl
1038   Xsp   Any Perl space character
1039   Xwd   Any Perl "word" character
1044 Xsp is the same as Xps; in PCRE1 it used to exclude vertical tab, for Perl
1045 compatibility, but Perl changed. Xwd matches the same characters as Xan, plus
1049 There is another non-standard property, Xuc, which matches any character that
1088 From version 5.32.0 Perl forbids the use of \K in lookaround assertions. From
1091 <b>pcre2_compile()</b> to re-enable the previous behaviour. When this option is
1110 without consuming any characters from the subject string. The use of
1134 nor Perl has a separate "start of word" or "end of word" metasequence. However,
1145 argument of <b>pcre2_match()</b> is non-zero, indicating that matching is to
1155 <i>startoffset</i> is non-zero. By calling <b>pcre2_match()</b> multiple times
1156 with appropriate arguments, you can mimic Perl's /g option, and it is in this
1161 character of the matching process, is subtly different from Perl's, which
1162 defines it as true at the end of the previous match. In Perl, these can be
1173 The circumflex and dollar metacharacters are zero-width assertions. That is,
1174 they test for a particular condition being true without consuming any
1177 only the two-character sequence CRLF is recognized as a newline, isolated CR
1185 <b>pcre2_match()</b> is non-zero, or if PCRE2_NOTBOL is set, circumflex can
1219 for compatibility with Perl. However, this can be changed by setting the
1227 when the <i>startoffset</i> argument of <b>pcre2_match()</b> is non-zero. The
1233 below) recognizes the two-character sequence CRLF as a newline, this is
1254 Dot never matches a single line-ending character. When the two-character
1263 PCRE2_DOTALL option is set, a dot matches any one character, without exception.
1264 If the two-character sequence CRLF is present in the subject string, it takes
1280 <a href="digitsafterbackslash">"Non-printing characters"</a>
1281 above for details. Perl also uses \N{name} to specify characters by Unicode
1287 whether or not a UTF mode is set. In the 8-bit library, one code unit is one
1288 byte; in the 16-bit library it is a 16-bit unit; in the 32-bit library it is a
1289 32-bit unit. Unlike a dot, \C always matches line-ending characters. The
1290 feature is provided in Perl in order to match individual bytes in UTF-8 mode,
1295 with \C in UTF-8 or UTF-16 mode means that the rest of the string may start
1309 in UTF-8 or UTF-16 modes, because this would make it impossible to calculate
1312 The former gives a match-time error; the latter fails to optimize and so the
1316 In the 32-bit library, however, \C is always supported (when not explicitly
1317 locked out) because it always matches a single code unit, whether or not UTF-32
1322 it that avoids the problem of malformed UTF-8 or UTF-16 characters is to use a
1324 could be used with a UTF-8 string (ignore white space and line breaks):
1326   (?| (?=[\x00-\x7f])(\C) |
1327       (?=[\x80-\x{7ff}])(\C)(\C) |
1328       (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
1329       (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
1334 below). The assertions at the start of each branch check the next UTF-8
1373 are case-equivalent with Unicode U+212A (Kelvin sign) and U+017F (long S)
1378 when matching character classes, whatever line-ending sequence is in use, and
1398 character class. For example, [d-m] matches any letter between d and m,
1402 or immediately after a range. For example, [b-d-z] matches letters in the range
1406 Perl treats a hyphen as a literal if it appears before or after a POSIX class
1408 However, unless the hyphen is the last character in the class, Perl outputs a
1414 range. A pattern such as [W-]46] is interpreted as a class of two characters
1415 ("W" and "-") followed by a literal string "46]", so it would match "W46]" or
1416 "-46]". However, if the "]" is escaped with a backslash it is interpreted as
1417 the end of range, so [W-\]46] is interpreted as a class containing a range
1424 example [\000-\037]. Ranges can include any characters that are valid for the
1425 current mode. In any UTF mode, the so-called "surrogate" characters (those
1428 this check). However, ranges such as [\x{d7ff}-\x{e000}], which include the
1434 Perl, EBCDIC code points within the range that are not letters are omitted. For
1435 example, [h-k] matches only four characters, even though the codes for h and k
1437 specified numerically, for example, [\x88-\x92] or [h-\x92], all code points
1442 matches the letters in either case. For example, [W-c] is equivalent to
1443 [][\\^_`wxyzabc], matched caselessly, and in a non-UTF mode, if character
1444 tables for a French locale are in use, [\xc8-\xcb] matches accented E
1459 introducing a POSIX class name, or for a special compatibility feature - see
1461 escaping other non-alphanumeric characters does no harm.
1465 Perl supports the POSIX notation for character classes. This uses names
1476   ascii    character codes 0 - 127
1490 and space (32). If locale-specific matching is taking place, the list of space
1495 The name "word" is a Perl extension, and "blank" is a GNU extension from Perl
1496 5.8. Another Perl extension is negation, which is indicated by a ^ character
1501 matches "1", "2", or any non-digit. PCRE2 (and Perl) also recognize the POSIX
1508 range 128-255 when locale-specific matching is happening. However, if the
1534   U+2066 - U+2069  Various "isolate"s
1564 not compatible with Perl. It is provided to help migrations from other
1568 above), and in a Perl-style pattern the preceding or following character
1569 normally shows which is wanted, without the need for the assertions that are
1592 and ")". These options are Perl-compatible, and are described in detail in the
1605 example (?-im). The two "extended" options are not independent; unsetting either
1609 A combined setting and unsetting such as (?im-sx), which sets PCRE2_CASELESS
1617 options to be unset. Thus, (?^) is equivalent to (?-imnsx). Letters may follow
1618 the circumflex to cause some options to be re-instated, but a hyphen may not
1622 The PCRE2-specific options PCRE2_DUPNAMES and PCRE2_UNGREEDY can be changed in
1623 the same way as the Perl-compatible options by using the characters J and U
1648 a non-capturing group (see the next section), the option letters may
1657 <b>Note:</b> There are other PCRE2-specific options, applying to the whole
1679 matches "cataract", "caterpillar", or "cat". Without the parentheses, it would
1701 There are often times when grouping is required without capturing. If an
1714 a non-capturing group, the option letters may appear between the "?" and the
1727 Perl 5.10 introduced a feature whereby each alternative in a group uses the
1729 itself a non-capturing group. For example, consider this pattern:
1740 any branch. The following example is taken from the Perl documentation. The
1743   # before  ---------------branch-reset----------- after
1758 A relative reference such as (?-1) is no different: it is just a convenient way
1764 for a group's having matched refers to a non-unique number, the test is
1776 the naming of capture groups. This feature was not added to Perl until release
1778 using the Python syntax. PCRE2 supports both the Perl and the Python syntax.
1782 (?'name'...) as in Perl, or (?P&#60;name&#62;...) as in Python. Names may be up to 32
1784 alphanumeric characters and underscores, but must start with a non-digit. When
1789   ^[_A-Za-z][_A-Za-z0-9]*\z   when PCRE2_UTF is not set
1801 if the names were not present. In both PCRE2 and Perl, capture groups
1804 name-to-number translation table from a compiled pattern, as well as
1810 of them. Perl allows identically numbered groups to have different names.
1815 Perl allows this, with both names AA and BB as aliases of group 1. Thus, after
1821 compile-time error. However, there is still scope for confusion. Consider this
1845 either as a 3-letter abbreviation or as the full name, and in both cases you
1864 If you make a backreference to a non-unique named group from elsewhere in the
1875 If you make a subroutine call to a non-unique named group, the one that
1933 which is represented by a two-byte sequence in a UTF-8 string. Similarly,
1948 For convenience, the three most common quantifiers have single-character
1960 Earlier versions of Perl and PCRE1 used to give an error at compile time for
1969 (up to the maximum number of permitted times), without causing the rest of the
1999 If the PCRE2_UNGREEDY option is set (an option that is not available in Perl),
2011 to Perl's /s) is set, thus allowing the dot to match newlines, the pattern is
2062 re-evaluated to see if a different number of repeats allows the rest of the
2076 that once a group has matched, it is not to be re-evaluated in this way.
2085 Perl 5.28 introduced an experimental alphabetic form starting with (* which may
2129 The possessive quantifier syntax is an extension to the Perl 5.8 syntax.
2132 package, and PCRE1 copied it from there. It found its way into Perl at release
2149 matches an unlimited number of substrings that either consist of non-digits, or
2158 than a single character at the end, because both PCRE2 and Perl have an
2166 sequences of non-digits cannot be broken, and failure happens quickly.
2187 "Non-printing characters"
2203 An unsigned number specifies an absolute reference without the ambiguity that
2207   (abc(def)ghi)\g{-1}
2209 The sequence \g{-1} is a reference to the most recently started capture group
2211 \g{-2} would be equivalent to \1. The use of relative references can be
2217 forward reference can be useful in patterns that repeat. Perl does not support
2240 groups. The .NET syntax \k{name} and the Perl syntax \k&#60;name&#62; or \k'name'
2241 are supported, as is the Python syntax (?P=name). Perl 5.10's unified
2315 The Perl-compatible lookaround assertions are atomic. If an assertion is true,
2317 assertion. However, there are some cases where non-atomic assertions can be
2319 <a href="#nonatomicassertions">"Non-atomic assertions"</a>
2320 below, but they are not Perl-compatible.
2333 such as (.)\g{-1} can be used to check that two adjacent characters are the
2369 specify lookaround assertions. Perl 5.28 introduced some experimental
2424 have a fixed length. However, if there are several top-level alternatives, they
2435 extension compared with Perl, which requires all branches to match the same
2440 is not permitted, because its single top-level branch can match two different
2441 lengths, but it is acceptable to PCRE2 if rewritten to use two top-level
2448 can be used instead of a lookbehind assertion to get round the fixed-length
2458 In UTF-8 and UTF-16 modes, PCRE2 does not allow the \C escape (which matches a
2467 as the called capture group matches a fixed-length string. However,
2473 Perl does not support backreferences in lookbehinds. PCRE2 does support them,
2486 specify efficient matching of fixed-length strings at the end of subject
2546 <br><a name="SEC21" href="#TOC1">NON-ATOMIC ASSERTIONS</a><br>
2548 The traditional Perl-compatible lookaround assertions are atomic. That is, if
2550 backtracking into the assertion. However, there are some cases where non-atomic
2557 Consider the problem of finding the right-most word in a string that also
2569 succeeds, it captures the right-most word in the string.
2577 assertion could not be re-entered, and the whole match would fail. The pattern
2581 Using a non-atomic lookahead, however, means that when the last word does not
2582 occur twice in the string, the lookahead can backtrack and find the second-last
2587 Two conditions must be met for a non-atomic assertion to be useful: the
2591 as before because nothing has changed, so using a non-atomic assertion just
2595 There is one exception to backtracking into a non-atomic assertion. If an
2600 Non-atomic assertions are not supported by the alternative matching function
2626 the matched characters in a sequence of non-spaces that follow white space are
2641   \s+(?=[0-9_.]*\p{Latin})(*sr:\S+)
2657 Support for script runs is not available if PCRE2 is compiled without Unicode
2658 support. A compile-time error is given if any of the above constructs is
2676   (?(condition)yes-pattern)
2677   (?(condition)yes-pattern|no-pattern)
2679 If the condition is satisfied, the yes-pattern is used; otherwise the
2680 no-pattern (if present) is used. An absent no-pattern is equivalent to an empty
2682 group, a compile-time error occurs. Each of the two alternatives may itself
2693 recursion, two pseudo-conditions called DEFINE and VERSION, and assertions.
2706 referenced by (?(-1), the next most recent by (?(-2), and so on. Inside loops
2709 is not used; it provokes a compile-time error.)
2712 Consider the following pattern, which contains non-significant white space to
2723 the condition is true, and so the yes-pattern is executed and a closing
2724 parenthesis is required. Otherwise, since no-pattern is not present, the
2726 sequence of non-parentheses, optionally enclosed in parentheses.
2732   ...other stuff... ( \( )?    [^()]+    (?(-1) \) ) ...
2740 Perl uses the syntax (?(&#60;name&#62;)...) or (?('name')...) to test for a used
2742 had this facility before Perl, the syntax (?(name)...) is also recognized.
2757 "Recursion" in this sense refers to any subroutine-like call from one part of
2811   (?(DEFINE) (?&#60;byte&#62; 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
2818 pattern uses references to the named group to match the four dot-separated
2846 PCRE2-specific
2847 <a href="#nonatomicassertions">non-atomic assertions.</a>
2850 Consider this pattern, again containing non-significant white space, and with
2853   (?(?=[^a-z]*[a-z])
2854   \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )
2857 sequence of non-letters followed by a letter. In other words, it tests for the
2861 dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits.
2867 assertion, whether it succeeds or fails. (Compare non-conditional assertions,
2903 unlimited nested parentheses. Without the use of recursion, the best that can
2908 For some time, Perl has provided a facility that allows regular expressions to
2909 recurse (amongst other things). It does this by interpolating Perl code in the
2910 expression at run time, and the code can refer to the expression itself. A Perl
2916 The (?p{...}) item interpolates Perl code at run time, and in this case refers
2920 Obviously, PCRE2 cannot support the interpolation of Perl code. Instead, it
2923 this kind of recursion was subsequently introduced into Perl at release 5.10.
2929 <a href="#groupsassubroutines">non-recursive subroutine</a>
2940 substrings which can either be a sequence of non-parentheses, or a recursive
2943 to avoid backtracking into sequences of non-parentheses.
2957 pattern above you can write (?-2) to refer to the second most recently opened
2967   (?|(a)|(b)) (c) (?-2)
2970 is number 2. When the reference (?-2) is encountered, the second most recently
2980 <a href="#groupsassubroutines">non-recursive subroutine</a>
2984 An alternative approach is to use named parentheses. The Perl syntax for this
2996 non-parentheses is important when applying the pattern to strings that do not
3029 alternatives for the recursive and non-recursive cases. The (?R) item is the
3033 Differences in recursion processing between PCRE2 and Perl
3036 Some former differences between PCRE2 and Perl no longer exist.
3039 Before release 10.30, recursion processing in PCRE2 differed from Perl in that
3041 once it had matched some of the subject string, it was never re-entered, even
3043 failure. (Historical note: PCRE implemented recursion before Perl did.)
3047 as atomic. That is, they can be re-entered to try unused alternatives if there
3049 Perl works. If you want a subroutine call to be atomic, you must explicitly
3062 typical palindromic phrases, the pattern has to ignore all non-word characters,
3069 avoid backtracking into sequences of non-word characters. Without this, PCRE2
3071 Perl takes so long that you think it has gone into a loop.
3074 Another way in which PCRE2 and Perl used to differ in their recursion
3075 processing is in the handling of captured values. Formerly in Perl, when a
3085 "b" and so the whole match succeeds. This match used to fail in Perl, but in
3098   (...(relative)...)...(?-1)...
3120 Processing options such as case-independence are fixed when a group is
3124   (abc)(?i:(?-1))
3138 For compatibility with Oniguruma, the non-Perl syntax \g followed by a name or
3149   (abc)(?i:\g&#60;-1&#62;)
3151 Note that \g{...} (Perl syntax) and \g&#60;...&#62; (Oniguruma syntax) are <i>not</i>
3156 Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl
3162 PCRE2 provides a similar feature, but of course it cannot obey arbitrary Perl
3176 scripts within patterns in a similar way to Perl.
3187 one side-effect is that sometimes callouts are skipped. If you need all
3232 There are a number of special "Backtracking Control Verbs" (to use Perl's
3239 By default, for compatibility with Perl, a name is any sequence of characters
3243 is no longer Perl-compatible.
3256 skipped, and #-comments are recognized, exactly as in the rest of the pattern.
3261 The maximum length of a name is 255 in the 8-bit library and 65535 in the
3262 16-bit and 32-bit libraries. If the name is empty, that is, if the closing
3291 the start-of-match optimizations by setting the PCRE2_NO_START_OPTIMIZE option
3301 Experiments with Perl suggest that it too has similar optimizations, and like
3347 abbreviated to (*F). It is equivalent to (?!) but easier to read. The Perl
3349 (??{}). Those are, of course, Perl features that are not present in PCRE2. The
3376 When a match succeeds, the name of the last-encountered mark name on the
3388 verb without a NAME argument is ignored for this purpose. Here is an example of
3407 name is recorded and passed back if it is the last-encountered. This does not
3475 unless PCRE2's start-of-match optimizations are turned off, as shown in this
3490 applied starting at "x", and so the (*COMMIT) causes the match to fail without
3514 This verb, when given without a name, is like (*PRUNE), except that if the
3547 assertions, because they are never re-entered by backtracking. Compare the
3578 pattern-based if-then-else block:
3584 second alternative and tries COND2, without backtracking into COND1. If that
3652 not always the same as Perl's. It means that if two or more backtracking verbs
3666 PCRE2 sometimes differs from Perl in its handling of backtracking verbs in
3671 If the subject is "abac", Perl matches unless its optimizations are disabled,
3686 without any further processing; captured strings and a mark name (if set) are
3688 fail without any further processing; captured substrings and any mark name are
3698 reach them. This means that, for the Perl-compatible assertions, their effect
3699 is confined to the assertion, because Perl lookaround assertions are atomic. A
3705 PCRE2 now supports non-atomic positive assertions, as described in the section
3707 <a href="#nonatomicassertions">"Non-atomic assertions"</a>
3709 not Perl-compatible. For these assertions, a later backtrack does jump back
3724 the assertion to be true, without considering any further alternative branches.
3734 succeed without any further processing. Matching then continues after the
3735 subroutine call. Perl documents this behaviour. Perl's treatment of the other
3771 Copyright &copy; 1997-2022 University of Cambridge.