pcre2pattern.3 - OpenGrok cross reference for /external/pcre/doc/pcre2pattern.3

Lines Matching +full:- +full:- +full:without +full:- +full:perl
3 PCRE2 - Perl-compatible regular expressions (revised API)
8 are described in detail below. There is a quick-reference syntax summary in the
12 page. PCRE2 tries to match Perl syntax and semantics as closely as it can.
14 conflict with the Perl syntax) in order to provide some compatibility with
17 Perl's regular expressions are described in its own documentation, and regular
26 using a different algorithm that is not Perl-compatible. Some of the features
36 .SH "SPECIAL START-OF-PATTERN ITEMS"
40 by special items at the start of a pattern. These are not Perl-compatible, but
50 In the 8-bit and 16-bit PCRE2 libraries, characters may be coded either as
51 single code units, or as multiple UTF-8 or UTF-16 code units. UTF-32 can be
52 specified for the 32-bit library, in which case it constrains the character
66 restrict them to non-UTF data for security reasons. If the PCRE2_NEVER_UTF
97 .SS "Disabling auto-possessification"
110 .SS "Disabling start-up optimizations"
127 apply to patterns whose top-level branches all start with .* (match any number
189 character, the two-character sequence CRLF, any of the three preceding, any
225 matches. By default, this is any Unicode newline sequence, for Perl
269 equivalents, are case-equivalent with Unicode U+212A (Kelvin sign) and U+017F
300   -      indicates character range
334 precede a non-alphanumeric with backslash to specify that it stands for itself.
342 putting them between \eQ and \eE. This is different from Perl in that $ and @
343 are handled as literals in \eQ...\eE sequences in PCRE2, whereas in Perl, $ and
344 @ cause variable interpolation. Also, Perl does "double-quotish backslash
349   Pattern            PCRE2 matches   Perl matches
368 .SS "Non-printing characters"
371 A second use of backslash provides a way of encoding non-printing characters
373 non-printing characters in a pattern, but when a pattern is being prepared by
379   \ecx         "control-x", where x is any printable ASCII character
404 two compile-time options. If PCRE2_ALT_BSUX is set, the sequence \ex followed
417 UTF mode. Perl also uses \eN{name} to specify characters by Unicode name; PCRE2
432 compile-time error occurs.
436 escape is processed as specified for Perl in the \fBperlebcdic\fP document. The
437 only characters that are allowed after \ec are A-Z, a-z, or one of @, [, \e, ],
438 ^, _, or ?. Any other character provokes a compile-time error. The sequence
440 characters 1-26 (hex 01 to hex 1A); [, \e, ], ^, and _ encode characters 27-31
449 because 127 is not a control character in EBCDIC, Perl makes it generate the
451 them the APC character has the value 255 (hex FF), but in the one Perl calls
452 POSIX-BC its value is 95 (hex 5F). If certain other characters have POSIX-BC
463 addition to Perl; it provides way of specifying character code points as octal
473 and Perl has changed over time, causing PCRE2 also to change.
525   8-bit non-UTF mode    no greater than 0xff
526   16-bit non-UTF mode   no greater than 0xffff
527   32-bit non-UTF mode   no greater than 0xffffffff
531 so-called "surrogate" code points). The check for these can be disabled by the
533 PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES. However, this is possible only in UTF-8
534 and UTF-32 modes, because these values are not representable in UTF-16.
553 In Perl, the sequences \eF, \el, \eL, \eu, and \eU are recognized by its string
581 For compatibility with Oniguruma, the non-Perl syntax \eg followed by a name or
588 Note that \eg{...} (Perl syntax) and \eg<...> (Oniguruma syntax) are \fInot\fP
613   \eW     any "non-word" character
625 "Non-printing characters"
627 above for details. Perl also uses \eN{name} to specify characters by Unicode
639 vary if locale-specific matching is taking place. For example, in some locales
640 the "non-breaking space" character (\exA0) is recognized as white space, and in
645 low-valued character tables, and may vary if locale-specific matching is taking
655 page). For example, in a French locale such as "fr_FR" in Unix-like systems,
662 for characters in the range 128-255 when locale-specific matching is happening.
684   U+00A0     Non-break space
691   U+2004     Three-per-em space
692   U+2005     Four-per-em space
693   U+2006     Six-per-em space
698   U+202F     Narrow no-break space
712 In 8-bit, non-UTF-8 mode, only the characters with code points less than 256
721 Unicode newline sequence. In 8-bit non-UTF-8 mode \eR is equivalent to the
731 This particular group matches either the two-character sequence CR followed by
734 line, U+0085). Because this is an atomic group, the two-character sequence is
753 Note that these special settings, which are not Perl-compatible, are recognized
771 can be used in any mode, though in 8-bit and 16-bit non-UTF modes these
773 less than U+0100 and U+10000, respectively. In 32-bit non-UTF mode, code points
786   \eP{\fIxx\fP}   a character without the \fIxx\fP property
789 The property names represented by \fIxx\fP above are not case-sensitive, and in
799 Certain other Perl properties such as "InMusicalSymbols" are not supported by
815 colon. If a script name is given without a property type, for example,
816 \ep{Adlam}, it is treated as \ep{scx:Adlam}. Perl changed to this
819 Unassigned characters (and in non-UTF 32-bit mode, characters with code points
822 of recognized script names and their 4-character abbreviations can be obtained
825   pcre2test -LS
834 a two-letter abbreviation. For compatibility with Perl, negation can be
865   Mn    Non-spacing mark
898 character when PCRE2 is not in UTF mode (using the 16-bit or 32-bit library).
907 The long synonyms for property names that Perl supports (such as \ep{Letter})
917 the behaviour of current versions of Perl.
927   pcre2test -LP
948   L           left-to-right
949   LRE         left-to-right embedding
950   LRI         left-to-right isolate
951   LRO         left-to-right override
952   NSM         non-spacing mark
956   R           right-to-left
957   RLE         right-to-left embedding
958   RLI         right-to-left isolate
959   RLO         right-to-left override
964 case-insensitive; only the short names listed above are recognized.
981 Instead it introduced various emoji-specific properties. PCRE2 uses only the
996 4. Do not end before extending characters or spacing marks or the "zero-width
1019 and \es to use Unicode properties. PCRE2 uses these non-standard, non-Perl
1025   Xsp   Any Perl space character
1026   Xwd   Any Perl "word" character
1031 Xsp is the same as Xps; in PCRE1 it used to exclude vertical tab, for Perl
1032 compatibility, but Perl changed. Xwd matches the same characters as Xan, plus
1035 There is another non-standard property, Xuc, which matches any character that
1080 From version 5.32.0 Perl forbids the use of \eK in lookaround assertions. From
1083 \fBpcre2_compile()\fP to re-enable the previous behaviour. When this option is
1103 without consuming any characters from the subject string. The use of
1129 nor Perl has a separate "start of word" or "end of word" metasequence. However,
1139 argument of \fBpcre2_match()\fP is non-zero, indicating that matching is to
1148 \fIstartoffset\fP is non-zero. By calling \fBpcre2_match()\fP multiple times
1149 with appropriate arguments, you can mimic Perl's /g option, and it is in this
1153 character of the matching process, is subtly different from Perl's, which
1154 defines it as true at the end of the previous match. In Perl, these can be
1166 The circumflex and dollar metacharacters are zero-width assertions. That is,
1167 they test for a particular condition being true without consuming any
1170 only the two-character sequence CRLF is recognized as a newline, isolated CR
1177 \fBpcre2_match()\fP is non-zero, or if PCRE2_NOTBOL is set, circumflex can
1210 for compatibility with Perl. However, this can be changed by setting the
1217 when the \fIstartoffset\fP argument of \fBpcre2_match()\fP is non-zero. The
1225 below) recognizes the two-character sequence CRLF as a newline, this is
1250 Dot never matches a single line-ending character. When the two-character
1258 PCRE2_DOTALL option is set, a dot matches any one character, without exception.
1259 If the two-character sequence CRLF is present in the subject string, it takes
1274 "Non-printing characters"
1276 above for details. Perl also uses \eN{name} to specify characters by Unicode
1284 whether or not a UTF mode is set. In the 8-bit library, one code unit is one
1285 byte; in the 16-bit library it is a 16-bit unit; in the 32-bit library it is a
1286 32-bit unit. Unlike a dot, \eC always matches line-ending characters. The
1287 feature is provided in Perl in order to match individual bytes in UTF-8 mode,
1291 with \eC in UTF-8 or UTF-16 mode means that the rest of the string may start
1306 in UTF-8 or UTF-16 modes, because this would make it impossible to calculate
1309 The former gives a match-time error; the latter fails to optimize and so the
1312 In the 32-bit library, however, \eC is always supported (when not explicitly
1313 locked out) because it always matches a single code unit, whether or not UTF-32
1317 it that avoids the problem of malformed UTF-8 or UTF-16 characters is to use a
1319 could be used with a UTF-8 string (ignore white space and line breaks):
1321   (?| (?=[\ex00-\ex7f])(\eC) |
1322       (?=[\ex80-\ex{7ff}])(\eC)(\eC) |
1323       (?=[\ex{800}-\ex{ffff}])(\eC)(\eC)(\eC) |
1324       (?=[\ex{10000}-\ex{1fffff}])(\eC)(\eC)(\eC)(\eC))
1332 below). The assertions at the start of each branch check the next UTF-8
1371 are case-equivalent with Unicode U+212A (Kelvin sign) and U+017F (long S)
1375 when matching character classes, whatever line-ending sequence is in use, and
1396 character class. For example, [d-m] matches any letter between d and m,
1400 or immediately after a range. For example, [b-d-z] matches letters in the range
1403 Perl treats a hyphen as a literal if it appears before or after a POSIX class
1405 However, unless the hyphen is the last character in the class, Perl outputs a
1410 range. A pattern such as [W-]46] is interpreted as a class of two characters
1411 ("W" and "-") followed by a literal string "46]", so it would match "W46]" or
1412 "-46]". However, if the "]" is escaped with a backslash it is interpreted as
1413 the end of range, so [W-\e]46] is interpreted as a class containing a range
1419 example [\e000-\e037]. Ranges can include any characters that are valid for the
1420 current mode. In any UTF mode, the so-called "surrogate" characters (those
1423 this check). However, ranges such as [\ex{d7ff}-\ex{e000}], which include the
1428 Perl, EBCDIC code points within the range that are not letters are omitted. For
1429 example, [h-k] matches only four characters, even though the codes for h and k
1431 specified numerically, for example, [\ex88-\ex92] or [h-\ex92], all code points
1435 matches the letters in either case. For example, [W-c] is equivalent to
1436 [][\e\e^_`wxyzabc], matched caselessly, and in a non-UTF mode, if character
1437 tables for a French locale are in use, [\exc8-\excb] matches accented E
1450 introducing a POSIX class name, or for a special compatibility feature - see
1452 escaping other non-alphanumeric characters does no harm.
1458 Perl supports the POSIX notation for character classes. This uses names
1469   ascii    character codes 0 - 127
1483 and space (32). If locale-specific matching is taking place, the list of space
1487 The name "word" is a Perl extension, and "blank" is a GNU extension from Perl
1488 5.8. Another Perl extension is negation, which is indicated by a ^ character
1493 matches "1", "2", or any non-digit. PCRE2 (and Perl) also recognize the POSIX
1499 range 128-255 when locale-specific matching is happening. However, if the
1524   U+2066 - U+2069  Various "isolate"s
1552 not compatible with Perl. It is provided to help migrations from other
1559 above), and in a Perl-style pattern the preceding or following character
1560 normally shows which is wanted, without the need for the assertions that are
1591 and ")". These options are Perl-compatible, and are described in detail in the
1606 example (?-im). The two "extended" options are not independent; unsetting either
1609 A combined setting and unsetting such as (?im-sx), which sets PCRE2_CASELESS
1616 options to be unset. Thus, (?^) is equivalent to (?-imnsx). Letters may follow
1617 the circumflex to cause some options to be re-instated, but a hyphen may not
1620 The PCRE2-specific options PCRE2_DUPNAMES and PCRE2_UNGREEDY can be changed in
1621 the same way as the Perl-compatible options by using the characters J and U
1644 a non-capturing group (see the next section), the option letters may
1652 \fBNote:\fP There are other PCRE2-specific options, applying to the whole
1679 matches "cataract", "caterpillar", or "cat". Without the parentheses, it would
1698 There are often times when grouping is required without capturing. If an
1710 a non-capturing group, the option letters may appear between the "?" and the
1726 Perl 5.10 introduced a feature whereby each alternative in a group uses the
1728 itself a non-capturing group. For example, consider this pattern:
1739 any branch. The following example is taken from the Perl documentation. The
1742   # before  ---------------branch-reset----------- after
1757 A relative reference such as (?-1) is no different: it is just a convenient way
1765 for a group's having matched refers to a non-unique number, the test is
1778 the naming of capture groups. This feature was not added to Perl until release
1780 using the Python syntax. PCRE2 supports both the Perl and the Python syntax.
1783 (?'name'...) as in Perl, or (?P<name>...) as in Python. Names may be up to 32
1785 alphanumeric characters and underscores, but must start with a non-digit. When
1790   ^[_A-Za-z][_A-Za-z0-9]*\ez   when PCRE2_UTF is not set
1810 if the names were not present. In both PCRE2 and Perl, capture groups
1813 name-to-number translation table from a compiled pattern, as well as
1818 of them. Perl allows identically numbered groups to have different names.
1823 Perl allows this, with both names AA and BB as aliases of group 1. Thus, after
1828 compile-time error. However, there is still scope for confusion. Consider this
1853 either as a 3-letter abbreviation or as the full name, and in both cases you
1871 If you make a backreference to a non-unique named group from elsewhere in the
1880 If you make a subroutine call to a non-unique named group, the one that
1944 which is represented by a two-byte sequence in a UTF-8 string. Similarly,
1963 For convenience, the three most common quantifiers have single-character
1975 Earlier versions of Perl and PCRE1 used to give an error at compile time for
1983 (up to the maximum number of permitted times), without causing the rest of the
2012 If the PCRE2_UNGREEDY option is set (an option that is not available in Perl),
2022 to Perl's /s) is set, thus allowing the dot to match newlines, the pattern is
2072 re-evaluated to see if a different number of repeats allows the rest of the
2085 that once a group has matched, it is not to be re-evaluated in this way.
2093 Perl 5.28 introduced an experimental alphabetic form starting with (* which may
2133 The possessive quantifier syntax is an extension to the Perl 5.8 syntax.
2136 package, and PCRE1 copied it from there. It found its way into Perl at release
2151 matches an unlimited number of substrings that either consist of non-digits, or
2160 than a single character at the end, because both PCRE2 and Perl have an
2168 sequences of non-digits cannot be broken, and failure happens quickly.
2190 "Non-printing characters"
2208 An unsigned number specifies an absolute reference without the ambiguity that
2212   (abc(def)ghi)\eg{-1}
2214 The sequence \eg{-1} is a reference to the most recently started capture group
2216 \eg{-2} would be equivalent to \e1. The use of relative references can be
2221 forward reference can be useful in patterns that repeat. Perl does not support
2245 groups. The .NET syntax \ek{name} and the Perl syntax \ek<name> or \ek'name'
2246 are supported, as is the Python syntax (?P=name). Perl 5.10's unified
2327 The Perl-compatible lookaround assertions are atomic. If an assertion is true,
2329 assertion. However, there are some cases where non-atomic assertions can be
2333 "Non-atomic assertions"
2335 below, but they are not Perl-compatible.
2349 such as (.)\eg{-1} can be used to check that two adjacent characters are the
2385 specify lookaround assertions. Perl 5.28 introduced some experimental
2440 have a fixed length. However, if there are several top-level alternatives, they
2451 extension compared with Perl, which requires all branches to match the same
2456 is not permitted, because its single top-level branch can match two different
2457 lengths, but it is acceptable to PCRE2 if rewritten to use two top-level
2467 can be used instead of a lookbehind assertion to get round the fixed-length
2475 In UTF-8 and UTF-16 modes, PCRE2 does not allow the \eC escape (which matches a
2486 as the called capture group matches a fixed-length string. However,
2494 Perl does not support backreferences in lookbehinds. PCRE2 does support them,
2505 specify efficient matching of fixed-length strings at the end of subject
2566 .SH "NON-ATOMIC ASSERTIONS"
2569 The traditional Perl-compatible lookaround assertions are atomic. That is, if
2571 backtracking into the assertion. However, there are some cases where non-atomic
2578 Consider the problem of finding the right-most word in a string that also
2590 succeeds, it captures the right-most word in the string.
2597 assertion could not be re-entered, and the whole match would fail. The pattern
2600 Using a non-atomic lookahead, however, means that when the last word does not
2601 occur twice in the string, the lookahead can backtrack and find the second-last
2605 Two conditions must be met for a non-atomic assertion to be useful: the
2609 as before because nothing has changed, so using a non-atomic assertion just
2612 There is one exception to backtracking into a non-atomic assertion. If an
2616 Non-atomic assertions are not supported by the alternative matching function
2651 the matched characters in a sequence of non-spaces that follow white space are
2666   \es+(?=[0-9_.]*\ep{Latin})(*sr:\eS+)
2679 Support for script runs is not available if PCRE2 is compiled without Unicode
2680 support. A compile-time error is given if any of the above constructs is
2703   (?(condition)yes-pattern)
2704   (?(condition)yes-pattern|no-pattern)
2706 If the condition is satisfied, the yes-pattern is used; otherwise the
2707 no-pattern (if present) is used. An absent no-pattern is equivalent to an empty
2709 group, a compile-time error occurs. Each of the two alternatives may itself
2718 recursion, two pseudo-conditions called DEFINE and VERSION, and assertions.
2735 referenced by (?(-1), the next most recent by (?(-2), and so on. Inside loops
2738 is not used; it provokes a compile-time error.)
2740 Consider the following pattern, which contains non-significant white space to
2751 the condition is true, and so the yes-pattern is executed and a closing
2752 parenthesis is required. Otherwise, since no-pattern is not present, the
2754 sequence of non-parentheses, optionally enclosed in parentheses.
2759   ...other stuff... ( \e( )?    [^()]+    (?(-1) \e) ) ...
2767 Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a used
2769 had this facility before Perl, the syntax (?(name)...) is also recognized.
2784 "Recursion" in this sense refers to any subroutine-like call from one part of
2844   (?(DEFINE) (?<byte> 2[0-4]\ed | 25[0-5] | 1\ed\ed | [1-9]?\ed) )
2851 pattern uses references to the named group to match the four dot-separated
2879 PCRE2-specific
2882 non-atomic assertions.
2885 Consider this pattern, again containing non-significant white space, and with
2888   (?(?=[^a-z]*[a-z])
2889   \ed{2}-[a-z]{3}-\ed{2}  |  \ed{2}-\ed{2}-\ed{2} )
2892 sequence of non-letters followed by a letter. In other words, it tests for the
2896 dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits.
2901 assertion, whether it succeeds or fails. (Compare non-conditional assertions,
2945 unlimited nested parentheses. Without the use of recursion, the best that can
2949 For some time, Perl has provided a facility that allows regular expressions to
2950 recurse (amongst other things). It does this by interpolating Perl code in the
2951 expression at run time, and the code can refer to the expression itself. A Perl
2957 The (?p{...}) item interpolates Perl code at run time, and in this case refers
2960 Obviously, PCRE2 cannot support the interpolation of Perl code. Instead, it
2963 this kind of recursion was subsequently introduced into Perl at release 5.10.
2970 non-recursive subroutine
2981 substrings which can either be a sequence of non-parentheses, or a recursive
2984 to avoid backtracking into sequences of non-parentheses.
2996 pattern above you can write (?-2) to refer to the second most recently opened
3008   (?|(a)|(b)) (c) (?-2)
3011 is number 2. When the reference (?-2) is encountered, the second most recently
3022 non-recursive subroutine
3026 An alternative approach is to use named parentheses. The Perl syntax for this
3037 non-parentheses is important when applying the pattern to strings that do not
3070 alternatives for the recursive and non-recursive cases. The (?R) item is the
3075 .SS "Differences in recursion processing between PCRE2 and Perl"
3078 Some former differences between PCRE2 and Perl no longer exist.
3080 Before release 10.30, recursion processing in PCRE2 differed from Perl in that
3082 once it had matched some of the subject string, it was never re-entered, even
3084 failure. (Historical note: PCRE implemented recursion before Perl did.)
3087 as atomic. That is, they can be re-entered to try unused alternatives if there
3089 Perl works. If you want a subroutine call to be atomic, you must explicitly
3101 typical palindromic phrases, the pattern has to ignore all non-word characters,
3108 avoid backtracking into sequences of non-word characters. Without this, PCRE2
3110 Perl takes so long that you think it has gone into a loop.
3112 Another way in which PCRE2 and Perl used to differ in their recursion
3113 processing is in the handling of captured values. Formerly in Perl, when a
3123 "b" and so the whole match succeeds. This match used to fail in Perl, but in
3139   (...(relative)...)...(?-1)...
3159 Processing options such as case-independence are fixed when a group is
3163   (abc)(?i:(?-1))
3185 For compatibility with Oniguruma, the non-Perl syntax \eg followed by a name or
3196   (abc)(?i:\eg<-1>)
3198 Note that \eg{...} (Perl syntax) and \eg<...> (Oniguruma syntax) are \fInot\fP
3205 Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl
3210 PCRE2 provides a similar feature, but of course it cannot obey arbitrary Perl
3223 scripts within patterns in a similar way to Perl.
3232 one side-effect is that sometimes callouts are skipped. If you need all
3282 There are a number of special "Backtracking Control Verbs" (to use Perl's
3288 By default, for compatibility with Perl, a name is any sequence of characters
3292 is no longer Perl-compatible.
3303 skipped, and #-comments are recognized, exactly as in the rest of the pattern.
3307 The maximum length of a name is 255 in the 8-bit library and 65535 in the
3308 16-bit and 32-bit libraries. If the name is empty, that is, if the closing
3345 the start-of-match optimizations by setting the PCRE2_NO_START_OPTIMIZE option
3359 Experiments with Perl suggest that it too has similar optimizations, and like
3403 abbreviated to (*F). It is equivalent to (?!) but easier to read. The Perl
3405 (??{}). Those are, of course, Perl features that are not present in PCRE2. The
3430 When a match succeeds, the name of the last-encountered mark name on the
3446 verb without a NAME argument is ignored for this purpose. Here is an example of
3464 name is recorded and passed back if it is the last-encountered. This does not
3529 unless PCRE2's start-of-match optimizations are turned off, as shown in this
3544 applied starting at "x", and so the (*COMMIT) causes the match to fail without
3567 This verb, when given without a name, is like (*PRUNE), except that if the
3598 assertions, because they are never re-entered by backtracking. Compare the
3628 pattern-based if-then-else block:
3634 second alternative and tries COND2, without backtracking into COND1. If that
3698 not always the same as Perl's. It means that if two or more backtracking verbs
3713 PCRE2 sometimes differs from Perl in its handling of backtracking verbs in
3718 If the subject is "abac", Perl matches unless its optimizations are disabled,
3733 without any further processing; captured strings and a mark name (if set) are
3735 fail without any further processing; captured substrings and any mark name are
3743 reach them. This means that, for the Perl-compatible assertions, their effect
3744 is confined to the assertion, because Perl lookaround assertions are atomic. A
3749 PCRE2 now supports non-atomic positive assertions, as described in the section
3753 "Non-atomic assertions"
3756 not Perl-compatible. For these assertions, a later backtrack does jump back
3769 the assertion to be true, without considering any further alternative branches.
3779 succeed without any further processing. Matching then continues after the
3780 subroutine call. Perl documents this behaviour. Perl's treatment of the other
3818 Copyright (c) 1997-2022 University of Cambridge.