pcre2pattern.3 - OpenGrok cross reference for /external/pcre/doc/pcre2pattern.3

Lines Matching full:are
7 The syntax and semantics of the regular expressions that are supported by PCRE2
17 Perl's regular expressions are described in its own documentation, and regular
18 expressions in general are covered in a number of books, some of which have
23 This document discusses the regular expression patterns that are supported by
27 discussed below are not available when DFA matching is used. The advantages and
29 function, are discussed in the
40 by special items at the start of a pattern. These are not Perl-compatible, but
41 are provided to make these options accessible to pattern writers who are not
153 These facilities are provided to catch runaway matches that are provoked by
174 \fBpcre2_dfa_match()\fP interpreters are used for matching. It does not apply
221 The newline convention affects where the circumflex and dollar assertions are
251 the sections below, character code values are ASCII or Unicode; in an EBCDIC
252 environment these characters may have different code values, and there are no
267 pattern), letters are matched independently of case. Note that there are two
269 equivalents, are case-equivalent with Unicode U+212A (Kelvin sign) and U+017F
273 character classes, alternatives, and repetitions in the pattern. These are
275 for themselves but instead are interpreted in some special way.
277 There are two different sets of metacharacters: those that are recognized
278 anywhere in the pattern except within square brackets, and those that are
296 a character class the only metacharacters are:
306 outside a character class and the next newline, inclusive, are ignored. An
309 applies, but in addition unescaped space and horizontal tab characters are
310 ignored inside a character class. Note: only these two characters are ignored,
311 not the full set of pattern white space characters that are ignored outside a
338 other characters (in particular, those whose code points are greater than 127)
376 environment, these escapes are as follows:
393 digits are read (letters can be in upper or lower case). Any number of
398 Characters whose code points are less than 256 can be defined by either of the
400 they are handled. For example, \exdc is exactly the same as \ex{dc} or \e334.
422 There are some legacy applications where the escape sequence \er is expected to
437 only characters that are allowed after \ec are A-Z, a-z, or one of @, [, \e, ],
450 APC character. Unfortunately, there are several variants of EBCDIC. In most of
455 After \e0 up to two further octal digits are read. If there are fewer than two
456 digits, just those that are present are used. Thus the sequence \e0\ex\e015
477 if there are at least that many previous capture groups in the expression, the
489 Otherwise, up to three octal digits are read to form a character code.
498   \e40    is the same, provided there are fewer than 40
514 Note that octal values of 100 or greater that are specified using this syntax
516 digits are ever read.
522 Characters that are specified using octal or hexadecimal numbers are
530 Invalid Unicode code points are all those in the range 0xd800 to 0xdfff (the
534 and UTF-32 modes, because these values are not representable in UTF-16.
545 \eB, \eR, and \eX are not special inside a character class. Like other
553 In Perl, the sequences \eF, \el, \eL, \eu, and \eU are recognized by its string
566 can be coded as \eg{name}. Backreferences are discussed
583 syntax for referencing a capture group as a subroutine. Details are discussed
588 Note that \eg{...} (Perl syntax) and \eg<...> (Oniguruma syntax) are \fInot\fP
637 The default \es characters are HT (9), LF (10), VT (11), FF (12), CR (13), and
638 space (32), which are defined as white space in the "C" locale. This list may
656 or "french" in Windows, some character codes greater than 127 are used for
657 accented letters, and these are then matched by \ew. The use of locales with
660 By default, characters whose code points are greater than 127 never match \ed,
665 is set, the behaviour is changed so that Unicode properties are used to
675 \eB because they are defined in terms of \ew and \eW. Matching these sequences
680 points, whether or not PCRE2_UCP is set. The horizontal space characters are:
702 The vertical space characters are:
726 This is an example of an "atomic group", details of which are given
737 In other modes, two additional characters whose code points are greater than 255
753 Note that these special settings, which are not Perl-compatible, are recognized
770 sequences that match characters with specific properties are available. They
772 sequences are of course limited to testing characters whose code points are
774 greater than 0x10ffff (the Unicode limit) may be encountered. These are all
783 The extra escape sequences that provide property support are:
789 The property names represented by \fIxx\fP above are not case-sensitive, and in
791 underscores are ignored. There is support for Unicode script names, Unicode
799 Certain other Perl properties such as "InMusicalSymbols" are not supported by
808 There are three different syntax forms for matching a script. Each Unicode
814 property types are recognized, and a equals sign is an alternative to the
820 greater than 0x10FFFF) are assigned the "Unknown" script. Others that are not
821 part of an identified script are lumped together as "Common". The current list
840 of negation, the curly brackets in the escape sequence are optional; these two
846 The following general category property codes are supported:
896 The Cs (Surrogate) property applies only to characters whose code points are in
897 the range U+D800 to U+DFFF. These characters are no different to any other
899 However, they are not valid in Unicode strings and so cannot be tested by PCRE2
924 values are true or false. You can obtain a list of those that are recognized by
937 The recognized classes are:
963 An equals sign may be used instead of a colon. The class names are
964 case-insensitive; only the short names listed above are recognized.
978 define the boundaries of extended grapheme clusters. The rules are defined in
1004 Extend and ZWJ characters are allowed between the characters.
1007 regional indicator (RI) characters if there are an odd number of RI characters
1021 explicitly. These properties are:
1037 languages. These are the characters $, @, ` (grave accent), and all characters
1039 surrogates U+D800 to U+DFFF. Note that most base (ASCII) characters are
1040 excluded. (Universal Character Names are of the form \euHHHH or \eUHHHHHHHH
1109 The backslashed assertions are:
1135 start and end of the subject string, whatever options are set. Thus, they are
1136 independent of multiline mode. These three assertions are not affected by the
1166 The circumflex and dollar metacharacters are zero-width assertions. That is,
1168 characters from the subject string. These two metacharacters are concerned with
1171 and LF characters are treated as ordinary data characters, and are not
1186 alternatives are involved, but it should be the first thing in each alternative
1190 "anchored" pattern. (There are also other constructs that can cause a pattern
1197 character of the pattern if a number of alternatives are involved, but it
1205 The meanings of the circumflex and dollar metacharacters are changed if the
1215 patterns that are anchored in single line mode because all branches start with
1216 ^ are not anchored in multiline mode, and a match for circumflex is possible
1226 preferred, even if the single characters CR and LF are also recognized as
1254 of CR of LF match dot. When all Unicode line endings are being recognized, dot
1334 character's individual bytes are then captured by the appropriate number of
1360 are in the class by enumerating those that are not. A class that starts with a
1369 match "A", whereas a caseful version would. Note that there are two ASCII
1374 Characters that might indicate line breaks are never treated in any special way
1390 class; it matches the backspace character. The sequences \eB, \eR, and \eX are
1419 example [\e000-\e037]. Ranges can include any characters that are valid for the
1424 surrogates, are always permitted.
1426 There is a special case in EBCDIC environments for ranges whose end points are
1428 Perl, EBCDIC code points within the range that are not letters are omitted. For
1437 tables for a French locale are in use, [\exc8-\excb] matches accented E
1447 The only metacharacters that are recognized in character classes are backslash,
1482 The default "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
1494 syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not
1495 supported, and an error is given if they are encountered.
1500 PCRE2_UCP option is passed to \fBpcre2_compile()\fP, some of the classes are
1501 changed so that Unicode character properties are used. This is achieved by
1515 classes are handled specially in UCP mode:
1528 This matches the same characters as [:graph:] plus space characters that are
1536 The other POSIX classes are unchanged, and match only characters with code
1550 Only these exact character sequences are recognized. A sequence such as
1560 normally shows which is wanted, without the need for the assertions that are
1567 Vertical bar characters are used to separate alternative patterns. For example,
1575 that succeeds is used. If the alternatives are within a group
1591 and ")". These options are Perl-compatible, and are described in detail in the
1595 documentation. The option letters are:
1606 example (?-im). The two "extended" options are not independent; unsetting either
1622 respectively. However, these are not unset by (?^).
1643 As a convenient shorthand, if any option settings are required at the start of
1652 \fBNote:\fP There are other PCRE2-specific options, applying to the whole
1656 Details are given in the section entitled
1661 above. There are also the (*UTF) and (*UCP) leading sequences that can be used
1662 to set UTF and Unicode property modes; they are equivalent to setting the
1672 Groups are delimited by parentheses (round brackets), which can be nested.
1688 Opening parentheses are counted from left to right (starting from 1) to obtain
1694 the captured substrings are "red king", "red", and "king", and are numbered 1,
1698 There are often times when grouping is required without capturing. If an
1706 the captured substrings are "white queen" and "queen", and are numbered 1 and
1709 As a convenient shorthand, if any option settings are required at the start of
1716 match exactly the same set of strings. Because alternative branches are tried
1717 from left to right, and options are not reset until the end of the group is
1732 Because the two alternatives are inside a (?| group, both sets of capturing
1733 parentheses are numbered one. Thus, when the pattern matches, you can look
1736 alternatives. Inside a (?| group, parentheses are numbered as usual, but the
1809 Named capture groups are allocated numbers as well as names, exactly as
1811 are primarily identified by numbers; any names are just aliases for these
1819 Consider this pattern, where there are two capture groups, both numbered 1:
1864 There are five capture groups, but only one is ever set after a match. The
1872 pattern, the groups to which the name refers are checked in the order in which
1892 recursion, all groups with the same name are tested. If the condition is true
1928 no upper limit; if the second number and the comma are both omitted, the
1950 capture groups that are referenced as
1960 below). Except for parenthesized groups, items that have a {0} quantifier are
1976 such patterns. However, because there are cases where this can be useful, such
1977 patterns are now accepted, but whenever an iteration of such a group matches no
1982 By default, quantifiers are "greedy", that is, they match as much as possible
2013 the quantifiers are not greedy by default, but individual ones can be made
2032 However, there are some cases where the optimization cannot be used. When .*
2033 is inside capturing parentheses that are the subject of a backreference
2058 "tweedledee". However, if there are nested capture groups, the corresponding
2107 Atomic groups are not capture groups. Simple cases such as the above example
2109 So, while both \ed+ and \ed+? are prepared to adjust the number of digits they
2127 Possessive quantifiers are always greedy; the setting of the PCRE2_UNGREEDY
2128 option is ignored. They are a convenient notation for the simpler forms of
2181 always taken as a backreference, and causes an error only if there are not that
2197 there is no problem when named capture groups are used (see below).
2201 signed or unsigned number, optionally enclosed in braces. These examples are
2217 helpful in long patterns, and also in patterns that are created by joining
2244 There are several different ways of writing backreferences to named capture
2270 backslash are taken as part of a potential backreference number. If the pattern
2313 coded as \eb, \eB, \eA, \eG, \eZ, \ez, ^ and $ are described
2319 More complicated assertions are coded as parenthesized groups. There are two
2327 The Perl-compatible lookaround assertions are atomic. If an assertion is true,
2329 assertion. However, there are some cases where non-atomic assertions can be
2335 below, but they are not Perl-compatible.
2345 Assertion groups are not capture groups. If an assertion contains capture
2346 groups within it, these are counted for the purposes of numbering the capture
2349 such as (.)\eg{-1} can be used to check that two adjacent characters are the
2353 captured are discarded (as happens with any pattern branch that fails to
2355 this means that no captured substrings are ever retained after a successful
2360 branch are retained, and matching continues with the next pattern item after
2367 (see below), captured substrings are retained, because matching continues with
2396 sections, the various assertions are described using the original symbolic
2420 (?!foo) is always true when the next three characters are "bar". A
2439 a lookbehind assertion are restricted such that all the strings it matches must
2440 have a fixed length. However, if there are several top-level alternatives, they
2472 match. If there are insufficient characters before the current position, the
2478 \eX and \eR escapes, which can match different numbers of code units, are never
2485 calls (see below) such as (?2) or (?&X) are permitted in lookbehinds, as long
2495 but only if certain conditions are met. The PCRE2_MATCH_UNSET_BACKREF option
2519 covers the entire string, from right to left, so we are no better off. However,
2538 matches "foo" preceded by three digits that are not "999". Notice that each of
2540 string. First there is a check that the previous three characters are all
2541 digits, and then there is a check that the same three characters are not "999".
2543 of which are digits and the last three of which are not "999". For example, it
2549 that the first three are digits, and then the second assertion checks that the
2550 preceding three characters are not "999".
2562 characters that are not "999".
2569 The traditional Perl-compatible lookaround assertions are atomic. That is, if
2571 backtracking into the assertion. However, there are some cases where non-atomic
2594 using an ungreedy .*? to scan from the left. If this succeeds, we are done, but
2616 Non-atomic assertions are not supported by the alternative matching function
2617 \fBpcre2_dfa_match()\fP. They are supported by JIT, but only if they do not
2630 In concept, a script run is a sequence of characters that are all from the same
2631 Unicode script such as Latin or Greek. However, because some scripts are
2632 commonly used together, and because some diacritical and other marks are used
2646 parenthesis, it fails if the sequence of characters that it matches are not a
2648 used to detect spoofing attacks using characters that look the same, but are
2651 the matched characters in a sequence of non-spaces that follow white space are
2656 To be sure that they are all from the Latin script (for example), a lookahead
2664 digits, underscore, and dots are permitted at the start:
2681 encountered. Script runs are not supported by the alternate matching function,
2701 already been matched. The two possible forms of conditional group are:
2708 string (it always matches). If there are more than two alternatives in the
2712 itself. This pattern fragment is an example where the alternatives are complex:
2717 There are five kinds of condition: references to capture groups, references to
2748 matches one or more characters that are not parentheses. The third part is a
2771 digits are ambiguous (see the following section). Rewriting the above example
2824 At "top level", all these recursion test conditions are false.
2862 they are dealing with by using this condition to match a string such as
2896 dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits.
2902 for which captures are retained only for positive assertions that succeed.)
2909 There are two ways of including comments in patterns that are processed by
2916 closing parenthesis. Nested parentheses are not permitted. If the
2920 characters are interpreted as newlines is controlled by an option passed to the
3010 The first two capture groups (a) and (b) are both numbered 1, and group (c)
3014 reference (?1) was used. In other words, relative references are just a
3019 reference is not inside the parentheses that are referenced. They are always
3043 the match runs for a very long time indeed because there are so many different
3047 At the end of a match, the values of capturing parentheses are those from
3064 arbitrary nesting. Only digits are allowed in nested brackets (that is, when
3065 recursing), whereas any characters are permitted at the outer level.
3086 Starting with release 10.30, recursive subroutine calls are no longer treated
3098 palindrome when there are an odd number of characters, or nothing when there
3156 occur. However, any capturing parentheses that are set during the subroutine
3159 Processing options such as case-independence are fixed when a group is
3187 syntax for calling a group as a subroutine, possibly recursively. Here are two
3198 Note that \eg{...} (Perl syntax) and \eg<...> (Oniguruma syntax) are \fInot\fP
3215 entry point is set to NULL, callouts are disabled.
3218 function is to be called. There are two kinds of callout: those with a
3232 one side-effect is that sometimes callouts are skipped. If you need all
3235 programming interface to the callout function, are given in the
3252 callouts are automatically installed before each item in the pattern. They are
3282 There are a number of special "Backtracking Control Verbs" (to use Perl's
3286 present. The names are not required to be unique within the pattern.
3296 only backslash items that are permitted are \eQ, \eE, and sequences such as
3303 skipped, and #-comments are recognized, exactly as in the rest of the pattern.
3313 Since these verbs are specifically related to backtracking, most of them can be
3340 PCRE2 contains some optimizations that are used to speed up matching by running
3367 The following verbs act as soon as they are encountered.
3405 (??{}). Those are, of course, Perl features that are not present in PCRE2. The
3441 including those inside assertions and atomic groups. However, there are
3479 If you are interested in (*MARK) values after failed matches, you should
3491 The following verbs do nothing when they are encountered. Matching continues
3521 caller. However, (*SKIP:NAME) searches only for names that are set with
3529 unless PCRE2's start-of-match optimizations are turned off, as shown in this
3556 possessive quantifier, but there are some uses of (*PRUNE) that cannot be
3597 means that it does not see (*MARK) settings that are inside atomic groups or
3598 assertions, because they are never re-entered by backtracking. Compare the
3621 names that are set by other backtracking verbs.
3635 succeeds and BAR fails, COND3 is tried. If subsequently BAZ fails, there are no
3647 Consider this pattern, where A, B, etc. are complex pattern fragments that do
3652 If A and B are matched, but there is a failure in C, matching does not
3691 etc. are complex pattern fragments:
3718 If the subject is "abac", Perl matches unless its optimizations are disabled,
3733 without any further processing; captured strings and a mark name (if set) are
3735 fail without any further processing; captured substrings and any mark name are
3739 a positive assertion and false for a negative one; captured substrings are
3744 is confined to the assertion, because Perl lookaround assertions are atomic. A
3755 above. These assertions must be standalone (not used as conditions). They are
3764 The other backtracking verbs are not treated specially if they appear in a