• Home
  • Raw
  • Download

Lines Matching full:is

8 are described in detail below. There is a quick-reference syntax summary in the
21 PCRE2's regular expressions is intended as reference material.
24 matching function, \fBpcre2_match()\fP, is used. PCRE2 also has an alternative
26 algorithm that is not Perl-compatible. Some of the features discussed below are
27 not available when DFA matching is used. The advantages and disadvantages of
54 built to include Unicode support (which is the default). When using UTF strings
56 pattern must start with the special sequence (*UTF), which is equivalent to
57 setting the relevant option. How setting a UTF mode affects pattern matching is
58 mentioned in several places below. There is also a summary of features in the
66 option is passed to \fBpcre2_compile()\fP, (*UTF) is not allowed, and its
73 Another special sequence that may appear at the start of a pattern is (*UCP).
80 restrict them for security reasons. If the PCRE2_NEVER_UCP option is passed to
81 \fBpcre2_compile()\fP, (*UCP) is not allowed, and its appearance in a pattern
90 matching function is subsequently called to match the pattern. These options
101 default a+b is treated as a++b. For more details, see the
136 If a pattern that starts with (*NO_JIT) is successfully compiled, an attempt by
138 \fBpcre2_jit_compile()\fP is ignored.
145 internal \fBmatch()\fP function is called and on the maximum depth of
147 are provoked by patterns with huge matching trees (a typical example is a
149 by too much recursion. When one of these limits is reached, \fBpcre2_match()\fP
156 where d is any number of decimal digits. However, the value of the setting must
159 limits set by the programmer, but not raise them. If there is more than one
160 setting of one of these limits, the lower value is used.
182 It is also possible to specify a newline convention by starting a pattern
192 example, on a Unix system where LF is the default newline sequence, the pattern
196 changes the convention to CR. That pattern matches "a\enb" because LF is no
197 longer a newline. If more than one of these settings is present, the last one
202 PCRE2_DOTALL is not set, and the behaviour of \eN. However, it does not affect
203 what the \eR escape sequence matches. By default, this is any Unicode newline
217 It is possible to restrict \eR to match only CR, LF, or CRLF (instead of the
220 (*BSR_ANYCRLF). For completeness, (*BSR_UNICODE) is also recognized,
237 A regular expression is a pattern that is matched against a subject string from
243 matches a portion of a subject string that is identical to itself. When
244 caseless matching is specified (the PCRE2_CASELESS option), letters are matched
273 Part of a pattern that is in square brackets is called a "character class". In
290 The backslash character has several uses. Firstly, if it is followed by a
291 character that is not a number or a letter, it takes away any special meaning
297 otherwise be interpreted as a metacharacter, so it is always safe to precede a
305 If a pattern is compiled with the PCRE2_EXTENDED option, most white space in
312 can do so by putting them between \eQ and \eE. This is different from Perl in
324 The \eQ...\eE sequence is recognized both inside and outside character classes.
325 An isolated \eE that is not preceded by \eQ is ignored. If \eQ is not followed
327 the pattern (that is, \eE is assumed at the end). If the isolated \eQ is inside
328 a character class, this causes an error, because the character class is not
337 in patterns in a visible manner. There is no restriction on the appearance of
338 non-printing characters in a pattern, but when a pattern is being prepared by
339 text editing, it is often easier to use one of the following escape sequences
343 \ea alarm, that is, the BEL character (hex 07)
344 \ecx "control-x", where x is any printable ASCII character
355 \euhhhh character with hex code hhhh (when PCRE2_ALT_BSUX is set)
357 The precise effect of \ecx on ASCII characters is as follows: if x is a lower
358 case letter, it is converted to upper case. Then bit 6 of the character (hex
359 40) is inverted. Thus \ecA to \ecZ become hex 01 to hex 1A (A is 41, Z is 5A),
360 but \ec{ becomes hex 3B ({ is 7B), and \ec; becomes hex 7B (; is 3B). If the
365 When PCRE2 is compiled in EBCDIC mode, \ea, \ee, \ef, \en, \er, and \et
366 generate the appropriate EBCDIC code values. The \ec escape is processed
376 differ. For example, \eG always generates code value 7, which is BEL in ASCII
380 because 127 is not a control character in EBCDIC, Perl makes it generate the
383 POSIX-BC its value is 95 (hex 5F). If certain other characters have POSIX-BC
390 follows is itself an octal digit.
393 braces. An error occurs if this is not the case. This escape is a recent
398 For greater clarity and unambiguity, it is best to avoid following \e by a
403 The handling of a backslash followed by a digit other than 0 is complicated,
407 decimal number. If the number is less than 10, begins with the digit 8 or 9, or
409 expression, the entire sequence is taken as a \fIback reference\fP. A
410 description of how this works is given
427 \e040 is another way of writing an ASCII space
429 \e40 is the same, provided there are fewer than 40
431 \e7 is always a back reference
435 \e011 is always a tab
436 \e0113 is a tab followed by the character "3"
444 \e81 is always a back reference
450 By default, after \ex that is not followed by {, from zero to two hexadecimal
453 a hexadecimal digit appears between \ex{ and }, or if there is no terminating
456 If the PCRE2_ALT_BSUX option is set, the interpretation of \ex is as just
457 described only when it is followed by two hexadecimal digits. Otherwise, it
459 greater than 256 is provided by \eu, which must be followed by four hexadecimal
462 Characters whose value is less than 256 can be defined by either of the two
463 syntaxes for \ex (or by \eu in PCRE2_ALT_BSUX mode). There is no difference in
464 the way they are handled. For example, \exdc is exactly the same as \ex{dc} (or
489 and outside character classes. In addition, inside a character class, \eb is
492 \eN is not allowed in a character class. \eB, \eR, and \eX are not special
512 enclosed in braces, is an absolute or relative back reference. A named back
529 a number enclosed either in angle brackets or single quotes, is an alternative
536 synonymous. The former is a back reference; the latter is a
548 Another use of backslash is for specifying generic character types:
551 \eD any character that is not a decimal digit
553 \eH any character that is not a horizontal white space character
555 \eS any character that is not a white space character
557 \eV any character that is not a vertical white space character
561 There is also the single sequence \eN, which matches a non-newline character.
562 This is the same as
567 when PCRE2_DOTALL is not set. Perl also uses \eN to match characters by name;
574 matching point is at the end of the subject string, all of them fail, because
575 there is no character to match.
579 vary if locale-specific matching is taking place. For example, in some locales
580 the "non-breaking space" character (\exA0) is recognized as white space, and in
581 others the VT character is not.
583 A "word" character is an underscore or any character that is a letter or digit.
584 By default, the definition of letters and digits is controlled by PCRE2's
585 low-valued character tables, and may vary if locale-specific matching is taking
598 Unicode is discouraged.
602 for characters in the range 128-255 when locale-specific matching is happening.
605 is set, the behaviour is changed so that Unicode properties are used to
616 is noticeably slower when PCRE2_UCP is set.
620 points, whether or not PCRE2_UCP is set. The horizontal space characters are:
661 Unicode newline sequence. In 8-bit non-UTF-8 mode \eR is equivalent to the
666 This is an example of an "atomic group", details of which are given
674 line, U+0085). Because this is an atomic group, the two-character sequence is
679 Unicode support is not needed for these characters to be recognized.
681 It is possible to restrict \eR to match only CR, LF, or CRLF (instead of the
683 at compile time. (BSR is an abbrevation for "backslash R".) This can be made
684 the default when PCRE2 is built; if this is the case, the other behaviour can
685 be requested via the PCRE2_BSR_UNICODE option. It is also possible to specify
695 more than one of them is present, the last one is used. They can be combined
701 character class, \eR is treated as an unrecognized escape sequence, and causes
709 When PCRE2 is built with Unicode support (the default), three additional escape
739 "Common". The current list of scripts is:
876 name. For example, \ep{^Lu} is the same as \eP{Lu}.
878 If only one letter is specified with \ep or \eP, it includes all the general
932 The special property L& is also supported: it matches a character that has
933 the Lu, Ll, or Lt property, in other words, a letter that is not classified as
946 are not supported by PCRE2, nor is it permitted to prefix any of these
947 properties with "Is".
949 No character that is in the Unicode table has the Cn (unassigned) property.
950 Instead, this property is assumed for any code point that is not in the
954 example, \ep{Lu} always matches only upper case letters. This is different from
957 Matching characters by Unicode property is not fast, because PCRE2 has to do a
958 multistage table lookup in order to find a character's property. That is why
1003 properties internally when PCRE2_UCP is set. However, they may also be used
1014 Xsp is the same as Xps; in PCRE1 it used to exclude vertical tab, for Perl
1018 There is another non-standard property, Xuc, which matches any character that
1024 where H is a hexadecimal digit. Note that the Xuc property does not match these
1037 matches "foobar", but reports that it has matched "bar". This feature is
1054 matches "foobar", the first substring is still set to "foo".
1056 Perl documents that the use of \eK within assertions is "not well defined". In
1057 PCRE2, \eK is acted upon when it occurs inside positive assertions, but is
1067 The final use of backslash is for certain simple assertions. An assertion
1070 subpatterns for more complicated assertions is described
1087 "invalid escape sequence" error is generated.
1089 A word boundary is a position in the subject string where the current character
1093 of \ew and \eW can be changed by setting the PCRE2_UCP option. When this is
1096 determines which it is. For example, the fragment \eba matches "a" at the start
1105 argument of \fBpcre2_match()\fP is non-zero, indicating that matching is to
1107 The difference between \eZ and \ez is that \eZ matches before a newline at the
1111 The \eG assertion is true only when the current matching position is at the
1113 \fBpcre2_match()\fP. It differs from \eA when the value of \fIstartoffset\fP is
1115 arguments, you can mimic Perl's /g option, and it is in this kind of
1119 match, is subtly different from Perl's, which defines it as the end of the
1124 If all the alternatives of a pattern begin with \eG, the expression is anchored
1125 to the starting match position, and the "anchored" flag is set in the compiled
1132 The circumflex and dollar metacharacters are zero-width assertions. That is,
1135 matching the starts and ends of lines. If the newline convention is set so that
1136 only the two-character sequence CRLF is recognized as a newline, isolated CR
1141 character is an assertion that is true only if the current matching point is at
1143 \fBpcre2_match()\fP is non-zero, or if PCRE2_NOTBOL is set, circumflex can
1144 never match if the PCRE2_MULTILINE option is unset. Inside a character class,
1153 in which it appears if the pattern is ever to match that branch. If all
1154 possible alternatives start with a circumflex, that is, if the pattern is
1155 constrained to match only at the start of the subject, it is said to be an
1159 The dollar character is an assertion that is true only if the current matching
1160 point is at the end of the subject string, or immediately before a newline at
1161 the end of the string (by default), unless PCRE2_NOTEOL is set. Note, however,
1172 PCRE2_MULTILINE option is set. When this is the case, a dollar character
1182 ^ are not anchored in multiline mode, and a match for circumflex is possible
1183 when the \fIstartoffset\fP argument of \fBpcre2_match()\fP is non-zero. The
1184 PCRE2_DOLLAR_ENDONLY option is ignored if PCRE2_MULTILINE is set.
1191 below) recognizes the two-character sequence CRLF as a newline, this is
1193 newlines. For example, if the newline convention is "any", a multiline mode
1195 CR, even though CR on its own is a valid newline. (It also matches at the very
1200 \eA it is always anchored, whether or not PCRE2_MULTILINE is set.
1211 When a line ending is defined as a single character, dot never matches that
1212 character; when the two-character sequence CRLF is used, dot does not match CR
1213 if it is immediately followed by LF, but otherwise it matches all characters
1219 PCRE2_DOTALL option is set, a dot matches any one character, without exception.
1220 If the two-character sequence CRLF is present in the subject string, it takes
1223 The handling of dot is entirely independent of the handling of circumflex and
1227 The escape sequence \eN behaves like a dot, except that it is not affected by
1237 whether or not a UTF mode is set. In the 8-bit library, one code unit is one
1238 byte; in the 16-bit library it is a 16-bit unit; in the 32-bit library it is a
1240 feature is provided in Perl in order to match individual bytes in UTF-8 mode,
1241 but it is unclear how it can usefully be used.
1246 assumes that it is matching character by character in a valid UTF string (by
1248 unless the PCRE2_NO_UTF_CHECK option is used).
1251 PCRE2_NEVER_BACKSLASH_C option when compiling a pattern. It is also possible to
1263 match is always run using the interpreter.
1265 In the 32-bit library, however, \eC is always supported (when not explicitly
1269 In general, the \eC escape sequence is best avoided. However, one way of using
1270 it that avoids the problem of malformed UTF-8 or UTF-16 characters is to use a
1296 square bracket. A closing square bracket on its own is not special by default.
1297 If a closing square bracket is required as a member of the class, it should be
1300 be defined. However, if the PCRE2_ALLOW_EMPTY_CLASS option is set, a closing
1305 first character in the class definition is a circumflex, in which case the
1307 is actually required as a member of the class, ensure it is not the first
1311 [^aeiou] matches any character that is not a lower case vowel. Note that a
1312 circumflex is just a convenient notation for specifying the characters that
1314 circumflex is not an assertion; it still consumes a character from the subject
1315 string, and therefore it fails if the current pointer is at the end of the
1318 When caseless matching is set, any letters in a class represent both their
1324 when matching character classes, whatever line-ending sequence is in use, and
1325 whatever setting of the PCRE2_DOTALL and PCRE2_MULTILINE options is used. A
1330 inclusive. If a minus character is required in a class, it must be escaped with
1336 It is not possible to have the literal character "]" as the end character of a
1337 range. A pattern such as [W-]46] is interpreted as a class of two characters
1339 "-46]". However, if the "]" is escaped with a backslash it is interpreted as
1340 the end of range, so [W-\e]46] is interpreted as a class containing a range
1344 An error is generated if a POSIX character class (see below) or an escape
1346 where a range ending character is expected. For example, [z-\exff] is valid,
1354 There is a special case in EBCDIC environments for ranges whose end points are
1358 are 0x88 and 0x92, a range of 11 code points. However, if the range is
1362 If a range that includes letters is used when caseless matching is set, it
1363 matches the letters in either case. For example, [W-c] is equivalent to
1426 and space (32). If locale-specific matching is taking place, the list of space
1430 The name "word" is a Perl extension, and "blank" is a GNU extension from Perl
1431 5.8. Another Perl extension is negation, which is indicated by a ^ character
1437 syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not
1438 supported, and an error is given if they are encountered.
1442 range 128-255 when locale-specific matching is happening. However, if the
1443 PCRE2_UCP option is passed to \fBpcre2_compile()\fP, some of the classes are
1444 changed so that Unicode character properties are used. This is achieved by
1472 not controls, that is, characters with the Zs property.
1487 syntax [[:<:]] and [[:>:]] is used for matching "start of word" and "end of
1490 [[:<:]] is converted to \eb(?=\ew)
1491 [[:>:]] is converted to \eb(?<=\ew)
1494 [a[:<:]b] provokes error for an unrecognized POSIX class name. This support is
1495 not compatible with Perl. It is provided to help migrations from other
1496 environments, and is best not used in any new patterns. Note that \eb matches
1503 normally shows which is wanted, without the need for the assertions that are
1516 and an empty alternative is permitted (matching the empty string). The matching
1518 that succeeds is used. If the alternatives are within a subpattern
1540 For example, (?im) sets caseless, multiline matching. It is also possible to
1543 PCRE2_MULTILINE while unsetting PCRE2_DOTALL and PCRE2_EXTENDED, is also
1544 permitted. If a letter appears both before and after the hyphen, the option is
1545 unset. An empty options setting "(?)" is allowed. Needless to say, it has no
1552 When one of these option changes occurs at top level (that is, not inside
1554 that follows. If the change is placed right at the start of a pattern, PCRE2
1563 matches abc and aBc and no other strings (assuming PCRE2_CASELESS is not used).
1571 branch is abandoned before the option setting. This is because the effects of
1585 application when the compiling function is called. The pattern can contain
1615 subpattern is passed back to the caller, separately from the portion that
1621 king" is matched against the pattern
1628 The fact that plain parentheses fulfil two functions is not always helpful.
1629 There are often times when a grouping subpattern is required without a
1630 capturing requirement. If an opening parenthesis is followed by a question mark
1631 and a colon, the subpattern does not do any capturing, and is not counted when
1633 the string "the white queen" is matched against the pattern
1638 2. The maximum number of capturing subpatterns is 65535.
1659 (?| and is itself a non-capturing subpattern. For example, consider this
1669 number is reset at the start of each branch. The numbers of any capturing
1671 any branch. The following example is taken from the Perl documentation. The
1678 A back reference to a numbered subpattern uses the most recent value that is
1690 A relative reference such as (?-1) is no different: it is just a convenient way
1698 for a subpattern's having matched refers to a non-unique number, the test is
1701 An alternative approach to using this "branch reset" feature is to use
1708 Identifying capturing parentheses by number is simple, but it can be very hard
1710 if an expression is modified, the numbers may change. To help with this
1742 By default, a name must be unique within a pattern, but it is possible to relax
1757 There are five capturing substrings, but only one is ever set after a match.
1758 (An alternative way of solving this problem is to use a "branch reset"
1767 in which they appear in the overall pattern. The first one that is set is used
1775 corresponds to the first occurrence of the name is used. In the absence of
1776 duplicate numbers (see the previous section) this is the one with the lowest
1787 recursion, all subpatterns with the same name are tested. If the condition is
1788 true for any one of them, the overall condition is true. This is the same
1798 matching. For this reason, an error is given at compile time if different names
1800 same name to subpatterns with the same number, even when PCRE2_DUPNAMES is not
1807 Repetition is specified by quantifiers, which can follow any of the following
1828 matches "zz", "zzz", or "zzzz". A closing brace on its own is not a special
1829 character. If the second number is omitted, but the comma is present, there is
1840 where a quantifier is not allowed, or one that does not match the syntax of a
1841 quantifier, is taken as a literal character. For example, {,6} is not a
1846 which is represented by a two-byte sequence in a UTF-8 string. Similarly,
1850 The quantifier {0} is permitted, causing the expression to behave as if the
1868 * is equivalent to {0,}
1869 + is equivalent to {1,}
1870 ? is equivalent to {0,1}
1872 It is possible to construct infinite loops by following a subpattern that can
1880 match no characters, the loop is forcibly broken.
1882 By default, the quantifiers are "greedy", that is, they match as much as
1898 If a quantifier is followed by a question mark, it ceases to be greedy, and
1904 quantifiers is not otherwise changed, just the preferred number of matches.
1910 which matches one digit by preference, but can match two if that is the only
1913 If the PCRE2_UNGREEDY option is set (an option that is not available in Perl),
1918 When a parenthesized subpattern is quantified with a minimum repeat count that
1919 is greater than 1 or with a limited maximum, more memory is required for the
1923 to Perl's /s) is set, thus allowing the dot to match newlines, the pattern is
1925 character position in the subject string, so there is no point in retrying the
1929 In cases where it is known that the subject string contains no newlines, it is
1940 If the subject is "xyz123abc123" the match point is the fourth character. For
1941 this reason, such a pattern is not implicitly anchored.
1943 Another case where implicit anchoring is not applied is when the leading .* is
1950 (*PRUNE) and (*SKIP) also disable this optimization, and there is an option,
1953 When a capturing subpattern is repeated, the value captured is the substring
1958 has matched "tweedledum tweedledee" the value of the captured substring is
1965 matches "aba" the value of the second captured substring is "b".
1975 pattern to match. Sometimes it is useful to prevent this, either to change the
1977 the author of the pattern knows there is no point in carrying on.
1984 action of the matcher is to try again with only 5 digits matching the \ed+
1987 that once a subpattern has matched, it is not to be re-evaluated in this way.
1990 immediately on failing to match "foo" the first time. The notation is a kind of
1996 it has matched, and a failure further into the pattern is prevented from
2000 An alternative description is that a subpattern of this type matches exactly
2012 group is just a single repeated item, as in the example above, a simpler
2025 option is ignored. They are a convenient notation for the simpler forms of
2026 atomic group. However, there is no difference in the meaning of a possessive
2030 The possessive quantifier syntax is an extension to the Perl 5.8 syntax.
2037 pattern constructs. For example, the sequence A+B is treated as A++B because
2038 there is no point in backtracking into a sequence of A's when B must follow.
2043 be repeated an unlimited number of times, the use of an atomic group is the
2051 quickly. However, if it is applied to
2055 it takes a long time before reporting failure. This is because the string can
2059 optimization that allows for fast failure when a single character is used. They
2060 remember the last single character that is required for a match, and fail early
2061 if it is not present in the string.) If the pattern is changed so that it uses
2074 possibly further digits) is a back reference to a capturing subpattern earlier
2075 (that is, to its left) in the pattern, provided there have been that many
2078 However, if the decimal number following the backslash is less than 8, it is
2083 when a repetition is involved and the subpattern to the right has participated
2086 It is not possible to have a numerical "forward back reference" to a subpattern
2087 whose number is 8 or more using this syntax because a sequence such as \e50 is
2094 for further details of the handling of digits following a backslash. There is
2096 subpattern is possible using named parentheses (see below).
2099 backslash is to use the \eg escape sequence. This escape must be followed by an
2108 is present in the older syntax. It is also useful when literal digits follow
2109 the reference. A negative number is a relative reference. Consider this
2114 The sequence \eg{-1} is a reference to the most recently started capturing
2115 subpattern before \eg, that is, is it equivalent to \e2 in this example.
2132 "sense and responsibility". If caseful matching is in force at the time of the
2133 back reference, the case of letters is relevant. For example,
2138 capturing subpattern is matched caselessly.
2142 \ek'name' are supported, as is the Python syntax (?P=name). Perl 5.10's unified
2144 references, is also supported. We could rewrite the above example in any of
2152 A subpattern that is referenced by name may appear in the pattern before or
2162 PCRE2_MATCH_UNSET_BACKREF option is set at compile time, a back reference to an
2168 terminate the back reference. If the PCRE2_EXTENDED option is set, this can be
2181 when the subpattern is first used, so, for example, (a\e1) never matches.
2208 An assertion is a test on the characters following or preceding the current
2218 that look behind it. An assertion subpattern is matched in the normal way,
2224 capturing is carried out only for positive assertions. (Perl sometimes, but not
2233 (1) If the quantifier is {0}, the assertion is never obeyed during matching.
2241 (2) If quantifier is {0,n} where n is greater than zero, it is treated as if it
2242 were {0,1}. At run time, the rest of the pattern match is tried with and
2245 (3) If the minimum repetition is greater than zero, the quantifier is ignored.
2246 The assertion is obeyed just once when encountered during matching.
2262 matches any occurrence of "foo" that is not followed by "bar". Note that the
2267 does not find an occurrence of "bar" that is preceded by something other than
2269 (?!foo) is always true when the next three characters are "bar". A
2270 lookbehind assertion is needed to achieve the other effect.
2273 convenient way to do it is with (?!) because an empty string always matches, so
2275 The backtracking control verb (*FAIL) or (*F) is a synonym for (?!).
2287 does find an occurrence of "bar" that is not preceded by "foo". The contents of
2299 are permitted only at the top level of a lookbehind assertion. This is an
2306 lengths, but it is acceptable to PCRE2 if rewritten to use two top-level
2319 The implementation of lookbehind assertions is, for each alternative, to
2340 however, is not supported.
2350 what follows matches the rest of the pattern. If the pattern is specified as
2355 there is no following "a"), it backtracks to match all but the last character,
2358 if the pattern is written as
2377 the assertions is applied independently at the same point in the subject
2378 string. First there is a check that the previous three characters are all
2379 digits, and then there is a check that the same three characters are not "999".
2382 doesn't match "123abcfoo". A pattern to do that is
2394 matches an occurrence of "baz" that is preceded by "bar" which in turn is not
2407 It is possible to cause the matching process to obey a subpattern
2415 If the condition is satisfied, the yes-pattern is used; otherwise the
2416 no-pattern (if present) is used. If there are more than two alternatives in the
2420 the condition. This pattern fragment is an example where the alternatives are
2434 condition is true if a capturing subpattern of that number has previously
2435 matched. If there is more than one capturing subpattern with the same number
2442 the condition is true if any of them have matched. An alternative notation is
2444 number is relative rather than absolute. The most recently opened parentheses
2448 zero in any of these forms is not used; it provokes a compile-time error.)
2457 character is present, sets it as the first captured substring. The second part
2458 matches one or more characters that are not parentheses. The third part is a
2460 matched. If they did, that is, if subject started with an opening parenthesis,
2461 the condition is true, and so the yes-pattern is executed and a closing
2462 parenthesis is required. Otherwise, since no-pattern is not present, the
2479 this facility before Perl, the syntax (?(name)...) is also recognized.
2485 If the name used in a condition of this kind is a duplicate, the test is
2486 applied to all subpatterns of the same name, and is true if any one of them has
2493 If the condition is the string (R), and there is no subpattern with the name R,
2494 the condition is true if a recursive call to the whole pattern or any
2500 the condition is true if the most recent recursion is into a subpattern whose
2501 number or name is given. This condition does not check the entire recursion
2502 stack. If the name used in a condition of this kind is a duplicate, the test is
2503 applied to all subpatterns of the same name, and is true if any one of them is
2518 If the condition is the string (DEFINE), and there is no subpattern with the
2519 name DEFINE, the condition is always false. In this case, there may be only one
2520 alternative in the subpattern. It is always skipped if control reaches this
2521 point in the pattern; the idea of DEFINE is that it can be used to define
2534 The first part of the pattern is a DEFINE group inside which a another group
2535 named "byte" is defined. This matches an individual component of an IPv4
2537 pattern is skipped because DEFINE acts like a false condition. The rest of the
2555 This pattern matches "yes" if the PCRE2 version is greater or equal to 10.4, or
2563 If the condition is not in any of the above formats, it must be an assertion.
2571 The condition is a positive lookahead assertion that matches an optional
2573 presence of at least one letter in the subject. If a letter is found, the
2574 subject is matched against the first alternative; otherwise it is matched
2591 PCRE2_EXTENDED option is set, an unescaped # character also introduces a
2594 interpreted as newlines is controlled by an option passed to the compiling
2601 above. Note that the end of this type of comment is a literal newline sequence
2603 count. For example, consider this pattern when PCRE2_EXTENDED is set, and the
2604 default newline convention (a single linefeed character) is in force:
2609 a newline in the pattern. The sequence \en is still literal at this stage, so
2620 be done is to use a pattern that matches up to some fixed depth of nesting. It
2640 closing parenthesis is a recursive subroutine call of the subpattern of the
2641 given number, provided that it occurs inside that subpattern. (If not, it is a
2646 call, which is described in the next section.) The special item (?R) or (?0) is
2650 PCRE2_EXTENDED option is set so that white space is ignored):
2656 match of the pattern itself (that is, a correctly parenthesized substring).
2657 Finally there is a closing parenthesis. Note the use of a possessive quantifier
2672 capturing parentheses leftwards from the point at which it is encountered.
2685 is number 2. When the reference (?-2) is encountered, the second most recently
2686 opened parentheses has the number 1, but it is the first such group (the (a)
2691 It is also possible to refer to subsequently opened parentheses, by writing
2693 reference is not inside the parentheses that are referenced. They are always
2700 An alternative approach is to use named parentheses. The Perl syntax for this
2701 is (?&name); PCRE1's earlier syntax (?P>name) is also supported. We could
2706 If there is more than one subpattern with the same name, the earliest one is
2711 non-parentheses is important when applying the pattern to strings that do not
2712 match. For example, when this pattern is applied to
2716 it yields "no match" quickly. However, if a possessive quantifier is not used,
2727 documentation). If the pattern above is matched against
2731 the value for the inner capturing parentheses (numbered 2) is "ef", which is
2732 the last value taken on at the top level. If a capturing subpattern is not
2733 matched at the top level, its final captured value is unset, even if it was
2742 arbitrary nesting. Only digits are allowed in nested brackets (that is, when
2747 In this pattern, (?(R) is the start of a conditional subpattern, with two
2757 (like Python, but unlike Perl), a recursive subpattern call is always treated
2758 as an atomic group. That is, once it has matched some of the subject string, it
2759 is never re-entered, even if it contains untried alternatives and there is a
2766 The idea is that it either matches a single character, or two identical
2768 it does not if the pattern is longer than three characters. Consider the
2771 At the top level, the first character is matched, but as it is not at the end
2772 of the string, the first alternative fails; the second alternative is taken
2777 Back at the top level, the next character ("c") is compared with what
2778 subpattern 2 matched, which was "a". This fails. Because the recursion is
2780 entire match fails. (Perl is able, at this point, to re-enter the recursion and
2781 try the second alternative.) However, if the pattern is written with the
2786 This time, the recursing alternative is tried first, and continues to recurse
2788 time we do have another alternative to try at the higher level. That is the big
2789 difference: in the previous case the remaining alternative is at a deeper
2793 those with an odd number of characters, it is tempting to change the pattern to
2800 order to match an empty string. The solution is to separate the two cases, and
2818 string does not start with a palindrome that is shorter than the entire string.
2819 For example, although "abcba" is correctly matched, if the subject is "ababa",
2824 The second way in which PCRE2 and Perl differ in their recursion processing is
2825 in the handling of captured values. In Perl, when a subpattern is called
2845 name) is used outside the parentheses to which it refers, it operates like a
2864 strings. Another example is given in the discussion of DEFINE above.
2867 groups. That is, once a subroutine has matched some of the subject string, it
2868 is never re-entered, even if it contains untried alternatives and there is a
2872 Processing options such as case-independence are fixed when a subpattern is
2873 defined, so if it is used as a subroutine, such options cannot be changed for
2887 a number enclosed either in angle brackets or single quotes, is an alternative
2894 PCRE2 supports an extension to Oniguruma: if a number is preceded by a
2895 plus or a minus sign it is taken as a relative reference. For example:
2900 synonymous. The former is a back reference; the latter is a subroutine call.
2909 same pair of parentheses when there is a repetition.
2912 code. The feature is called "callout". The caller of PCRE2 provides an external
2915 or \fBpcre2_dfa_match()\fP. If no match context is passed, or if the callout
2916 entry point is set to NULL, callouts are disabled.
2919 function is to be called. There are two kinds of callout: those with a
2921 argument is treated as (?C0). A numerical argument allows the application to
2926 During matching, when PCRE2 reaches a callout point, the external function is
2927 called. It is provided with the number or string argument of the callout, the
2928 position in the pattern, and one item of data that is also set in the match
2933 one side-effect is that sometimes callouts are skipped. If you need all
2952 If the PCRE2_AUTO_CALLOUT flag is passed to \fBpcre2_compile()\fP, numerical
2954 all numbered 255. If there is a conditional group in the pattern whose
2955 condition is an assertion, an additional callout is inserted just before the
2969 starting delimiter must be one of ` ' " ^ % # $ { and the ending delimiter is
2970 the same as the start, except for {, where the ending delimiter is }. If the
2971 ending delimiter is needed within the string, it must be doubled. For
2976 The doubling is removed before the string is passed to the callout function.
2992 depending on whether or not a name is present.
2994 By default, for compatibility with Perl, a name is any sequence of characters
2995 that does not include a closing parenthesis. The name is not processed in
2996 any way, and it is not possible to include a closing parenthesis in the name.
2997 However, if the PCRE2_ALT_VERBNAMES option is set, normal backslash processing
3000 between \eQ and \eE. If the PCRE2_EXTENDED option is set, unescaped whitespace
3001 in verb names is skipped and #-comments are recognized, exactly as in the rest
3004 The maximum length of a name is 255 in the 8-bit library and 65535 in the
3005 16-bit and 32-bit libraries. If the name is empty, that is, if the closing
3006 parenthesis immediately follows the colon, the effect is as if the colon were
3010 used only when the pattern is to be matched using the traditional matching
3029 (whether or not recursively) is documented below.
3043 (*NO_START_OPT). There is more discussion of this option in the section
3068 pattern. However, when it is inside a subpattern that is called as a
3069 subroutine, only that subpattern is ended successfully. Matching then continues
3073 If (*ACCEPT) is inside capturing parentheses, the data so far is captured. For
3078 This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is captured by
3083 This verb causes a matching failure, forcing backtracking to occur. It is
3084 equivalent to (?!) but easier to read. The Perl documentation notes that it is
3086 Perl features that are not present in PCRE2. The nearest equivalent is the
3091 A match with the string "aaaa" always fails, but the callout is taken before
3098 There is one verb whose main purpose is to track how a match was arrived at,
3104 A name is always required with this verb. There may be as many instances of
3108 (*PRUNE:NAME), or (*THEN:NAME) on the matching path is passed back to the
3118 documentation. Here is an example of \fBpcre2test\fP output, where the "mark"
3129 The (*MARK) name is tagged with "MK:" in this output, and in this example it
3130 indicates which of the two alternatives matched. This is a more efficient way
3134 If a verb with a name is encountered in a positive assertion that is true, the
3135 name is recorded and passed back if it is the last-encountered. This does not
3139 entire match process is returned. For example:
3145 Note that in this unanchored example the mark is retained from the match
3156 to ensure that the match is always attempted.
3163 with what follows, but if there is no subsequent match, causing a backtrack to
3164 the verb, a failure is forced. That is, backtracking cannot pass to the left of
3166 (which includes any group that is called as a subroutine) or in an assertion
3167 that is true, its effect is confined to that group, because once the group has
3168 been matched, there is never any backtracking into it. In this situation,
3172 reaches them. The behaviour described below is what happens when the verb is
3179 outright if there is a later matching failure that causes backtracking to reach
3180 it. Even if the pattern is unanchored, no further attempts to find a match by
3181 advancing the starting point take place. If (*COMMIT) is the only backtracking
3182 verb that is encountered, once it has been passed \fBpcre2_match()\fP is
3190 recently passed (*MARK) in the path is passed back when (*COMMIT) forces a
3193 If there is more than one backtracking verb in a pattern, a different one that
3197 Note that (*COMMIT) at the start of a pattern is not the same as an anchor,
3212 the optimization that skips along to the first character. The pattern is now
3219 subject if there is a later matching failure that causes backtracking to reach
3220 it. If the pattern is unanchored, the normal "bumpalong" advance to the next
3222 (*PRUNE), before it is reached, or when matching to the right of (*PRUNE), but
3223 if there is no match to the right, backtracking cannot cross (*PRUNE). In
3224 simple cases, the use of (*PRUNE) is just an alternative to an atomic group or
3229 The behaviour of (*PRUNE:NAME) is the not the same as (*MARK:NAME)(*PRUNE).
3230 It is like (*MARK:NAME) in that the name is remembered for passing back to the
3236 This verb, when given without a name, is like (*PRUNE), except that if the
3237 pattern is unanchored, the "bumpalong" advance is not to the next character,
3244 If the subject is "aaaac...", after the first match attempt fails (starting at
3253 When (*SKIP) has an associated name, its behaviour is modified. When it is
3254 triggered, the previous path through the pattern is searched for the most
3255 recent (*MARK) that has the same name. If one is found, the "bumpalong" advance
3257 (*SKIP) was encountered. If no (*MARK) with a matching name is found, the
3258 (*SKIP) is ignored.
3266 reaches it. That is, it cancels any further backtracking within the current
3272 If the COND1 pattern matches, FOO is tried (and possibly further items after
3275 succeeds and BAR fails, COND3 is tried. If subsequently BAZ fails, there are no
3276 more alternatives, so there is a backtrack to whatever came before the entire
3277 group. If (*THEN) is not inside an alternation, it acts like (*PRUNE).
3279 The behaviour of (*THEN:NAME) is the not the same as (*MARK:NAME)(*THEN).
3280 It is like (*MARK:NAME) in that the name is remembered for passing back to the
3284 A subpattern that does not contain a | character is just a part of the
3285 enclosing alternative; it is not a nested alternation with only one
3292 If A and B are matched, but there is a failure in C, matching does not
3293 backtrack into A; instead it moves to the next alternative, that is, D.
3294 However, if the subpattern containing (*THEN) is given an alternative, it
3299 The effect of (*THEN) is now confined to the inner subpattern. After a failure
3304 Note that a conditional subpattern is not considered as having two
3305 alternatives, because only one is ever used. In other words, the | character in
3311 If the subject is "ba", this pattern does not match. Because .*? is ungreedy,
3313 character "b" is matched, but "c" is not. At this point, matching does not
3315 character. The conditional subpattern is part of the single alternative that
3320 subsequent matching fails. (*THEN) is the weakest, carrying on the match at the
3323 unanchored pattern). (*SKIP) is similar, except that the advance may be more
3324 than one character. (*COMMIT) is the strongest, causing the entire match to
3331 If more than one backtracking verb is present in a pattern, the one that is
3339 the next alternative (ABD) to be tried. This behaviour is consistent, but is
3346 If there is a matching failure to the right, backtracking onto (*PRUNE) causes
3347 it to be triggered, and its action is taken. There can never be a backtrack
3360 If the subject is "abac", Perl matches, but PCRE2 fails because the (*COMMIT)
3377 innermost enclosing group that has alternations, whether or not this is within
3393 These behaviours occur whether or not the subpattern is called recursively.
3394 Perl's treatment of subroutines is different in some cases.
3407 the subpattern that has alternatives. If there is no such group within the