pcre2pattern.3 - OpenGrok cross reference for /external/pcre/doc/pcre2pattern.3

Lines Matching +full:posix +full:- +full:character +full:- +full:classes
3 PCRE2 - Perl-compatible regular expressions (revised API)
8 are described in detail below. There is a quick-reference syntax summary in the
26 using a different algorithm that is not Perl-compatible. Some of the features
36 .SH "SPECIAL START-OF-PATTERN ITEMS"
40 by special items at the start of a pattern. These are not Perl-compatible, but
50 In the 8-bit and 16-bit PCRE2 libraries, characters may be coded either as
51 single code units, or as multiple UTF-8 or UTF-16 code units. UTF-32 can be
52 specified for the 32-bit library, in which case it constrains the character
66 restrict them to non-UTF data for security reasons. If the PCRE2_NEVER_UTF
76 such as \ed and \ew to use Unicode properties to determine character types,
103 .SS "Disabling auto-possessification"
116 .SS "Disabling start-up optimizations"
133 apply to patterns whose top-level branches all start with .* (match any number
194 strings: a single CR (carriage return) character, a single LF (linefeed)
195 character, the two-character sequence CRLF, any of the three preceding, any
196 Unicode newline sequence, or the NUL character (binary zero). The
216   (*NUL)       the NUL character (binary zero)
252 .SH "EBCDIC CHARACTER CODES"
256 character code instead of ASCII or Unicode (typically a mainframe system). In
257 the sections below, character code values are ASCII or Unicode; in an EBCDIC
275 equivalents, are case-equivalent with Unicode U+212A (Kelvin sign) and U+017F
281 character classes, alternatives, and repetitions in the pattern. These are
290   \e      general escape character with several uses
293   .      match any character except newline (by default)
294   [      start character class definition
312 Part of a pattern that is in square brackets is called a "character class". In
313 a character class the only metacharacters are:
315   \e      general escape character
316   ^      negate the class, but only if the first character
317   -      indicates character range
318   [      POSIX character class (if followed by POSIX syntax)
319   ]      terminates the character class
322 the pattern, other than in a character class, within a \eQ...\eE sequence, or
323 between a # outside a character class and the next newline, inclusive, are
325 character as part of the pattern. If the PCRE2_EXTENDED_MORE option is set, the
327 ignored inside a character class. Note: only these two characters are ignored,
329 character class. Option settings can be changed within a pattern; see the
343 The backslash character has several uses. Firstly, if it is followed by a
344 character that is not a digit or a letter, it takes away any special meaning
345 that character may have. This use of backslash as an escape character applies
346 both inside and outside character classes.
348 For example, if you want to match a * character, you must write \e* in the
349 pattern. This escaping action applies whether or not the following character
351 precede a non-alphanumeric with backslash to specify that it stands for itself.
363 interpolation. Also, Perl does "double-quotish backslash interpolation" on any
366 other character. Note the following examples:
378 The \eQ...\eE sequence is recognized both inside and outside character classes.
382 a character class, this causes an error, because the character class is then
387 .SS "Non-printing characters"
390 A second use of backslash provides a way of encoding non-printing characters
392 non-printing characters in a pattern, but when a pattern is being prepared by
394 instead of the binary character it represents. In an ASCII or Unicode
397   \ea          alarm, that is, the BEL character (hex 07)
398   \ecx         "control-x", where x is a non-control ASCII character
404   \e0dd        character with octal code 0dd
405   \eddd        character with octal code ddd, or backreference
406   \eo{ddd..}   character with octal code ddd..
407   \exhh        character with hex code hh
408   \ex{hhh..}   character with hex code hhh..
409   \eN{U+hhh..} character with Unicode hex code point hhh..
413 hexadecimal digits may appear between \ex{ and }. If a character other than a
423 two compile-time options. If PCRE2_ALT_BSUX is set, the sequence \ex followed
425 recognized as a character escape. Otherwise it is interpreted as a literal "x"
426 character. In this mode, support for code points greater than 256 is provided
428 interpreted as a literal "u" character.
431 \eu{hhh..} is recognized as the character specified by hexadecimal code point.
439 (curly bracket) it has an entirely different meaning, matching any character
445 (carriage return) character.
447 An error occurs if \ec is not followed by a character whose ASCII code point
449 lower case letter, it is converted to upper case. Then bit 6 of the character
453 a compile-time error occurs.
458 only characters that are allowed after \ec are A-Z, a-z, or one of @, [, \e, ],
459 ^, _, or ?. Any other character provokes a compile-time error. The sequence
460 \ec@ encodes character code 0; after \ec the letters (in either case) encode
461 characters 1-26 (hex 01 to hex 1A); [, \e, ], ^, and _ encode characters 27-31
464 Thus, apart from \ec?, these escapes generate the same character code values as
470 because 127 is not a control character in EBCDIC, Perl makes it generate the
471 APC character. Unfortunately, there are several variants of EBCDIC. In most of
472 them the APC character has the value 255 (hex FF), but in the one Perl calls
473 POSIX-BC its value is 95 (hex 5F). If certain other characters have POSIX-BC
478 specifies two binary zeros followed by a CR character (code value 13). Make
479 sure you supply two digits after the initial zero if the pattern character that
484 addition to Perl; it provides way of specifying character code points as octal
490 character code points, and \eg{...} to specify backreferences. The following
496 Outside a character class, PCRE2 reads the digit and any following digits as a
510 Otherwise, up to three octal digits are read to form a character code.
512 Inside a character class, PCRE2 handles \e8 and \e9 as the literal characters
514 backslash, using them to generate a data character. Any subsequent digits stand
515 for themselves. For example, outside a character class:
526   \e0113  is a tab followed by the character "3"
529             character with octal code 113
540 .SS "Constraints on character values"
546   8-bit non-UTF mode    no greater than 0xff
547   16-bit non-UTF mode   no greater than 0xffff
548   32-bit non-UTF mode   no greater than 0xffffffff
552 so-called "surrogate" code points). The check for these can be disabled by the
554 PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES. However, this is possible only in UTF-8
555 and UTF-32 modes, because these values are not representable in UTF-16.
558 .SS "Escape sequences in character classes"
561 All the sequences that define a single character value can be used both inside
562 and outside character classes. In addition, inside a character class, \eb is
563 interpreted as the backspace character (hex 08).
565 When not followed by an opening brace, \eN is not allowed in a character class.
566 \eB, \eR, and \eX are not special inside a character class. Like other
568 character class, these sequences have different meanings.
578 character, and \eu can be used to define a character by code point, as
602 For compatibility with Oniguruma, the non-Perl syntax \eg followed by a name or
619 .SS "Generic character types"
622 Another use of backslash is for specifying generic character types:
625   \eD     any character that is not a decimal digit
626   \eh     any horizontal white space character
627   \eH     any character that is not a horizontal white space character
628   \eN     any character that is not a newline
629   \es     any white space character
630   \eS     any character that is not a white space character
631   \ev     any vertical white space character
632   \eV     any character that is not a vertical white space character
633   \ew     any "word" character
634   \eW     any "non-word" character
646 "Non-printing characters"
652 of characters into two disjoint sets. Any given character matches one, and only
653 one, of each pair. The sequences can appear both inside and outside character
654 classes. They each match one character of the appropriate type. If the current
656 there is no character to match.
660 vary if locale-specific matching is taking place. For example, in some locales
661 the "non-breaking space" character (\exA0) is recognized as white space, and in
662 others the VT character is not.
664 A "word" character is an underscore or any character that is a letter or digit.
666 low-valued character tables, and may vary if locale-specific matching is taking
676 page). For example, in a French locale such as "fr_FR" in Unix-like systems,
677 or "french" in Windows, some character codes greater than 127 are used for
683 for characters in the range 128-255 when locale-specific matching is happening.
687 determine character types, as follows:
689   \ed  any character that matches \ep{Nd} (decimal digit)
690   \es  any character that matches \ep{Z} or \eh or \ev
691   \ew  any character that matches \ep{L}, \ep{N}, \ep{Mn}, or \ep{Pc}
693 The addition of \ep{Mn} (non-spacing mark) and the replacement of an explicit
699 other character categories. Note also that PCRE2_UCP affects \eb, and
718   U+00A0     Non-break space
725   U+2004     Three-per-em space
726   U+2005     Four-per-em space
727   U+2006     Six-per-em space
732   U+202F     Narrow no-break space
746 In 8-bit, non-UTF-8 mode, only the characters with code points less than 256
754 Outside a character class, by default, the escape sequence \eR matches any
755 Unicode newline sequence. In 8-bit non-UTF-8 mode \eR is equivalent to the
765 This particular group matches either the two-character sequence CR followed by
768 line, U+0085). Because this is an atomic group, the two-character sequence is
787 Note that these special settings, which are not Perl-compatible, are recognized
795 character class, \eR is treated as an unrecognized escape sequence, and causes
800 .SS Unicode character properties
805 can be used in any mode, though in 8-bit and 16-bit non-UTF modes these
807 less than U+0100 and U+10000, respectively. In 32-bit non-UTF mode, code points
812 multistage table lookup in order to find a character's property. That is why
819   \ep{\fIxx\fP}   a character with the \fIxx\fP property
820   \eP{\fIxx\fP}   a character without the \fIxx\fP property
823 The property names represented by \fIxx\fP above are not case-sensitive, and in
826 general category properties, "Any", which matches any character (including
843 character has a basic script and, optionally, a list of other scripts ("Script
853 Unassigned characters (and in non-UTF 32-bit mode, characters with code points
856 of recognized script names and their 4-character abbreviations can be obtained
859   pcre2test -LS
867 Each character has exactly one Unicode general category property, specified by
868 a two-letter abbreviation. For compatibility with Perl, negation can be
899   Mn    Non-spacing mark
927 matches a character that has the Lu, Ll, or Lt property, in other words, a
932 character when PCRE2 is not in UTF mode (using the 16-bit or 32-bit library).
945 No character that is in the Unicode table has the Cn (unassigned) property.
961   pcre2test -LP
968   \ep{Bidi_Class:<class>}   matches a character with the given class
969   \ep{BC:<class>}           matches a character with the given class
971 The recognized classes are:
982   L           left-to-right
983   LRE         left-to-right embedding
984   LRI         left-to-right isolate
985   LRO         left-to-right override
986   NSM         non-spacing mark
990   R           right-to-left
991   RLE         right-to-left embedding
992   RLI         right-to-left isolate
993   RLO         right-to-left override
998 case-insensitive; only the short names listed above are recognized.
1010 Unicode supports various kinds of composite character by giving each character
1015 Instead it introduced various emoji-specific properties. PCRE2 uses only the
1018 \eX always matches at least one character. Then it decides whether to add
1023 2. Do not end between CR and LF; otherwise end after any control character.
1026 are of five types: L, V, T, LV, and LVT. An L character may be followed by an
1027 L, V, LV, or LVT character; an LV or V character may be followed by a V or T
1028 character; an LVT or T character may be followed only by a T character.
1030 4. Do not end before extending characters or spacing marks or the zero-width
1031 joiner (ZWJ) character. Characters with the "mark" property always have the
1036 6. Do not end within emoji modifier sequences or emoji ZWJ (zero-width
1037 joiner) sequences. An emoji ZWJ sequence consists of a character with the
1039 with the Extend property, followed by the ZWJ character, followed by another
1040 Extended_Pictographic character.
1055 and \es to use Unicode properties. PCRE2 uses these non-standard, non-Perl
1059   Xan   Any alphanumeric character
1060   Xps   Any POSIX space character
1061   Xsp   Any Perl space character
1062   Xwd   Any Perl "word" character
1066 carriage return, and any other character that has the Z (separator) property.
1069 those that match Mn (non-spacing mark) or Pc (connector punctuation, which
1072 There is another non-standard property, Xuc, which matches any character that
1073 can be represented by a Universal Character Name in C++ and other programming
1077 excluded. (Universal Character Names are of the form \euHHHH or \eUHHHHHHHH
1120 \fBpcre2_compile()\fP to re-enable the previous behaviour. When this option is
1156 Inside a character class, \eb has a different meaning; it matches the backspace
1157 character. If any other of these assertions appears in a character class, an
1160 A word boundary is a position in the subject string where the current character
1161 and the previous character do not both match \ew or \eW (i.e. one matches
1163 first or last character matches \ew, respectively. When PCRE2 is built with
1176 argument of \fBpcre2_match()\fP is non-zero, indicating that matching is to
1185 \fIstartoffset\fP is non-zero. By calling \fBpcre2_match()\fP multiple times
1190 character of the matching process, is subtly different from Perl's, which
1203 The circumflex and dollar metacharacters are zero-width assertions. That is,
1207 only the two-character sequence CRLF is recognized as a newline, isolated CR
1211 Outside a character class, in the default matching mode, the circumflex
1212 character is an assertion that is true only if the current matching point is at
1214 \fBpcre2_match()\fP is non-zero, or if PCRE2_NOTBOL is set, circumflex can
1215 never match if the PCRE2_MULTILINE option is unset. Inside a character class,
1222 Circumflex need not be the first character of the pattern if a number of
1230 The dollar character is an assertion that is true only if the current matching
1234 character of the pattern if a number of alternatives are involved, but it
1236 special meaning in a character class.
1243 PCRE2_MULTILINE option is set. When this is the case, a dollar character
1254 when the \fIstartoffset\fP argument of \fBpcre2_match()\fP is non-zero. The
1262 below) recognizes the two-character sequence CRLF as a newline, this is
1278 Outside a character class, a dot in the pattern matches any one character in
1279 the subject string except (by default) a character that signifies the end of a
1287 Dot never matches a single line-ending character. When the two-character
1295 PCRE2_DOTALL option is set, a dot matches any one character, without exception.
1296 If the two-character sequence CRLF is present in the subject string, it takes
1301 special meaning in a character class.
1305 it matches any character except one that signifies the end of a line.
1311 "Non-printing characters"
1320 Outside a character class, the escape sequence \eC matches any one code unit,
1321 whether or not a UTF mode is set. In the 8-bit library, one code unit is one
1322 byte; in the 16-bit library it is a 16-bit unit; in the 32-bit library it is a
1323 32-bit unit. Unlike a dot, \eC always matches line-ending characters. The
1324 feature is provided in Perl in order to match individual bytes in UTF-8 mode,
1328 with \eC in UTF-8 or UTF-16 mode means that the rest of the string may start
1329 with a malformed UTF character. This has undefined results, because PCRE2
1330 assumes that it is matching character by character in a valid UTF string (by
1343 in UTF-8 or UTF-16 modes, because this would make it impossible to calculate
1346 The former gives a match-time error; the latter fails to optimize and so the
1349 In the 32-bit library, however, \eC is always supported (when not explicitly
1350 locked out) because it always matches a single code unit, whether or not UTF-32
1354 it that avoids the problem of malformed UTF-8 or UTF-16 characters is to use a
1355 lookahead to check the length of the next character, as in this pattern, which
1356 could be used with a UTF-8 string (ignore white space and line breaks):
1358   (?| (?=[\ex00-\ex7f])(\eC) |
1359       (?=[\ex80-\ex{7ff}])(\eC)(\eC) |
1360       (?=[\ex{800}-\ex{ffff}])(\eC)(\eC)(\eC) |
1361       (?=[\ex{10000}-\ex{1fffff}])(\eC)(\eC)(\eC)(\eC))
1369 below). The assertions at the start of each branch check the next UTF-8
1370 character for values whose encoding uses 1, 2, 3, or 4 bytes, respectively. The
1371 character's individual bytes are then captured by the appropriate number of
1376 .SH "SQUARE BRACKETS AND CHARACTER CLASSES"
1379 An opening square bracket introduces a character class, terminated by a closing
1382 the first data character in the class (after an initial circumflex, if present)
1387 A character class matches a single character in the subject. A matched
1388 character must be in the set of characters defined by the class, unless the
1389 first character in the class definition is a circumflex, in which case the
1390 subject character must not be in the set defined by the class. If a circumflex
1392 character, or escape it with a backslash.
1394 For example, the character class [aeiou] matches any lower case vowel, while
1395 [^aeiou] matches any character that is not a lower case vowel. Note that a
1398 circumflex is not an assertion; it still consumes a character from the subject
1408 are case-equivalent with Unicode U+212A (Kelvin sign) and U+017F (long S)
1412 when matching character classes, whatever line-ending sequence is in use, and
1416 The generic character type escape sequences \ed, \eD, \eh, \eH, \ep, \eP, \es,
1417 \eS, \ev, \eV, \ew, and \eW may appear in a character class, and add the
1421 outside a character class, as described in the section entitled
1424 "Generic character types"
1426 above. The escape sequence \eb has a different meaning inside a character
1427 class; it matches the backspace character. The sequences \eB, \eR, and \eX are
1428 not special inside a character class. Like any other unrecognized escape
1432 The minus (hyphen) character can be used to specify a range of characters in a
1433 character class. For example, [d-m] matches any letter between d and m,
1434 inclusive. If a minus character is required in a class, it must be escaped with
1436 indicating a range, typically as the first or last character in the class,
1437 or immediately after a range. For example, [b-d-z] matches letters in the range
1438 b to d, a hyphen character, or z.
1440 Perl treats a hyphen as a literal if it appears before or after a POSIX class
1441 (see below) or before or after a character type escape such as \ed or \eH.
1442 However, unless the hyphen is the last character in the class, Perl outputs a
1446 It is not possible to have the literal character "]" as the end character of a
1447 range. A pattern such as [W-]46] is interpreted as a class of two characters
1448 ("W" and "-") followed by a literal string "46]", so it would match "W46]" or
1449 "-46]". However, if the "]" is escaped with a backslash it is interpreted as
1450 the end of range, so [W-\e]46] is interpreted as a class containing a range
1456 example [\e000-\e037]. Ranges can include any characters that are valid for the
1457 current mode. In any UTF mode, the so-called "surrogate" characters (those
1460 this check). However, ranges such as [\ex{d7ff}-\ex{e000}], which include the
1466 example, [h-k] matches only four characters, even though the codes for h and k
1468 specified numerically, for example, [\ex88-\ex92] or [h-\ex92], all code points
1472 matches the letters in either case. For example, [W-c] is equivalent to
1473 [][\e\e^_`wxyzabc], matched caselessly, and in a non-UTF mode, if character
1474 tables for a French locale are in use, [\exc8-\excb] matches accented E
1477 A circumflex can conveniently be used with the upper case character types to
1480 whereas [\ew] includes underscore. A positive character class should be read as
1484 The only metacharacters that are recognized in character classes are backslash,
1487 introducing a POSIX class name, or for a special compatibility feature - see
1489 escaping other non-alphanumeric characters does no harm.
1492 .SH "POSIX CHARACTER CLASSES"
1495 Perl supports the POSIX notation for character classes. This uses names
1501 matches "0", "1", any alphabetic character, or "%". The supported class names
1506   ascii    character codes 0 - 127
1520 and space (32). If locale-specific matching is taking place, the list of space
1525 5.8. Another Perl extension is negation, which is indicated by a ^ character
1530 matches "1", "2", or any non-digit. PCRE2 (and Perl) also recognize the POSIX
1535 POSIX character classes, although this may be different for characters in the
1536 range 128-255 when locale-specific matching is happening. However, in UCP mode,
1537 unless certain options are set (see below), some of the classes are changed so
1538 that Unicode character properties are used. This is achieved by replacing
1539 POSIX classes with other sequences, as follows:
1551 Negated versions, such as [:^alpha:] use \eP instead of \ep. Four other POSIX
1552 classes are handled specially in UCP mode:
1561   U+2066 - U+2069  Various "isolate"s
1578 The other POSIX classes are unchanged by PCRE2_UCP, and match only characters
1581 There are two options that can be used to restrict the POSIX classes to ASCII
1584 (?aT) and (?-aT). The PCRE2_EXTRA_ASCII_POSIX option disables UCP processing
1585 for all POSIX classes, including [:digit:] and [:xdigit:]. Within a pattern,
1586 (?aP) and (?-aP) set and unset both these options for consistency.
1592 In the POSIX.2 compliant library that was included in 4.4BSD Unix, the ugly
1599 Only these exact character sequences are recognized. A sequence such as
1600 [a[:<:]b] provokes error for an unrecognized POSIX class name. This support is
1608 above), and in a Perl-style pattern the preceding or following character
1610 used above in order to give exactly the POSIX behaviour. Note also that the
1612 it also affects these POSIX sequences.
1640 of letters enclosed between "(?" and ")". The following are Perl-compatible,
1656 example (?-im). The two "extended" options are not independent; unsetting
1659 A combined setting and unsetting such as (?im-sx), which sets PCRE2_CASELESS
1665 If the first character following (? is a circumflex, it causes all of the above
1667 be re-instated, but a hyphen may not appear.
1669 Some PCRE2-specific options can be changed by the same mechanism using these
1682 (?-imnrsx). If 'a' is not followed by any of the upper case letters shown
1686 is set, but including it in (?aP) means that (?-aP) suppresses all ASCII
1687 restrictions for POSIX classes.
1710 a non-capturing group (see the next section), the option letters may
1718 \fBNote:\fP There are other PCRE2-specific options, applying to the whole
1776 a non-capturing group, the option letters may appear between the "?" and the
1794 itself a non-capturing group. For example, consider this pattern:
1808   # before  ---------------branch-reset----------- after
1823 A relative reference such as (?-1) is no different: it is just a convenient way
1831 for a group's having matched refers to a non-unique number, the test is
1851 alphanumeric characters and underscores, but must start with a non-digit. When
1856   ^[_A-Za-z][_A-Za-z0-9]*\ez   when PCRE2_UTF is not set
1879 name-to-number translation table from a compiled pattern, as well as
1894 compile-time error. However, there is still scope for confusion. Consider this
1919 either as a 3-letter abbreviation or as the full name, and in both cases you
1937 If you make a backreference to a non-unique named group from elsewhere in the
1946 If you make a subroutine call to a non-unique named group, the one that
1974   a literal data character
1979   any escape sequence that matches a single character
1980   a character class
1994 character. If the second number is omitted, but the comma is present, there is
2014 quantifier, the brace is taken as a literal character. In particular, this
2022 which is represented by a two-byte sequence in a UTF-8 string. Similarly,
2041 For convenience, the three most common quantifiers have single-character
2102 character position in the subject string, so there is no point in retrying the
2117 If the subject is "xyz123abc123" the match point is the fourth character. For
2150 re-evaluated to see if a different number of repeats allows the rest of the
2163 that once a group has matched, it is not to be re-evaluated in this way.
2195 additional + character following a quantifier. Using this notation, the
2229 matches an unlimited number of substrings that either consist of non-digits, or
2238 than a single character at the end, because both PCRE2 and Perl have an
2239 optimization that allows for fast failure when a single character is used. They
2240 remember the last single character that is required for a match, and fail early
2246 sequences of non-digits cannot be broken, and failure happens quickly.
2253 Outside a character class, a backslash followed by a digit greater than 0 (and
2267 interpreted as a character defined in octal. See the subsection entitled
2268 "Non-printing characters"
2290   (abc(def)ghi)\eg{-1}
2292 The sequence \eg{-1} is a reference to the capture group whose number is one
2294 the next group would be numbered 3) is it equivalent to \e2, and \eg{-2} would
2296 that group is included in the count, so in this example \eg{-2} also refers to
2299   (A)(\eg{-2}B)
2357 continues with a digit character, some delimiter must be used to terminate the
2377 the group, the backreference matches the character string corresponding to the
2413 The Perl-compatible lookaround assertions are atomic. If an assertion is true,
2415 assertion. However, there are some cases where non-atomic assertions can be
2419 "Non-atomic assertions"
2421 below, but they are not Perl-compatible.
2435 such as (.)\eg{-1} can be used to check that two adjacent characters are the
2528 If every top-level alternative matches a fixed length, for example
2538 which one or more top-level alternatives can match more than one string length,
2553 In UTF-8 and UTF-16 modes, PCRE2 does not allow the \eC escape (which matches a
2564 as the called capture group matches a limited-length string. However,
2578 same character:
2595 there is no following "a"), it backtracks to match all but the last character,
2644 .SH "NON-ATOMIC ASSERTIONS"
2649 assertion. However, there are some cases where non-atomic positive assertions
2655 Consider the problem of finding the right-most word in a string that also
2667 succeeds, it captures the right-most word in the string.
2674 assertion could not be re-entered, and the whole match would fail. The pattern
2677 Using a non-atomic lookahead, however, means that when the last word does not
2678 occur twice in the string, the lookahead can backtrack and find the second-last
2682 Two conditions must be met for a non-atomic assertion to be useful: the
2686 as before because nothing has changed, so using a non-atomic assertion just
2689 There is one exception to backtracking into a non-atomic assertion. If an
2693 Non-atomic assertions are not supported by the alternative matching function
2728 the matched characters in a sequence of non-spaces that follow white space are
2738 This works as long as the first character is expected to be a character in that
2743   \es+(?=[0-9_.]*\ep{Latin})(*sr:\eS+)
2757 support. A compile-time error is given if any of the above constructs is
2780   (?(condition)yes-pattern)
2781   (?(condition)yes-pattern|no-pattern)
2783 If the condition is satisfied, the yes-pattern is used; otherwise the
2784 no-pattern (if present) is used. An absent no-pattern is equivalent to an empty
2786 group, a compile-time error occurs. Each of the two alternatives may itself
2795 recursion, two pseudo-conditions called DEFINE and VERSION, and assertions.
2813 this condition) can be referenced by (?(-1), the next most recent by (?(-2),
2816 value zero in any of these forms is not used; it provokes a compile-time error.
2818 Consider the following pattern, which contains non-significant white space to
2825 character is present, sets it as the first captured substring. The second part
2829 the condition is true, and so the yes-pattern is executed and a closing
2830 parenthesis is required. Otherwise, since no-pattern is not present, the
2832 sequence of non-parentheses, optionally enclosed in parentheses.
2837   ...other stuff... ( \e( )?    [^()]+    (?(-1) \e) ) ...
2862 "Recursion" in this sense refers to any subroutine-like call from one part of
2922   (?(DEFINE) (?<byte> 2[0-4]\ed | 25[0-5] | 1\ed\ed | [1-9]?\ed) )
2929 pattern uses references to the named group to match the four dot-separated
2959 non-atomic assertions.
2962 Consider this pattern, again containing non-significant white space, and with
2965   (?(?=[^a-z]*[a-z])
2966   \ed{2}-[a-z]{3}-\ed{2}  |  \ed{2}-\ed{2}-\ed{2} )
2969 sequence of non-letters followed by a letter. In other words, it tests for the
2973 dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits.
2978 assertion, whether it succeeds or fails. (Compare non-conditional assertions,
2987 PCRE2. In both cases, the start of the comment must not be in a character
2994 PCRE2_EXTENDED or PCRE2_EXTENDED_MORE option is set, an unescaped # character
2996 the next newline character or character sequence in the pattern. Which
3007 default newline convention (a single linefeed character) is in force:
3011 On encountering the # character, \fBpcre2_compile()\fP skips along, looking for
3013 it does not terminate the comment. Only an actual character with the code value
3047 non-recursive subroutine
3058 substrings which can either be a sequence of non-parentheses, or a recursive
3061 to avoid backtracking into sequences of non-parentheses.
3073 pattern above you can write (?-2) to refer to the second most recently opened
3085   (?|(a)|(b)) (c) (?-2)
3088 is number 2. When the reference (?-2) is encountered, the second most recently
3099 non-recursive subroutine
3114 non-parentheses is important when applying the pattern to strings that do not
3147 alternatives for the recursive and non-recursive cases. The (?R) item is the
3159 once it had matched some of the subject string, it was never re-entered, even
3164 as atomic. That is, they can be re-entered to try unused alternatives if there
3174 The second branch in the group matches a single central character in the
3178 typical palindromic phrases, the pattern has to ignore all non-word characters,
3185 avoid backtracking into sequences of non-word characters. Without this, PCRE2
3216   (...(relative)...)...(?-1)...
3236 Processing options such as case-independence are fixed when a group is
3240   (abc)(?i:(?-1))
3262 For compatibility with Oniguruma, the non-Perl syntax \eg followed by a name or
3273   (abc)(?i:\eg<-1>)
3309 one side-effect is that sometimes callouts are skipped. If you need all
3369 is no longer Perl-compatible.
3374 \ex{100} that define character code points. Character type escapes such as \ed
3380 skipped, and #-comments are recognized, exactly as in the rest of the pattern.
3384 The maximum length of a name is 255 in the 8-bit library and 65535 in the
3385 16-bit and 32-bit libraries. If the name is empty, that is, if the closing
3419 minimum length of matching subject, or that a particular character must be
3422 the start-of-match optimizations by setting the PCRE2_NO_START_OPTIMIZE option
3507 When a match succeeds, the name of the last-encountered mark name on the
3541 name is recorded and passed back if it is the last-encountered. This does not
3606 unless PCRE2's start-of-match optimizations are turned off, as shown in this
3620 the optimization that skips along to the first character. The pattern is now
3629 starting character then happens. Backtracking can occur as usual to the left of
3645 pattern is unanchored, the "bumpalong" advance is not to the next character,
3653 the first character in the string), the starting point skips on to start the
3656 first match attempt, the second attempt would start at the second character
3675 assertions, because they are never re-entered by backtracking. Compare the
3690 the second branch of the pattern to be tried at the first character position.
3693 matching attempt to start at the second character. This time, the (*MARK) is
3705 pattern-based if-then-else block:
3721 A group that does not contain a | character is just a part of the enclosing
3741 because only one is ever used. In other words, the | character in a conditional
3748 character "b" is matched, but "c" is not. At this point, matching does not
3750 character. The conditional group is part of the single alternative that
3757 starting position, but allowing an advance to the next character (for an
3759 than one character. (*COMMIT) is the strongest, causing the entire match to
3820 reach them. This means that, for the Perl-compatible assertions, their effect
3826 PCRE2 now supports non-atomic positive assertions, as described in the section
3830 "Non-atomic assertions"
3833 not Perl-compatible. For these assertions, a later backtrack does jump back
3895 Copyright (c) 1997-2024 University of Cambridge.