• Home
  • Raw
  • Download

Lines Matching +full:- +full:- +full:without +full:- +full:perl

1 -----------------------------------------------------------------------------
8 -----------------------------------------------------------------------------
16 PCRE2 - Perl-compatible regular expressions (revised API)
22 pattern matching using the same syntax and semantics as Perl, with just
25 API is more extensible, and it was simplified by abolishing the sepa-
30 As well as Perl-style regular expression patterns, some features that
31 appeared in Python and the original PCRE before they appeared in Perl
37 The source code for PCRE2 can be compiled to support 8-bit, 16-bit, or
38 32-bit code units, which means that up to three separate libraries may
39 be installed. The original work to extend PCRE to 16-bit and 32-bit
40 code units was done by Zoltan Herczeg and Christian Persch, respec-
42 character per code unit, or as UTF-encoded Unicode, with support for
45 code units must be enabled explicitly at run time. The version of Uni-
48 pcre2test -C
51 ending in _8, _16, or _32, respectively (for example, pcre2_com-
57 In addition to the Perl-compatible matching function, PCRE2 contains an
58 alternative function that matches the same compiled patterns in a dif-
63 Details of exactly which Perl regular expression features are and are
70 client to discover which features are available. The features them-
71 selves are described in the pcre2build page. Documentation about build-
73 NON-AUTOTOOLS_BUILD files in the source distribution.
86 If you are using PCRE2 in a non-UTF application that permits users to
89 For example, an 8-bit pattern that begins with "(*UTF)" turns on UTF-8
90 mode, which interprets patterns and subjects as strings of UTF-8 code
91 units instead of individual 8-bit characters. This causes both the pat-
92 tern and any data against which it is matched to be checked for UTF-8
93 validity. If the data string is very long, such a check might use suf-
94 ficiently many resources as to cause your application to lose perfor-
97 One way of guarding against this possibility is to use the pcre2_pat-
100 calling pcre2_compile(). This causes a compile time error if the pat-
101 tern contains a UTF-setting sequence.
104 be enabled from within the pattern, by specifying "(*UCP)". This fea-
112 The use of the \C escape sequence in a UTF-8 or UTF-16 pattern can lead
114 middle of a multi-code-unit character. The PCRE2_NEVER_BACKSLASH_C
116 a compile-time error if it is encountered. It is also possible to build
121 Nested unlimited repeats in a pattern are a common example. PCRE2 pro-
124 pcre2_set_depth_limit() that can be used to restrict the amount of mem-
130 The user documentation for PCRE2 comprises a number of different sec-
136 (which is a program listing), and the short pages for individual func-
137 tions, are concatenated in pcre2.txt, for ease of searching. The sec-
141 pcre2-config show PCRE2 installation configuration information
145 pcre2compat discussion of Perl compatibility
148 pcre2grep description of the pcre2grep command (8-bit only)
149 pcre2jit discussion of just-in-time optimization support
156 pcre2posix the POSIX-compatible C API for the 8-bit library
181 Copyright (c) 1997-2018 University of Cambridge.
182 ------------------------------------------------------------------------------
190 PCRE2 - Perl-compatible regular expressions (revised API)
195 contains a description of all its native functions. See the pcre2 docu-
448 These functions provide a way of converting non-PCRE2 patterns into
455 PCRE2 8-BIT, 16-BIT, AND 32-BIT LIBRARIES
457 There are three PCRE2 libraries, supporting 8-bit, 16-bit, and 32-bit
460 for all three libraries. One, two, or all three can be installed simul-
461 taneously. On Unix-like systems the libraries are called libpcre2-8,
462 libpcre2-16, and libpcre2-32, and they can also co-exist with the orig-
485 macros are defined whose names are the generic forms such as pcre2_com-
487 PCRE2_CODE_UNIT_WIDTH to generate the appropriate width-specific func-
505 single library. For example, if you want to run a match using a pat-
511 their generic names, without the _8, _16, or _32 suffix.
517 There are also some wrapper functions for the 8-bit library that corre-
529 program against a non-dll PCRE2 library, you must define PCRE2_STATIC
533 and matching regular expressions in a Perl-compatible manner. A sample
540 passed as bits in an options argument. There are also some more compli-
546 Just-in-time (JIT) compiler support is an optional feature of PCRE2
561 less sanity checking. The JIT-specific functions are discussed in the
564 A second matching function, pcre2_dfa_match(), which is not Perl-com-
569 return captured substrings. A description of the two matching algo-
588 pcre2_substring_free() and pcre2_substring_list_free() are also pro-
590 functions is called with a NULL argument, the function returns immedi-
591 ately without doing anything.
600 Finally, there are functions for finding out information about a com-
615 ~(PCRE2_SIZE)0) is reserved as a special indicator for zero-terminated
623 strings: a single CR (carriage return) character, a single LF (line-
624 feed) character, the two-character sequence CRLF, any of the three pre-
642 dollar metacharacters, the handling of #-comments in /x mode, and, when
643 CRLF is a recognized line ending sequence, the match position advance-
644 ment for a non-anchored pattern. There is more detail about this in the
654 In a multithreaded application it is important to keep thread-specific
656 library code itself is thread-safe: it contains no static or global
657 variables. The API is designed to be fairly simple for non-threaded
658 applications while at the same time ensuring that multithreaded appli-
661 There are several different blocks of data that are used to pass infor-
669 is thread-safe, that is, the same compiled pattern can be used by more
672 use them. However, if the just-in-time (JIT) optimization feature is
682 Get a read-only (shared) lock (mutex) for pointer
694 If JIT is being used, but the JIT compilation is not being done immedi-
700 obtain a private copy of the compiled code before calling the JIT com-
709 a PCRE2 function without using lots of arguments. The parameters that
713 In a multithreaded application, if the parameters in a context are val-
716 it must make its own thread-specific copy.
721 of a match. This includes details of what was matched, as well as addi-
730 memory management or non-standard character tables. To keep function
739 relevant for several PCRE2 operations, a compile-time context, and a
740 match-time context.
764 function may be NULL, in which case the system memory management func-
786 without doing anything.
790 A compile context is required if you want to provide an external func-
792 values of any of the following compile-time parameters:
801 A compile context is also required if you are using custom memory man-
802 agement. If none of these apply, just pass NULL as the context argu-
805 A compile context is created, copied, and freed by the following func-
833 only argument is a general context. This function builds a set of char-
839 As PCRE2 has developed, almost all the 32 option bits that are avail-
842 bits which are used for some newer, assumed rarer, options. This func-
854 largest number that a PCRE2_SIZE variable can hold, which is effec-
860 This specifies which characters or character sequences are to be recog-
863 two-character sequence CR followed by LF), PCRE2_NEWLINE_ANYCRLF (any
871 PCRE2_EXTENDED_MORE option, the newline convention affects the recogni-
883 limit applies to parentheses of all kinds, not just capturing parenthe-
899 nesting, and the second is user data that is set up by the last argu-
901 should return zero if all is well, or non-zero to force an error.
917 A match context is created, copied, and freed by the following func-
937 during a matching operation. Details are given in the pcre2callout doc-
957 option when calling pcre2_compile() so that when JIT is in use, differ-
958 ent code can be compiled. If a match is started with a non-default
959 match limit when PCRE2_USE_OFFSET_LIMIT is not set, an error is gener-
967 the first line and also within the offset limit. In other words, which-
976 also applies to pcre2_dfa_match(), which may use the heap when process-
978 atomic groups. This limit does not apply to matching with the JIT opti-
994 The pcre2_match() function starts out using a 20KiB vector on the sys-
995 tem stack for recording backtracking points. The more nested backtrack-
998 too small. If the heap limit is set to a value less than 21 (in partic-
1000 that do not have a lot of nested backtracking can be successfully pro-
1025 When pcre2_match() is called with a pattern that was successfully pro-
1088 CHECKING BUILD-TIME OPTIONS
1107 non-negative on success, or the negative error code PCRE2_ERROR_BADOP-
1108 TION if the value in the first argument is not recognized. The follow-
1122 unit widths were selected when PCRE2 was built. The 1-bit indicates
1123 8-bit support, and the 2-bit and 4-bit indicate 16-bit and 32-bit sup-
1130 recursions, lookarounds, and atomic groups in pcre2_dfa_match(). Fur-
1143 just-in-time compiling is available; otherwise it is set to zero.
1162 the 16-bit library is compiled, a value of 3 is rounded up to 4, and
1163 when the 32-bit library is compiled, internal linkages always use 4
1166 The default value of 2 for the 8-bit and 16-bit libraries is sufficient
1202 The output is a uint32_t integer that gives the maximum depth of nest-
1212 This parameter is obsolete and should not be used in new code. The out-
1220 without Unicode support, the buffer is filled with the text "Unicode
1236 PCRE2 version string, zero-terminated. The number of code units used is
1237 returned. This is the length of the string plus one unit for the termi-
1255 length (in code units). If the pattern is zero-terminated, the length
1260 If the compile context argument ccontext is NULL, memory for the com-
1265 NULL argument, it returns immediately, without doing anything.
1270 below), the JIT information cannot be copied (because it is position-
1271 dependent). The new copy can initially be used only for non-JIT match-
1276 a multithreaded application to acquire a private copy of shared com-
1285 pointing to the new tables. The memory for the new tables is automati-
1293 After running a match, you must not free a compiled pattern (or a sub-
1300 particular, those that are compatible with Perl, but some others as
1304 For those options that can be different in different parts of the pat-
1310 Other, less frequently required compile-time parameters (for example,
1314 If errorcode or erroroffset is NULL, pcre2_compile() returns NULL imme-
1318 error has occurred. The values are not defined when compilation is suc-
1319 cessful and pcre2_compile() returns a non-NULL value.
1322 return if it finds an error in the pattern. There are also some nega-
1328 "Obtaining a textual error message" below) should be self-explanatory.
1332 The value returned in erroroffset is an indication of where in the pat-
1336 the failing assertion. For an invalid UTF-8 or UTF-16 string, the off-
1342 mode. It may sometimes point into the middle of a UTF-8 or UTF-16 char-
1345 This code fragment shows a typical straightforward call to pcre2_com-
1353 PCRE2_ZERO_TERMINATED, /* the pattern is zero-terminated */
1368 only way to do it in Perl.
1372 By default, for compatibility with Perl, a closing square bracket that
1383 (1) \U matches an upper case "U" character; by default \U causes a com-
1384 pile time error (Perl uses \U to upper case subsequent characters).
1388 code point to match. By default, \u causes a compile time error (Perl
1393 code point to match. By default, as in Perl, a hexadecimal number is
1403 Perl. If you want a multiline circumflex also to match after a termi-
1408 By default, for compatibility with Perl, the name in any verb sequence
1417 whitespace in verb names is skipped and #-comments are recognized,
1423 items, all with number 255, before each pattern item, except immedi-
1424 ately before or after an explicit callout in the pattern. For discus-
1430 case letters in the subject. It is equivalent to Perl's /i option, and
1437 points (available only in 16-bit or 32-bit mode) are treated as not
1443 at the end of the subject string. Without this option, a dollar also
1447 Perl, and no way to set it within a pattern.
1453 ever matches one character, even if newlines are coded as CRLF. Without
1454 this option, a dot does not match when the current position in the sub-
1455 ject is at a newline. This option is equivalent to Perl's /s option,
1456 and it can be changed within a pattern by a (?s) option setting. A neg-
1458 escape sequence always matches a non-newline character, independent of
1475 patterns, a new match is then tried at the next starting point. How-
1486 which is the only way to do it in Perl.
1490 matches, which are necessarily substrings of the first one, must obvi-
1496 totally ignored except when escaped or inside a character class. How-
1498 introduce various parenthesized subpatterns, nor within numerical quan-
1500 item and a following quantifier and between a quantifier and a follow-
1502 Perl's /x option, and it can be changed within a pattern by a (?x)
1505 When PCRE2 is compiled without Unicode support, PCRE2_EXTENDED recog-
1507 256 that are flagged as white space in its low-character table. The ta-
1514 When PCRE2 is compiled with Unicode support, in addition to these char-
1515 acters, five more Unicode "Pattern White Space" characters are recog-
1516 nized by PCRE2_EXTENDED. These are U+0085 (next line), U+200E (left-to-
1517 right mark), U+200F (right-to-left mark), U+2028 (line separator), and
1519 recognized by Perl's /x option. Note that the horizontal and vertical
1523 As well as ignoring most white space, PCRE2_EXTENDED also causes char-
1530 Which characters are interpreted as newlines can be specified by a set-
1532 special sequence at the start of the pattern, as described in the sec-
1542 character class. PCRE2_EXTENDED_MORE is equivalent to Perl's /xx
1543 option, and it can be changed within a pattern by a (?xx) option set-
1550 start of matching, though the matched text may continue over the new-
1551 line. If startoffset is non-zero, the limiting newline is not necessar-
1553 string is "abc\nxyz" (where \n represents a single-character newline) a
1562 If this option is set, all meta-characters in the pattern are disabled,
1565 you are doing a lot of literal matching and are worried about effi-
1580 fails by default, for Perl compatibility. Setting this option makes
1590 string, or before a terminating newline (except when PCRE2_DOL-
1593 behaviour (for ^, $, and dot) is the same as Perl.
1598 start and end. This is equivalent to Perl's /m option, and it can be
1601 subject, for compatibility with Perl. However, you can change this by
1608 This option locks out the use of \C in the pattern that is being com-
1609 piled. This escape can cause unpredictable behaviour in UTF-8 or
1610 UTF-16 modes, because it may leave the current matching point in the
1611 middle of a multi-code-unit character. This option may be useful in
1613 there is also a build-time option that permanently locks out the use of
1628 This option locks out interpretation of the pattern as UTF-8, UTF-16,
1629 or UTF-32, depending on which library is in use. In particular, it pre-
1632 applications that process patterns from external sources. The combina-
1637 If this option is set, it disables the use of numbered capturing paren-
1641 is the same as Perl's /n option. Note that, when this option is set,
1648 If this option is set, it disables "auto-possessification", which is an
1651 are in use, auto-possessification means that some callouts are never
1659 .* is the first significant item in a top-level branch of a pattern,
1662 atomic group or a capturing group that is the subject of a backrefer-
1663 ence, or if the pattern contains (*PRUNE) or (*SKIP). When the opti-
1679 the matching code searches the subject for that value, and fails imme-
1680 diately if it cannot find it, without actually running the main match-
1684 items are in use, these "start-up" optimizations can cause them to be
1685 skipped if the pattern is never actually used. The start-up optimiza-
1686 tions are in effect a pre-scan of the subject that takes place before
1689 The PCRE2_NO_START_OPTIMIZE option disables the start-up optimizations,
1702 start-up optimization scans along the subject, finds "A" and runs the
1703 first match attempt from there. The (*COMMIT) item means that the pat-
1711 There are also other start-up optimizations. For example, a minimum
1728 UTF-8 strings, UTF-16 strings, and UTF-32 strings in the pcre2unicode
1742 Note also that setting PCRE2_NO_UTF_CHECK at compile time does not dis-
1743 able the error that is given if an escape sequence for an invalid Uni-
1744 code code point is encountered in the pattern. In particular, the so-
1748 section entitled "Extra compile options" below. However, this is pos-
1749 sible only in UTF-8 and UTF-32 modes, because these values are not rep-
1750 resentable in UTF-16.
1760 option is available only if PCRE2 has been compiled with Unicode sup-
1767 not compatible with Perl. It can also be set by a (?U) option setting
1773 is going to be used to set a non-default offset limit in a match con-
1775 offset limit is set without this option. For more details, see the
1783 instead of single-code-unit strings. It is available when PCRE2 is
1792 Unlike the main compile-time options, the extra options are not saved
1799 This option applies when compiling a pattern in UTF-8 or UTF-32 mode.
1800 It is forbidden in UTF-16 mode, and ignored in non-UTF modes. Unicode
1802 in UTF-16 to encode code points with values in the range 0x10000 to
1803 0x10ffff. The surrogates cannot therefore be represented in UTF-16.
1804 They can be represented in UTF-8 and UTF-32, but are defined as invalid
1805 code points, and cause errors if encountered in a UTF-8 or UTF-32
1810 when using PCRE2 to check for unwanted characters in UTF-8 strings,
1813 because it applies only to the testing of input strings for UTF valid-
1816 If the extra option PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES is set, surro-
1817 gate code point values in UTF-8 and UTF-32 patterns no longer provoke
1825 escape such as \j or a malformed one such as \x{2z} causes a compile-
1826 time error when detected by pcre2_compile(). Perl is somewhat inconsis-
1828 "j", and non-hexadecimal digits in \x{} are just ignored, though warn-
1829 ings are given in both cases if Perl's warning switch is enabled. How-
1831 Perl.
1835 treated as single-character escapes. For example, \j is a literal "j"
1837 option means that typos in patterns may go undetected and have unex-
1842 This option is provided for use by the -x option of pcre2grep. It
1844 automatically inserting the code for "^(?:" at the start of the com-
1851 This option is provided for use by the -w option of pcre2grep. It
1859 JUST-IN-TIME (JIT) COMPILATION
1879 just-in-time compiler is available, further processes a compiled pat-
1885 for patterns to be analyzed, and for one-off matches and simple pat-
1896 points are less than 256. By default, higher-valued code points never
1897 match escapes such as \w or \d. However, if PCRE2 is built with Uni-
1898 code support, all characters can be tested with \p and \P, or, alterna-
1901 the built-in tables.
1911 default "C" locale of the local system, which may cause them to be dif-
1914 The internal tables can be overridden by tables supplied by the appli-
1916 from the default. As more and more applications change to using Uni-
1933 The locale name "fr_FR" is used on Linux and other Unix-like systems;
1940 pcre2_match() and pcre_dfa_match(). Thus, for any single pattern, com-
1951 The first argument for pcre2_pattern_info() is a pointer to the com-
1957 the function is zero for success, or one of the following negative num-
1967 typical call of pcre2_pattern_info(), to obtain the length of the com-
1986 options that were passed to pcre2_compile(), whereas PCRE2_INFO_ALLOP-
1987 TIONS returns the compile options as modified by any top-level (*XXX)
1990 compile context by calling the pcre2_set_compile_extra_options() func-
1996 change within a pattern do not affect the result of PCRE2_INFO_ALLOP-
2000 A pattern compiled without PCRE2_ANCHORED is automatically anchored by
2001 PCRE2 if the first significant item in every top-level branch is one of
2007 .* sometimes - see below
2019 For patterns that are auto-anchored, the PCRE2_ANCHORED bit is set in
2028 characters of the given group, but in addition, the check that a cap-
2041 Return the highest capturing subpattern number in the pattern. In pat-
2051 PCRE2_ERROR_UNSET. Note that this limit will only be used during match-
2057 In the absence of a single first code unit for a non-anchored pattern,
2058 pcre2_compile() may construct a 256-bit table that defines a fixed set
2062 means "any code unit of value 255 or above". If such a table was con-
2069 a non-anchored pattern. The third argument should point to an uint32_t
2081 The third argument should point to an uint32_t variable. In the 8-bit
2082 library, the value is always less than 256. In the 16-bit library the
2083 value can be up to 0xffff. In the 32-bit library in UTF-32 mode the
2084 value can be up to 0x10ffff, and up to 0xffffffff when not using UTF-32
2091 without the use of JIT. The third argument should point to a size_t
2112 (*LIMIT_HEAP=nnnn) at the start, the value is returned. The third argu-
2120 Return 1 if the (?J) or (?-J) option setting is used in the pattern,
2122 (?J) and (?-J) set and unset the local PCRE2_DUPNAMES option, respec-
2127 If the compiled pattern was successfully processed by pcre2_jit_com-
2147 PCRE2_INFO_LASTCODETYPE returns 1. Otherwise, return 0. The third argu-
2154 contains recursive subroutine calls it is not always possible to deter-
2155 mine whether or not it can match an empty string. PCRE2 takes a cau-
2164 PCRE2_ERROR_UNSET. Note that this limit will only be used during match-
2170 Return the number of characters (not code units) in the longest lookbe-
2172 uint32_t integer. This information is useful when doing multi-segment
2174 assertions \b and \B require a one-character lookbehind. \A also regis-
2175 ters a one-character lookbehind, though it does not actually inspect
2177 from the old segment is retained when a new segment is processed. Oth-
2185 number of characters, which in UTF mode may be different from the num-
2195 PCRE2 supports the use of named as well as numbered capturing parenthe-
2196 ses. The names are just an additional way of identifying the parenthe-
2198 pcre2_substring_get_byname() are provided for extracting captured sub-
2202 do the conversion, you need to use the name-to-number map, which is
2205 The map consists of a number of fixed-size entries. PCRE2_INFO_NAME-
2211 This is a PCRE2_SPTR pointer to a block of code units. In the 8-bit
2212 library, the first two bytes of each entry are the number of the cap-
2213 turing parenthesis, most significant byte first. In the 16-bit library,
2214 the pointer points to 16-bit code units, the first of which contains
2215 the parenthesis number. In the 32-bit library, the pointer points to
2216 32-bit code units, the first of which contains the parenthesis number.
2232 pattern after compilation by the 8-bit library (assume PCRE2_EXTENDED
2233 is set, so white space - including newlines - is ignored):
2235 (?<date> (?<year>(\d\d)?\d\d) -
2236 (?<month>\d\d) - (?<day>\d\d) )
2240 with non-printing bytes shows in hexadecimal, and undefined bytes shown
2249 name-to-number map, remember that the length of the entries is likely
2263 This identifies the character sequence that will be recognized as mean-
2272 pcre2_compile() is getting memory in which to place the compiled pat-
2275 over-estimate. Processing a pattern with the JIT compiler does not
2291 which they appear. Its first argument is a pointer to a callout enumer-
2293 passed to pcre2_callout_enumerate(). The contents of the callout enu-
2303 PCRE2, with the same code unit width, and must also have the same endi-
2308 the serialized form. They are described in the pcre2serialize documen-
2309 tation. Note that PCRE2 serialization does not convert compiled pat-
2331 you must create a match data block by calling one of the creation func-
2338 pcre2_match_data_create(), so it is always possible to return the over-
2341 The second argument of pcre2_match_data_create() is a pointer to a gen-
2348 right size to hold all the substrings a pattern might capture. The sec-
2374 NULL argument, it returns immediately, without doing anything.
2387 order to find multiple matches in the subject string or to match dif-
2391 operates in a Perl-like manner. For specialist use there is also an
2407 If the subject string is zero-terminated, the length can be given as
2409 common matching parameters are to be changed. For details, see the sec-
2417 bytes for the 8-bit library, 16-bit code units for the 16-bit library,
2418 and 32-bit code units for the 32-bit library, whether or not UTF pro-
2424 by far the most common case. In UTF-8 or UTF-16 mode, the starting off-
2425 set must point to the start of a character, or to the end of the sub-
2426 ject (in UTF-32 mode, one code unit equals one character, so all off-
2430 A non-zero starting offset is useful when searching for another match
2445 string again, but with startoffset set to 4, it finds the second occur-
2450 match an empty string. It is possible to emulate Perl's /g behaviour by
2457 so, and the current character is CR followed by LF, advance the start-
2460 If a non-zero starting offset is passed when the pattern is anchored, a
2461 single attempt to match at the given offset is made. This can only suc-
2463 the subject. In other words, the anchoring must be the result of set-
2472 PCRE2_NO_JIT, PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PAR-
2475 Setting PCRE2_ANCHORED or PCRE2_ENDANCHORED at match time is not sup-
2476 ported by the just-in-time (JIT) compiler. If it is set, JIT matching
2492 matches must be right at the end of the subject string. Note that set-
2499 match before it. Setting this without having set PCRE2_MULTILINE at
2507 in multiline mode) a newline immediately before it. Setting this with-
2509 match. This option affects only the behaviour of the dollar metacharac-
2545 called. If a non-zero starting offset is given, the check is applied
2546 only to that part of the subject that could be inspected during match-
2553 sequences \b and \B are one-character lookbehinds.
2559 validity of UTF-8 strings, UTF-16 strings, and UTF-32 strings in the
2582 the caller is prepared to handle a partial match, but only if no com-
2587 PCRE2_ERROR_PARTIAL, without considering any other alternatives. In
2588 other words, when PCRE2_PARTIAL_HARD is set, a partial match is consid-
2591 There is a more detailed discussion of partial and multi-segment match-
2597 When PCRE2 is built, a default newline convention is set; this is usu-
2602 pcre2pattern page. During matching, the newline choice affects the be-
2618 However, the pattern [\r\n]A does match that string, because it con-
2619 tains an explicit CR or LF reference, and so advances only by one char-
2625 not count, nor does \s, even though it includes CR and LF in the char-
2643 phrase "capturing subpattern" or "capturing group" is used for a frag-
2652 Alternatively, you can make direct use of the vector of PCRE2_SIZE val-
2656 pcre2_get_ovector_count() returns the number of pairs of values it con-
2659 Within the ovector, the first in each pair of values is set to the off-
2661 offset of the first code unit after the end of a substring. These val-
2663 are byte offsets in the 8-bit library, 16-bit offsets in the 16-bit
2664 library, and 32-bit offsets in the 32-bit library.
2672 the portion of the subject string that was matched by the entire pat-
2676 been captured, the returned value is 3. If there are no captured sub-
2699 2 is not. When this happens, both values in the offset pairs corre-
2705 are not matched. The return from the function is 2, because the high-
2706 est used capturing subpattern number is 1. The offsets for for the sec-
2711 in the pattern are never changed. That is, if a pattern contains n cap-
2713 pcre2_match(). The other elements retain whatever values they previ-
2733 returns a pointer to the zero-terminated name, which is within the com-
2741 through the pattern. Instances of (*PRUNE) and (*THEN) without names
2754 Warning: By default, certain start-of-match optimizations are used to
2758 engine. This check fails for "bx", causing a match failure without see-
2759 ing any marks. You can disable the start-of-match optimizations by set-
2766 offset of the character at which the match started. For a non-partial
2779 If pcre2_match() fails, it returns a negative number. This can be con-
2780 verted to a text string by calling the pcre2_get_error_message() func-
2785 of UTF-specific negative error codes is returned. Details are given in
2800 PCRE2 stores a 4-byte "magic number" at the start of the compiled code,
2807 a library of a different code unit width, for example, a pattern com-
2808 piled by the 8-bit library is passed to a 16-bit or 32-bit library
2849 using JIT is being matched, but the memory available for the just-in-
2850 time processing stack is not large enough. See the pcre2jit documenta-
2872 within the pattern. Specifically, it means that either the whole pat-
2875 might do this are detected and faulted at compile time, but more com-
2886 match, or auxiliary) can be obtained by calling pcre2_get_error_mes-
2893 The returned message is terminated with a trailing zero, and the func-
2896 PCRE2_ERROR_BADDATA is returned. If the buffer is too small, the mes-
2919 extracting captured substrings as new, separate, zero-terminated
2925 zero refers to the entire matched substring, with higher numbers refer-
2936 extracts a zero-length empty string.
2938 You can find the length in code units of a captured substring without
2945 The pcre2_substring_copy_bynumber() function copies a captured sub-
2948 function that was used for the match data block. The first two argu-
2988 pattern is (abc)|(def) and the subject is "def", and the ovector con-
2999 The pcre2_substring_list_get() function extracts all available sub-
3013 therefore need the lengths, you may supply NULL as the lengthsptr argu-
3015 function is zero if all went well, or PCRE2_ERROR_NOMEMORY if the mem-
3022 This can be distinguished from a genuine zero-length substring by
3024 PCRE2_UNSET for unset substrings, or by calling pcre2_sub-
3044 To extract a substring by name, you first have to find associated num-
3051 the name by calling pcre2_substring_number_from_name(). The first argu-
3071 Warning: If the pattern uses the (?| feature to set up multiple subpat-
3092 given as PCRE2_ZERO_TERMINATED for a zero-terminated string. Matches in
3100 pcre2_match(), except that the partial matching options are not permit-
3102 block is obtained and freed within this function, using memory manage-
3112 length, in code units, of the output buffer. If the function is suc-
3130 option is set, a dollar character is an escape character that can spec-
3140 brackets are required only if the following character would be inter-
3164 takes place in the original subject string (that is, previous replace-
3167 subject string. If an offset limit is set in the match context, search-
3171 the subject string by setting either or both of startoffset and an off-
3179 with zero length, an attempt to find a non-empty match at the same off-
3186 buffer is too small. The default action is to return PCRE2_ERROR_NOMEM-
3188 continues to go through the motions of matching and substituting (with-
3189 out, of course, writing anything) in order to compute the size of buf-
3196 that the entire operation is carried out twice. Depending on the appli-
3198 the excess afterwards, instead of using PCRE2_SUBSTITUTE_OVER-
3215 replacement string. Without this option, only the dollar character is
3221 particular character codes, and backslash followed by any non-alphanu-
3228 current state: \U and \L change to upper or lower case forcing, respec-
3233 all inserted characters, including those from captured groups and let-
3236 Note that case forcing sequences such as \U...\E do not nest. For exam-
3244 ${<n>:-<string>}
3247 As before, <n> may be a group number or a name. The first form speci-
3279 PCRE2_ERROR_NOSUBSTRING is returned for a non-existent substring inser-
3282 PCRE2_ERROR_UNSET is returned for an unset substring insertion (includ-
3284 when the simple (non-extended) syntax is used and PCRE2_SUBSTI-
3294 PCRE2_ERROR_BADREPESCAPE (invalid escape sequence), PCRE2_ERROR_REP-
3295 MISSINGBRACE (closing curly bracket not found), PCRE2_ERROR_BADSUBSTI-
3336 point to the first and last entries in the name-to-number table for the
3349 The traditional matching function uses a similar algorithm to Perl,
3350 which stops when it finds the first match at a given point in the sub-
3353 function (see below) instead. If you cannot use the alternative func-
3357 What you have to do is to insert a callout right at the end of the pat-
3358 tern. When your callout function is called, extract and save the cur-
3375 not backtrack. This has different characteristics to the normal algo-
3376 rithm, and is not compatible with Perl. Some of the features of PCRE2
3379 algorithms, and a list of features that pcre2_dfa_match() does not sup-
3384 is used in a different way, and this is described below. The other com-
3412 zero. The only bits that may be set are PCRE2_ANCHORED, PCRE2_ENDAN-
3430 matches, but there is still at least one matching possibility. The por-
3433 more detailed discussion of partial and multi-segment matching, with
3439 stop as soon as it has found one match. Because of the way the alterna-
3455 When pcre2_dfa_match() succeeds, it may have matched more than one sub-
3474 which is the number of matched substrings. The offsets of the sub-
3477 any capturing groups that may exist in the pattern, because DFA match-
3490 NOTE: PCRE2's "auto-possessification" optimization usually applies to
3553 Copyright (c) 1997-2018 University of Cambridge.
3554 ------------------------------------------------------------------------------
3562 PCRE2 - Perl-compatible regular expressions (revised API)
3567 the library in Unix-like environments using the applications known as
3572 systems. There is a lot more information about building PCRE2 without
3574 "by hand") in the text file called NON-AUTOTOOLS-BUILD. You should
3576 non-Unix-like environment.
3579 PCRE2 BUILD-TIME OPTIONS
3583 configure script, where the optional features are selected or dese-
3584 lected by providing options to configure before running the make com-
3585 mand. However, the same options can be selected in both Unix-like and
3586 non-Unix-like environments if you are using CMake instead of configure
3591 compiler, as described in NON-AUTOTOOLS-BUILD.
3597 ./configure --help
3600 names begin with --enable or --disable. Because of the way that config-
3601 ure works, --enable and --disable always come in pairs, so the comple-
3604 with --with. At the end of a configure run, a summary of the configura-
3608 BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES
3610 By default, a library called libpcre2-8 is built, containing functions
3612 either as single-byte characters, or UTF-8 strings. You can also build
3613 two other libraries, called libpcre2-16 and libpcre2-32, which process
3614 strings that are contained in arrays of 16-bit and 32-bit code units,
3615 respectively. These can be interpreted either as single-unit characters
3616 or UTF-16/UTF-32 strings. To build these additional libraries, add one
3619 --enable-pcre2-16
3620 --enable-pcre2-32
3622 If you do not want the 8-bit library, add
3624 --disable-pcre2-8
3627 the POSIX wrapper is for the 8-bit library only, and that pcre2grep is
3628 an 8-bit program. Neither of these are built if you select only the
3629 16-bit or 32-bit libraries.
3638 --disable-shared
3639 --disable-static
3647 strings. To build it without Unicode support, add
3649 --disable-unicode
3653 another without, in the same configuration.
3655 Of itself, Unicode support does not make PCRE2 treat strings as UTF-8,
3656 UTF-16 or UTF-32. To do that, applications that use the library can set
3657 the PCRE2_UTF option when they call pcre2_compile() to compile a pat-
3665 and Nd are supported. Details are given in the pcre2pattern documenta-
3677 mode, can cause unpredictable behaviour because it may leave the cur-
3678 rent matching point in the middle of a multi-code-unit character. The
3680 option when calling pcre2_compile(). There is also a build-time option
3682 --enable-never-backslash-C
3687 JUST-IN-TIME COMPILER SUPPORT
3689 Just-in-time (JIT) compiler support is included in the build by speci-
3692 --enable-jit
3698 --enable-jit=auto
3705 --enable-jit-sealloc
3712 --disable-pcre2grep-jit
3720 the end of a line. This is the normal newline character on Unix-like
3724 --enable-newline-is-cr
3726 to the configure command. There is also an --enable-newline-is-lf
3730 the two-character sequence CRLF (CR immediately followed by LF). If you
3733 --enable-newline-is-crlf
3737 --enable-newline-is-anycrlf
3742 --enable-newline-is-any
3745 newline sequences are the three just mentioned, plus the single charac-
3750 --enable-newline-is-nul
3752 which causes NUL (binary zero) to be set as the default line-ending
3766 --enable-bsr-anycrlf
3768 the default is changed so that \R matches only CR, LF, or CRLF. What-
3776 part to another (for example, from an opening parenthesis to an alter-
3777 nation metacharacter). By default, in the 8-bit and 16-bit libraries,
3778 two-byte values are used for these offsets, leading to a maximum size
3779 for a compiled pattern of around 64 thousand code units. This is suffi-
3782 compile PCRE2 to use three-byte or four-byte offsets by adding a set-
3785 --with-link-size=3
3788 16-bit library, a value of 3 is rounded up to 4. In these libraries,
3790 to load additional data when handling them. For the 32-bit library the
3791 value is always 4 and cannot be overridden; the value of --with-link-
3804 --with-match-limit=500000
3810 The pcre2_match() function starts out using a 20KiB vector on the sys-
3819 --with-heap-limit=500
3829 for --with-match-limit. You can set a lower default limit by adding,
3832 --with-match-limit_depth=10000
3844 for lookaround assertions, atomic groups, and recursion within pat-
3855 --enable-rebuild-chartables
3860 C run-time system. This method of replacing the tables does not work if
3871 compiled to run in an 8-bit EBCDIC environment by adding
3873 --enable-ebcdic --disable-unicode
3875 to the configure command. This setting implies --enable-rebuild-charta-
3879 It is not possible to support both EBCDIC and UTF-8 codes in the same
3880 version of the library. Consequently, --enable-unicode and --enable-
3887 --enable-ebcdic-nl25
3889 as well as, or instead of, --enable-ebcdic. The EBCDIC character for CR
3891 0x25 is not chosen as LF is made to correspond to the Unicode NEL char-
3894 The options that select newline behaviour, such as --enable-newline-is-
3895 cr, and equivalent run-time options, refer to these character values in
3901 By default, on non-Windows systems, pcre2grep supports the use of call-
3904 This support can be disabled by adding --disable-pcre2grep-callout to
3914 --enable-pcre2grep-libz
3915 --enable-pcre2grep-libbz2
3917 to the configure command. These options naturally require that the rel-
3929 be processable is the notional buffer size. If a longer line is encoun-
3935 --with-pcre2grep-bufsize=51200
3936 --with-pcre2grep-max-bufsize=2097152
3939 values by using --buffer-size and --max-buffer-size on the command
3947 --enable-pcre2test-libreadline
3948 --enable-pcre2test-libedit
3952 it reads it using the readline() function. This provides line-editing
3953 and history facilities. Note that libreadline is GPL-licensed, so if
3958 Setting --enable-pcre2test-libreadline causes the -lreadline option to
3960 sytem-installed readline library this is sufficient. However, in some
3972 LIBS="-ncurses"
3981 --enable-debug
3991 --enable-valgrind
4005 --enable-coverage
4017 When --enable-coverage is used, the following addition targets are
4023 equivalent to running "make coverage-reset", "make coverage-baseline",
4024 "make check", and then "make coverage-report".
4026 make coverage-reset
4030 make coverage-baseline
4034 make coverage-report
4038 make coverage-clean-report
4040 This removes the generated coverage report without cleaning the cover-
4043 make coverage-clean-data
4045 This removes the captured coverage data without removing the coverage
4048 make coverage-clean
4051 For more information about code coverage, see the gcov and lcov docu-
4060 --enable-fuzz-support
4062 At present this applies only to the 8-bit library. If set, it causes an
4063 extra library called libpcre2-fuzzsupport.a to be built, but not
4064 installed. This contains a single function called LLVMFuzzerTestOneIn-
4071 Setting --enable-fuzz-support also causes a binary called pcre2fuz-
4086 --disable-stack-for-recursion
4095 pcre2api(3), pcre2-config(3).
4108 Copyright (c) 1997-2018 University of Cambridge.
4109 ------------------------------------------------------------------------------
4117 PCRE2 - Perl-compatible regular expressions (revised API)
4132 PCRE2 provides a feature called "callout", which is a means of tempo-
4143 ending delimiter is the same as the start, except for {, where the end-
4163 A(\d{2}|--)
4167 (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
4170 alternation bar. If the pattern contains a conditional group whose con-
4184 information when you are trying to optimize the performance of a par-
4194 Auto-possessification
4196 At compile time, PCRE2 "auto-possessifies" repeated items when it knows
4202 --->aaaa
4210 the auto-possessify feature by passing PCRE2_NO_AUTO_POSSESS to
4214 --->aaaa
4231 beginning of the subject, and pcre2_compile() remembers this. If a pat-
4232 tern has more than one top-level branch, automatic anchoring occurs if
4237 It is also disabled if the pattern contains (*PRUNE) or (*SKIP). How-
4243 --->aa
4250 This shows that all match attempts start at the beginning of the sub-
4253 starting the pattern with (*NO_DOTSTAR_ANCHOR). In this case, the out-
4256 --->aa
4266 This shows more match attempts, starting at the second subject charac-
4283 string, and will immediately give a "no match" return without actually
4287 You can disable these optimizations by passing the PCRE2_NO_START_OPTI-
4297 to both normal, DFA, and JIT matching. The first argument to the call-
4323 version 1, and the callout_flags field for version 2. If you are writ-
4332 contains the number of the callout, in the range 0-255. This is the
4339 callout_string points to the string that is contained within the com-
4347 delimiter as callout_string[-1] if you need it.
4371 The capture_last field contains the number of the most recently cap-
4373 number of the highest numbered captured substring so far. If no sub-
4379 The contents of ovector[2] to ovector[<capture_top>*2-1] can be
4388 was passed to the matching function in the match data block for call-
4414 parenthesis, the length includes meta characters that follow the paren-
4417 the length is one, unless a closing parenthesis is followed by a quan-
4426 are used by pcre2test to show the next item to be matched when display-
4430 zero-terminated name of the most recently passed (*MARK), (*PRUNE), or
4432 Instances of (*PRUNE) or (*THEN) without a name do not obliterate a
4437 pcre2_dfa_match() or when JIT is being used. When pcre2_match() without
4452 starting position in the subject. Output from pcre2test does not indi-
4456 The information in the callout_flags field is provided so that applica-
4460 because there is no backtracking in DFA matching, and there is no sup-
4493 which they appear. Its first argument is a pointer to a callout enumer-
4495 passed to pcre2_callout_enumerate(). The data block contains the fol-
4513 non-zero minimum or a fixed maximum, the group is replicated inside the
4519 The callback function should normally return zero. If it returns a non-
4534 Copyright (c) 1997-2018 University of Cambridge.
4535 ------------------------------------------------------------------------------
4543 PCRE2 - Perl-compatible regular expressions (revised API)
4545 DIFFERENCES BETWEEN PCRE2 AND PERL
4547 This document describes the differences in the ways that PCRE2 and Perl
4549 respect to Perl versions 5.26, but as both Perl and PCRE2 are continu-
4552 1. PCRE2 has only a subset of Perl's Unicode support. Details of what
4555 2. Like Perl, PCRE2 allows repeat quantifiers on parenthesized asser-
4559 PCRE2 optimizes this to run the assertion just once). Perl allows some
4563 3. Capturing subpatterns that occur inside negative lookaround asser-
4568 4. The following Perl escape sequences are not supported: \F, \l, \L,
4569 \u, \U, and \N when followed by a character name. \N on its own, match-
4570 ing a non-newline character, and \N{U+dd..}, matching a Unicode code
4572 letters are implemented by Perl's general string-handling and are not
4577 5. The Perl escape sequences \p, \P, and \X are supported only if PCRE2
4582 which Perl does not; the Perl documentation says "Because Perl hides
4583 the need for the user to understand the internal representation of Uni-
4584 code characters, there is no need to implement the somewhat messy con-
4589 from Perl in that $ and @ are also handled as literals inside the
4590 quotes. In Perl, they cause variable interpolation (but of course PCRE2
4591 does not have variables). Also, Perl does "double-quotish backslash
4592 interpolation" on any backslashes between \Q and \E which, its documen-
4597 Pattern PCRE2 matches Perl matches
4616 and backtracking into subroutine calls is now supported, as in Perl.
4620 effect is confined to that subpattern; it does not extend to the sur-
4621 rounding pattern. This is not always the case in Perl. In particular,
4630 in C triggers (*PRUNE). Perl's behaviour is more complex; in many cases
4638 matching "aba" against the pattern /^(a(b)?)+$/ in Perl leaves $2
4641 13. PCRE2's handling of duplicate subpattern numbers and duplicate sub-
4642 pattern names is not as general as Perl's. This is a consequence of the
4648 distinguish which parentheses matched, because both names map to cap-
4652 14. Perl used to recognize comments in some places that PCRE2 does not,
4654 /x modifier is set, Perl allowed white space between ( and ? though the
4656 may still be some cases where Perl behaves differently.
4658 15. Perl, when in warning mode, gives warnings for character classes
4659 such as [A-\d] or [a-[:digit:]]. It then treats the hyphens as liter-
4664 not affected when case-independent matching is specified. For example,
4665 \p{Lu} always matches an upper case letter. I think Perl has changed in
4670 17. PCRE2 provides some extensions to the Perl regular expression
4671 facilities. Perl 5.10 includes new features that are not in earlier
4672 versions of Perl, some of which (such as named parentheses) were in
4673 PCRE2 for some time before. This list is with respect to Perl 5.26:
4677 different length of string. Perl requires them all to have the same
4680 (b) From PCRE2 10.23, backreferences to groups of fixed length are sup-
4681 ported in lookbehinds, provided that there is no possibility of refer-
4682 encing a non-unique number or name. Perl does not support backrefer-
4686 $ meta-character matches only at the very end of the string.
4689 faulted. (Perl can be made to issue a warning.)
4691 (e) If PCRE2_UNGREEDY is set, the greediness of the repetition quanti-
4692 fiers is inverted, that is, by default they are not greedy, but if fol-
4699 PCRE2_NOTEMPTY_ATSTART options have no Perl equivalents.
4704 (i) The callout facility is PCRE2-specific. Perl supports codeblocks
4707 (j) The partial matching facility is PCRE2-specific.
4710 different way and is not Perl-compatible.
4716 18. The Perl /a modifier restricts /d numbers to pure ascii, and the
4717 /aa modifier restricts /i case-insensitive matching to pure ascii,
4721 19. Perl has different limits than PCRE2. See the pcre2limit documenta-
4722 tion for details. Perl went with 5.10 from recursion to iteration keep-
4724 not fall into any stack-overflow limit. PCRE2 made a similar change at
4725 release 10.30, and also has many build-time and run-time customizable
4739 Copyright (c) 1997-2018 University of Cambridge.
4740 ------------------------------------------------------------------------------
4748 PCRE2 - Perl-compatible regular expressions (revised API)
4750 PCRE2 JUST-IN-TIME COMPILER SUPPORT
4752 Just-in-time compiling is a heavyweight optimization that can greatly
4753 speed up pattern matching. However, it comes at the cost of extra pro-
4755 the same pattern is going to be matched many times. This does not nec-
4757 anchored, matching attempts may take place many times at various posi-
4759 string is very long, it may still pay to use JIT even for one-off
4760 matches. JIT support is available for all of the 8-bit, 16-bit and
4761 32-bit PCRE2 libraries.
4763 JIT support applies only to the traditional Perl-compatible matching
4771 --enable-jit (or equivalent CMake option) must be set when PCRE2 is
4775 ARM 32-bit (v5, v7, and Thumb2)
4776 ARM 64-bit
4777 Intel x86 32-bit and 64-bit
4778 MIPS 32-bit and 64-bit
4779 Power PC 32-bit and 64-bit
4780 SPARC 32-bit
4782 If --enable-jit is set on an unsupported platform, compilation fails.
4784 A program can tell if JIT support is available by calling pcre2_con-
4788 falls back to the interpretive code if JIT is not available. For pro-
4790 path" API that is JIT-specific.
4799 second is zero or more of the following option bits: PCRE2_JIT_COM-
4810 the size of machine stack that it uses. The exact rules are not docu-
4815 PCRE2_JIT_COMPLETE requests the JIT compiler to generate code for com-
4816 plete matches. If you want to run partial matches using the PCRE2_PAR-
4821 pcre2_match() is called, the appropriate code is run if it is avail-
4826 the option bits. For example, you can call it once with PCRE2_JIT_COM-
4829 will ignore PCRE2_JIT_COMPLETE and just compile code for partial match-
4830 ing. If pcre2_jit_compile() is called with no option bits set, it imme-
4847 stack" below, even if you do not need to supply a non-default JIT
4849 be obeyed. If the match-time options are not right for JIT execution,
4852 If the JIT compiler finds an unsupported item, no JIT data is gener-
4855 option. A non-zero result means that JIT compilation was successful. A
4872 when running in a UTF mode, and a callout immediately before an asser-
4881 that the memory used for the JIT stack was insufficient. See "Control-
4901 The pcre2_jit_stack_create() function creates a JIT stack. Its argu-
4907 function returns immediately, without doing anything. (For the techni-
4919 The first argument is a pointer to a match context. When this is subse-
4921 JIT stack is used. If this argument is NULL, the function returns imme-
4922 diately, without doing anything. There are three cases for the values
4941 is not obeyed when pcre2_match() is called with options that are incom-
4949 up non-sequential matches in one thread is to use callouts: if a call-
4954 you assign or pass back NULL from a callback, that is thread-safe,
4956 or pass back a non-NULL JIT stack, this must be a different stack for
4957 each thread so that the application is thread-safe.
4959 Strictly speaking, even more is allowed. You can assign the same non-
4968 up non-default JIT stacks might operate:
4976 Use a one-line callback function
4987 PCRE2 (and JIT) is a recursive, depth-first engine, so it needs a stack
4989 child nodes. Allocating real machine stack on some platforms is diffi-
4997 address space instead of allocating memory. We can safely allocate mem-
4998 ory pages inside this address space, so the stack could grow without
5019 You can free compiled patterns, contexts, and stacks in any order, any-
5032 this without keeping a list of patterns.
5038 Especially on embedded sytems, it might be a good idea to release mem-
5039 ory sometimes without freeing the stack. There is no API for this at
5041 allocated memory for any stack and another which allows releasing mem-
5055 The JIT executable allocator does not free all memory when it is possi-
5059 calling pcre2_jit_free_unused_memory(). Its argument is a general con-
5060 text, for custom memory management, or NULL for standard memory manage-
5066 This is a single-threaded example that specifies a JIT stack without
5111 number of other sanity checks are performed on the arguments. For exam-
5136 Copyright (c) 1997-2018 University of Cambridge.
5137 ------------------------------------------------------------------------------
5145 PCRE2 - Perl-compatible regular expressions (revised API)
5153 code units for the 8-bit and 16-bit libraries if PCRE2 is compiled with
5157 (when building the 16-bit library, 3 is rounded up to 4). See the
5159 for details. In these cases the limit is substantially larger. How-
5160 ever, the speed of execution is slower. In the 32-bit library, the
5170 (that is ~(PCRE2_SIZE)0) is reserved as a special indicator for zero-
5190 (*THEN) verb is 255 code units for the 8-bit library and 65535 code
5191 units for the 16-bit and 32-bit libraries.
5194 number a 32-bit unsigned integer can hold.
5207 Copyright (c) 1997-2017 University of Cambridge.
5208 ------------------------------------------------------------------------------
5216 PCRE2 - Perl-compatible regular expressions (revised API)
5223 pcre2_match() function. This works in the same as as Perl's matching
5224 function, and provide a Perl-compatible matching operation. The just-
5225 in-time (JIT) optimization that is described in the pcre2jit documenta-
5229 it operates in a different way, and is not Perl-compatible. This alter-
5250 The set of strings that are matched by a regular expression can be rep-
5255 tree: depth-first and breadth-first, and these correspond to the two
5261 In the terminology of Jeffrey Friedl's book "Mastering Regular Expres-
5263 depth-first search of the pattern tree. That is, it proceeds along a
5265 required. When there is a mismatch, the algorithm tries any alterna-
5274 that point the algorithm stops. Thus, if there is more than one possi-
5280 Because it ends up with a single path through the tree, it is rela-
5281 tively straightforward for this algorithm to keep track of the sub-
5288 This algorithm conducts a breadth-first search of the tree. Starting
5297 scans the subject string only once, without backtracking, there is one
5306 this algorithm finds all of them, and in particular, it finds the long-
5308 an option to stop the algorithm after the first match (which is neces-
5318 the fifth character of the subject. The algorithm does not automati-
5321 PCRE2's "auto-possessification" optimization usually applies to charac-
5322 ter repeats at the end of a pattern (as well as internally). For exam-
5327 either use an ungreedy repeat ("a\d+?") or set the PCRE2_NO_AUTO_POS-
5331 not supported by the alternative matching algorithm. They are as fol-
5336 may affect auto-possessification, as just described). During matching,
5345 a non-possessive quantifier. Similarly, if an atomic group is present,
5353 algorithm does not attempt to do this. This means that no captured sub-
5356 3. Because no substrings are captured, backreferences within the pat-
5359 4. For the same reason, conditional expressions that use a backrefer-
5373 these modes, because the alternative algorithm moves through the sub-
5384 Using the alternative matching algorithm provides the following advan-
5387 1. All possible matches (at a single point in the subject) are automat-
5393 once, and never needs to backtrack (except for lookbehinds), it is pos-
5396 also possible to do multi-segment matching using the standard algo-
5397 rithm, by retaining partially matched substrings, it is more compli-
5399 and discusses multi-segment matching.
5426 Copyright (c) 1997-2014 University of Cambridge.
5427 ------------------------------------------------------------------------------
5435 PCRE2 - Perl-compatible regular expressions
5441 the entire pattern, PCRE2_ERROR_NOMATCH is returned. There are circum-
5454 reflecting the character that has been typed, for example. This immedi-
5467 If you want to use partial matching with just-in-time optimized code,
5473 PCRE2_JIT_COMPLETE should also be set if you are going to run non-par-
5483 shorter strings. This optimization is also disabled for partial match-
5490 the subject string is reached successfully, but matching cannot con-
5491 tinue because more characters are needed. However, at least one charac-
5495 of a matched string. The requirement for inspecting at least one char-
5496 acter exists because an empty string can always be matched; without
5502 the rest of the ovector are undefined. The appearance of \K in the pat-
5511 string "abc12", because all these characters are needed for a subse-
5512 quent re-match with additional characters.
5520 match, the partial match is remembered, but matching continues as nor-
5525 This option is "soft" because it prefers a complete match over a par-
5529 of the subject is treated as a non-alphanumeric.
5536 If this is matched against the subject string "abc123dog", both alter-
5546 returned as soon as a partial match is found, without continuing to
5557 The difference between the two partial matching options can be illus-
5565 However, if PCRE2_PARTIAL_HARD is set, the result is PCRE2_ERROR_PAR-
5586 without backtracking, searching for all possible matches simultane-
5587 ously. If the end of the subject is reached before the end of the pat-
5600 behaviour is different from the standard functions when PCRE2_PAR-
5614 boundaries, partial matching with PCRE2_PARTIAL_SOFT can give counter-
5649 matched substrings. The remaining four strings do not match the com-
5657 MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()
5661 and calling the function again with the same compiled regular expres-
5663 same working space as before, because this is where details of the pre-
5672 The first call has "23ja" as the subject, and requests partial match-
5675 last part is shown; PCRE2 does not retain the previously partially-
5685 this may or may not be what you want. The only way to allow for start-
5695 MULTI-SEGMENT MATCHING WITH pcre2_match()
5700 re-run, starting from the point where the partial match occurred. Ear-
5719 ISSUES WITH MULTI-SEGMENT MATCHING
5721 Certain types of pattern may give problems with multi-segment matching,
5727 option, but in practice when doing multi-segment matching you should be
5730 2. If a pattern contains a lookbehind assertion, characters that pre-
5742 retained. In a non-UTF or a 32-bit situation, moving back is just a
5743 subtraction, but in UTF-8 or UTF-16 you have to count characters while
5752 the string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maxi-
5784 been found, continuation to a new subject segment is no longer possi-
5809 matching multi-segment data. The example above then behaves differ-
5849 re-running the entire match can also be used with the DFA matching
5866 Copyright (c) 1997-2014 University of Cambridge.
5867 ------------------------------------------------------------------------------
5875 PCRE2 - Perl-compatible regular expressions (revised API)
5880 by PCRE2 are described in detail below. There is a quick-reference syn-
5881 tax summary in the pcre2syntax page. PCRE2 tries to match Perl syntax
5882 and semantics as closely as it can. PCRE2 also supports some alterna-
5883 tive regular expression syntax (which does not conflict with the Perl
5887 Perl's regular expressions are described in its own documentation, and
5897 different algorithm that is not Perl-compatible. Some of the features
5898 discussed below are not available when DFA matching is used. The advan-
5903 SPECIAL START-OF-PATTERN ITEMS
5906 set by special items at the start of a pattern. These are not Perl-com-
5908 writers who are not able to change the program that processes the pat-
5915 In the 8-bit and 16-bit PCRE2 libraries, characters may be coded either
5916 as single code units, or as multiple UTF-8 or UTF-16 code units. UTF-32
5917 can be specified for the 32-bit library, in which case it constrains
5928 restrict them to non-UTF data for security reasons. If the
5936 causes sequences such as \d and \w to use Unicode properties to deter-
5949 to whichever matching function is subsequently called to match the pat-
5953 Disabling auto-possessification
5961 Disabling start-up optimizations
5964 setting the PCRE2_NO_START_OPTIMIZE option. This disables several opti-
5971 as setting the PCRE2_NO_DOTSTAR_ANCHOR option. This disables optimiza-
5972 tions that apply to patterns whose top-level branches all start with .*
5992 These facilities are provided to catch runaway matches that are pro-
5993 voked by patterns with huge matching trees (a typical example is a pat-
6003 where d is any number of decimal digits. However, the value of the set-
6026 strings: a single CR (carriage return) character, a single LF (line-
6027 feed) character, the two-character sequence CRLF, any of the three pre-
6032 It is also possible to specify a newline convention by starting a pat-
6042 These override the default and the options given to the compiling func-
6052 The newline convention affects where the circumflex and dollar asser-
6053 tions are true. It also affects the interpretation of the dot metachar-
6057 sequence, for Perl compatibility. However, this can be changed; see the
6067 starting a pattern with (*BSR_ANYCRLF). For completeness, (*BSR_UNI-
6074 character code instead of ASCII or Unicode (typically a mainframe sys-
6075 tem). In the sections below, character code values are ASCII or Uni-
6098 There are two different sets of metacharacters: those that are recog-
6124 - indicates character range
6142 always safe to precede a non-alphanumeric with backslash to specify
6143 that it stands for itself. In particular, if you want to match a back-
6156 If you want to remove the special meaning from a sequence of charac-
6157 ters, you can do so by putting them between \Q and \E. This is differ-
6158 ent from Perl in that $ and @ are handled as literals in \Q...\E
6159 sequences in PCRE2, whereas in Perl, $ and @ cause variable interpola-
6160 tion. Also, Perl does "double-quotish backslash interpolation" on any
6165 Pattern PCRE2 matches Perl matches
6182 Non-printing characters
6184 A second use of backslash provides a way of encoding non-printing char-
6186 appearance of non-printing characters in a pattern, but when a pattern
6192 \cx "control-x", where x is any printable ASCII character
6207 option is set, that is, when PCRE2 is operating in a Unicode mode. Perl
6218 32 or greater than 126, a compile-time error occurs.
6222 The \c escape is processed as specified for Perl in the perlebcdic doc-
6223 ument. The only characters that are allowed after \c are A-Z, a-z, or
6224 one of @, [, \, ], ^, _, or ?. Any other character provokes a compile-
6226 letters (in either case) encode characters 1-26 (hex 01 to hex 1A); [,
6227 \, ], ^, and _ encode characters 27-31 (hex 1B to hex 1F), and \c?
6236 but because 127 is not a control character in EBCDIC, Perl makes it
6239 FF), but in the one Perl calls POSIX-BC its value is 95 (hex 5F). If
6240 certain other characters have POSIX-BC values, PCRE2 makes \c? generate
6251 recent addition to Perl; it provides way of specifying character code
6256 a digit greater than zero. Instead, use \o{} or \x{} to specify numeri-
6257 cal character code points, and \g{} to specify backreferences. The fol-
6260 The handling of a backslash followed by a digit other than 0 is compli-
6261 cated, and Perl has changed over time, causing PCRE2 also to change.
6263 Outside a character class, PCRE2 reads the digit and any following dig-
6267 backreference. A description of how this works is given later, follow-
6271 Inside a character class, PCRE2 handles \8 and \9 as the literal char-
6272 acters "8" and "9", and otherwise reads up to three octal digits fol-
6273 lowing the backslash, using them to generate a data character. Any sub-
6295 By default, after \x that is not followed by {, from zero to two hexa-
6297 number of hexadecimal digits may appear between \x{ and }. If a charac-
6302 just described only when it is followed by two hexadecimal digits. Oth-
6305 by four hexadecimal digits; otherwise it matches a literal "u" charac-
6309 two syntaxes for \x (or by \u in PCRE2_ALT_BSUX mode). There is no dif-
6318 8-bit non-UTF mode no greater than 0xff
6319 16-bit non-UTF mode no greater than 0xffff
6320 32-bit non-UTF mode no greater than 0xffffffff
6324 (the so-called "surrogate" code points). The check for these can be
6327 UTF-8 and UTF-32 modes, because these values are not representable in
6328 UTF-16.
6343 In Perl, the sequences \F, \l, \L, \u, and \U are recognized by its
6358 For compatibility with Oniguruma, the non-Perl syntax \g followed by a
6361 Details are discussed later. Note that \g{...} (Perl syntax) and
6362 \g<...> (Oniguruma syntax) are not synonymous. The former is a backref-
6379 \W any "non-word" character
6384 has a different meaning. See the section entitled "Non-printing charac-
6385 ters" above for details. Perl also uses \N{name} to specify characters
6388 Each pair of lower and upper case escape sequences partitions the com-
6398 locale. This list may vary if locale-specific matching is taking place.
6399 For example, in some locales the "non-breaking space" character (\xA0)
6403 or digit. By default, the definition of letters and digits is con-
6404 trolled by PCRE2's low-valued character tables, and may vary if locale-
6406 page). For example, in a French locale such as "fr_FR" in Unix-like
6413 be different for characters in the range 128-255 when locale-specific
6415 meanings from before Unicode support was available, mainly for effi-
6437 U+00A0 Non-break space
6444 U+2004 Three-per-em space
6445 U+2005 Four-per-em space
6446 U+2006 Six-per-em space
6451 U+202F Narrow no-break space
6465 In 8-bit, non-UTF-8 mode, only the characters with code points less
6471 any Unicode newline sequence. In 8-bit non-UTF-8 mode \R is equivalent
6477 below. This particular group matches either the two-character sequence
6479 U+000A), VT (vertical tab, U+000B), FF (form feed, U+000C), CR (car-
6481 atomic group, the two-character sequence is treated as a single unit
6485 than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-
6491 PCRE2_BSR_ANYCRLF at compile time. (BSR is an abbrevation for "back-
6493 the case, the other behaviour can be requested via the PCRE2_BSR_UNI-
6500 These override the default and the options given to the compiling func-
6501 tion. Note that these special settings, which are not Perl-compatible,
6515 When PCRE2 is built with Unicode support (the default), three addi-
6517 are available. In 8-bit non-UTF-8 mode, these sequences are of course
6519 they do work in this mode. In 32-bit non-UTF mode, code points greater
6525 \P{xx} a character without the xx property
6531 (described in the next section). Other Perl properties such as "InMu-
6545 Adlam, Ahom, Anatolian_Hieroglyphs, Arabic, Armenian, Avestan, Bali-
6547 Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Caucasian_Alba-
6553 Inscriptional_Pahlavi, Inscriptional_Parthian, Javanese, Kaithi, Kan-
6555 Latin, Lepcha, Limbu, Linear_A, Linear_B, Lisu, Lycian, Lydian, Maha-
6559 Nabataean, New_Tai_Lue, Newa, Nko, Nushu, Ogham, Ol_Chiki, Old_Hungar-
6560 ian, Old_Italic, Old_North_Arabian, Old_Permic, Old_Persian, Old_Sog-
6563 Psalter_Pahlavi, Rejang, Runic, Samaritan, Saurashtra, Sharada, Sha-
6566 Tai_Viet, Takri, Tamil, Tangut, Telugu, Thaana, Thai, Tibetan, Tifi-
6569 Each character has exactly one Unicode general category property, spec-
6570 ified by a two-letter abbreviation. For compatibility with Perl, nega-
6575 If only one letter is specified with \p or \P, it includes all the gen-
6602 Mn Non-spacing mark
6637 page). Perl does not support the Cs property.
6639 The long synonyms for property names that Perl supports (such as
6643 No character that is in the Unicode table has the Cn (unassigned) prop-
6649 different from the behaviour of current versions of Perl.
6652 to do a multistage table lookup in order to find a character's prop-
6667 properties that had been used for emojis. Instead it introduced vari-
6668 ous emoji-specific properties. PCRE2 uses only the Extended Picto-
6677 2. Do not end between CR and LF; otherwise end after any control char-
6687 "zero-width joiner" character. Characters with the "mark" property
6694 property. Extend and ZWJ characters are allowed between the charac-
6705 As well as the standard Unicode properties described above, PCRE2 sup-
6708 non-standard, non-Perl properties internally when PCRE2_UCP is set.
6713 Xsp Any Perl space character
6714 Xwd Any Perl "word" character
6716 Xan matches characters that have either the L (letter) or the N (num-
6720 exclude vertical tab, for Perl compatibility, but Perl changed. Xwd
6723 There is another non-standard property, Xuc, which matches any charac-
6730 Note that the Xuc property does not match these sequences but the char-
6747 mode), though it again reports the matched string as "bar". This fea-
6751 does not interfere with the setting of captured substrings. For exam-
6758 Perl documents that the use of \K within assertions is "not well
6762 be greater than the end of the match. Using \K in a lookbehind asser-
6763 tion at the start of a pattern can also lead to odd effects. For exam-
6775 The final use of backslash is for certain simple assertions. An asser-
6777 a match, without consuming any characters from the subject string. The
6799 PCRE2 nor Perl has a separate "start of word" or "end of word" metase-
6806 set. Thus, they are independent of multiline mode. These three asser-
6808 which affect only the behaviour of the circumflex and dollar metachar-
6809 acters. However, if the startoffset argument of pcre2_match() is non-
6816 the start point of the matching process, as specified by the startoff-
6818 startoffset is non-zero. By calling pcre2_match() multiple times with
6819 appropriate arguments, you can mimic Perl's /g option, and it is in
6824 Perl's, which defines it as true at the end of the previous match. In
6825 Perl, these can be different when the previously matched string was
6836 The circumflex and dollar metacharacters are zero-width assertions.
6837 That is, they test for a particular condition being true without con-
6839 are concerned with matching the starts and ends of lines. If the new-
6840 line convention is set so that only the two-character sequence CRLF is
6846 point is at the start of the subject string. If the startoffset argu-
6847 ment of pcre2_match() is non-zero, or if PCRE2_NOTBOL is set, circum-
6856 if the pattern is constrained to match only at the start of the sub-
6864 newline. Dollar need not be the last character of the pattern if a num-
6866 branch in which it appears. Dollar has no special meaning in a charac-
6878 a newline that ends the string, for compatibility with Perl. However,
6886 pcre2_match() is non-zero. The PCRE2_DOLLAR_ENDONLY option is ignored
6889 When the newline convention (see "Newline conventions" below) recog-
6890 nizes the two-character sequence CRLF as a newline, this is preferred,
6891 even if the single characters CR and LF are also recognized as new-
6905 Outside a character class, a dot in the pattern matches any one charac-
6906 ter in the subject string except (by default) a character that signi-
6910 that character; when the two-character sequence CRLF is used, dot does
6912 matches all characters (including isolated CRs and LFs). When any Uni-
6917 PCRE2_DOTALL option is set, a dot matches any one character, without
6918 exception. If the two-character sequence CRLF is present in the sub-
6921 The handling of dot is entirely independent of the handling of circum-
6931 the section entitled "Non-printing characters" above for details. Perl
6939 unit, whether or not a UTF mode is set. In the 8-bit library, one code
6940 unit is one byte; in the 16-bit library it is a 16-bit unit; in the
6941 32-bit library it is a 32-bit unit. Unlike a dot, \C always matches
6942 line-ending characters. The feature is provided in Perl in order to
6943 match individual bytes in UTF-8 mode, but it is unclear how it can use-
6947 one unit with \C in UTF-8 or UTF-16 mode means that the rest of the
6949 results, because PCRE2 assumes that it is matching character by charac-
6959 below) in UTF-8 or UTF-16 modes, because this would make it impossible
6962 these UTF modes. The former gives a match-time error; the latter fails
6965 In the 32-bit library, however, \C is always supported (when not
6967 whether or not UTF-32 is specified.
6970 using it that avoids the problem of malformed UTF-8 or UTF-16 charac-
6972 as in this pattern, which could be used with a UTF-8 string (ignore
6975 (?| (?=[\x00-\x7f])(\C) |
6976 (?=[\x80-\x{7ff}])(\C)(\C) |
6977 (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
6978 (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
6981 parentheses numbers in each alternative (see "Duplicate Subpattern Num-
6983 UTF-8 character for values whose encoding uses 1, 2, 3, or 4 bytes,
6991 closing square bracket. A closing square bracket on its own is not spe-
7010 class that starts with a circumflex is not an assertion; it still con-
7016 letters in a class represent both their upper case and lower case ver-
7022 special way when matching character classes, whatever line-ending
7037 sequences, they cause an error. The same is true for \N when not fol-
7040 The minus (hyphen) character can be used to specify a range of charac-
7041 ters in a character class. For example, [d-m] matches any letter
7046 example, [b-d-z] matches letters in the range b to d, a hyphen charac-
7049 Perl treats a hyphen as a literal if it appears before or after a POSIX
7052 class, Perl outputs a warning in its warning mode, as this is most
7056 It is not possible to have the literal character "]" as the end charac-
7057 ter of a range. A pattern such as [W-]46] is interpreted as a class of
7058 two characters ("W" and "-") followed by a literal string "46]", so it
7059 would match "W46]" or "-46]". However, if the "]" is escaped with a
7060 backslash it is interpreted as the end of range, so [W-\]46] is inter-
7065 Ranges normally include all code points between the start and end char-
7067 numerically, for example [\000-\037]. Ranges can include any characters
7068 that are valid for the current mode. In any UTF mode, the so-called
7071 PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES option disables this check). How-
7072 ever, ranges such as [\x{d7ff}-\x{e000}], which include the surrogates,
7076 points are both specified as literal letters in the same case. For com-
7077 patibility with Perl, EBCDIC code points within the range that are not
7078 letters are omitted. For example, [h-k] matches only four characters,
7081 [\x88-\x92] or [h-\x92], all code points are included.
7084 it matches the letters in either case. For example, [W-c] is equivalent
7085 to [][\\^_`wxyzabc], matched caselessly, and in a non-UTF mode, if
7086 character tables for a French locale are in use, [\xc8-\xcb] matches
7100 special compatibility feature - see the next two sections), and the
7101 terminating closing square bracket. However, escaping other non-
7107 Perl supports the POSIX notation for character classes. This uses names
7118 ascii character codes 0 - 127
7132 CR (13), and space (32). If locale-specific matching is taking place,
7136 The name "word" is a Perl extension, and "blank" is a GNU extension
7137 from Perl 5.8. Another Perl extension is negation, which is indicated
7142 matches "1", "2", or any non-digit. PCRE2 (and Perl) also recognize the
7147 the POSIX character classes, although this may be different for charac-
7148 ters in the range 128-255 when locale-specific matching is happening.
7168 when printed. In Unicode property terms, it matches all char-
7173 U+2066 - U+2069 Various "isolate"s
7180 [:punct:] This matches all characters that have the Unicode P (punctua-
7199 support is not compatible with Perl. It is provided to help migrations
7201 that \b matches at the start and the end of a word (see "Simple asser-
7202 tions" above), and in a Perl-style pattern the preceding or following
7203 character normally shows which is wanted, without the need for the
7204 assertions that are used above in order to give exactly the POSIX be-
7228 enclosed between "(?" and ")". These options are Perl-compatible, and
7229 are described in detail in the pcre2api documentation. The option let-
7239 For example, (?im) sets caseless, multiline matching. It is also possi-
7241 hyphen, for example (?-im). The two "extended" options are not indepen-
7244 A combined setting and unsetting such as (?im-sx), which sets
7248 the option is unset. An empty options setting "(?)" is allowed. Need-
7252 the above options to be unset. Thus, (?^) is equivalent to (?-imnsx).
7253 Letters may follow the circumflex to cause some options to be re-
7256 The PCRE2-specific options PCRE2_DUPNAMES and PCRE2_UNGREEDY can be
7257 changed in the same way as the Perl-compatible options by using the
7269 not used). By this means, options can be made to have different set-
7270 tings in different parts of the pattern. Any changes made in one alter-
7282 start of a non-capturing subpattern (see the next section), the option
7290 Note: There are other PCRE2-specific options that can be set by the
7291 application when the compiling function is called. The pattern can con-
7311 matches "cataract", "caterpillar", or "cat". Without the parentheses,
7327 the captured substrings are "red king", "red", and "king", and are num-
7332 without a capturing requirement. If an opening parenthesis is followed
7333 by a question mark and a colon, the subpattern does not do any captur-
7344 start of a non-capturing subpattern, the option letters may appear
7359 Perl 5.10 introduced a feature whereby each alternative in a subpattern
7361 starts with (?| and is itself a non-capturing subpattern. For example,
7366 Because the two alternatives are inside a (?| group, both sets of cap-
7370 not all, of one of a number of alternatives. Inside a (?| group, paren-
7373 subpattern start after the highest number used in any branch. The fol-
7374 lowing example is taken from the Perl documentation. The numbers under-
7377 # before ---------------branch-reset----------- after
7393 A relative reference such as (?-1) is no different: it is just a conve-
7396 If a condition test for a subpattern's having matched refers to a non-
7397 unique number, the test is true if any of the subpatterns of that num-
7407 very hard to keep track of the numbers in complicated patterns. Fur-
7409 with this difficulty, PCRE2 supports the naming of capturing subpat-
7410 terns. This feature was not added to Perl until release 5.10. Python
7412 the Python syntax. PCRE2 supports both the Perl and the Python syntax.
7415 (?<name>...) or (?'name'...) as in Perl, or (?P<name>...) as in Python.
7417 must start with a non-digit. References to capturing parentheses from
7418 other parts of the pattern, such as backreferences, recursion, and con-
7422 exactly as if the names were not present. In both PCRE2 and Perl, cap-
7425 for extracting the complete name-to-number translation table from a
7426 compiled pattern, as well as convenience functions for extracting cap-
7431 to all of them. Perl allows identically numbered subpatterns to have
7437 Perl allows this, with both names AA and BB as aliases of group 1.
7442 number to be associated with more than one name. The example above pro-
7443 vokes a compile-time error. However, there is still scope for confu-
7453 By default, a name must be unique within a pattern, except that dupli-
7459 The duplicate name constraint can be disabled by setting the PCRE2_DUP-
7463 a weekday, either as a 3-letter abbreviation or as the full name, and
7478 problem is to use a "branch reset" subpattern, as described in the pre-
7481 If you make a backreference to a non-unique named subpattern from else-
7484 first one that is set is used for the reference. For example, this pat-
7490 If you make a subroutine call to a non-unique named subpattern, the one
7519 The general repetition quantifier specifies a minimum and maximum num-
7540 the syntax of a quantifier, is taken as a literal character. For exam-
7545 of which is represented by a two-byte sequence in a UTF-8 string. Simi-
7551 the previous item and the quantifier were not present. This may be use-
7557 For convenience, the three most common quantifiers have single-charac-
7570 Earlier versions of Perl and PCRE1 used to give an error at compile
7573 subpattern does in fact match no characters, the loop is forcibly bro-
7577 as possible (up to the maximum number of permitted times), without
7594 and instead matches the minimum number of times possible, so the pat-
7611 Perl), the quantifiers are not greedy by default, but individual ones
7621 (equivalent to Perl's /s) is set, thus allowing the dot to match new-
7628 In cases where it is known that the subject string contains no new-
7629 lines, it is worth setting PCRE2_DOTALL in order to obtain this opti-
7639 If the subject is "xyz123abc123" the match point is the fourth charac-
7642 Another case where implicit anchoring is not applied is when the lead-
7648 It matches "ab" in the subject "aab". The use of the backtracking con-
7652 When a capturing subpattern is repeated, the value captured is the sub-
7659 the corresponding captured values may have been set in previous itera-
7671 to be re-evaluated to see if a different number of repeats allows the
7687 to be re-evaluated in this way.
7695 This kind of parenthesis "locks up" the part of the pattern it con-
7706 must swallow everything it can. So, while both \d+ and \d+? are pre-
7732 The possessive quantifier syntax is an extension to the Perl 5.8 syn-
7736 found its way into Perl at release 5.10.
7741 when B must follow. This feature can be disabled by the PCRE2_NO_AUTO-
7751 matches an unlimited number of substrings that either consist of non-
7761 both PCRE2 and Perl have an optimization that allows for fast failure
7762 when a single character is used. They remember the last single charac-
7769 sequences of non-digits cannot be broken, and failure happens quickly.
7775 0 (and possibly further digits) is a backreference to a capturing sub-
7781 there are not that many capturing left parentheses in the entire pat-
7783 to the left of the reference for numbers less than 8. A "forward back-
7785 and the subpattern to the right has participated in an earlier itera-
7791 See the subsection entitled "Non-printing characters" above for further
7805 An unsigned number specifies an absolute reference without the ambigu-
7810 (abc(def)ghi)\g{-1}
7812 The sequence \g{-1} is a reference to the most recently started captur-
7813 ing subpattern before \g, that is, is it equivalent to \2 in this exam-
7814 ple. Similarly, \g{-2} would be equivalent to \1. The use of relative
7821 Perl does not support the use of + in this way.
7823 A backreference matches whatever actually matched the capturing subpat-
7832 time of the backreference, the case of letters is relevant. For exam-
7841 subpatterns. The .NET syntax \k{name} and the Perl syntax \k<name> or
7842 \k'name' are supported, as is the Python syntax (?P=name). Perl 5.10's
7856 subpattern has not actually been used in a particular match, any back-
7862 the PCRE2_MATCH_UNSET_BACKREF option is set at compile time, a backref-
7865 Because there may be many capturing parentheses in a pattern, all dig-
7866 its following a backslash are taken as part of a potential backrefer-
7877 matches. However, such references can be useful inside repeated sub-
7882 matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
7886 the backreference. This can be done using alternation, as in the exam-
7915 referenced in the usual way. For example, a sequence such as (.)\g{-1}
7922 retained after a successful negative assertion. When an assertion con-
7925 For a positive assertion, internally captured substrings in the suc-
7926 cessful branch are retained, and matching continues with the next pat-
7935 For compatibility with Perl, most assertion subpatterns may be
7938 useful. However, an assertion that forms the condition for a condi-
7939 tional subpattern may not be quantified. In practice, for other asser-
7948 tried with and without the assertion, the order depending on the greed-
7962 matches a word followed by a semicolon, but does not include the semi-
7992 strings it matches must have a fixed length. However, if there are sev-
7993 eral top-level alternatives, they do not all have to have the same
8004 This is an extension compared with Perl, which requires all branches to
8009 is not permitted, because its single top-level branch can match two
8011 two top-level branches:
8016 of a lookbehind assertion to get round the fixed-length restriction.
8020 then try to match. If there are insufficient characters before the cur-
8023 In UTF-8 and UTF-16 modes, PCRE2 does not allow the \C escape (which
8026 the lookbehind. The \X and \R escapes, which can match different num-
8030 lookbehinds, as long as the subpattern matches a fixed-length string.
8034 Perl does not support backreferences in lookbehinds. PCRE2 does support
8046 assertions to specify efficient matching of fixed-length strings at the
8052 proceeds from left to right, PCRE2 will look for each "a" in the sub-
8067 quantifier; it can match only the entire string. The subsequent lookbe-
8082 three characters are not "999". This pattern does not match "foo" pre-
8084 three of which are not "999". For example, it doesn't match "123abc-
8108 It is possible to cause the matching process to obey a subpattern con-
8110 on the result of an assertion, or whether a specific capturing subpat-
8114 (?(condition)yes-pattern)
8115 (?(condition)yes-pattern|no-pattern)
8117 If the condition is satisfied, the yes-pattern is used; otherwise the
8118 no-pattern (if present) is used. An absent no-pattern is equivalent to
8119 an empty string (it always matches). If there are more than two alter-
8120 natives in the subpattern, a compile-time error occurs. Each of the two
8121 alternatives may itself contain nested subpatterns of any form, includ-
8129 There are five kinds of condition: references to subpatterns, refer-
8130 ences to recursion, two pseudo-conditions called DEFINE and VERSION,
8136 the condition is true if a capturing subpattern of that number has pre-
8139 numbers), the condition is true if any of them have matched. An alter-
8142 most recently opened parentheses can be referenced by (?(-1), the next
8143 most recent by (?(-2), and so on. Inside loops it can also make sense
8146 is not used; it provokes a compile-time error.)
8148 Consider the following pattern, which contains non-significant white
8155 character is present, sets it as the first captured substring. The sec-
8160 yes-pattern is executed and a closing parenthesis is required. Other-
8161 wise, since no-pattern is not present, the subpattern matches nothing.
8162 In other words, this pattern matches a sequence of non-parentheses,
8168 ...other stuff... ( \( )? [^()]+ (?(-1) \) ) ...
8175 Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a
8177 PCRE1, which had this facility before Perl, the syntax (?(name)...) is
8179 the letter R followed by digits are ambiguous (see the following sec-
8192 "Recursion" in this sense refers to any subroutine-like call from one
8193 part of the pattern to another, whether or not it is actually recur-
8231 be only one alternative in the subpattern. It is always skipped if con-
8233 can be used to define subroutines that can be referenced from else-
8234 where. (The use of subroutines is described below.) For example, a pat-
8238 (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
8246 to match the four dot-separated components of an IPv4 address, insist-
8251 Programs that link with a PCRE2 library can check the version by call-
8253 that do not have access to the underlying code cannot do this. A spe-
8269 assertion. Consider this pattern, again containing non-significant
8272 (?(?=[^a-z]*[a-z])
8273 \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
8276 optional sequence of non-letters followed by a letter. In other words,
8280 strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
8285 for both positive and negative assertions, because matching always con-
8286 tinues after the assertion, whether it succeeds or fails. (Compare non-
8306 at the start of the pattern, as described in the section entitled "New-
8310 when PCRE2_EXTENDED is set, and the default newline convention (a sin-
8324 unlimited nested parentheses. Without the use of recursion, the best
8329 For some time, Perl has provided a facility that allows regular expres-
8331 Perl code in the expression at run time, and the code can refer to the
8332 expression itself. A Perl pattern using code interpolation to solve the
8337 The (?p{...}) item interpolates Perl code at run time, and in this case
8340 Obviously, PCRE2 cannot support the interpolation of Perl code.
8341 Instead, it supports special syntax for recursion of the entire pat-
8342 tern, and also for individual subpattern recursion. After its introduc-
8344 introduced into Perl at release 5.10.
8349 subpattern. (If not, it is a non-recursive subroutine call, which is
8359 substrings which can either be a sequence of non-parentheses, or a
8360 recursive match of the pattern itself (that is, a correctly parenthe-
8362 of a possessive quantifier to avoid backtracking into sequences of non-
8375 of (?1) in the pattern above you can write (?-2) to refer to the second
8380 Be aware however, that if duplicate subpattern numbers are in use, rel-
8384 (?|(a)|(b)) (c) (?-2)
8387 group (c) is number 2. When the reference (?-2) is encountered, the
8396 because the reference is not inside the parentheses that are refer-
8397 enced. They are always non-recursive subroutine calls, as described in
8400 An alternative approach is to use named parentheses. The Perl syntax
8401 for this is (?&name); PCRE1's earlier syntax (?P>name) is also sup-
8409 The example pattern that we have been looking at contains nested unlim-
8411 strings of non-parentheses is important when applying the pattern to
8423 callout function can be used (see below and the pcre2callout documenta-
8429 which is the last value taken on at the top level. If a capturing sub-
8435 recursion. Consider this pattern, which matches text in angle brack-
8437 brackets (that is, when recursing), whereas any characters are permit-
8443 two different alternatives for the recursive and non-recursive cases.
8446 Differences in recursion processing between PCRE2 and Perl
8448 Some former differences between PCRE2 and Perl no longer exist.
8450 Before release 10.30, recursion processing in PCRE2 differed from Perl
8453 never re-entered, even if it contained untried alternatives and there
8455 recursion before Perl did.)
8458 treated as atomic. That is, they can be re-entered to try unused alter-
8460 now compatible with the way Perl works. If you want a subroutine call
8473 match fails. If you want to match typical palindromic phrases, the pat-
8474 tern has to ignore all non-word characters, which can be done like
8480 such as "A man, a plan, a canal: Panama!". Note the use of the posses-
8481 sive quantifier *+ to avoid backtracking into sequences of non-word
8482 characters. Without this, PCRE2 takes a great deal longer (ten times or
8483 more) to match typical phrases, and Perl takes so long that you think
8486 Another way in which PCRE2 and Perl used to differ in their recursion
8487 processing is in the handling of captured values. Formerly in Perl,
8489 next section), it had no access to any values that were captured out-
8499 to fail in Perl, but in later versions (I tried 5.024) it now works.
8513 (...(relative)...)...(?-1)...
8534 Processing options such as case-independence are fixed when a subpat-
8538 (abc)(?i:(?-1))
8550 For compatibility with Oniguruma, the non-Perl syntax \g followed by a
8553 possibly recursively. Here are two of the examples used above, rewrit-
8562 (abc)(?i:\g<-1>)
8564 Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not
8571 Perl has a feature whereby using the sequence (?{...}) causes arbitrary
8572 Perl code to be obeyed in the middle of matching a regular expression.
8573 This makes it possible, amongst other things, to extract different sub-
8574 strings that match the same pair of parentheses when there is a repeti-
8577 PCRE2 provides a similar feature, but of course it cannot obey arbi-
8578 trary Perl code. The feature is called "callout". The caller of PCRE2
8582 passed, or if the callout entry point is set to NULL, callouts are dis-
8592 in a similar way to Perl.
8594 During matching, when PCRE2 reaches a callout point, the external func-
8601 time, and one side-effect is that sometimes callouts are skipped. If
8617 They are all numbered 255. If there is a conditional group in the pat-
8629 A delimited string may be used instead of a number as a callout argu-
8631 ending delimiter is the same as the start, except for {, where the end-
8644 Perl's terminology) that modify the behaviour of backtracking during
8649 By default, for compatibility with Perl, a name is any sequence of
8653 PCRE2_ALT_VERBNAMES option, but the result is no longer Perl-compati-
8659 and sequences such as \x{100} that define character code points. Char-
8665 names is skipped, and #-comments are recognized, exactly as in the rest
8669 The maximum length of a name is 255 in the 8-bit library and 65535 in
8670 the 16-bit and 32-bit libraries. If the name is empty, that is, if the
8672 the colon were not there. Any number of these verbs may occur in a pat-
8676 them can be used only when the pattern is to be matched using the tra-
8683 subpatterns called as subroutines (whether or not recursively) is docu-
8693 course, be processed. You can suppress the start-of-match optimizations
8694 by setting the PCRE2_NO_START_OPTIMIZE option when calling pcre2_com-
8699 Experiments with Perl suggest that it too has similar optimizations,
8711 then continues at the outer level. If (*ACCEPT) in triggered in a posi-
8715 If (*ACCEPT) is inside capturing parentheses, the data so far is cap-
8720 This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap-
8727 read. The Perl documentation notes that it is probably useful only when
8728 combined with (?{}) or (??{}). Those are, of course, Perl features that
8729 are not present in PCRE2. The nearest equivalent is the callout fea-
8752 When a match succeeds, the name of the last-encountered (*MARK:NAME) on
8753 the matching path is passed back to the caller as described in the sec-
8754 tion entitled "Other information about the match" in the pcre2api docu-
8775 The (*MARK) name is tagged with "MK:" in this output, and in this exam-
8777 efficient way of obtaining this information than putting each alterna-
8781 true, the name is recorded and passed back if it is the last-encoun-
8803 The following verbs do nothing when they are encountered. Matching con-
8805 causing a backtrack to the verb, a failure is forced. That is, back-
8809 group has been matched, there is never any backtracking into it. Back-
8813 These verbs differ in exactly what kind of failure occurs when back-
8815 when the verb is not in a subroutine or an assertion. Subsequent sec-
8821 matching failure that causes backtracking to reach it. Even if the pat-
8824 verb that is encountered, once it has been passed pcre2_match() is com-
8833 The behaviour of (*COMMIT:NAME) is not the same as (*MARK:NAME)(*COM-
8834 MIT). It is like (*MARK:NAME) in that the name is remembered for pass-
8845 anchor, unless PCRE2's start-of-match optimizations are turned off, as
8861 (*COMMIT) causes the match to fail without trying any other starting
8867 the subject if there is a later matching failure that causes backtrack-
8873 (*PRUNE) is just an alternative to an atomic group or possessive quan-
8885 This verb, when given without a name, is like (*PRUNE), except that if
8887 character, but to the position in the subject where (*SKIP) was encoun-
8896 skips on to start the next attempt at "c". Note that a possessive quan-
8907 found, the "bumpalong" advance is to the subject position that corre-
8913 atomic groups or assertions, because they are never re-entered by back-
8931 backtracks, and this causes a new matching attempt to start at the sec-
8942 This verb causes a skip to the next innermost alternative when back-
8945 that it can be used for a pattern-based if-then-else block:
8951 skips to the second alternative and tries COND2, without backtracking
8952 into COND1. If that succeeds and BAR fails, COND3 is tried. If subse-
8953 quently BAZ fails, there are no more alternatives, so there is a back-
8979 failure in C, matching moves to (*FAIL), which causes the whole subpat-
9010 that is backtracked onto first acts. For example, consider this pat-
9018 is consistent, but is not always the same as Perl's. It means that if
9030 PCRE2 sometimes differs from Perl in its handling of backtracking verbs
9035 If the subject is "abac", Perl matches unless its optimizations are
9047 succeed without any further processing; captured strings and a (*MARK)
9049 (*ACCEPT) causes the assertion to fail without any further processing;
9068 in a standalone positive assertion. In a conditional positive asser-
9070 or (*PRUNE) causes the condition to be false. However, for both stand-
9072 (*SKIP), or (*PRUNE) causes the assertion to be true, without consider-
9077 These behaviours occur whether or not the subpattern is called recur-
9081 match to succeed without any further processing. Matching then contin-
9082 ues after the subroutine call. Perl documents this behaviour. Perl's
9089 when triggered by being backtracked to in a subpattern called as a sub-
9115 Copyright (c) 1997-2018 University of Cambridge.
9116 ------------------------------------------------------------------------------
9124 PCRE2 - Perl-compatible regular expressions (revised API)
9128 Two aspects of performance are discussed below: memory usage and pro-
9139 subpattern has a quantifier with a minimum greater than 1 and/or a lim-
9153 is not usually a problem. However, if the numbers are large, and par-
9159 uses over 50KiB when compiled using the 8-bit library. When PCRE2 is
9161 limit on a compiled pattern is 65535 code units in the 8-bit and 16-bit
9162 libraries, and this is reached with the above pattern if the outer rep-
9168 of PCRE2's "subroutine" facility. Re-writing the above pattern as
9174 this kind of pattern is not always exactly equivalent, because any cap-
9177 process patterns that PCRE2 cannot otherwise handle. The matching per-
9179 same. (This applies from release 10.30 - things were different in ear-
9185 From release 10.30, the interpretive (non-JIT) version of pcre2_match()
9186 uses very little system stack at run time. In earlier releases recur-
9188 cause problems, but this usage has been eliminated. Backtracking posi-
9194 used. Rewriting patterns to be time-efficient, as described below, may
9203 has been re-factored to use heap memory when necessary for internal
9214 Certain items in regular expression patterns are processed more effi-
9216 [aeiou] than a set of single-character alternatives such as
9224 slow, because PCRE2 has to use a multi-stage table lookup whenever it
9234 pcre2_match(); the performance loss is less with a DFA matching func-
9237 When a pattern begins with .* not in atomic parentheses, nor in paren-
9241 multiple top-level branches, they must all be anchorable. The optimiza-
9247 subject string contains newlines, the pattern may match from the char-
9258 If you are using such a pattern with subject strings that do not con-
9261 explicit anchoring. That saves PCRE2 from having to scan along the sub-
9283 matching procedure, PCRE2 checks that there is a "b" later in the sub-
9284 ject string, and if there is not, it fails the match immediately. How-
9295 an atomic group or a possessive quantifier. This can often reduce mem-
9306 matched character. For a long string, a lot of memory is required. Con-
9312 This runs much faster, because sequences of characters that do not con-
9313 tain "<" are "swallowed" in one item inside the parentheses, and a pos-
9315 non-"<" characters. This version also uses a lot less memory because
9350 Copyright (c) 1997-2018 University of Cambridge.
9351 ------------------------------------------------------------------------------
9359 PCRE2 - Perl-compatible regular expressions (revised API)
9379 This set of functions provides a POSIX-style API for the PCRE2 regular
9380 expression 8-bit library. See the pcre2api documentation for a descrip-
9381 tion of PCRE2's native API, which contains much additional functional-
9382 ity. There are no POSIX-style wrappers for PCRE2's 16-bit and 32-bit
9388 called libpcre2-posix.a, so can be accessed by adding -lpcre2-posix to
9391 -lpcre2-8.
9402 PCRE2-specific features via the POSIX calling interface or to add BSD
9406 POSIX-like in style. The syntax and semantics of the regular expres-
9407 sions themselves are still those of Perl, subject to the setting of
9408 various PCRE2 options, as described below. "POSIX-like in style" means
9410 POSIX-compatible, and in multi-unit encoding domains it is probably
9416 two structure types, regex_t for compiled internal forms, and reg-
9417 match_t for returning captured substrings. It also defines some con-
9427 structure that is used as a base for storing information about the com-
9449 the defined POSIX behaviour for REG_NEWLINE (see the following sec-
9455 for compilation to the native function. This disables all meta charac-
9464 for matching, the nmatch and pmatch arguments are ignored, and no cap-
9474 now contain binary zeros, which are treated as data characters. Without
9496 all data strings used for matching it to be treated as UTF-8 strings.
9502 subject string is the Perl way, not the POSIX way. Note that setting
9504 It does not affect the way newlines are matched by the dot metacharac-
9507 The yield of regcomp() is zero on success, and non-zero otherwise. The
9513 NOTE: If the yield of regcomp() is non-zero, you must not attempt to
9520 This area is not simple, because POSIX and Perl take different views of
9524 Perl and PCRE2:
9534 This is the equivalent table for a POSIX-compatible pattern matcher:
9545 API. By default, PCRE2's behaviour is the same as Perl's, except that
9546 there is no equivalent for PCRE2_DOLLAR_ENDONLY in Perl. In both PCRE2
9547 and Perl, there is no way to stop newline from matching [^a].
9552 action. When using the POSIX API, passing REG_NEWLINE to PCRE2's reg-
9554 and REG_DOTALL passes PCRE2_DOTALL. There is no way to pass PCRE2_DOL-
9567 The PCRE2_NOTBOL option is set when calling the underlying PCRE2 match-
9574 standard. However, setting this option can give more POSIX-like behav-
9579 The PCRE2_NOTEOL option is set when calling the underlying PCRE2 match-
9598 intended to be portable to other systems. Note that a non-zero rm_so
9620 Unused entries in the array have both structure members set to -1.
9629 The regerror() function maps a non-zero errorcode from either regcomp()
9633 the first errbuf_size - 1 characters of the error message are used. The
9641 Compiling a regular expression causes memory to be allocated and asso-
9643 memory, after which preg may no longer be used as a compiled expres-
9657 Copyright (c) 1997-2017 University of Cambridge.
9658 ------------------------------------------------------------------------------
9666 PCRE2 - Perl-compatible regular expressions (revised API)
9674 can save this listing to re-create the contents of pcre2demo.c.
9679 used. If matching succeeds, the program outputs the portion of the sub-
9680 ject that matched, together with the contents of any captured sub-
9683 If the -g option is given on the command line, the program then goes on
9685 subject string. The logic is a little bit tricky because of the possi-
9689 The code in pcre2demo.c is an 8-bit program that uses the PCRE2 8-bit
9690 library. It handles strings and characters that are stored in 8-bit
9693 treated as UTF-8 strings, where characters may occupy multiple code
9697 for your operating system, you should be able to compile the demonstra-
9700 cc -o pcre2demo pcre2demo.c -lpcre2-8
9703 to the command line. For example, on a Unix-like system that has PCRE2
9707 cc -o pcre2demo -I/usr/local/include pcre2demo.c \
9708 -L/usr/local/lib -lpcre2-8
9714 ./pcre2demo -g 'cat|dog' 'the dog sat on the cat'
9718 expressions using all three PCRE2 libraries (8-bit, 16-bit, and 32-bit,
9719 though not all three need be installed). The pcre2demo program is pro-
9726 ld.so.1: pcre2demo: fatal: libpcre2-8.so.0: open failed: No such file
9729 This is caused by the way shared library support works on those sys-
9732 -R/usr/local/lib
9747 Copyright (c) 1997-2016 University of Cambridge.
9748 ------------------------------------------------------------------------------
9754 PCRE2 - Perl-compatible regular expressions (revised API)
9756 SAVING AND RE-USING PRECOMPILED PCRE2 PATTERNS
9773 run. However, if you are using the just-in-time optimization feature,
9774 it is not possible to save and reload the JIT data, because it is posi-
9775 tion-dependent. The host on which the patterns are reloaded must be
9778 For example, patterns compiled on a 32-bit system using PCRE2's 16-bit
9779 library cannot be reloaded on a 64-bit system, nor can they be reloaded
9780 using the 8-bit library.
9787 linked with a fixed version of PCRE2 must be prepared to recompile pat-
9797 checking, not complete validation of what is being re-loaded. Corrupted
9809 in the byte stream (its size is 1088 bytes). For more details of char-
9810 acter tables, see the section on locale support in the pcre2api docu-
9816 the length of the vector. The third and fourth arguments point to vari-
9830 PCRE2_ERROR_BADMAGIC means either that a pattern's code has been cor-
9831 rupted, or that a slot in the vector does not point to a compiled pat-
9855 between binary and non-binary data, be sure that the file is opened for
9860 freed in the usual way by calling pcre2_code_free(). When you have fin-
9861 ished with the byte stream, it too must be freed by calling pcre2_seri-
9863 returns immediately without doing anything.
9866 RE-USING PRECOMPILED PATTERNS
9868 In order to re-use a set of saved patterns you must first make the
9869 serialized byte stream available in main memory (for example, by read-
9873 data without actually decoding the patterns:
9884 If this argument is NULL, malloc() and free() are used. After deserial-
9913 and a reference count is used to arrange for its memory to be automati-
9920 If a pattern was processed by pcre2_jit_compile() before being serial-
9936 Copyright (c) 1997-2018 University of Cambridge.
9937 ------------------------------------------------------------------------------
9945 PCRE2 - Perl-compatible regular expressions (revised API)
9949 The full syntax and semantics of the regular expressions that are sup-
9951 document contains a quick-reference summary of the syntax.
9956 \x where x is non-alphanumeric is a literal x
9965 \cx "control-x", where x is any ASCII printing character
9980 Note that \0dd is always an octal code. The treatment of backslash fol-
9981 lowed by a non-zero digit is complicated; for details see the section
9982 "Non-printing characters" in the pcre2pattern documentation, where
9989 read, but if PCRE2_ALT_BSUX is set, \x must be followed by two hexadec-
9991 matches a literal "x". Likewise, if \u (in ALT_BSUX mode) is not fol-
10006 \P{xx} a character without the xx property
10013 \W a "non-word" character
10017 middle of a UTF-8 or UTF-16 character. The application can lock out the
10021 By default, \d, \s, and \w match only ASCII characters, even in UTF-8
10022 mode or in the 16-bit and 32-bit libraries. However, if locale-specific
10024 points in the range 128-255. If the PCRE2_UCP option is set, the behav-
10049 Mn Non-spacing mark
10081 Xsp Perl space: property Z or tab, NL, VT, FF, CR
10082 Xuc Univerally-named character: one that can be
10084 Xwd Perl word: property Xan or underscore
10086 Perl and POSIX space are now the same. Perl added VT to its space char-
10092 Adlam, Ahom, Anatolian_Hieroglyphs, Arabic, Armenian, Avestan, Bali-
10094 Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Caucasian_Alba-
10100 Inscriptional_Pahlavi, Inscriptional_Parthian, Javanese, Kaithi, Kan-
10102 Latin, Lepcha, Limbu, Linear_A, Linear_B, Lisu, Lycian, Lydian, Maha-
10106 Nabataean, New_Tai_Lue, Newa, Nko, Nushu, Ogham, Ol_Chiki, Old_Hungar-
10107 ian, Old_Italic, Old_North_Arabian, Old_Permic, Old_Persian, Old_Sog-
10110 Psalter_Pahlavi, Rejang, Runic, Samaritan, Saurashtra, Sharada, Sha-
10113 Tai_Viet, Takri, Tamil, Tangut, Telugu, Thaana, Thai, Tibetan, Tifi-
10121 [x-y] range (can be used for hex characters)
10127 ascii 0-127
10197 (?<name>...) named capturing group (Perl)
10198 (?'name'...) named capturing group (Perl)
10200 (?:...) non-capturing group
10201 (?|...) non-capturing group; reset group numbers for
10207 (?>...) atomic, non-capturing group
10227 (?-...) unset option(s)
10231 a mixture of setting and unsetting such as (?i-x) is allowed, but there
10233 for example (?^in). An option setting may appear at the start of a non-
10245 (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
10248 (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
10289 Each top-level branch of a look behind must be of a fixed length.
10298 \g-n relative reference by number
10300 \g{-n} relative reference by number
10301 \k<name> reference by name (Perl)
10302 \k'name' reference by name (Perl)
10303 \g{name} reference by name (Perl)
10313 (?-n) call subpattern by relative number
10314 (?&name) call subpattern by name (Perl)
10322 \g<-n> call subpattern by relative number (PCRE2 extension)
10323 \g'-n' call subpattern by relative number (PCRE2 extension)
10328 (?(condition)yes-pattern)
10329 (?(condition)yes-pattern|no-pattern)
10333 (?(-n) relative reference condition
10334 (?(<name>) named reference condition (Perl)
10335 (?('name') named reference condition (Perl)
10361 The following act only when a subsequent match failure causes a back-
10363 what happens afterwards. Those that advance the start-of-match point do
10405 Copyright (c) 1997-2018 University of Cambridge.
10406 ------------------------------------------------------------------------------
10414 PCRE - Perl-compatible regular expressions (revised API)
10420 in UTF-8, UTF-16, or UTF-32 format (depending on the code unit width).
10425 (*UTF). When either of these is the case, both the pattern and any sub-
10427 instead of strings of individual one-code-unit characters. There are
10428 also some other changes to the way characters are handled, as docu-
10431 If you do not need Unicode support you can build PCRE2 without it, in
10443 names for properties are supported. For example, \p{L} matches a let-
10444 ter. Its Perl synonym, \p{Letter}, is not supported. Furthermore, in
10445 Perl, many properties may optionally be prefixed by "Is", for compati-
10446 bility with Perl 5.6. PCRE2 does not support this.
10458 allowed in non-UTF modes.
10468 multi-unit characters (see the description of \C in the pcre2pattern
10472 pcre2_dfa_match() when in UTF-8 or UTF-16 mode, that is, when a charac-
10474 modes provokes a match-time error. Also, the JIT optimization does not
10475 support \C in these modes. If JIT optimization is requested for a UTF-8
10476 or UTF-16 pattern that contains \C, it will not succeed, and so when
10483 set as in non-UTF mode, all with code points less than 256. This
10489 Alternatively, if you set the PCRE2_UCP option, the way that the char-
10495 all low-valued characters, unless the PCRE2_UCP option is set.
10498 escapes (\h, \H, \v, and \V) do match all the appropriate Unicode char-
10502 CASE-EQUIVALENCE IN UTF MODES
10504 Case-insensitive matching in a UTF mode makes use of Unicode properties
10506 at most two case-equivalent values. For these, a direct table lookup is
10508 than two code points that are case-equivalent, and these are treated as
10521 UTF-16 and UTF-32 strings can indicate their endianness by special code
10522 knows as a byte-order mark (BOM). The PCRE2 functions do not handle
10526 case of pcre2_match() and pcre2_dfa_match() calls with a non-zero
10530 end of the subject. If there are no lookbehind assertions in the pat-
10534 the starting offset. Note that the sequences \b and \B are one-charac-
10539 the surrogate area. The so-called "non-character" code points are not
10544 UTF-16, where they are used in pairs to encode code points with values
10545 greater than 0xFFFF. The code points that are encoded by UTF-16 pairs
10546 are available independently in the UTF-8 and UTF-32 encodings. (In
10547 other words, the whole surrogate thing is a fudge for UTF-16 which
10548 unfortunately messes up UTF-8 and UTF-32.)
10551 and therefore want to skip these checks in order to improve perfor-
10553 scanned repeatedly. If you set the PCRE2_NO_UTF_CHECK option at com-
10569 PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra option. However, this is pos-
10570 sible only in UTF-8 and UTF-32 modes, because these values are not rep-
10571 resentable in UTF-16.
10573 Errors in UTF-8 strings
10575 The following negative error codes are given for invalid UTF-8 strings:
10583 The string ends with a truncated UTF-8 character; the code specifies
10584 how many bytes are missing (1 to 5). Although RFC 3629 restricts UTF-8
10585 characters to be no longer than 4 bytes, the encoding scheme (origi-
10607 A 4-byte character has a value greater than 0x10fff; these code points
10612 A 3-byte character has a value in the range 0xd800 to 0xdfff; this
10613 range of code points are reserved by RFC 3629 for use with UTF-16, and
10614 so are excluded from UTF-8.
10622 A 2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it codes
10624 For example, the two bytes 0xc0, 0xae give the value 0x2e, whose cor-
10630 binary value 0b10 (that is, the most significant bit is 1 and the sec-
10631 ond is 0). Such a byte can only validly occur as the second or subse-
10632 quent byte of a multi-byte character.
10637 can never occur in a valid UTF-8 string.
10639 Errors in UTF-16 strings
10641 The following negative error codes are given for invalid UTF-16
10649 Errors in UTF-32 strings
10651 The following negative error codes are given for invalid UTF-32
10668 Copyright (c) 1997-2018 University of Cambridge.
10669 ------------------------------------------------------------------------------