• Home
  • Raw
  • Download

Lines Matching +full:ipv4 +full:- +full:simple +full:- +full:service +full:- +full:config

1 -----------------------------------------------------------------------------
8 -----------------------------------------------------------------------------
16 PCRE2 - Perl-compatible regular expressions (revised API)
25 API is more extensible, and it was simplified by abolishing the sepa-
30 As well as Perl-style regular expression patterns, some features that
37 The source code for PCRE2 can be compiled to support 8-bit, 16-bit, or
38 32-bit code units, which means that up to three separate libraries may
39 be installed. The original work to extend PCRE to 16-bit and 32-bit
40 code units was done by Zoltan Herczeg and Christian Persch, respec-
42 character per code unit, or as UTF-encoded Unicode, with support for
45 code units must be enabled explicitly at run time. The version of Uni-
48 pcre2test -C
51 ending in _8, _16, or _32, respectively (for example, pcre2_com-
57 In addition to the Perl-compatible matching function, PCRE2 contains an
58 alternative function that matches the same compiled patterns in a dif-
70 client to discover which features are available. The features them-
71 selves are described in the pcre2build page. Documentation about build-
73 NON-AUTOTOOLS_BUILD files in the source distribution.
86 If you are using PCRE2 in a non-UTF application that permits users to
89 For example, an 8-bit pattern that begins with "(*UTF)" turns on UTF-8
90 mode, which interprets patterns and subjects as strings of UTF-8 code
91 units instead of individual 8-bit characters. This causes both the pat-
92 tern and any data against which it is matched to be checked for UTF-8
93 validity. If the data string is very long, such a check might use suf-
94 ficiently many resources as to cause your application to lose perfor-
97 One way of guarding against this possibility is to use the pcre2_pat-
100 calling pcre2_compile(). This causes a compile time error if the pat-
101 tern contains a UTF-setting sequence.
104 be enabled from within the pattern, by specifying "(*UCP)". This fea-
112 The use of the \C escape sequence in a UTF-8 or UTF-16 pattern can lead
114 middle of a multi-code-unit character. The PCRE2_NEVER_BACKSLASH_C
116 a compile-time error if it is encountered. It is also possible to build
121 Nested unlimited repeats in a pattern are a common example. PCRE2 pro-
124 pcre2_set_depth_limit() that can be used to restrict the amount of mem-
130 The user documentation for PCRE2 comprises a number of different sec-
136 (which is a program listing), and the short pages for individual func-
137 tions, are concatenated in pcre2.txt, for ease of searching. The sec-
141 pcre2-config show PCRE2 installation configuration information
148 pcre2grep description of the pcre2grep command (8-bit only)
149 pcre2jit discussion of just-in-time optimization support
156 pcre2posix the POSIX-compatible C API for the 8-bit library
170 University Computing Service
181 Copyright (c) 1997-2018 University of Cambridge.
182 ------------------------------------------------------------------------------
190 PCRE2 - Perl-compatible regular expressions (revised API)
195 contains a description of all its native functions. See the pcre2 docu-
448 These functions provide a way of converting non-PCRE2 patterns into
455 PCRE2 8-BIT, 16-BIT, AND 32-BIT LIBRARIES
457 There are three PCRE2 libraries, supporting 8-bit, 16-bit, and 32-bit
460 for all three libraries. One, two, or all three can be installed simul-
461 taneously. On Unix-like systems the libraries are called libpcre2-8,
462 libpcre2-16, and libpcre2-32, and they can also co-exist with the orig-
485 macros are defined whose names are the generic forms such as pcre2_com-
487 PCRE2_CODE_UNIT_WIDTH to generate the appropriate width-specific func-
505 single library. For example, if you want to run a match using a pat-
517 There are also some wrapper functions for the 8-bit library that corre-
529 program against a non-dll PCRE2 library, you must define PCRE2_STATIC
533 and matching regular expressions in a Perl-compatible manner. A sample
540 passed as bits in an options argument. There are also some more compli-
543 blocks, described below). Simple applications do not need to make use
546 Just-in-time (JIT) compiler support is an optional feature of PCRE2
561 less sanity checking. The JIT-specific functions are discussed in the
564 A second matching function, pcre2_dfa_match(), which is not Perl-com-
569 return captured substrings. A description of the two matching algo-
588 pcre2_substring_free() and pcre2_substring_list_free() are also pro-
590 functions is called with a NULL argument, the function returns immedi-
600 Finally, there are functions for finding out information about a com-
615 ~(PCRE2_SIZE)0) is reserved as a special indicator for zero-terminated
623 strings: a single CR (carriage return) character, a single LF (line-
624 feed) character, the two-character sequence CRLF, any of the three pre-
642 dollar metacharacters, the handling of #-comments in /x mode, and, when
643 CRLF is a recognized line ending sequence, the match position advance-
644 ment for a non-anchored pattern. There is more detail about this in the
654 In a multithreaded application it is important to keep thread-specific
656 library code itself is thread-safe: it contains no static or global
657 variables. The API is designed to be fairly simple for non-threaded
658 applications while at the same time ensuring that multithreaded appli-
661 There are several different blocks of data that are used to pass infor-
669 is thread-safe, that is, the same compiled pattern can be used by more
672 use them. However, if the just-in-time (JIT) optimization feature is
682 Get a read-only (shared) lock (mutex) for pointer
694 If JIT is being used, but the JIT compilation is not being done immedi-
700 obtain a private copy of the compiled code before calling the JIT com-
713 In a multithreaded application, if the parameters in a context are val-
716 it must make its own thread-specific copy.
721 of a match. This includes details of what was matched, as well as addi-
730 memory management or non-standard character tables. To keep function
739 relevant for several PCRE2 operations, a compile-time context, and a
740 match-time context.
764 function may be NULL, in which case the system memory management func-
790 A compile context is required if you want to provide an external func-
792 values of any of the following compile-time parameters:
801 A compile context is also required if you are using custom memory man-
802 agement. If none of these apply, just pass NULL as the context argu-
805 A compile context is created, copied, and freed by the following func-
833 only argument is a general context. This function builds a set of char-
839 As PCRE2 has developed, almost all the 32 option bits that are avail-
842 bits which are used for some newer, assumed rarer, options. This func-
854 largest number that a PCRE2_SIZE variable can hold, which is effec-
860 This specifies which characters or character sequences are to be recog-
863 two-character sequence CR followed by LF), PCRE2_NEWLINE_ANYCRLF (any
871 PCRE2_EXTENDED_MORE option, the newline convention affects the recogni-
883 limit applies to parentheses of all kinds, not just capturing parenthe-
899 nesting, and the second is user data that is set up by the last argu-
901 should return zero if all is well, or non-zero to force an error.
917 A match context is created, copied, and freed by the following func-
937 during a matching operation. Details are given in the pcre2callout doc-
957 option when calling pcre2_compile() so that when JIT is in use, differ-
958 ent code can be compiled. If a match is started with a non-default
959 match limit when PCRE2_USE_OFFSET_LIMIT is not set, an error is gener-
967 the first line and also within the offset limit. In other words, which-
976 also applies to pcre2_dfa_match(), which may use the heap when process-
978 atomic groups. This limit does not apply to matching with the JIT opti-
994 The pcre2_match() function starts out using a 20KiB vector on the sys-
995 tem stack for recording backtracking points. The more nested backtrack-
998 too small. If the heap limit is set to a value less than 21 (in partic-
1000 that do not have a lot of nested backtracking can be successfully pro-
1025 When pcre2_match() is called with a pattern that was successfully pro-
1088 CHECKING BUILD-TIME OPTIONS
1107 non-negative on success, or the negative error code PCRE2_ERROR_BADOP-
1108 TION if the value in the first argument is not recognized. The follow-
1122 unit widths were selected when PCRE2 was built. The 1-bit indicates
1123 8-bit support, and the 2-bit and 4-bit indicate 16-bit and 32-bit sup-
1130 recursions, lookarounds, and atomic groups in pcre2_dfa_match(). Fur-
1143 just-in-time compiling is available; otherwise it is set to zero.
1162 the 16-bit library is compiled, a value of 3 is rounded up to 4, and
1163 when the 32-bit library is compiled, internal linkages always use 4
1166 The default value of 2 for the 8-bit and 16-bit libraries is sufficient
1202 The output is a uint32_t integer that gives the maximum depth of nest-
1212 This parameter is obsolete and should not be used in new code. The out-
1236 PCRE2 version string, zero-terminated. The number of code units used is
1237 returned. This is the length of the string plus one unit for the termi-
1255 length (in code units). If the pattern is zero-terminated, the length
1260 If the compile context argument ccontext is NULL, memory for the com-
1270 below), the JIT information cannot be copied (because it is position-
1271 dependent). The new copy can initially be used only for non-JIT match-
1276 a multithreaded application to acquire a private copy of shared com-
1285 pointing to the new tables. The memory for the new tables is automati-
1293 After running a match, you must not free a compiled pattern (or a sub-
1304 For those options that can be different in different parts of the pat-
1310 Other, less frequently required compile-time parameters (for example,
1314 If errorcode or erroroffset is NULL, pcre2_compile() returns NULL imme-
1318 error has occurred. The values are not defined when compilation is suc-
1319 cessful and pcre2_compile() returns a non-NULL value.
1322 return if it finds an error in the pattern. There are also some nega-
1328 "Obtaining a textual error message" below) should be self-explanatory.
1332 The value returned in erroroffset is an indication of where in the pat-
1336 the failing assertion. For an invalid UTF-8 or UTF-16 string, the off-
1342 mode. It may sometimes point into the middle of a UTF-8 or UTF-16 char-
1345 This code fragment shows a typical straightforward call to pcre2_com-
1353 PCRE2_ZERO_TERMINATED, /* the pattern is zero-terminated */
1383 (1) \U matches an upper case "U" character; by default \U causes a com-
1403 Perl. If you want a multiline circumflex also to match after a termi-
1417 whitespace in verb names is skipped and #-comments are recognized,
1423 items, all with number 255, before each pattern item, except immedi-
1424 ately before or after an explicit callout in the pattern. For discus-
1437 points (available only in 16-bit or 32-bit mode) are treated as not
1454 this option, a dot does not match when the current position in the sub-
1456 and it can be changed within a pattern by a (?s) option setting. A neg-
1458 escape sequence always matches a non-newline character, independent of
1475 patterns, a new match is then tried at the next starting point. How-
1490 matches, which are necessarily substrings of the first one, must obvi-
1496 totally ignored except when escaped or inside a character class. How-
1498 introduce various parenthesized subpatterns, nor within numerical quan-
1500 item and a following quantifier and between a quantifier and a follow-
1505 When PCRE2 is compiled without Unicode support, PCRE2_EXTENDED recog-
1507 256 that are flagged as white space in its low-character table. The ta-
1514 When PCRE2 is compiled with Unicode support, in addition to these char-
1515 acters, five more Unicode "Pattern White Space" characters are recog-
1516 nized by PCRE2_EXTENDED. These are U+0085 (next line), U+200E (left-to-
1517 right mark), U+200F (right-to-left mark), U+2028 (line separator), and
1523 As well as ignoring most white space, PCRE2_EXTENDED also causes char-
1530 Which characters are interpreted as newlines can be specified by a set-
1532 special sequence at the start of the pattern, as described in the sec-
1543 option, and it can be changed within a pattern by a (?xx) option set-
1550 start of matching, though the matched text may continue over the new-
1551 line. If startoffset is non-zero, the limiting newline is not necessar-
1553 string is "abc\nxyz" (where \n represents a single-character newline) a
1562 If this option is set, all meta-characters in the pattern are disabled,
1565 you are doing a lot of literal matching and are worried about effi-
1590 string, or before a terminating newline (except when PCRE2_DOL-
1608 This option locks out the use of \C in the pattern that is being com-
1609 piled. This escape can cause unpredictable behaviour in UTF-8 or
1610 UTF-16 modes, because it may leave the current matching point in the
1611 middle of a multi-code-unit character. This option may be useful in
1613 there is also a build-time option that permanently locks out the use of
1628 This option locks out interpretation of the pattern as UTF-8, UTF-16,
1629 or UTF-32, depending on which library is in use. In particular, it pre-
1632 applications that process patterns from external sources. The combina-
1637 If this option is set, it disables the use of numbered capturing paren-
1648 If this option is set, it disables "auto-possessification", which is an
1651 are in use, auto-possessification means that some callouts are never
1659 .* is the first significant item in a top-level branch of a pattern,
1662 atomic group or a capturing group that is the subject of a backrefer-
1663 ence, or if the pattern contains (*PRUNE) or (*SKIP). When the opti-
1679 the matching code searches the subject for that value, and fails imme-
1680 diately if it cannot find it, without actually running the main match-
1684 items are in use, these "start-up" optimizations can cause them to be
1685 skipped if the pattern is never actually used. The start-up optimiza-
1686 tions are in effect a pre-scan of the subject that takes place before
1689 The PCRE2_NO_START_OPTIMIZE option disables the start-up optimizations,
1702 start-up optimization scans along the subject, finds "A" and runs the
1703 first match attempt from there. The (*COMMIT) item means that the pat-
1711 There are also other start-up optimizations. For example, a minimum
1728 UTF-8 strings, UTF-16 strings, and UTF-32 strings in the pcre2unicode
1742 Note also that setting PCRE2_NO_UTF_CHECK at compile time does not dis-
1743 able the error that is given if an escape sequence for an invalid Uni-
1744 code code point is encountered in the pattern. In particular, the so-
1748 section entitled "Extra compile options" below. However, this is pos-
1749 sible only in UTF-8 and UTF-32 modes, because these values are not rep-
1750 resentable in UTF-16.
1760 option is available only if PCRE2 has been compiled with Unicode sup-
1773 is going to be used to set a non-default offset limit in a match con-
1783 instead of single-code-unit strings. It is available when PCRE2 is
1792 Unlike the main compile-time options, the extra options are not saved
1799 This option applies when compiling a pattern in UTF-8 or UTF-32 mode.
1800 It is forbidden in UTF-16 mode, and ignored in non-UTF modes. Unicode
1802 in UTF-16 to encode code points with values in the range 0x10000 to
1803 0x10ffff. The surrogates cannot therefore be represented in UTF-16.
1804 They can be represented in UTF-8 and UTF-32, but are defined as invalid
1805 code points, and cause errors if encountered in a UTF-8 or UTF-32
1810 when using PCRE2 to check for unwanted characters in UTF-8 strings,
1813 because it applies only to the testing of input strings for UTF valid-
1816 If the extra option PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES is set, surro-
1817 gate code point values in UTF-8 and UTF-32 patterns no longer provoke
1825 escape such as \j or a malformed one such as \x{2z} causes a compile-
1826 time error when detected by pcre2_compile(). Perl is somewhat inconsis-
1828 "j", and non-hexadecimal digits in \x{} are just ignored, though warn-
1829 ings are given in both cases if Perl's warning switch is enabled. How-
1835 treated as single-character escapes. For example, \j is a literal "j"
1837 option means that typos in patterns may go undetected and have unex-
1842 This option is provided for use by the -x option of pcre2grep. It
1844 automatically inserting the code for "^(?:" at the start of the com-
1851 This option is provided for use by the -w option of pcre2grep. It
1859 JUST-IN-TIME (JIT) COMPILATION
1879 just-in-time compiler is available, further processes a compiled pat-
1885 for patterns to be analyzed, and for one-off matches and simple pat-
1896 points are less than 256. By default, higher-valued code points never
1897 match escapes such as \w or \d. However, if PCRE2 is built with Uni-
1898 code support, all characters can be tested with \p and \P, or, alterna-
1901 the built-in tables.
1911 default "C" locale of the local system, which may cause them to be dif-
1914 The internal tables can be overridden by tables supplied by the appli-
1916 from the default. As more and more applications change to using Uni-
1933 The locale name "fr_FR" is used on Linux and other Unix-like systems;
1940 pcre2_match() and pcre_dfa_match(). Thus, for any single pattern, com-
1951 The first argument for pcre2_pattern_info() is a pointer to the com-
1957 the function is zero for success, or one of the following negative num-
1966 an simple check against passing an arbitrary memory pointer. Here is a
1967 typical call of pcre2_pattern_info(), to obtain the length of the com-
1986 options that were passed to pcre2_compile(), whereas PCRE2_INFO_ALLOP-
1987 TIONS returns the compile options as modified by any top-level (*XXX)
1990 compile context by calling the pcre2_set_compile_extra_options() func-
1996 change within a pattern do not affect the result of PCRE2_INFO_ALLOP-
2001 PCRE2 if the first significant item in every top-level branch is one of
2007 .* sometimes - see below
2019 For patterns that are auto-anchored, the PCRE2_ANCHORED bit is set in
2028 characters of the given group, but in addition, the check that a cap-
2041 Return the highest capturing subpattern number in the pattern. In pat-
2051 PCRE2_ERROR_UNSET. Note that this limit will only be used during match-
2057 In the absence of a single first code unit for a non-anchored pattern,
2058 pcre2_compile() may construct a 256-bit table that defines a fixed set
2062 means "any code unit of value 255 or above". If such a table was con-
2069 a non-anchored pattern. The third argument should point to an uint32_t
2081 The third argument should point to an uint32_t variable. In the 8-bit
2082 library, the value is always less than 256. In the 16-bit library the
2083 value can be up to 0xffff. In the 32-bit library in UTF-32 mode the
2084 value can be up to 0x10ffff, and up to 0xffffffff when not using UTF-32
2112 (*LIMIT_HEAP=nnnn) at the start, the value is returned. The third argu-
2120 Return 1 if the (?J) or (?-J) option setting is used in the pattern,
2122 (?J) and (?-J) set and unset the local PCRE2_DUPNAMES option, respec-
2127 If the compiled pattern was successfully processed by pcre2_jit_com-
2147 PCRE2_INFO_LASTCODETYPE returns 1. Otherwise, return 0. The third argu-
2154 contains recursive subroutine calls it is not always possible to deter-
2155 mine whether or not it can match an empty string. PCRE2 takes a cau-
2164 PCRE2_ERROR_UNSET. Note that this limit will only be used during match-
2170 Return the number of characters (not code units) in the longest lookbe-
2172 uint32_t integer. This information is useful when doing multi-segment
2173 matching using the partial matching facilities. Note that the simple
2174 assertions \b and \B require a one-character lookbehind. \A also regis-
2175 ters a one-character lookbehind, though it does not actually inspect
2177 from the old segment is retained when a new segment is processed. Oth-
2185 number of characters, which in UTF mode may be different from the num-
2195 PCRE2 supports the use of named as well as numbered capturing parenthe-
2196 ses. The names are just an additional way of identifying the parenthe-
2198 pcre2_substring_get_byname() are provided for extracting captured sub-
2202 do the conversion, you need to use the name-to-number map, which is
2205 The map consists of a number of fixed-size entries. PCRE2_INFO_NAME-
2211 This is a PCRE2_SPTR pointer to a block of code units. In the 8-bit
2212 library, the first two bytes of each entry are the number of the cap-
2213 turing parenthesis, most significant byte first. In the 16-bit library,
2214 the pointer points to 16-bit code units, the first of which contains
2215 the parenthesis number. In the 32-bit library, the pointer points to
2216 32-bit code units, the first of which contains the parenthesis number.
2231 As a simple example of the name/number table, consider the following
2232 pattern after compilation by the 8-bit library (assume PCRE2_EXTENDED
2233 is set, so white space - including newlines - is ignored):
2235 (?<date> (?<year>(\d\d)?\d\d) -
2236 (?<month>\d\d) - (?<day>\d\d) )
2240 with non-printing bytes shows in hexadecimal, and undefined bytes shown
2249 name-to-number map, remember that the length of the entries is likely
2263 This identifies the character sequence that will be recognized as mean-
2272 pcre2_compile() is getting memory in which to place the compiled pat-
2275 over-estimate. Processing a pattern with the JIT compiler does not
2291 which they appear. Its first argument is a pointer to a callout enumer-
2293 passed to pcre2_callout_enumerate(). The contents of the callout enu-
2303 PCRE2, with the same code unit width, and must also have the same endi-
2308 the serialized form. They are described in the pcre2serialize documen-
2309 tation. Note that PCRE2 serialization does not convert compiled pat-
2331 you must create a match data block by calling one of the creation func-
2338 pcre2_match_data_create(), so it is always possible to return the over-
2341 The second argument of pcre2_match_data_create() is a pointer to a gen-
2348 right size to hold all the substrings a pattern might capture. The sec-
2387 order to find multiple matches in the subject string or to match dif-
2391 operates in a Perl-like manner. For specialist use there is also an
2395 Here is an example of a simple call to pcre2_match():
2407 If the subject string is zero-terminated, the length can be given as
2409 common matching parameters are to be changed. For details, see the sec-
2417 bytes for the 8-bit library, 16-bit code units for the 16-bit library,
2418 and 32-bit code units for the 32-bit library, whether or not UTF pro-
2424 by far the most common case. In UTF-8 or UTF-16 mode, the starting off-
2425 set must point to the start of a character, or to the end of the sub-
2426 ject (in UTF-32 mode, one code unit equals one character, so all off-
2430 A non-zero starting offset is useful when searching for another match
2445 string again, but with startoffset set to 4, it finds the second occur-
2457 so, and the current character is CR followed by LF, advance the start-
2460 If a non-zero starting offset is passed when the pattern is anchored, a
2461 single attempt to match at the given offset is made. This can only suc-
2463 the subject. In other words, the anchoring must be the result of set-
2472 PCRE2_NO_JIT, PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PAR-
2475 Setting PCRE2_ANCHORED or PCRE2_ENDANCHORED at match time is not sup-
2476 ported by the just-in-time (JIT) compiler. If it is set, JIT matching
2492 matches must be right at the end of the subject string. Note that set-
2507 in multiline mode) a newline immediately before it. Setting this with-
2509 match. This option affects only the behaviour of the dollar metacharac-
2545 called. If a non-zero starting offset is given, the check is applied
2546 only to that part of the subject that could be inspected during match-
2553 sequences \b and \B are one-character lookbehinds.
2559 validity of UTF-8 strings, UTF-16 strings, and UTF-32 strings in the
2582 the caller is prepared to handle a partial match, but only if no com-
2588 other words, when PCRE2_PARTIAL_HARD is set, a partial match is consid-
2591 There is a more detailed discussion of partial and multi-segment match-
2597 When PCRE2 is built, a default newline convention is set; this is usu-
2602 pcre2pattern page. During matching, the newline choice affects the be-
2618 However, the pattern [\r\n]A does match that string, because it con-
2619 tains an explicit CR or LF reference, and so advances only by one char-
2625 not count, nor does \s, even though it includes CR and LF in the char-
2643 phrase "capturing subpattern" or "capturing group" is used for a frag-
2652 Alternatively, you can make direct use of the vector of PCRE2_SIZE val-
2656 pcre2_get_ovector_count() returns the number of pairs of values it con-
2659 Within the ovector, the first in each pair of values is set to the off-
2661 offset of the first code unit after the end of a substring. These val-
2663 are byte offsets in the 8-bit library, 16-bit offsets in the 16-bit
2664 library, and 32-bit offsets in the 32-bit library.
2672 the portion of the subject string that was matched by the entire pat-
2676 been captured, the returned value is 3. If there are no captured sub-
2699 2 is not. When this happens, both values in the offset pairs corre-
2705 are not matched. The return from the function is 2, because the high-
2706 est used capturing subpattern number is 1. The offsets for for the sec-
2711 in the pattern are never changed. That is, if a pattern contains n cap-
2713 pcre2_match(). The other elements retain whatever values they previ-
2733 returns a pointer to the zero-terminated name, which is within the com-
2754 Warning: By default, certain start-of-match optimizations are used to
2758 engine. This check fails for "bx", causing a match failure without see-
2759 ing any marks. You can disable the start-of-match optimizations by set-
2766 offset of the character at which the match started. For a non-partial
2779 If pcre2_match() fails, it returns a negative number. This can be con-
2780 verted to a text string by calling the pcre2_get_error_message() func-
2785 of UTF-specific negative error codes is returned. Details are given in
2800 PCRE2 stores a 4-byte "magic number" at the start of the compiled code,
2807 a library of a different code unit width, for example, a pattern com-
2808 piled by the 8-bit library is passed to a 16-bit or 32-bit library
2849 using JIT is being matched, but the memory available for the just-in-
2850 time processing stack is not large enough. See the pcre2jit documenta-
2872 within the pattern. Specifically, it means that either the whole pat-
2874 the same position in the subject string. Some simple patterns that
2875 might do this are detected and faulted at compile time, but more com-
2886 match, or auxiliary) can be obtained by calling pcre2_get_error_mes-
2893 The returned message is terminated with a trailing zero, and the func-
2896 PCRE2_ERROR_BADDATA is returned. If the buffer is too small, the mes-
2919 extracting captured substrings as new, separate, zero-terminated
2925 zero refers to the entire matched substring, with higher numbers refer-
2936 extracts a zero-length empty string.
2945 The pcre2_substring_copy_bynumber() function copies a captured sub-
2948 function that was used for the match data block. The first two argu-
2988 pattern is (abc)|(def) and the subject is "def", and the ovector con-
2999 The pcre2_substring_list_get() function extracts all available sub-
3013 therefore need the lengths, you may supply NULL as the lengthsptr argu-
3015 function is zero if all went well, or PCRE2_ERROR_NOMEMORY if the mem-
3022 This can be distinguished from a genuine zero-length substring by
3024 PCRE2_UNSET for unset substrings, or by calling pcre2_sub-
3044 To extract a substring by name, you first have to find associated num-
3051 the name by calling pcre2_substring_number_from_name(). The first argu-
3071 Warning: If the pattern uses the (?| feature to set up multiple subpat-
3092 given as PCRE2_ZERO_TERMINATED for a zero-terminated string. Matches in
3100 pcre2_match(), except that the partial matching options are not permit-
3102 block is obtained and freed within this function, using memory manage-
3112 length, in code units, of the output buffer. If the function is suc-
3130 option is set, a dollar character is an escape character that can spec-
3140 brackets are required only if the following character would be inter-
3151 used to perform simple simultaneous substitutions, as this pcre2test
3164 takes place in the original subject string (that is, previous replace-
3167 subject string. If an offset limit is set in the match context, search-
3171 the subject string by setting either or both of startoffset and an off-
3179 with zero length, an attempt to find a non-empty match at the same off-
3186 buffer is too small. The default action is to return PCRE2_ERROR_NOMEM-
3188 continues to go through the motions of matching and substituting (with-
3189 out, of course, writing anything) in order to compute the size of buf-
3196 that the entire operation is carried out twice. Depending on the appli-
3198 the excess afterwards, instead of using PCRE2_SUBSTITUTE_OVER-
3221 particular character codes, and backslash followed by any non-alphanu-
3228 current state: \U and \L change to upper or lower case forcing, respec-
3233 all inserted characters, including those from captured groups and let-
3236 Note that case forcing sequences such as \U...\E do not nest. For exam-
3244 ${<n>:-<string>}
3247 As before, <n> may be a group number or a name. The first form speci-
3279 PCRE2_ERROR_NOSUBSTRING is returned for a non-existent substring inser-
3282 PCRE2_ERROR_UNSET is returned for an unset substring insertion (includ-
3284 when the simple (non-extended) syntax is used and PCRE2_SUBSTI-
3294 PCRE2_ERROR_BADREPESCAPE (invalid escape sequence), PCRE2_ERROR_REP-
3295 MISSINGBRACE (closing curly bracket not found), PCRE2_ERROR_BADSUBSTI-
3336 point to the first and last entries in the name-to-number table for the
3350 which stops when it finds the first match at a given point in the sub-
3353 function (see below) instead. If you cannot use the alternative func-
3357 What you have to do is to insert a callout right at the end of the pat-
3358 tern. When your callout function is called, extract and save the cur-
3375 not backtrack. This has different characteristics to the normal algo-
3379 algorithms, and a list of features that pcre2_dfa_match() does not sup-
3384 is used in a different way, and this is described below. The other com-
3394 Here is an example of a simple call to pcre2_dfa_match():
3412 zero. The only bits that may be set are PCRE2_ANCHORED, PCRE2_ENDAN-
3430 matches, but there is still at least one matching possibility. The por-
3433 more detailed discussion of partial and multi-segment matching, with
3439 stop as soon as it has found one match. Because of the way the alterna-
3455 When pcre2_dfa_match() succeeds, it may have matched more than one sub-
3474 which is the number of matched substrings. The offsets of the sub-
3477 any capturing groups that may exist in the pattern, because DFA match-
3490 NOTE: PCRE2's "auto-possessification" optimization usually applies to
3546 University Computing Service
3553 Copyright (c) 1997-2018 University of Cambridge.
3554 ------------------------------------------------------------------------------
3562 PCRE2 - Perl-compatible regular expressions (revised API)
3567 the library in Unix-like environments using the applications known as
3574 "by hand") in the text file called NON-AUTOTOOLS-BUILD. You should
3576 non-Unix-like environment.
3579 PCRE2 BUILD-TIME OPTIONS
3583 configure script, where the optional features are selected or dese-
3584 lected by providing options to configure before running the make com-
3585 mand. However, the same options can be selected in both Unix-like and
3586 non-Unix-like environments if you are using CMake instead of configure
3590 by editing the config.h file, or by passing parameter settings to the
3591 compiler, as described in NON-AUTOTOOLS-BUILD.
3597 ./configure --help
3600 names begin with --enable or --disable. Because of the way that config-
3601 ure works, --enable and --disable always come in pairs, so the comple-
3604 with --with. At the end of a configure run, a summary of the configura-
3608 BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES
3610 By default, a library called libpcre2-8 is built, containing functions
3612 either as single-byte characters, or UTF-8 strings. You can also build
3613 two other libraries, called libpcre2-16 and libpcre2-32, which process
3614 strings that are contained in arrays of 16-bit and 32-bit code units,
3615 respectively. These can be interpreted either as single-unit characters
3616 or UTF-16/UTF-32 strings. To build these additional libraries, add one
3619 --enable-pcre2-16
3620 --enable-pcre2-32
3622 If you do not want the 8-bit library, add
3624 --disable-pcre2-8
3627 the POSIX wrapper is for the 8-bit library only, and that pcre2grep is
3628 an 8-bit program. Neither of these are built if you select only the
3629 16-bit or 32-bit libraries.
3638 --disable-shared
3639 --disable-static
3649 --disable-unicode
3655 Of itself, Unicode support does not make PCRE2 treat strings as UTF-8,
3656 UTF-16 or UTF-32. To do that, applications that use the library can set
3657 the PCRE2_UTF option when they call pcre2_compile() to compile a pat-
3665 and Nd are supported. Details are given in the pcre2pattern documenta-
3677 mode, can cause unpredictable behaviour because it may leave the cur-
3678 rent matching point in the middle of a multi-code-unit character. The
3680 option when calling pcre2_compile(). There is also a build-time option
3682 --enable-never-backslash-C
3687 JUST-IN-TIME COMPILER SUPPORT
3689 Just-in-time (JIT) compiler support is included in the build by speci-
3692 --enable-jit
3698 --enable-jit=auto
3705 --enable-jit-sealloc
3712 --disable-pcre2grep-jit
3720 the end of a line. This is the normal newline character on Unix-like
3724 --enable-newline-is-cr
3726 to the configure command. There is also an --enable-newline-is-lf
3730 the two-character sequence CRLF (CR immediately followed by LF). If you
3733 --enable-newline-is-crlf
3737 --enable-newline-is-anycrlf
3742 --enable-newline-is-any
3745 newline sequences are the three just mentioned, plus the single charac-
3750 --enable-newline-is-nul
3752 which causes NUL (binary zero) to be set as the default line-ending
3766 --enable-bsr-anycrlf
3768 the default is changed so that \R matches only CR, LF, or CRLF. What-
3776 part to another (for example, from an opening parenthesis to an alter-
3777 nation metacharacter). By default, in the 8-bit and 16-bit libraries,
3778 two-byte values are used for these offsets, leading to a maximum size
3779 for a compiled pattern of around 64 thousand code units. This is suffi-
3782 compile PCRE2 to use three-byte or four-byte offsets by adding a set-
3785 --with-link-size=3
3788 16-bit library, a value of 3 is rounded up to 4. In these libraries,
3790 to load additional data when handling them. For the 32-bit library the
3791 value is always 4 and cannot be overridden; the value of --with-link-
3804 --with-match-limit=500000
3810 The pcre2_match() function starts out using a 20KiB vector on the sys-
3819 --with-heap-limit=500
3829 for --with-match-limit. You can set a lower default limit by adding,
3832 --with-match-limit_depth=10000
3844 for lookaround assertions, atomic groups, and recursion within pat-
3855 --enable-rebuild-chartables
3860 C run-time system. This method of replacing the tables does not work if
3871 compiled to run in an 8-bit EBCDIC environment by adding
3873 --enable-ebcdic --disable-unicode
3875 to the configure command. This setting implies --enable-rebuild-charta-
3879 It is not possible to support both EBCDIC and UTF-8 codes in the same
3880 version of the library. Consequently, --enable-unicode and --enable-
3887 --enable-ebcdic-nl25
3889 as well as, or instead of, --enable-ebcdic. The EBCDIC character for CR
3891 0x25 is not chosen as LF is made to correspond to the Unicode NEL char-
3894 The options that select newline behaviour, such as --enable-newline-is-
3895 cr, and equivalent run-time options, refer to these character values in
3901 By default, on non-Windows systems, pcre2grep supports the use of call-
3904 This support can be disabled by adding --disable-pcre2grep-callout to
3914 --enable-pcre2grep-libz
3915 --enable-pcre2grep-libbz2
3917 to the configure command. These options naturally require that the rel-
3929 be processable is the notional buffer size. If a longer line is encoun-
3935 --with-pcre2grep-bufsize=51200
3936 --with-pcre2grep-max-bufsize=2097152
3939 values by using --buffer-size and --max-buffer-size on the command
3947 --enable-pcre2test-libreadline
3948 --enable-pcre2test-libedit
3952 it reads it using the readline() function. This provides line-editing
3953 and history facilities. Note that libreadline is GPL-licensed, so if
3958 Setting --enable-pcre2test-libreadline causes the -lreadline option to
3960 sytem-installed readline library this is sufficient. However, in some
3972 LIBS="-ncurses"
3981 --enable-debug
3991 --enable-valgrind
4005 --enable-coverage
4017 When --enable-coverage is used, the following addition targets are
4023 equivalent to running "make coverage-reset", "make coverage-baseline",
4024 "make check", and then "make coverage-report".
4026 make coverage-reset
4030 make coverage-baseline
4034 make coverage-report
4038 make coverage-clean-report
4040 This removes the generated coverage report without cleaning the cover-
4043 make coverage-clean-data
4048 make coverage-clean
4051 For more information about code coverage, see the gcov and lcov docu-
4060 --enable-fuzz-support
4062 At present this applies only to the 8-bit library. If set, it causes an
4063 extra library called libpcre2-fuzzsupport.a to be built, but not
4064 installed. This contains a single function called LLVMFuzzerTestOneIn-
4071 Setting --enable-fuzz-support also causes a binary called pcre2fuz-
4086 --disable-stack-for-recursion
4095 pcre2api(3), pcre2-config(3).
4101 University Computing Service
4108 Copyright (c) 1997-2018 University of Cambridge.
4109 ------------------------------------------------------------------------------
4117 PCRE2 - Perl-compatible regular expressions (revised API)
4132 PCRE2 provides a feature called "callout", which is a means of tempo-
4143 ending delimiter is the same as the start, except for {, where the end-
4163 A(\d{2}|--)
4167 (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
4170 alternation bar. If the pattern contains a conditional group whose con-
4184 information when you are trying to optimize the performance of a par-
4194 Auto-possessification
4196 At compile time, PCRE2 "auto-possessifies" repeated items when it knows
4202 --->aaaa
4210 the auto-possessify feature by passing PCRE2_NO_AUTO_POSSESS to
4214 --->aaaa
4231 beginning of the subject, and pcre2_compile() remembers this. If a pat-
4232 tern has more than one top-level branch, automatic anchoring occurs if
4237 It is also disabled if the pattern contains (*PRUNE) or (*SKIP). How-
4243 --->aa
4250 This shows that all match attempts start at the beginning of the sub-
4253 starting the pattern with (*NO_DOTSTAR_ANCHOR). In this case, the out-
4256 --->aa
4266 This shows more match attempts, starting at the second subject charac-
4287 You can disable these optimizations by passing the PCRE2_NO_START_OPTI-
4297 to both normal, DFA, and JIT matching. The first argument to the call-
4323 version 1, and the callout_flags field for version 2. If you are writ-
4332 contains the number of the callout, in the range 0-255. This is the
4339 callout_string points to the string that is contained within the com-
4347 delimiter as callout_string[-1] if you need it.
4371 The capture_last field contains the number of the most recently cap-
4373 number of the highest numbered captured substring so far. If no sub-
4379 The contents of ovector[2] to ovector[<capture_top>*2-1] can be
4388 was passed to the matching function in the match data block for call-
4414 parenthesis, the length includes meta characters that follow the paren-
4417 the length is one, unless a closing parenthesis is followed by a quan-
4426 are used by pcre2test to show the next item to be matched when display-
4430 zero-terminated name of the most recently passed (*MARK), (*PRUNE), or
4452 starting position in the subject. Output from pcre2test does not indi-
4456 The information in the callout_flags field is provided so that applica-
4460 because there is no backtracking in DFA matching, and there is no sup-
4493 which they appear. Its first argument is a pointer to a callout enumer-
4495 passed to pcre2_callout_enumerate(). The data block contains the fol-
4513 non-zero minimum or a fixed maximum, the group is replicated inside the
4519 The callback function should normally return zero. If it returns a non-
4527 University Computing Service
4534 Copyright (c) 1997-2018 University of Cambridge.
4535 ------------------------------------------------------------------------------
4543 PCRE2 - Perl-compatible regular expressions (revised API)
4549 respect to Perl versions 5.26, but as both Perl and PCRE2 are continu-
4555 2. Like Perl, PCRE2 allows repeat quantifiers on parenthesized asser-
4563 3. Capturing subpatterns that occur inside negative lookaround asser-
4569 \u, \U, and \N when followed by a character name. \N on its own, match-
4570 ing a non-newline character, and \N{U+dd..}, matching a Unicode code
4572 letters are implemented by Perl's general string-handling and are not
4583 the need for the user to understand the internal representation of Uni-
4584 code characters, there is no need to implement the somewhat messy con-
4591 does not have variables). Also, Perl does "double-quotish backslash
4592 interpolation" on any backslashes between \Q and \E which, its documen-
4620 effect is confined to that subpattern; it does not extend to the sur-
4641 13. PCRE2's handling of duplicate subpattern numbers and duplicate sub-
4648 distinguish which parentheses matched, because both names map to cap-
4659 such as [A-\d] or [a-[:digit:]]. It then treats the hyphens as liter-
4664 not affected when case-independent matching is specified. For example,
4680 (b) From PCRE2 10.23, backreferences to groups of fixed length are sup-
4681 ported in lookbehinds, provided that there is no possibility of refer-
4682 encing a non-unique number or name. Perl does not support backrefer-
4686 $ meta-character matches only at the very end of the string.
4691 (e) If PCRE2_UNGREEDY is set, the greediness of the repetition quanti-
4692 fiers is inverted, that is, by default they are not greedy, but if fol-
4704 (i) The callout facility is PCRE2-specific. Perl supports codeblocks
4707 (j) The partial matching facility is PCRE2-specific.
4710 different way and is not Perl-compatible.
4717 /aa modifier restricts /i case-insensitive matching to pure ascii,
4721 19. Perl has different limits than PCRE2. See the pcre2limit documenta-
4722 tion for details. Perl went with 5.10 from recursion to iteration keep-
4724 not fall into any stack-overflow limit. PCRE2 made a similar change at
4725 release 10.30, and also has many build-time and run-time customizable
4732 University Computing Service
4739 Copyright (c) 1997-2018 University of Cambridge.
4740 ------------------------------------------------------------------------------
4748 PCRE2 - Perl-compatible regular expressions (revised API)
4750 PCRE2 JUST-IN-TIME COMPILER SUPPORT
4752 Just-in-time compiling is a heavyweight optimization that can greatly
4753 speed up pattern matching. However, it comes at the cost of extra pro-
4755 the same pattern is going to be matched many times. This does not nec-
4757 anchored, matching attempts may take place many times at various posi-
4759 string is very long, it may still pay to use JIT even for one-off
4760 matches. JIT support is available for all of the 8-bit, 16-bit and
4761 32-bit PCRE2 libraries.
4763 JIT support applies only to the traditional Perl-compatible matching
4771 --enable-jit (or equivalent CMake option) must be set when PCRE2 is
4775 ARM 32-bit (v5, v7, and Thumb2)
4776 ARM 64-bit
4777 Intel x86 32-bit and 64-bit
4778 MIPS 32-bit and 64-bit
4779 Power PC 32-bit and 64-bit
4780 SPARC 32-bit
4782 If --enable-jit is set on an unsupported platform, compilation fails.
4784 A program can tell if JIT support is available by calling pcre2_con-
4786 available, and 0 otherwise. However, a simple program does not need to
4788 falls back to the interpretive code if JIT is not available. For pro-
4790 path" API that is JIT-specific.
4793 SIMPLE USE OF JIT
4799 second is zero or more of the following option bits: PCRE2_JIT_COM-
4810 the size of machine stack that it uses. The exact rules are not docu-
4815 PCRE2_JIT_COMPLETE requests the JIT compiler to generate code for com-
4816 plete matches. If you want to run partial matches using the PCRE2_PAR-
4821 pcre2_match() is called, the appropriate code is run if it is avail-
4826 the option bits. For example, you can call it once with PCRE2_JIT_COM-
4829 will ignore PCRE2_JIT_COMPLETE and just compile code for partial match-
4830 ing. If pcre2_jit_compile() is called with no option bits set, it imme-
4847 stack" below, even if you do not need to supply a non-default JIT
4849 be obeyed. If the match-time options are not right for JIT execution,
4852 If the JIT compiler finds an unsupported item, no JIT data is gener-
4855 option. A non-zero result means that JIT compilation was successful. A
4872 when running in a UTF mode, and a callout immediately before an asser-
4881 that the memory used for the JIT stack was insufficient. See "Control-
4901 The pcre2_jit_stack_create() function creates a JIT stack. Its argu-
4907 function returns immediately, without doing anything. (For the techni-
4919 The first argument is a pointer to a match context. When this is subse-
4921 JIT stack is used. If this argument is NULL, the function returns imme-
4941 is not obeyed when pcre2_match() is called with options that are incom-
4949 up non-sequential matches in one thread is to use callouts: if a call-
4954 you assign or pass back NULL from a callback, that is thread-safe,
4956 or pass back a non-NULL JIT stack, this must be a different stack for
4957 each thread so that the application is thread-safe.
4959 Strictly speaking, even more is allowed. You can assign the same non-
4968 up non-default JIT stacks might operate:
4976 Use a one-line callback function
4987 PCRE2 (and JIT) is a recursive, depth-first engine, so it needs a stack
4989 child nodes. Allocating real machine stack on some platforms is diffi-
4997 address space instead of allocating memory. We can safely allocate mem-
5019 You can free compiled patterns, contexts, and stacks in any order, any-
5038 Especially on embedded sytems, it might be a good idea to release mem-
5041 allocated memory for any stack and another which allows releasing mem-
5055 The JIT executable allocator does not free all memory when it is possi-
5059 calling pcre2_jit_free_unused_memory(). Its argument is a general con-
5060 text, for custom memory management, or NULL for standard memory manage-
5066 This is a single-threaded example that specifies a JIT stack without
5111 number of other sanity checks are performed on the arguments. For exam-
5129 University Computing Service
5136 Copyright (c) 1997-2018 University of Cambridge.
5137 ------------------------------------------------------------------------------
5145 PCRE2 - Perl-compatible regular expressions (revised API)
5153 code units for the 8-bit and 16-bit libraries if PCRE2 is compiled with
5157 (when building the 16-bit library, 3 is rounded up to 4). See the
5159 for details. In these cases the limit is substantially larger. How-
5160 ever, the speed of execution is slower. In the 32-bit library, the
5170 (that is ~(PCRE2_SIZE)0) is reserved as a special indicator for zero-
5190 (*THEN) verb is 255 code units for the 8-bit library and 65535 code
5191 units for the 16-bit and 32-bit libraries.
5194 number a 32-bit unsigned integer can hold.
5200 University Computing Service
5207 Copyright (c) 1997-2017 University of Cambridge.
5208 ------------------------------------------------------------------------------
5216 PCRE2 - Perl-compatible regular expressions (revised API)
5224 function, and provide a Perl-compatible matching operation. The just-
5225 in-time (JIT) optimization that is described in the pcre2jit documenta-
5229 it operates in a different way, and is not Perl-compatible. This alter-
5250 The set of strings that are matched by a regular expression can be rep-
5255 tree: depth-first and breadth-first, and these correspond to the two
5261 In the terminology of Jeffrey Friedl's book "Mastering Regular Expres-
5263 depth-first search of the pattern tree. That is, it proceeds along a
5265 required. When there is a mismatch, the algorithm tries any alterna-
5274 that point the algorithm stops. Thus, if there is more than one possi-
5280 Because it ends up with a single path through the tree, it is rela-
5281 tively straightforward for this algorithm to keep track of the sub-
5288 This algorithm conducts a breadth-first search of the tree. Starting
5306 this algorithm finds all of them, and in particular, it finds the long-
5308 an option to stop the algorithm after the first match (which is neces-
5318 the fifth character of the subject. The algorithm does not automati-
5321 PCRE2's "auto-possessification" optimization usually applies to charac-
5322 ter repeats at the end of a pattern (as well as internally). For exam-
5327 either use an ungreedy repeat ("a\d+?") or set the PCRE2_NO_AUTO_POS-
5331 not supported by the alternative matching algorithm. They are as fol-
5336 may affect auto-possessification, as just described). During matching,
5345 a non-possessive quantifier. Similarly, if an atomic group is present,
5353 algorithm does not attempt to do this. This means that no captured sub-
5356 3. Because no substrings are captured, backreferences within the pat-
5359 4. For the same reason, conditional expressions that use a backrefer-
5373 these modes, because the alternative algorithm moves through the sub-
5384 Using the alternative matching algorithm provides the following advan-
5387 1. All possible matches (at a single point in the subject) are automat-
5393 once, and never needs to backtrack (except for lookbehinds), it is pos-
5396 also possible to do multi-segment matching using the standard algo-
5397 rithm, by retaining partially matched substrings, it is more compli-
5399 and discusses multi-segment matching.
5419 University Computing Service
5426 Copyright (c) 1997-2014 University of Cambridge.
5427 ------------------------------------------------------------------------------
5435 PCRE2 - Perl-compatible regular expressions
5441 the entire pattern, PCRE2_ERROR_NOMATCH is returned. There are circum-
5454 reflecting the character that has been typed, for example. This immedi-
5467 If you want to use partial matching with just-in-time optimized code,
5473 PCRE2_JIT_COMPLETE should also be set if you are going to run non-par-
5483 shorter strings. This optimization is also disabled for partial match-
5490 the subject string is reached successfully, but matching cannot con-
5491 tinue because more characters are needed. However, at least one charac-
5495 of a matched string. The requirement for inspecting at least one char-
5502 the rest of the ovector are undefined. The appearance of \K in the pat-
5511 string "abc12", because all these characters are needed for a subse-
5512 quent re-match with additional characters.
5520 match, the partial match is remembered, but matching continues as nor-
5525 This option is "soft" because it prefers a complete match over a par-
5529 of the subject is treated as a non-alphanumeric.
5536 If this is matched against the subject string "abc123dog", both alter-
5557 The difference between the two partial matching options can be illus-
5565 However, if PCRE2_PARTIAL_HARD is set, the result is PCRE2_ERROR_PAR-
5586 without backtracking, searching for all possible matches simultane-
5587 ously. If the end of the subject is reached before the end of the pat-
5600 behaviour is different from the standard functions when PCRE2_PAR-
5614 boundaries, partial matching with PCRE2_PARTIAL_SOFT can give counter-
5649 matched substrings. The remaining four strings do not match the com-
5657 MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()
5661 and calling the function again with the same compiled regular expres-
5663 same working space as before, because this is where details of the pre-
5672 The first call has "23ja" as the subject, and requests partial match-
5675 last part is shown; PCRE2 does not retain the previously partially-
5685 this may or may not be what you want. The only way to allow for start-
5695 MULTI-SEGMENT MATCHING WITH pcre2_match()
5700 re-run, starting from the point where the partial match occurred. Ear-
5719 ISSUES WITH MULTI-SEGMENT MATCHING
5721 Certain types of pattern may give problems with multi-segment matching,
5727 option, but in practice when doing multi-segment matching you should be
5730 2. If a pattern contains a lookbehind assertion, characters that pre-
5742 retained. In a non-UTF or a 32-bit situation, moving back is just a
5743 subtraction, but in UTF-8 or UTF-16 you have to count characters while
5752 the string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maxi-
5784 been found, continuation to a new subject segment is no longer possi-
5809 matching multi-segment data. The example above then behaves differ-
5849 re-running the entire match can also be used with the DFA matching
5859 University Computing Service
5866 Copyright (c) 1997-2014 University of Cambridge.
5867 ------------------------------------------------------------------------------
5875 PCRE2 - Perl-compatible regular expressions (revised API)
5880 by PCRE2 are described in detail below. There is a quick-reference syn-
5882 and semantics as closely as it can. PCRE2 also supports some alterna-
5897 different algorithm that is not Perl-compatible. Some of the features
5898 discussed below are not available when DFA matching is used. The advan-
5903 SPECIAL START-OF-PATTERN ITEMS
5906 set by special items at the start of a pattern. These are not Perl-com-
5908 writers who are not able to change the program that processes the pat-
5915 In the 8-bit and 16-bit PCRE2 libraries, characters may be coded either
5916 as single code units, or as multiple UTF-8 or UTF-16 code units. UTF-32
5917 can be specified for the 32-bit library, in which case it constrains
5928 restrict them to non-UTF data for security reasons. If the
5936 causes sequences such as \d and \w to use Unicode properties to deter-
5949 to whichever matching function is subsequently called to match the pat-
5953 Disabling auto-possessification
5961 Disabling start-up optimizations
5964 setting the PCRE2_NO_START_OPTIMIZE option. This disables several opti-
5971 as setting the PCRE2_NO_DOTSTAR_ANCHOR option. This disables optimiza-
5972 tions that apply to patterns whose top-level branches all start with .*
5992 These facilities are provided to catch runaway matches that are pro-
5993 voked by patterns with huge matching trees (a typical example is a pat-
6003 where d is any number of decimal digits. However, the value of the set-
6026 strings: a single CR (carriage return) character, a single LF (line-
6027 feed) character, the two-character sequence CRLF, any of the three pre-
6032 It is also possible to specify a newline convention by starting a pat-
6042 These override the default and the options given to the compiling func-
6052 The newline convention affects where the circumflex and dollar asser-
6053 tions are true. It also affects the interpretation of the dot metachar-
6067 starting a pattern with (*BSR_ANYCRLF). For completeness, (*BSR_UNI-
6074 character code instead of ASCII or Unicode (typically a mainframe sys-
6075 tem). In the sections below, character code values are ASCII or Uni-
6098 There are two different sets of metacharacters: those that are recog-
6124 - indicates character range
6142 always safe to precede a non-alphanumeric with backslash to specify
6143 that it stands for itself. In particular, if you want to match a back-
6156 If you want to remove the special meaning from a sequence of charac-
6157 ters, you can do so by putting them between \Q and \E. This is differ-
6159 sequences in PCRE2, whereas in Perl, $ and @ cause variable interpola-
6160 tion. Also, Perl does "double-quotish backslash interpolation" on any
6182 Non-printing characters
6184 A second use of backslash provides a way of encoding non-printing char-
6186 appearance of non-printing characters in a pattern, but when a pattern
6192 \cx "control-x", where x is any printable ASCII character
6218 32 or greater than 126, a compile-time error occurs.
6222 The \c escape is processed as specified for Perl in the perlebcdic doc-
6223 ument. The only characters that are allowed after \c are A-Z, a-z, or
6224 one of @, [, \, ], ^, _, or ?. Any other character provokes a compile-
6226 letters (in either case) encode characters 1-26 (hex 01 to hex 1A); [,
6227 \, ], ^, and _ encode characters 27-31 (hex 1B to hex 1F), and \c?
6239 FF), but in the one Perl calls POSIX-BC its value is 95 (hex 5F). If
6240 certain other characters have POSIX-BC values, PCRE2 makes \c? generate
6256 a digit greater than zero. Instead, use \o{} or \x{} to specify numeri-
6257 cal character code points, and \g{} to specify backreferences. The fol-
6260 The handling of a backslash followed by a digit other than 0 is compli-
6263 Outside a character class, PCRE2 reads the digit and any following dig-
6267 backreference. A description of how this works is given later, follow-
6271 Inside a character class, PCRE2 handles \8 and \9 as the literal char-
6272 acters "8" and "9", and otherwise reads up to three octal digits fol-
6273 lowing the backslash, using them to generate a data character. Any sub-
6295 By default, after \x that is not followed by {, from zero to two hexa-
6297 number of hexadecimal digits may appear between \x{ and }. If a charac-
6302 just described only when it is followed by two hexadecimal digits. Oth-
6305 by four hexadecimal digits; otherwise it matches a literal "u" charac-
6309 two syntaxes for \x (or by \u in PCRE2_ALT_BSUX mode). There is no dif-
6318 8-bit non-UTF mode no greater than 0xff
6319 16-bit non-UTF mode no greater than 0xffff
6320 32-bit non-UTF mode no greater than 0xffffffff
6324 (the so-called "surrogate" code points). The check for these can be
6327 UTF-8 and UTF-32 modes, because these values are not representable in
6328 UTF-16.
6358 For compatibility with Oniguruma, the non-Perl syntax \g followed by a
6362 \g<...> (Oniguruma syntax) are not synonymous. The former is a backref-
6379 \W any "non-word" character
6384 has a different meaning. See the section entitled "Non-printing charac-
6388 Each pair of lower and upper case escape sequences partitions the com-
6398 locale. This list may vary if locale-specific matching is taking place.
6399 For example, in some locales the "non-breaking space" character (\xA0)
6403 or digit. By default, the definition of letters and digits is con-
6404 trolled by PCRE2's low-valued character tables, and may vary if locale-
6406 page). For example, in a French locale such as "fr_FR" in Unix-like
6413 be different for characters in the range 128-255 when locale-specific
6415 meanings from before Unicode support was available, mainly for effi-
6437 U+00A0 Non-break space
6444 U+2004 Three-per-em space
6445 U+2005 Four-per-em space
6446 U+2006 Six-per-em space
6451 U+202F Narrow no-break space
6465 In 8-bit, non-UTF-8 mode, only the characters with code points less
6471 any Unicode newline sequence. In 8-bit non-UTF-8 mode \R is equivalent
6477 below. This particular group matches either the two-character sequence
6479 U+000A), VT (vertical tab, U+000B), FF (form feed, U+000C), CR (car-
6481 atomic group, the two-character sequence is treated as a single unit
6485 than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-
6491 PCRE2_BSR_ANYCRLF at compile time. (BSR is an abbrevation for "back-
6493 the case, the other behaviour can be requested via the PCRE2_BSR_UNI-
6500 These override the default and the options given to the compiling func-
6501 tion. Note that these special settings, which are not Perl-compatible,
6515 When PCRE2 is built with Unicode support (the default), three addi-
6517 are available. In 8-bit non-UTF-8 mode, these sequences are of course
6519 they do work in this mode. In 32-bit non-UTF mode, code points greater
6531 (described in the next section). Other Perl properties such as "InMu-
6545 Adlam, Ahom, Anatolian_Hieroglyphs, Arabic, Armenian, Avestan, Bali-
6547 Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Caucasian_Alba-
6553 Inscriptional_Pahlavi, Inscriptional_Parthian, Javanese, Kaithi, Kan-
6555 Latin, Lepcha, Limbu, Linear_A, Linear_B, Lisu, Lycian, Lydian, Maha-
6559 Nabataean, New_Tai_Lue, Newa, Nko, Nushu, Ogham, Ol_Chiki, Old_Hungar-
6560 ian, Old_Italic, Old_North_Arabian, Old_Permic, Old_Persian, Old_Sog-
6563 Psalter_Pahlavi, Rejang, Runic, Samaritan, Saurashtra, Sharada, Sha-
6566 Tai_Viet, Takri, Tamil, Tangut, Telugu, Thaana, Thai, Tibetan, Tifi-
6569 Each character has exactly one Unicode general category property, spec-
6570 ified by a two-letter abbreviation. For compatibility with Perl, nega-
6575 If only one letter is specified with \p or \P, it includes all the gen-
6602 Mn Non-spacing mark
6643 No character that is in the Unicode table has the Cn (unassigned) prop-
6652 to do a multistage table lookup in order to find a character's prop-
6667 properties that had been used for emojis. Instead it introduced vari-
6668 ous emoji-specific properties. PCRE2 uses only the Extended Picto-
6677 2. Do not end between CR and LF; otherwise end after any control char-
6687 "zero-width joiner" character. Characters with the "mark" property
6694 property. Extend and ZWJ characters are allowed between the charac-
6705 As well as the standard Unicode properties described above, PCRE2 sup-
6708 non-standard, non-Perl properties internally when PCRE2_UCP is set.
6716 Xan matches characters that have either the L (letter) or the N (num-
6723 There is another non-standard property, Xuc, which matches any charac-
6730 Note that the Xuc property does not match these sequences but the char-
6747 mode), though it again reports the matched string as "bar". This fea-
6751 does not interfere with the setting of captured substrings. For exam-
6762 be greater than the end of the match. Using \K in a lookbehind asser-
6763 tion at the start of a pattern can also lead to odd effects. For exam-
6773 Simple assertions
6775 The final use of backslash is for certain simple assertions. An asser-
6799 PCRE2 nor Perl has a separate "start of word" or "end of word" metase-
6806 set. Thus, they are independent of multiline mode. These three asser-
6808 which affect only the behaviour of the circumflex and dollar metachar-
6809 acters. However, if the startoffset argument of pcre2_match() is non-
6816 the start point of the matching process, as specified by the startoff-
6818 startoffset is non-zero. By calling pcre2_match() multiple times with
6836 The circumflex and dollar metacharacters are zero-width assertions.
6837 That is, they test for a particular condition being true without con-
6839 are concerned with matching the starts and ends of lines. If the new-
6840 line convention is set so that only the two-character sequence CRLF is
6846 point is at the start of the subject string. If the startoffset argu-
6847 ment of pcre2_match() is non-zero, or if PCRE2_NOTBOL is set, circum-
6856 if the pattern is constrained to match only at the start of the sub-
6864 newline. Dollar need not be the last character of the pattern if a num-
6866 branch in which it appears. Dollar has no special meaning in a charac-
6886 pcre2_match() is non-zero. The PCRE2_DOLLAR_ENDONLY option is ignored
6889 When the newline convention (see "Newline conventions" below) recog-
6890 nizes the two-character sequence CRLF as a newline, this is preferred,
6891 even if the single characters CR and LF are also recognized as new-
6905 Outside a character class, a dot in the pattern matches any one charac-
6906 ter in the subject string except (by default) a character that signi-
6910 that character; when the two-character sequence CRLF is used, dot does
6912 matches all characters (including isolated CRs and LFs). When any Uni-
6918 exception. If the two-character sequence CRLF is present in the sub-
6921 The handling of dot is entirely independent of the handling of circum-
6931 the section entitled "Non-printing characters" above for details. Perl
6939 unit, whether or not a UTF mode is set. In the 8-bit library, one code
6940 unit is one byte; in the 16-bit library it is a 16-bit unit; in the
6941 32-bit library it is a 32-bit unit. Unlike a dot, \C always matches
6942 line-ending characters. The feature is provided in Perl in order to
6943 match individual bytes in UTF-8 mode, but it is unclear how it can use-
6947 one unit with \C in UTF-8 or UTF-16 mode means that the rest of the
6949 results, because PCRE2 assumes that it is matching character by charac-
6959 below) in UTF-8 or UTF-16 modes, because this would make it impossible
6962 these UTF modes. The former gives a match-time error; the latter fails
6965 In the 32-bit library, however, \C is always supported (when not
6967 whether or not UTF-32 is specified.
6970 using it that avoids the problem of malformed UTF-8 or UTF-16 charac-
6972 as in this pattern, which could be used with a UTF-8 string (ignore
6975 (?| (?=[\x00-\x7f])(\C) |
6976 (?=[\x80-\x{7ff}])(\C)(\C) |
6977 (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
6978 (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
6981 parentheses numbers in each alternative (see "Duplicate Subpattern Num-
6983 UTF-8 character for values whose encoding uses 1, 2, 3, or 4 bytes,
6991 closing square bracket. A closing square bracket on its own is not spe-
7010 class that starts with a circumflex is not an assertion; it still con-
7016 letters in a class represent both their upper case and lower case ver-
7022 special way when matching character classes, whatever line-ending
7037 sequences, they cause an error. The same is true for \N when not fol-
7040 The minus (hyphen) character can be used to specify a range of charac-
7041 ters in a character class. For example, [d-m] matches any letter
7046 example, [b-d-z] matches letters in the range b to d, a hyphen charac-
7056 It is not possible to have the literal character "]" as the end charac-
7057 ter of a range. A pattern such as [W-]46] is interpreted as a class of
7058 two characters ("W" and "-") followed by a literal string "46]", so it
7059 would match "W46]" or "-46]". However, if the "]" is escaped with a
7060 backslash it is interpreted as the end of range, so [W-\]46] is inter-
7065 Ranges normally include all code points between the start and end char-
7067 numerically, for example [\000-\037]. Ranges can include any characters
7068 that are valid for the current mode. In any UTF mode, the so-called
7071 PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES option disables this check). How-
7072 ever, ranges such as [\x{d7ff}-\x{e000}], which include the surrogates,
7076 points are both specified as literal letters in the same case. For com-
7078 letters are omitted. For example, [h-k] matches only four characters,
7081 [\x88-\x92] or [h-\x92], all code points are included.
7084 it matches the letters in either case. For example, [W-c] is equivalent
7085 to [][\\^_`wxyzabc], matched caselessly, and in a non-UTF mode, if
7086 character tables for a French locale are in use, [\xc8-\xcb] matches
7100 special compatibility feature - see the next two sections), and the
7101 terminating closing square bracket. However, escaping other non-
7118 ascii character codes 0 - 127
7132 CR (13), and space (32). If locale-specific matching is taking place,
7142 matches "1", "2", or any non-digit. PCRE2 (and Perl) also recognize the
7147 the POSIX character classes, although this may be different for charac-
7148 ters in the range 128-255 when locale-specific matching is happening.
7168 when printed. In Unicode property terms, it matches all char-
7173 U+2066 - U+2069 Various "isolate"s
7180 [:punct:] This matches all characters that have the Unicode P (punctua-
7201 that \b matches at the start and the end of a word (see "Simple asser-
7202 tions" above), and in a Perl-style pattern the preceding or following
7204 assertions that are used above in order to give exactly the POSIX be-
7228 enclosed between "(?" and ")". These options are Perl-compatible, and
7229 are described in detail in the pcre2api documentation. The option let-
7239 For example, (?im) sets caseless, multiline matching. It is also possi-
7241 hyphen, for example (?-im). The two "extended" options are not indepen-
7244 A combined setting and unsetting such as (?im-sx), which sets
7248 the option is unset. An empty options setting "(?)" is allowed. Need-
7252 the above options to be unset. Thus, (?^) is equivalent to (?-imnsx).
7253 Letters may follow the circumflex to cause some options to be re-
7256 The PCRE2-specific options PCRE2_DUPNAMES and PCRE2_UNGREEDY can be
7257 changed in the same way as the Perl-compatible options by using the
7269 not used). By this means, options can be made to have different set-
7270 tings in different parts of the pattern. Any changes made in one alter-
7282 start of a non-capturing subpattern (see the next section), the option
7290 Note: There are other PCRE2-specific options that can be set by the
7291 application when the compiling function is called. The pattern can con-
7327 the captured substrings are "red king", "red", and "king", and are num-
7333 by a question mark and a colon, the subpattern does not do any captur-
7344 start of a non-capturing subpattern, the option letters may appear
7361 starts with (?| and is itself a non-capturing subpattern. For example,
7366 Because the two alternatives are inside a (?| group, both sets of cap-
7370 not all, of one of a number of alternatives. Inside a (?| group, paren-
7373 subpattern start after the highest number used in any branch. The fol-
7374 lowing example is taken from the Perl documentation. The numbers under-
7377 # before ---------------branch-reset----------- after
7393 A relative reference such as (?-1) is no different: it is just a conve-
7396 If a condition test for a subpattern's having matched refers to a non-
7397 unique number, the test is true if any of the subpatterns of that num-
7406 Identifying capturing parentheses by number is simple, but it can be
7407 very hard to keep track of the numbers in complicated patterns. Fur-
7409 with this difficulty, PCRE2 supports the naming of capturing subpat-
7417 must start with a non-digit. References to capturing parentheses from
7418 other parts of the pattern, such as backreferences, recursion, and con-
7422 exactly as if the names were not present. In both PCRE2 and Perl, cap-
7425 for extracting the complete name-to-number translation table from a
7426 compiled pattern, as well as convenience functions for extracting cap-
7442 number to be associated with more than one name. The example above pro-
7443 vokes a compile-time error. However, there is still scope for confu-
7453 By default, a name must be unique within a pattern, except that dupli-
7459 The duplicate name constraint can be disabled by setting the PCRE2_DUP-
7463 a weekday, either as a 3-letter abbreviation or as the full name, and
7478 problem is to use a "branch reset" subpattern, as described in the pre-
7481 If you make a backreference to a non-unique named subpattern from else-
7484 first one that is set is used for the reference. For example, this pat-
7490 If you make a subroutine call to a non-unique named subpattern, the one
7519 The general repetition quantifier specifies a minimum and maximum num-
7540 the syntax of a quantifier, is taken as a literal character. For exam-
7545 of which is represented by a two-byte sequence in a UTF-8 string. Simi-
7551 the previous item and the quantifier were not present. This may be use-
7557 For convenience, the three most common quantifiers have single-charac-
7573 subpattern does in fact match no characters, the loop is forcibly bro-
7594 and instead matches the minimum number of times possible, so the pat-
7621 (equivalent to Perl's /s) is set, thus allowing the dot to match new-
7628 In cases where it is known that the subject string contains no new-
7629 lines, it is worth setting PCRE2_DOTALL in order to obtain this opti-
7639 If the subject is "xyz123abc123" the match point is the fourth charac-
7642 Another case where implicit anchoring is not applied is when the lead-
7648 It matches "ab" in the subject "aab". The use of the backtracking con-
7652 When a capturing subpattern is repeated, the value captured is the sub-
7659 the corresponding captured values may have been set in previous itera-
7671 to be re-evaluated to see if a different number of repeats allows the
7687 to be re-evaluated in this way.
7695 This kind of parenthesis "locks up" the part of the pattern it con-
7704 Atomic grouping subpatterns are not capturing subpatterns. Simple cases
7706 must swallow everything it can. So, while both \d+ and \d+? are pre-
7732 The possessive quantifier syntax is an extension to the Perl 5.8 syn-
7739 simple pattern constructs. For example, the sequence A+B is treated as
7741 when B must follow. This feature can be disabled by the PCRE2_NO_AUTO-
7751 matches an unlimited number of substrings that either consist of non-
7762 when a single character is used. They remember the last single charac-
7769 sequences of non-digits cannot be broken, and failure happens quickly.
7775 0 (and possibly further digits) is a backreference to a capturing sub-
7781 there are not that many capturing left parentheses in the entire pat-
7783 to the left of the reference for numbers less than 8. A "forward back-
7785 and the subpattern to the right has participated in an earlier itera-
7791 See the subsection entitled "Non-printing characters" above for further
7805 An unsigned number specifies an absolute reference without the ambigu-
7810 (abc(def)ghi)\g{-1}
7812 The sequence \g{-1} is a reference to the most recently started captur-
7813 ing subpattern before \g, that is, is it equivalent to \2 in this exam-
7814 ple. Similarly, \g{-2} would be equivalent to \1. The use of relative
7823 A backreference matches whatever actually matched the capturing subpat-
7832 time of the backreference, the case of letters is relevant. For exam-
7856 subpattern has not actually been used in a particular match, any back-
7862 the PCRE2_MATCH_UNSET_BACKREF option is set at compile time, a backref-
7865 Because there may be many capturing parentheses in a pattern, all dig-
7866 its following a backslash are taken as part of a potential backrefer-
7877 matches. However, such references can be useful inside repeated sub-
7882 matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
7886 the backreference. This can be done using alternation, as in the exam-
7898 current matching point that does not consume any characters. The simple
7915 referenced in the usual way. For example, a sequence such as (.)\g{-1}
7922 retained after a successful negative assertion. When an assertion con-
7925 For a positive assertion, internally captured substrings in the suc-
7926 cessful branch are retained, and matching continues with the next pat-
7938 useful. However, an assertion that forms the condition for a condi-
7939 tional subpattern may not be quantified. In practice, for other asser-
7948 tried with and without the assertion, the order depending on the greed-
7962 matches a word followed by a semicolon, but does not include the semi-
7992 strings it matches must have a fixed length. However, if there are sev-
7993 eral top-level alternatives, they do not all have to have the same
8009 is not permitted, because its single top-level branch can match two
8011 two top-level branches:
8016 of a lookbehind assertion to get round the fixed-length restriction.
8020 then try to match. If there are insufficient characters before the cur-
8023 In UTF-8 and UTF-16 modes, PCRE2 does not allow the \C escape (which
8026 the lookbehind. The \X and \R escapes, which can match different num-
8030 lookbehinds, as long as the subpattern matches a fixed-length string.
8046 assertions to specify efficient matching of fixed-length strings at the
8047 end of subject strings. Consider a simple pattern such as
8052 proceeds from left to right, PCRE2 will look for each "a" in the sub-
8067 quantifier; it can match only the entire string. The subsequent lookbe-
8082 three characters are not "999". This pattern does not match "foo" pre-
8084 three of which are not "999". For example, it doesn't match "123abc-
8108 It is possible to cause the matching process to obey a subpattern con-
8110 on the result of an assertion, or whether a specific capturing subpat-
8114 (?(condition)yes-pattern)
8115 (?(condition)yes-pattern|no-pattern)
8117 If the condition is satisfied, the yes-pattern is used; otherwise the
8118 no-pattern (if present) is used. An absent no-pattern is equivalent to
8119 an empty string (it always matches). If there are more than two alter-
8120 natives in the subpattern, a compile-time error occurs. Each of the two
8121 alternatives may itself contain nested subpatterns of any form, includ-
8129 There are five kinds of condition: references to subpatterns, refer-
8130 ences to recursion, two pseudo-conditions called DEFINE and VERSION,
8136 the condition is true if a capturing subpattern of that number has pre-
8139 numbers), the condition is true if any of them have matched. An alter-
8142 most recently opened parentheses can be referenced by (?(-1), the next
8143 most recent by (?(-2), and so on. Inside loops it can also make sense
8146 is not used; it provokes a compile-time error.)
8148 Consider the following pattern, which contains non-significant white
8155 character is present, sets it as the first captured substring. The sec-
8160 yes-pattern is executed and a closing parenthesis is required. Other-
8161 wise, since no-pattern is not present, the subpattern matches nothing.
8162 In other words, this pattern matches a sequence of non-parentheses,
8168 ...other stuff... ( \( )? [^()]+ (?(-1) \) ) ...
8179 the letter R followed by digits are ambiguous (see the following sec-
8192 "Recursion" in this sense refers to any subroutine-like call from one
8193 part of the pattern to another, whether or not it is actually recur-
8231 be only one alternative in the subpattern. It is always skipped if con-
8233 can be used to define subroutines that can be referenced from else-
8234 where. (The use of subroutines is described below.) For example, a pat-
8235 tern to match an IPv4 address such as "192.168.23.245" could be written
8238 (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
8243 an IPv4 address (a number less than 256). When matching takes place,
8246 to match the four dot-separated components of an IPv4 address, insist-
8251 Programs that link with a PCRE2 library can check the version by call-
8253 that do not have access to the underlying code cannot do this. A spe-
8269 assertion. Consider this pattern, again containing non-significant
8272 (?(?=[^a-z]*[a-z])
8273 \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
8276 optional sequence of non-letters followed by a letter. In other words,
8280 strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
8285 for both positive and negative assertions, because matching always con-
8286 tinues after the assertion, whether it succeeds or fails. (Compare non-
8306 at the start of the pattern, as described in the section entitled "New-
8310 when PCRE2_EXTENDED is set, and the default newline convention (a sin-
8329 For some time, Perl has provided a facility that allows regular expres-
8341 Instead, it supports special syntax for recursion of the entire pat-
8342 tern, and also for individual subpattern recursion. After its introduc-
8349 subpattern. (If not, it is a non-recursive subroutine call, which is
8359 substrings which can either be a sequence of non-parentheses, or a
8360 recursive match of the pattern itself (that is, a correctly parenthe-
8362 of a possessive quantifier to avoid backtracking into sequences of non-
8375 of (?1) in the pattern above you can write (?-2) to refer to the second
8380 Be aware however, that if duplicate subpattern numbers are in use, rel-
8384 (?|(a)|(b)) (c) (?-2)
8387 group (c) is number 2. When the reference (?-2) is encountered, the
8396 because the reference is not inside the parentheses that are refer-
8397 enced. They are always non-recursive subroutine calls, as described in
8401 for this is (?&name); PCRE1's earlier syntax (?P>name) is also sup-
8409 The example pattern that we have been looking at contains nested unlim-
8411 strings of non-parentheses is important when applying the pattern to
8423 callout function can be used (see below and the pcre2callout documenta-
8429 which is the last value taken on at the top level. If a capturing sub-
8435 recursion. Consider this pattern, which matches text in angle brack-
8437 brackets (that is, when recursing), whereas any characters are permit-
8443 two different alternatives for the recursive and non-recursive cases.
8453 never re-entered, even if it contained untried alternatives and there
8458 treated as atomic. That is, they can be re-entered to try unused alter-
8473 match fails. If you want to match typical palindromic phrases, the pat-
8474 tern has to ignore all non-word characters, which can be done like
8480 such as "A man, a plan, a canal: Panama!". Note the use of the posses-
8481 sive quantifier *+ to avoid backtracking into sequences of non-word
8489 next section), it had no access to any values that were captured out-
8513 (...(relative)...)...(?-1)...
8534 Processing options such as case-independence are fixed when a subpat-
8538 (abc)(?i:(?-1))
8550 For compatibility with Oniguruma, the non-Perl syntax \g followed by a
8553 possibly recursively. Here are two of the examples used above, rewrit-
8562 (abc)(?i:\g<-1>)
8573 This makes it possible, amongst other things, to extract different sub-
8574 strings that match the same pair of parentheses when there is a repeti-
8577 PCRE2 provides a similar feature, but of course it cannot obey arbi-
8582 passed, or if the callout entry point is set to NULL, callouts are dis-
8594 During matching, when PCRE2 reaches a callout point, the external func-
8601 time, and one side-effect is that sometimes callouts are skipped. If
8617 They are all numbered 255. If there is a conditional group in the pat-
8629 A delimited string may be used instead of a number as a callout argu-
8631 ending delimiter is the same as the start, except for {, where the end-
8653 PCRE2_ALT_VERBNAMES option, but the result is no longer Perl-compati-
8659 and sequences such as \x{100} that define character code points. Char-
8665 names is skipped, and #-comments are recognized, exactly as in the rest
8669 The maximum length of a name is 255 in the 8-bit library and 65535 in
8670 the 16-bit and 32-bit libraries. If the name is empty, that is, if the
8672 the colon were not there. Any number of these verbs may occur in a pat-
8676 them can be used only when the pattern is to be matched using the tra-
8683 subpatterns called as subroutines (whether or not recursively) is docu-
8693 course, be processed. You can suppress the start-of-match optimizations
8694 by setting the PCRE2_NO_START_OPTIMIZE option when calling pcre2_com-
8711 then continues at the outer level. If (*ACCEPT) in triggered in a posi-
8715 If (*ACCEPT) is inside capturing parentheses, the data so far is cap-
8720 This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap-
8729 are not present in PCRE2. The nearest equivalent is the callout fea-
8752 When a match succeeds, the name of the last-encountered (*MARK:NAME) on
8753 the matching path is passed back to the caller as described in the sec-
8754 tion entitled "Other information about the match" in the pcre2api docu-
8775 The (*MARK) name is tagged with "MK:" in this output, and in this exam-
8777 efficient way of obtaining this information than putting each alterna-
8781 true, the name is recorded and passed back if it is the last-encoun-
8803 The following verbs do nothing when they are encountered. Matching con-
8805 causing a backtrack to the verb, a failure is forced. That is, back-
8809 group has been matched, there is never any backtracking into it. Back-
8813 These verbs differ in exactly what kind of failure occurs when back-
8815 when the verb is not in a subroutine or an assertion. Subsequent sec-
8821 matching failure that causes backtracking to reach it. Even if the pat-
8824 verb that is encountered, once it has been passed pcre2_match() is com-
8833 The behaviour of (*COMMIT:NAME) is not the same as (*MARK:NAME)(*COM-
8834 MIT). It is like (*MARK:NAME) in that the name is remembered for pass-
8845 anchor, unless PCRE2's start-of-match optimizations are turned off, as
8867 the subject if there is a later matching failure that causes backtrack-
8872 right, backtracking cannot cross (*PRUNE). In simple cases, the use of
8873 (*PRUNE) is just an alternative to an atomic group or possessive quan-
8887 character, but to the position in the subject where (*SKIP) was encoun-
8896 skips on to start the next attempt at "c". Note that a possessive quan-
8907 found, the "bumpalong" advance is to the subject position that corre-
8913 atomic groups or assertions, because they are never re-entered by back-
8931 backtracks, and this causes a new matching attempt to start at the sec-
8942 This verb causes a skip to the next innermost alternative when back-
8945 that it can be used for a pattern-based if-then-else block:
8952 into COND1. If that succeeds and BAR fails, COND3 is tried. If subse-
8953 quently BAZ fails, there are no more alternatives, so there is a back-
8979 failure in C, matching moves to (*FAIL), which causes the whole subpat-
9010 that is backtracked onto first acts. For example, consider this pat-
9068 in a standalone positive assertion. In a conditional positive asser-
9070 or (*PRUNE) causes the condition to be false. However, for both stand-
9072 (*SKIP), or (*PRUNE) causes the assertion to be true, without consider-
9077 These behaviours occur whether or not the subpattern is called recur-
9081 match to succeed without any further processing. Matching then contin-
9089 when triggered by being backtracked to in a subpattern called as a sub-
9108 University Computing Service
9115 Copyright (c) 1997-2018 University of Cambridge.
9116 ------------------------------------------------------------------------------
9124 PCRE2 - Perl-compatible regular expressions (revised API)
9128 Two aspects of performance are discussed below: memory usage and pro-
9136 code, so that most simple patterns do not use much memory for storing
9139 subpattern has a quantifier with a minimum greater than 1 and/or a lim-
9153 is not usually a problem. However, if the numbers are large, and par-
9155 an embarrassment. For example, the very simple pattern
9159 uses over 50KiB when compiled using the 8-bit library. When PCRE2 is
9161 limit on a compiled pattern is 65535 code units in the 8-bit and 16-bit
9162 libraries, and this is reached with the above pattern if the outer rep-
9168 of PCRE2's "subroutine" facility. Re-writing the above pattern as
9174 this kind of pattern is not always exactly equivalent, because any cap-
9177 process patterns that PCRE2 cannot otherwise handle. The matching per-
9179 same. (This applies from release 10.30 - things were different in ear-
9185 From release 10.30, the interpretive (non-JIT) version of pcre2_match()
9186 uses very little system stack at run time. In earlier releases recur-
9188 cause problems, but this usage has been eliminated. Backtracking posi-
9194 used. Rewriting patterns to be time-efficient, as described below, may
9203 has been re-factored to use heap memory when necessary for internal
9214 Certain items in regular expression patterns are processed more effi-
9216 [aeiou] than a set of single-character alternatives such as
9224 slow, because PCRE2 has to use a multi-stage table lookup whenever it
9234 pcre2_match(); the performance loss is less with a DFA matching func-
9237 When a pattern begins with .* not in atomic parentheses, nor in paren-
9241 multiple top-level branches, they must all be anchorable. The optimiza-
9247 subject string contains newlines, the pattern may match from the char-
9258 If you are using such a pattern with subject strings that do not con-
9261 explicit anchoring. That saves PCRE2 from having to scan along the sub-
9278 An optimization catches some of the more simple cases such as
9283 matching procedure, PCRE2 checks that there is a "b" later in the sub-
9284 ject string, and if there is not, it fails the match immediately. How-
9295 an atomic group or a possessive quantifier. This can often reduce mem-
9306 matched character. For a long string, a lot of memory is required. Con-
9312 This runs much faster, because sequences of characters that do not con-
9313 tain "<" are "swallowed" in one item inside the parentheses, and a pos-
9315 non-"<" characters. This version also uses a lot less memory because
9343 University Computing Service
9350 Copyright (c) 1997-2018 University of Cambridge.
9351 ------------------------------------------------------------------------------
9359 PCRE2 - Perl-compatible regular expressions (revised API)
9379 This set of functions provides a POSIX-style API for the PCRE2 regular
9380 expression 8-bit library. See the pcre2api documentation for a descrip-
9381 tion of PCRE2's native API, which contains much additional functional-
9382 ity. There are no POSIX-style wrappers for PCRE2's 16-bit and 32-bit
9388 called libpcre2-posix.a, so can be accessed by adding -lpcre2-posix to
9391 -lpcre2-8.
9402 PCRE2-specific features via the POSIX calling interface or to add BSD
9406 POSIX-like in style. The syntax and semantics of the regular expres-
9408 various PCRE2 options, as described below. "POSIX-like in style" means
9410 POSIX-compatible, and in multi-unit encoding domains it is probably
9416 two structure types, regex_t for compiled internal forms, and reg-
9417 match_t for returning captured substrings. It also defines some con-
9427 structure that is used as a base for storing information about the com-
9449 the defined POSIX behaviour for REG_NEWLINE (see the following sec-
9455 for compilation to the native function. This disables all meta charac-
9464 for matching, the nmatch and pmatch arguments are ignored, and no cap-
9496 all data strings used for matching it to be treated as UTF-8 strings.
9504 It does not affect the way newlines are matched by the dot metacharac-
9507 The yield of regcomp() is zero on success, and non-zero otherwise. The
9513 NOTE: If the yield of regcomp() is non-zero, you must not attempt to
9520 This area is not simple, because POSIX and Perl take different views of
9534 This is the equivalent table for a POSIX-compatible pattern matcher:
9552 action. When using the POSIX API, passing REG_NEWLINE to PCRE2's reg-
9554 and REG_DOTALL passes PCRE2_DOTALL. There is no way to pass PCRE2_DOL-
9567 The PCRE2_NOTBOL option is set when calling the underlying PCRE2 match-
9574 standard. However, setting this option can give more POSIX-like behav-
9579 The PCRE2_NOTEOL option is set when calling the underlying PCRE2 match-
9598 intended to be portable to other systems. Note that a non-zero rm_so
9620 Unused entries in the array have both structure members set to -1.
9629 The regerror() function maps a non-zero errorcode from either regcomp()
9633 the first errbuf_size - 1 characters of the error message are used. The
9641 Compiling a regular expression causes memory to be allocated and asso-
9643 memory, after which preg may no longer be used as a compiled expres-
9650 University Computing Service
9657 Copyright (c) 1997-2017 University of Cambridge.
9658 ------------------------------------------------------------------------------
9666 PCRE2 - Perl-compatible regular expressions (revised API)
9670 A simple, complete demonstration program to get you started with using
9674 can save this listing to re-create the contents of pcre2demo.c.
9679 used. If matching succeeds, the program outputs the portion of the sub-
9680 ject that matched, together with the contents of any captured sub-
9683 If the -g option is given on the command line, the program then goes on
9685 subject string. The logic is a little bit tricky because of the possi-
9689 The code in pcre2demo.c is an 8-bit program that uses the PCRE2 8-bit
9690 library. It handles strings and characters that are stored in 8-bit
9693 treated as UTF-8 strings, where characters may occupy multiple code
9697 for your operating system, you should be able to compile the demonstra-
9700 cc -o pcre2demo pcre2demo.c -lpcre2-8
9703 to the command line. For example, on a Unix-like system that has PCRE2
9707 cc -o pcre2demo -I/usr/local/include pcre2demo.c \
9708 -L/usr/local/lib -lpcre2-8
9710 Once you have built the demonstration program, you can run simple tests
9714 ./pcre2demo -g 'cat|dog' 'the dog sat on the cat'
9718 expressions using all three PCRE2 libraries (8-bit, 16-bit, and 32-bit,
9719 though not all three need be installed). The pcre2demo program is pro-
9720 vided as a relatively simple coding example.
9726 ld.so.1: pcre2demo: fatal: libpcre2-8.so.0: open failed: No such file
9729 This is caused by the way shared library support works on those sys-
9732 -R/usr/local/lib
9740 University Computing Service
9747 Copyright (c) 1997-2016 University of Cambridge.
9748 ------------------------------------------------------------------------------
9754 PCRE2 - Perl-compatible regular expressions (revised API)
9756 SAVING AND RE-USING PRECOMPILED PCRE2 PATTERNS
9773 run. However, if you are using the just-in-time optimization feature,
9774 it is not possible to save and reload the JIT data, because it is posi-
9775 tion-dependent. The host on which the patterns are reloaded must be
9778 For example, patterns compiled on a 32-bit system using PCRE2's 16-bit
9779 library cannot be reloaded on a 64-bit system, nor can they be reloaded
9780 using the 8-bit library.
9787 linked with a fixed version of PCRE2 must be prepared to recompile pat-
9796 arbitrary external sources. There is only some simple consistency
9797 checking, not complete validation of what is being re-loaded. Corrupted
9809 in the byte stream (its size is 1088 bytes). For more details of char-
9810 acter tables, see the section on locale support in the pcre2api docu-
9816 the length of the vector. The third and fourth arguments point to vari-
9830 PCRE2_ERROR_BADMAGIC means either that a pattern's code has been cor-
9831 rupted, or that a slot in the vector does not point to a compiled pat-
9855 between binary and non-binary data, be sure that the file is opened for
9860 freed in the usual way by calling pcre2_code_free(). When you have fin-
9861 ished with the byte stream, it too must be freed by calling pcre2_seri-
9866 RE-USING PRECOMPILED PATTERNS
9868 In order to re-use a set of saved patterns you must first make the
9869 serialized byte stream available in main memory (for example, by read-
9884 If this argument is NULL, malloc() and free() are used. After deserial-
9913 and a reference count is used to arrange for its memory to be automati-
9920 If a pattern was processed by pcre2_jit_compile() before being serial-
9929 University Computing Service
9936 Copyright (c) 1997-2018 University of Cambridge.
9937 ------------------------------------------------------------------------------
9945 PCRE2 - Perl-compatible regular expressions (revised API)
9949 The full syntax and semantics of the regular expressions that are sup-
9951 document contains a quick-reference summary of the syntax.
9956 \x where x is non-alphanumeric is a literal x
9965 \cx "control-x", where x is any ASCII printing character
9980 Note that \0dd is always an octal code. The treatment of backslash fol-
9981 lowed by a non-zero digit is complicated; for details see the section
9982 "Non-printing characters" in the pcre2pattern documentation, where
9989 read, but if PCRE2_ALT_BSUX is set, \x must be followed by two hexadec-
9991 matches a literal "x". Likewise, if \u (in ALT_BSUX mode) is not fol-
10013 \W a "non-word" character
10017 middle of a UTF-8 or UTF-16 character. The application can lock out the
10021 By default, \d, \s, and \w match only ASCII characters, even in UTF-8
10022 mode or in the 16-bit and 32-bit libraries. However, if locale-specific
10024 points in the range 128-255. If the PCRE2_UCP option is set, the behav-
10049 Mn Non-spacing mark
10082 Xuc Univerally-named character: one that can be
10086 Perl and POSIX space are now the same. Perl added VT to its space char-
10092 Adlam, Ahom, Anatolian_Hieroglyphs, Arabic, Armenian, Avestan, Bali-
10094 Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Caucasian_Alba-
10100 Inscriptional_Pahlavi, Inscriptional_Parthian, Javanese, Kaithi, Kan-
10102 Latin, Lepcha, Limbu, Linear_A, Linear_B, Lisu, Lycian, Lydian, Maha-
10106 Nabataean, New_Tai_Lue, Newa, Nko, Nushu, Ogham, Ol_Chiki, Old_Hungar-
10107 ian, Old_Italic, Old_North_Arabian, Old_Permic, Old_Persian, Old_Sog-
10110 Psalter_Pahlavi, Rejang, Runic, Samaritan, Saurashtra, Sharada, Sha-
10113 Tai_Viet, Takri, Tamil, Tangut, Telugu, Thaana, Thai, Tibetan, Tifi-
10121 [x-y] range (can be used for hex characters)
10127 ascii 0-127
10165 ANCHORS AND SIMPLE ASSERTIONS
10200 (?:...) non-capturing group
10201 (?|...) non-capturing group; reset group numbers for
10207 (?>...) atomic, non-capturing group
10227 (?-...) unset option(s)
10231 a mixture of setting and unsetting such as (?i-x) is allowed, but there
10233 for example (?^in). An option setting may appear at the start of a non-
10245 (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
10248 (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
10289 Each top-level branch of a look behind must be of a fixed length.
10298 \g-n relative reference by number
10300 \g{-n} relative reference by number
10313 (?-n) call subpattern by relative number
10322 \g<-n> call subpattern by relative number (PCRE2 extension)
10323 \g'-n' call subpattern by relative number (PCRE2 extension)
10328 (?(condition)yes-pattern)
10329 (?(condition)yes-pattern|no-pattern)
10333 (?(-n) relative reference condition
10361 The following act only when a subsequent match failure causes a back-
10363 what happens afterwards. Those that advance the start-of-match point do
10398 University Computing Service
10405 Copyright (c) 1997-2018 University of Cambridge.
10406 ------------------------------------------------------------------------------
10414 PCRE - Perl-compatible regular expressions (revised API)
10420 in UTF-8, UTF-16, or UTF-32 format (depending on the code unit width).
10425 (*UTF). When either of these is the case, both the pattern and any sub-
10427 instead of strings of individual one-code-unit characters. There are
10428 also some other changes to the way characters are handled, as docu-
10443 names for properties are supported. For example, \p{L} matches a let-
10445 Perl, many properties may optionally be prefixed by "Is", for compati-
10458 allowed in non-UTF modes.
10468 multi-unit characters (see the description of \C in the pcre2pattern
10472 pcre2_dfa_match() when in UTF-8 or UTF-16 mode, that is, when a charac-
10474 modes provokes a match-time error. Also, the JIT optimization does not
10475 support \C in these modes. If JIT optimization is requested for a UTF-8
10476 or UTF-16 pattern that contains \C, it will not succeed, and so when
10483 set as in non-UTF mode, all with code points less than 256. This
10489 Alternatively, if you set the PCRE2_UCP option, the way that the char-
10495 all low-valued characters, unless the PCRE2_UCP option is set.
10498 escapes (\h, \H, \v, and \V) do match all the appropriate Unicode char-
10502 CASE-EQUIVALENCE IN UTF MODES
10504 Case-insensitive matching in a UTF mode makes use of Unicode properties
10506 at most two case-equivalent values. For these, a direct table lookup is
10508 than two code points that are case-equivalent, and these are treated as
10521 UTF-16 and UTF-32 strings can indicate their endianness by special code
10522 knows as a byte-order mark (BOM). The PCRE2 functions do not handle
10526 case of pcre2_match() and pcre2_dfa_match() calls with a non-zero
10530 end of the subject. If there are no lookbehind assertions in the pat-
10534 the starting offset. Note that the sequences \b and \B are one-charac-
10539 the surrogate area. The so-called "non-character" code points are not
10544 UTF-16, where they are used in pairs to encode code points with values
10545 greater than 0xFFFF. The code points that are encoded by UTF-16 pairs
10546 are available independently in the UTF-8 and UTF-32 encodings. (In
10547 other words, the whole surrogate thing is a fudge for UTF-16 which
10548 unfortunately messes up UTF-8 and UTF-32.)
10551 and therefore want to skip these checks in order to improve perfor-
10553 scanned repeatedly. If you set the PCRE2_NO_UTF_CHECK option at com-
10569 PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra option. However, this is pos-
10570 sible only in UTF-8 and UTF-32 modes, because these values are not rep-
10571 resentable in UTF-16.
10573 Errors in UTF-8 strings
10575 The following negative error codes are given for invalid UTF-8 strings:
10583 The string ends with a truncated UTF-8 character; the code specifies
10584 how many bytes are missing (1 to 5). Although RFC 3629 restricts UTF-8
10585 characters to be no longer than 4 bytes, the encoding scheme (origi-
10607 A 4-byte character has a value greater than 0x10fff; these code points
10612 A 3-byte character has a value in the range 0xd800 to 0xdfff; this
10613 range of code points are reserved by RFC 3629 for use with UTF-16, and
10614 so are excluded from UTF-8.
10622 A 2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it codes
10624 For example, the two bytes 0xc0, 0xae give the value 0x2e, whose cor-
10630 binary value 0b10 (that is, the most significant bit is 1 and the sec-
10631 ond is 0). Such a byte can only validly occur as the second or subse-
10632 quent byte of a multi-byte character.
10637 can never occur in a valid UTF-8 string.
10639 Errors in UTF-16 strings
10641 The following negative error codes are given for invalid UTF-16
10649 Errors in UTF-32 strings
10651 The following negative error codes are given for invalid UTF-32
10661 University Computing Service
10668 Copyright (c) 1997-2018 University of Cambridge.
10669 ------------------------------------------------------------------------------