• Home
  • Raw
  • Download

Lines Matching +full:- +full:- +full:without +full:- +full:perl

1 -----------------------------------------------------------------------------
8 -----------------------------------------------------------------------------
16 PCRE2 - Perl-compatible regular expressions (revised API)
22 pattern matching using the same syntax and semantics as Perl, with just
25 API is more extensible, and it was simplified by abolishing the sepa-
31 As well as Perl-style regular expression patterns, some features that
32 appeared in Python and the original PCRE before they appeared in Perl
35 requesting some minor changes that give better ECMAScript (aka Java-
38 The source code for PCRE2 can be compiled to support strings of 8-bit,
39 16-bit, or 32-bit code units, which means that up to three separate li-
42 64-bit environment that also supports 32-bit applications, versions of
43 PCRE2 that are compiled in both 64-bit and 32-bit modes may be needed.
45 The original work to extend PCRE to 16-bit and 32-bit code units was
48 unit, or as UTF-encoded Unicode, with support for Unicode general cate-
54 pcre2test -C
57 ending in _8, _16, or _32, respectively (for example, pcre2_com-
63 In addition to the Perl-compatible matching function, PCRE2 contains an
64 alternative function that matches the same compiled patterns in a dif-
69 Details of exactly which Perl regular expression features are and are
76 client to discover which features are available. The features them-
77 selves are described in the pcre2build page. Documentation about build-
79 NON-AUTOTOOLS_BUILD files in the source distribution.
92 If you are using PCRE2 in a non-UTF application that permits users to
95 For example, an 8-bit pattern that begins with "(*UTF)" turns on UTF-8
96 mode, which interprets patterns and subjects as strings of UTF-8 code
97 units instead of individual 8-bit characters. This causes both the pat-
98 tern and any data against which it is matched to be checked for UTF-8
99 validity. If the data string is very long, such a check might use suf-
100 ficiently many resources as to cause your application to lose perfor-
103 One way of guarding against this possibility is to use the pcre2_pat-
106 calling pcre2_compile(). This causes a compile time error if the pat-
107 tern contains a UTF-setting sequence.
110 be enabled from within the pattern, by specifying "(*UCP)". This fea-
118 The use of the \C escape sequence in a UTF-8 or UTF-16 pattern can lead
120 middle of a multi-code-unit character. The PCRE2_NEVER_BACKSLASH_C op-
122 compile-time error if it is encountered. It is also possible to build
127 Nested unlimited repeats in a pattern are a common example. PCRE2 pro-
130 pcre2_set_depth_limit() that can be used to restrict the amount of mem-
136 The user documentation for PCRE2 comprises a number of different sec-
142 (which is a program listing), and the short pages for individual func-
143 tions, are concatenated in pcre2.txt, for ease of searching. The sec-
147 pcre2-config show PCRE2 installation configuration information
151 pcre2compat discussion of Perl compatibility
154 pcre2grep description of the pcre2grep command (8-bit only)
155 pcre2jit discussion of just-in-time optimization support
162 pcre2posix the POSIX-compatible C API for the 8-bit library
186 Copyright (c) 1997-2021 University of Cambridge.
187 ------------------------------------------------------------------------------
195 PCRE2 - Perl-compatible regular expressions (revised API)
200 contains a description of all its native functions. See the pcre2 docu-
461 These functions provide a way of converting non-PCRE2 patterns into
462 patterns that can be processed by pcre2_compile(). This facility is ex-
468 PCRE2 8-BIT, 16-BIT, AND 32-BIT LIBRARIES
470 There are three PCRE2 libraries, supporting 8-bit, 16-bit, and 32-bit
473 for all three libraries. One, two, or all three can be installed simul-
474 taneously. On Unix-like systems the libraries are called libpcre2-8,
475 libpcre2-16, and libpcre2-32, and they can also co-exist with the orig-
498 macros are defined whose names are the generic forms such as pcre2_com-
500 PCRE2_CODE_UNIT_WIDTH to generate the appropriate width-specific func-
518 single library. For example, if you want to run a match using a pat-
524 their generic names, without the _8, _16, or _32 suffix.
530 There are also some wrapper functions for the 8-bit library that corre-
542 program against a non-dll PCRE2 library, you must define PCRE2_STATIC
546 and matching regular expressions in a Perl-compatible manner. A sample
553 passed as bits in an options argument. There are also some more compli-
554 cated parameters such as custom memory management functions and re-
559 Just-in-time (JIT) compiler support is an optional feature of PCRE2
561 speeds up the matching performance of many patterns. Programs can re-
568 pcre2_jit_stack_assign() in order to control the JIT code's memory us-
574 less sanity checking. The JIT-specific functions are discussed in the
577 A second matching function, pcre2_dfa_match(), which is not Perl-com-
581 there are lookaround assertions). However, this algorithm does not re-
600 pcre2_substring_free() and pcre2_substring_list_free() are also pro-
602 functions is called with a NULL argument, the function returns immedi-
603 ately without doing anything.
612 Finally, there are functions for finding out information about a com-
627 ~(PCRE2_SIZE)0) is reserved as a special indicator for zero-terminated
635 strings: a single CR (carriage return) character, a single LF (line-
636 feed) character, the two-character sequence CRLF, any of the three pre-
645 Unix standard. However, the newline convention can be changed by an ap-
646 plication when calling pcre2_compile(), or it can be specified by spe-
648 settings. See the pcre2pattern page for details of the special charac-
654 dollar metacharacters, the handling of #-comments in /x mode, and, when
655 CRLF is a recognized line ending sequence, the match position advance-
656 ment for a non-anchored pattern. There is more detail about this in the
666 In a multithreaded application it is important to keep thread-specific
668 library code itself is thread-safe: it contains no static or global
669 variables. The API is designed to be fairly simple for non-threaded ap-
670 plications while at the same time ensuring that multithreaded applica-
673 There are several different blocks of data that are used to pass infor-
681 is thread-safe, that is, the same compiled pattern can be used by more
684 use them. However, if the just-in-time (JIT) optimization feature is
695 Get a read-only (shared) lock (mutex) for pointer
707 The reason for checking the pointer a second time is as follows: Sev-
723 Get a read-only (shared) lock (mutex) for pointer
736 If JIT is being used, but the JIT compilation is not being done immedi-
741 pcre2_code_copy() or pcre2_code_copy_with_tables() can be used to ob-
742 tain a private copy of the compiled code before calling the JIT com-
751 a PCRE2 function without using lots of arguments. The parameters that
755 In a multithreaded application, if the parameters in a context are val-
758 it must make its own thread-specific copy.
763 of a match. This includes details of what was matched, as well as addi-
772 memory management or non-standard character tables. To keep function
776 that holds the parameter values. Applications that do not need to ad-
781 relevant for several PCRE2 operations, a compile-time context, and a
782 match-time context.
786 At present, this context just contains pointers to (and data for) ex-
806 function may be NULL, in which case the system memory management func-
809 might be.) The private_malloc() function is used (if supplied) to ob-
828 without doing anything.
832 A compile context is required if you want to provide an external func-
834 values of any of the following compile-time parameters:
843 A compile context is also required if you are using custom memory man-
844 agement. If none of these apply, just pass NULL as the context argu-
847 A compile context is created, copied, and freed by the following func-
875 only argument is a general context. This function builds a set of char-
881 As PCRE2 has developed, almost all the 32 option bits that are avail-
884 bits which are used for some newer, assumed rarer, options. This func-
886 It does not modify any existing setting. The available options are de-
896 largest number that a PCRE2_SIZE variable can hold, which is effec-
902 This specifies which characters or character sequences are to be recog-
905 two-character sequence CR followed by LF), PCRE2_NEWLINE_ANYCRLF (any
912 When a pattern is compiled with the PCRE2_EXTENDED or PCRE2_EX-
924 stops rogue patterns using up too much system stack when being com-
925 piled. The limit applies to parentheses of all kinds, not just captur-
941 nesting, and the second is user data that is set up by the last argu-
943 should return zero if all is well, or non-zero to force an error.
959 A match context is created, copied, and freed by the following func-
979 during a matching operation. Details are given in the pcre2callout doc-
986 This sets up a callout function for PCRE2 to call after each substitu-
987 tion made by pcre2_substitute(). Details are given in the section enti-
993 The offset_limit parameter limits how far an unanchored search can ad-
995 pcre2_match() and pcre2_dfa_match() functions return PCRE2_ERROR_NO-
1005 When using this facility, you must set the PCRE2_USE_OFFSET_LIMIT op-
1007 code can be compiled. If a match is started with a non-default match
1015 the first line and also within the offset limit. In other words, which-
1024 also applies to pcre2_dfa_match(), which may use the heap when process-
1026 atomic groups. This limit does not apply to matching with the JIT opti-
1038 where ddd is a decimal number. However, such a setting is ignored un-
1044 pcre2_match() uses the heap are given in the pcre2perform documenta-
1047 For pcre2_dfa_match(), a vector on the system stack is used when pro-
1055 The match_limit parameter provides a means of preventing PCRE2 from us-
1069 When pcre2_match() is called with a pattern that was successfully pro-
1076 The default value for the limit can be set when PCRE2 is built; the de-
1083 where ddd is a decimal number. However, such a setting is ignored un-
1110 If the depth of internal recursive function calls is great enough, lo-
1111 cal workspace vectors are allocated on the heap from version 10.32 on-
1115 deal of memory. However, it is probably better to limit heap usage di-
1127 where ddd is a decimal number. However, such a setting is ignored un-
1132 CHECKING BUILD-TIME OPTIONS
1142 required. The second argument is a pointer to memory into which the in-
1151 non-negative on success, or the negative error code PCRE2_ERROR_BADOP-
1152 TION if the value in the first argument is not recognized. The follow-
1159 PCRE2_BSR_UNICODE means that \R matches any Unicode line ending se-
1166 unit widths were selected when PCRE2 was built. The 1-bit indicates
1167 8-bit support, and the 2-bit and 4-bit indicate 16-bit and 32-bit sup-
1174 recursions, lookarounds, and atomic groups in pcre2_dfa_match(). Fur-
1187 just-in-time compiling is available; otherwise it is set to zero.
1195 compiler is configured, for example "x86 32bit (little endian + un-
1206 the 16-bit library is compiled, a value of 3 is rounded up to 4, and
1207 when the 32-bit library is compiled, internal linkages always use 4
1210 The default value of 2 for the 8-bit and 16-bit libraries is sufficient
1246 The output is a uint32_t integer that gives the maximum depth of nest-
1250 take into account the stack that may already be used by the calling ap-
1256 This parameter is obsolete and should not be used in new code. The out-
1261 The output is a uint32_t integer that gives the length of PCRE2's char-
1270 without Unicode support, the buffer is filled with the text "Unicode
1286 PCRE2 version string, zero-terminated. The number of code units used is
1287 returned. This is the length of the string plus one unit for the termi-
1305 length (in code units). If the pattern is zero-terminated, the length
1307 pointer to a block of memory that contains the compiled pattern and re-
1310 If the compile context argument ccontext is NULL, memory for the com-
1311 piled pattern is obtained by calling malloc(). Otherwise, it is ob-
1312 tained from the same memory function that was used for the compile con-
1314 it is no longer needed. If pcre2_code_free() is called with a NULL ar-
1315 gument, it returns immediately, without doing anything.
1319 However, if the code has been processed by the JIT compiler (see be-
1320 low), the JIT information cannot be copied (because it is position-de-
1321 pendent). The new copy can initially be used only for non-JIT match-
1326 a multithreaded application to acquire a private copy of shared com-
1335 pointing to the new tables. The memory for the new tables is automati-
1347 described in the section entitled "Option bits for pcre2_match()" be-
1351 that affect the compilation. It should be zero if none of them are re-
1353 particular, those that are compatible with Perl, but some others as
1354 well) can also be set and unset from within the pattern (see the de-
1357 For those options that can be different in different parts of the pat-
1363 Some additional options and less frequently required compile-time pa-
1364 rameters (for example, the newline setting) can be provided in a com-
1367 If errorcode or erroroffset is NULL, pcre2_compile() returns NULL imme-
1369 error code and an offset (number of code units) within the pattern, re-
1370 spectively, when pcre2_compile() returns NULL because a compilation er-
1373 There are nearly 100 positive error codes that pcre2_compile() may re-
1375 error codes that are used for invalid UTF strings when validity check-
1378 There is no separate documentation for the positive error codes, be-
1380 pcre2_get_error_message() function (see "Obtaining a textual error mes-
1381 sage" below) should be self-explanatory. Macro names starting with
1384 that returns the message "no error" if passed to pcre2_get_error_mes-
1387 The value returned in erroroffset is an indication of where in the pat-
1389 non-zero value is not necessarily the furthest point in the pattern
1392 assertion. For an invalid UTF-8 or UTF-16 string, the offset is that of
1398 mode. It may sometimes point into the middle of a UTF-8 or UTF-16 char-
1401 This code fragment shows a typical straightforward call to pcre2_com-
1409 PCRE2_ZERO_TERMINATED, /* the pattern is zero-terminated */
1427 only way to do it in Perl.
1431 By default, for compatibility with Perl, a closing square bracket that
1442 (1) \U matches an upper case "U" character; by default \U causes a com-
1443 pile time error (Perl uses \U to upper case subsequent characters).
1447 code point to match. By default, \u causes a compile time error (Perl
1452 code point to match. By default, as in Perl, a hexadecimal number is
1457 using the PCRE2_EXTRA_ALT_BSUX extra option (see "Extra compile op-
1459 to patterns. Neither of these options affects the processing of re-
1468 Perl. If you want a multiline circumflex also to match after a termi-
1473 By default, for compatibility with Perl, the name in any verb sequence
1474 such as (*MARK:NAME) is any sequence of characters that does not in-
1476 it is not possible to include a closing parenthesis in the name. How-
1477 ever, if the PCRE2_ALT_VERBNAMES option is set, normal backslash pro-
1478 cessing is applied to verb names and only an unescaped closing paren-
1482 whitespace in verb names is skipped and #-comments are recognized, ex-
1488 items, all with number 255, before each pattern item, except immedi-
1489 ately before or after an explicit callout in the pattern. For discus-
1495 case letters in the subject. It is equivalent to Perl's /i option, and
1500 characters, K and S, that, in addition to their lower case ASCII equiv-
1501 alents, are case-equivalent with U+212A (Kelvin sign) and U+017F (long
1505 higher code points (available only in 16-bit or 32-bit mode) are
1511 at the end of the subject string. Without this option, a dollar also
1515 Perl, and no way to set it within a pattern.
1521 ever matches one character, even if newlines are coded as CRLF. Without
1522 this option, a dot does not match when the current position in the sub-
1523 ject is at a newline. This option is equivalent to Perl's /s option,
1524 and it can be changed within a pattern by a (?s) option setting. A neg-
1526 escape sequence always matches a non-newline character, independent of
1543 patterns, a new match is then tried at the next starting point. How-
1554 which is the only way to do it in Perl.
1558 matches, which are necessarily substrings of the first one, must obvi-
1563 If this bit is set, most white space characters in the pattern are to-
1567 {1,3}. Ignorable white space is permitted between an item and a follow-
1568 ing quantifier and between a quantifier and a following + that indi-
1569 cates possessiveness. PCRE2_EXTENDED is equivalent to Perl's /x option,
1572 When PCRE2 is compiled without Unicode support, PCRE2_EXTENDED recog-
1574 256 that are flagged as white space in its low-character table. The ta-
1581 When PCRE2 is compiled with Unicode support, in addition to these char-
1582 acters, five more Unicode "Pattern White Space" characters are recog-
1583 nized by PCRE2_EXTENDED. These are U+0085 (next line), U+200E (left-to-
1584 right mark), U+200F (right-to-left mark), U+2028 (line separator), and
1586 recognized by Perl's /x option. Note that the horizontal and vertical
1590 As well as ignoring most white space, PCRE2_EXTENDED also causes char-
1597 Which characters are interpreted as newlines can be specified by a set-
1599 special sequence at the start of the pattern, as described in the sec-
1605 This option has the effect of PCRE2_EXTENDED, but, in addition, un-
1606 escaped space and horizontal tab characters are ignored inside a char-
1608 set of pattern white space characters that are ignored outside a char-
1609 acter class. PCRE2_EXTENDED_MORE is equivalent to Perl's /xx option,
1616 start of matching, though the matched text may continue over the new-
1617 line. If startoffset is non-zero, the limiting newline is not necessar-
1619 string is "abc\nxyz" (where \n represents a single-character newline) a
1628 If this option is set, all meta-characters in the pattern are disabled,
1631 you are doing a lot of literal matching and are worried about effi-
1636 PCRE2_UTF, and PCRE2_USE_OFFSET_LIMIT. The extra options PCRE2_EX-
1644 sequences. This facility is not supported for DFA matching. For de-
1651 alternative to fail). A pattern such as (\1)(a) succeeds when this op-
1653 fails by default, for Perl compatibility. Setting this option makes
1663 string, or before a terminating newline (except when PCRE2_DOLLAR_EN-
1665 character" metacharacter (.) does not match at a newline. This behav-
1666 iour (for ^, $, and dot) is the same as Perl.
1671 start and end. This is equivalent to Perl's /m option, and it can be
1674 subject, for compatibility with Perl. However, you can change this by
1681 This option locks out the use of \C in the pattern that is being com-
1682 piled. This escape can cause unpredictable behaviour in UTF-8 or
1683 UTF-16 modes, because it may leave the current matching point in the
1684 middle of a multi-code-unit character. This option may be useful in ap-
1686 is also a build-time option that permanently locks out the use of \C.
1700 This option locks out interpretation of the pattern as UTF-8, UTF-16,
1701 or UTF-32, depending on which library is in use. In particular, it pre-
1703 by starting the pattern with (*UTF). This option may be useful in ap-
1709 If this option is set, it disables the use of numbered capturing paren-
1713 is the same as Perl's /n option. Note that, when this option is set,
1720 If this option is set, it disables "auto-possessification", which is an
1723 are in use, auto-possessification means that some callouts are never
1731 .* is the first significant item in a top-level branch of a pattern,
1751 the matching code searches the subject for that value, and fails imme-
1752 diately if it cannot find it, without actually running the main match-
1756 items are in use, these "start-up" optimizations can cause them to be
1757 skipped if the pattern is never actually used. The start-up optimiza-
1758 tions are in effect a pre-scan of the subject that takes place before
1761 The PCRE2_NO_START_OPTIMIZE option disables the start-up optimizations,
1774 start-up optimization scans along the subject, finds "A" and runs the
1775 first match attempt from there. The (*COMMIT) item means that the pat-
1780 (*COMMIT) prevents any further matches being tried, so the overall re-
1783 As another start-up optimization makes use of a minimum length for a
1790 match "BB", which is long enough. In the process, (*MARK:2) is encoun-
1792 found, but there is only one character left, so there are no more at-
1805 UTF-8 strings, UTF-16 strings, and UTF-32 strings in the pcre2unicode
1811 PCRE2_NO_UTF_CHECK option. When it is set, the effect of passing an in-
1819 Note also that setting PCRE2_NO_UTF_CHECK at compile time does not dis-
1820 able the error that is given if an escape sequence for an invalid Uni-
1821 code code point is encountered in the pattern. In particular, the so-
1825 section entitled "Extra compile options" below. However, this is pos-
1826 sible only in UTF-8 and UTF-32 modes, because these values are not rep-
1827 resentable in UTF-16.
1834 PCRE2_UCP is set, Unicode properties are used instead to classify char-
1839 The second effect of PCRE2_UCP is to force the use of Unicode proper-
1841 greater than 127, even when PCRE2_UTF is not set. This makes it possi-
1842 ble, for example, to process strings in the 16-bit UCS-2 code. This op-
1850 not compatible with Perl. It can also be set by a (?U) option setting
1856 is going to be used to set a non-default offset limit in a match con-
1858 offset limit is set without this option. For more details, see the de-
1866 instead of single-code-unit strings. It is available when PCRE2 is
1868 support is not available, the use of this option provokes an error. De-
1881 assertions, following Perl's lead. This option is provided to re-enable
1887 This option applies when compiling a pattern in UTF-8 or UTF-32 mode.
1888 It is forbidden in UTF-16 mode, and ignored in non-UTF modes. Unicode
1890 in UTF-16 to encode code points with values in the range 0x10000 to
1891 0x10ffff. The surrogates cannot therefore be represented in UTF-16.
1892 They can be represented in UTF-8 and UTF-32, but are defined as invalid
1893 code points, and cause errors if encountered in a UTF-8 or UTF-32
1898 when using PCRE2 to check for unwanted characters in UTF-8 strings, ex-
1900 PCRE2_NO_UTF_CHECK option does not disable the error that occurs, be-
1903 If the extra option PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES is set, surro-
1904 gate code point values in UTF-8 and UTF-32 patterns no longer provoke
1912 \x in the way that ECMAscript (aka JavaScript) does. Additional func-
1915 as a hexadecimal character code, where hhh.. is any number of hexadeci-
1921 escape such as \j or a malformed one such as \x{2z} causes a compile-
1922 time error when detected by pcre2_compile(). Perl is somewhat inconsis-
1924 "j", and non-hexadecimal digits in \x{} are just ignored, though warn-
1925 ings are given in both cases if Perl's warning switch is enabled. How-
1927 Perl.
1931 treated as single-character escapes. For example, \j is a literal "j"
1932 and \x{2z} is treated as the literal string "x{2z}". Setting this op-
1937 is not supported in a character class. To reiterate: this is a danger-
1945 of a CR (carriage return) character. The option does not affect a lit-
1951 This option is provided for use by the -x option of pcre2grep. It
1953 automatically inserting the code for "^(?:" at the start of the com-
1955 the matched line may be in the middle of the subject string. This op-
1960 This option is provided for use by the -w option of pcre2grep. It
1968 JUST-IN-TIME (JIT) COMPILATION
1988 just-in-time compiler is available, further processes a compiled pat-
1994 for patterns to be analyzed, and for one-off matches and simple pat-
2010 code points are less than 256. By default, higher-valued code points
2016 \w and friends to use Unicode property support instead of the built-in
2017 tables. PCRE2_UCP also causes upper/lower casing operations on charac-
2025 PCRE2 contains a built-in set of character tables that are used by de-
2026 fault. These are sufficient for many applications. Normally, the in-
2029 default "C" locale of the local system, which may cause them to be dif-
2032 The built-in tables can be overridden by tables supplied by the appli-
2034 from the default. As more and more applications change to using Uni-
2055 The locale name "fr_FR" is used on Linux and other Unix-like systems;
2074 or whether the processor is 32-bit or 64-bit. A copy of the result of
2076 re-used later, even in a different program or on another computer. The
2081 used stand-alone to create a file that contains a set of binary tables.
2091 The first argument for pcre2_pattern_info() is a pointer to the com-
2093 is required, and the third argument is a pointer to a variable to re-
2097 the function is zero for success, or one of the following negative num-
2107 typical call of pcre2_pattern_info(), to obtain the length of the com-
2125 to a uint32_t variable. PCRE2_INFO_ARGOPTIONS returns exactly the op-
2126 tions that were passed to pcre2_compile(), whereas PCRE2_INFO_ALLOP-
2127 TIONS returns the compile options as modified by any top-level (*XXX)
2130 compile context by calling the pcre2_set_compile_extra_options() func-
2133 For example, if the pattern /(*UTF)abc/ is compiled with the PCRE2_EX-
2140 A pattern compiled without PCRE2_ANCHORED is automatically anchored by
2141 PCRE2 if the first significant item in every top-level branch is one of
2147 .* sometimes - see below
2159 For patterns that are auto-anchored, the PCRE2_ANCHORED bit is set in
2190 been set, the call to pcre2_pattern_info() returns the error PCRE2_ER-
2197 In the absence of a single first code unit for a non-anchored pattern,
2198 pcre2_compile() may construct a 256-bit table that defines a fixed set
2202 means "any code unit of value 255 or above". If such a table was con-
2209 a non-anchored pattern. The third argument should point to a uint32_t
2221 The third argument should point to a uint32_t variable. In the 8-bit
2222 library, the value is always less than 256. In the 16-bit library the
2223 value can be up to 0xffff. In the 32-bit library in UTF-32 mode the
2224 value can be up to 0x10ffff, and up to 0xffffffff when not using UTF-32
2231 without the use of JIT. The third argument should point to a size_t
2233 in the pattern. Each additional capture group adds two PCRE2_SIZE vari-
2246 \r or \n or one of the equivalent hexadecimal or octal escape se-
2252 (*LIMIT_HEAP=nnnn) at the start, the value is returned. The third argu-
2260 Return 1 if the (?J) or (?-J) option setting is used in the pattern,
2262 (?J) and (?-J) set and unset the local PCRE2_DUPNAMES option, respec-
2267 If the compiled pattern was successfully processed by pcre2_jit_com-
2287 PCRE2_INFO_LASTCODETYPE returns 1. Otherwise, return 0. The third argu-
2293 third argument should point to a uint32_t variable. When a pattern con-
2295 whether or not it can match an empty string. PCRE2 takes a cautious ap-
2301 (*LIMIT_MATCH=nnnn) at the start, the value is returned. The third ar-
2303 set, the call to pcre2_pattern_info() returns the error PCRE2_ERROR_UN-
2305 less than the limit set or defaulted by the caller of the match func-
2311 code units) when it starts to process each of its branches. This re-
2313 should point to a uint32_t integer. The simple assertions \b and \B re-
2314 quire a one-character lookbehind and cause PCRE2_INFO_MAXLOOKBEHIND to
2315 return 1 in the absence of anything longer. \A also registers a one-
2319 Note that this information is useful for multi-segment matching only if
2321 (?<=a(?<=ba)c) returns a maximum lookbehind of 2, but when it is pro-
2323 character, then the nested lookbehind also moves back by two charac-
2325 at the start. PCRE2_INFO_MAXLOOKBEHIND is really only useful as a de-
2327 multi-segment matching.
2344 PCRE2 supports the use of named as well as numbered capturing parenthe-
2345 ses. The names are just an additional way of identifying the parenthe-
2347 pcre2_substring_get_byname() are provided for extracting captured sub-
2351 do the conversion, you need to use the name-to-number map, which is de-
2354 The map consists of a number of fixed-size entries. PCRE2_INFO_NAME-
2360 This is a PCRE2_SPTR pointer to a block of code units. In the 8-bit li-
2361 brary, the first two bytes of each entry are the number of the captur-
2362 ing parenthesis, most significant byte first. In the 16-bit library,
2363 the pointer points to 16-bit code units, the first of which contains
2364 the parenthesis number. In the 32-bit library, the pointer points to
2365 32-bit code units, the first of which contains the parenthesis number.
2369 capture groups with the same number, as described in the section on du-
2374 Duplicate names for capture groups with different numbers are permit-
2378 necessarily the case because later capture groups may have lower num-
2382 pattern after compilation by the 8-bit library (assume PCRE2_EXTENDED
2383 is set, so white space - including newlines - is ignored):
2385 (?<date> (?<year>(\d\d)?\d\d) -
2386 (?<month>\d\d) - (?<day>\d\d) )
2390 with non-printing bytes shows in hexadecimal, and undefined bytes shown
2399 name-to-number map, remember that the length of the entries is likely
2413 This identifies the character sequence that will be recognized as mean-
2418 Return the size of the compiled pattern in bytes (for all three li-
2422 pcre2_compile() is getting memory in which to place the compiled pat-
2423 tern may be slightly larger than the value returned by this option, be-
2425 over-estimate. Processing a pattern with the JIT compiler does not al-
2441 which they appear. Its first argument is a pointer to a callout enumer-
2443 passed to pcre2_callout_enumerate(). The contents of the callout enu-
2453 PCRE2, with the same code unit width, and must also have the same endi-
2458 the serialized form. They are described in the pcre2serialize documen-
2459 tation. Note that PCRE2 serialization does not convert compiled pat-
2480 you must create a match data block by calling one of the creation func-
2487 to record the matched portion of the subject plus three captured sub-
2501 The second argument of pcre2_match_data_create() is a pointer to a gen-
2511 general context, but in this case if NULL is passed, the memory is ob-
2517 after a match operation has finished, using functions that are de-
2521 match block only when the error is PCRE2_ERROR_NOMATCH, PCRE2_ER-
2522 ROR_PARTIAL, or one of the error codes for an invalid UTF string. Ex-
2532 described in the section entitled "Option bits for pcre2_match()" be-
2537 NULL argument, it returns immediately, without doing anything.
2550 order to find multiple matches in the subject string or to match dif-
2553 This function is the main matching facility of the library, and it op-
2554 erates in a Perl-like manner. For specialist use there is also an al-
2570 If the subject string is zero-terminated, the length can be given as
2572 common matching parameters are to be changed. For details, see the sec-
2580 bytes for the 8-bit library, 16-bit code units for the 16-bit library,
2581 and 32-bit code units for the 32-bit library, whether or not UTF pro-
2583 zero, the subject is assumed to be an empty string. If length is non-
2589 by far the most common case. In UTF-8 or UTF-16 mode, the starting off-
2590 set must point to the start of a character, or to the end of the sub-
2591 ject (in UTF-32 mode, one code unit equals one character, so all off-
2592 sets are valid). Like the pattern string, the subject may contain bi-
2595 A non-zero starting offset is useful when searching for another match
2615 match an empty string. It is possible to emulate Perl's /g behaviour by
2622 so, and the current character is CR followed by LF, advance the start-
2625 If a non-zero starting offset is passed when the pattern is anchored, a
2626 single attempt to match at the given offset is made. This can only suc-
2628 the subject. In other words, the anchoring must be the result of set-
2636 PCRE2_COPY_MATCHED_SUBJECT, PCRE2_ENDANCHORED, PCRE2_NOTBOL, PCRE2_NO-
2641 Setting PCRE2_ANCHORED or PCRE2_ENDANCHORED at match time is not sup-
2642 ported by the just-in-time (JIT) compiler. If it is set, JIT matching
2660 must not be freed until all such operations are complete. For some ap-
2669 also automatically freed if the match data block is re-used for another
2675 matches must be right at the end of the subject string. Note that set-
2682 match before it. Setting this without having set PCRE2_MULTILINE at
2690 in multiline mode) a newline immediately before it. Setting this with-
2692 match. This option affects only the behaviour of the dollar metacharac-
2714 subject is permitted. If the pattern is anchored, such a match can oc-
2729 The latter special case is discussed in detail in the pcre2unicode doc-
2732 In the default case, if a non-zero starting offset is given, the check
2740 that the sequences \b and \B are one-character lookbehinds.
2746 validity of UTF-8 strings, UTF-16 strings, and UTF-32 strings in the
2756 PCRE2_NO_UTF_CHECK is set at match time the effect of passing an in-
2757 valid string as a subject, or an invalid value of startoffset, is unde-
2758 fined. Your program may crash or loop indefinitely or give wrong re-
2764 These options turn on the partial matching feature. A partial match oc-
2766 there are not enough subject characters to complete the match. In addi-
2771 If this situation arises when PCRE2_PARTIAL_SOFT (but not PCRE2_PAR-
2772 TIAL_HARD) is set, matching continues by testing any remaining alterna-
2774 returned instead of PCRE2_ERROR_NOMATCH. In other words, PCRE2_PAR-
2780 PCRE2_ERROR_PARTIAL, without considering any other alternatives. In
2781 other words, when PCRE2_PARTIAL_HARD is set, a partial match is consid-
2784 There is a more detailed discussion of partial and multi-segment match-
2790 When PCRE2 is built, a default newline convention is set; this is usu-
2795 pcre2pattern page. During matching, the newline choice affects the be-
2808 expected. For example, if the pattern is .+A (and the PCRE2_DOTALL op-
2811 However, the pattern [\r\n]A does match that string, because it con-
2812 tains an explicit CR or LF reference, and so advances only by one char-
2818 not count, nor does \s, even though it includes CR and LF in the char-
2836 phrase "capture group" (Perl terminology) is used for a fragment of a
2845 Alternatively, you can make direct use of the vector of PCRE2_SIZE val-
2849 pcre2_get_ovector_count() returns the number of pairs of values it con-
2852 Within the ovector, the first in each pair of values is set to the off-
2854 offset of the first code unit after the end of a substring. These val-
2856 are byte offsets in the 8-bit library, 16-bit offsets in the 16-bit li-
2857 brary, and 32-bit offsets in the 32-bit library.
2865 the portion of the subject string that was matched by the entire pat-
2869 been captured, the returned value is 3. If there are no captured sub-
2878 If a capture group is matched repeatedly within a single match opera-
2879 tion, it is the last portion of the subject that it matched that is re-
2895 Offset values that correspond to unused groups at the end of the ex-
2904 in the pattern are never changed. That is, if a pattern contains n cap-
2906 pcre2_match(). The other elements retain whatever values they previ-
2927 returns a pointer to the zero-terminated name, which is within the com-
2935 backtracking verbs without names do not count. Thus, for example, if
2937 After a "no match" or a partial match, the last encountered name is re-
2947 Warning: By default, certain start-of-match optimizations are used to
2950 for the presence of "c" in the subject before running the matching en-
2951 gine. This check fails for "bx", causing a match failure without seeing
2952 any marks. You can disable the start-of-match optimizations by setting
2959 offset of the character at which the match started. For a non-partial
2972 If pcre2_match() fails, it returns a negative number. This can be con-
2973 verted to a text string by calling the pcre2_get_error_message() func-
2978 of UTF-specific negative error codes is returned. Details are given in
2993 PCRE2 stores a 4-byte "magic number" at the start of the compiled code,
3000 a library of a different code unit width, for example, a pattern com-
3001 piled by the 8-bit library is passed to a 16-bit or 32-bit library
3041 This error is returned when a pattern that was successfully studied us-
3042 ing JIT is being matched, but the memory available for the just-in-time
3056 also returned if PCRE2_COPY_MATCHED_SUBJECT is set and memory alloca-
3066 within the pattern. Specifically, it means that either the whole pat-
3069 might do this are detected and faulted at compile time, but more com-
3080 match, or auxiliary) can be obtained by calling pcre2_get_error_mes-
3087 The returned message is terminated with a trailing zero, and the func-
3089 zero. If the error number is unknown, the negative error code PCRE2_ER-
3113 extracting captured substrings as new, separate, zero-terminated
3119 zero refers to the entire matched substring, with higher numbers refer-
3130 extracts a zero-length empty string.
3132 You can find the length in code units of a captured substring without
3139 The pcre2_substring_copy_bynumber() function copies a captured sub-
3142 function that was used for the match data block. The first two argu-
3159 code is returned. If a substring number greater than zero is used af-
3182 pattern is (abc)|(def) and the subject is "def", and the ovector con-
3193 The pcre2_substring_list_get() function extracts all available sub-
3195 builds a second list that contains their lengths (in code units), ex-
3207 therefore need the lengths, you may supply NULL as the lengthsptr argu-
3209 function is zero if all went well, or PCRE2_ERROR_NOMEMORY if the mem-
3216 be distinguished from a genuine zero-length substring by inspecting the
3237 To extract a substring by name, you first have to find associated num-
3244 the name by calling pcre2_substring_number_from_name(). The first argu-
3253 the "bynumber" functions, the only difference being that the second ar-
3261 than the number of slots in the ovector, PCRE2_ERROR_UNAVAILABLE is re-
3267 group numbers in the pcre2pattern page, you cannot use names to distin-
3286 can be given as PCRE2_ZERO_TERMINATED for a zero-terminated string. As
3287 a special case, if replacement is NULL and rlength is zero, the re-
3288 placement is assumed to be an empty string. If rlength is non-zero, an
3291 There is an option (see PCRE2_SUBSTITUTE_REPLACEMENT_ONLY below) to re-
3294 that requests multiple replacements (see PCRE2_SUBSTITUTE_GLOBAL be-
3299 never greater than one unless PCRE2_SUBSTITUTE_GLOBAL is set. A nega-
3304 error return. For global replacements, matches in which \K in a lookbe-
3309 pcre2_match(), except that the partial matching options are not permit-
3311 block is obtained and freed within this function, using memory manage-
3318 will always be a no-match error. The contents of the ovector within the
3326 arguments. The data in the match_data block (return code, offset vec-
3328 pcre2_match() from within pcre2_substitute(). This allows an applica-
3329 tion to check for a match before choosing to substitute, without having
3333 changed when PCRE2_SUBSTITUTE_MATCHED is set. If PCRE2_SUBSTI-
3334 TUTE_GLOBAL is also set, pcre2_match() is called after the first sub-
3335 stitution to check for further matches, but this is done using an in-
3339 The code argument is not used for matching before the first substitu-
3341 even when PCRE2_SUBSTITUTE_GLOBAL is not set, because it contains in-
3342 formation such as the UTF setting and the number of capturing parenthe-
3346 subject string with matched substrings replaced. However, if PCRE2_SUB-
3352 The outlengthptr argument of pcre2_substitute() must point to a vari-
3358 If the function is not successful, the value set via outlengthptr de-
3360 string, the value is the offset in the replacement string where the er-
3361 ror was detected. For other errors, the value is PCRE2_UNSET by de-
3362 fault. This includes the case of the output buffer being too small, un-
3366 buffer is too small. The default action is to return PCRE2_ERROR_NOMEM-
3368 continues to go through the motions of matching and substituting (with-
3369 out, of course, writing anything) in order to compute the size of buf-
3371 variable, with the result of the function still being PCRE2_ER-
3376 that the entire operation is carried out twice. Depending on the appli-
3378 the excess afterwards, instead of using PCRE2_SUBSTITUTE_OVER-
3383 invalid UTF replacement string causes an immediate return with the rel-
3386 If PCRE2_SUBSTITUTE_LITERAL is set, the replacement string is not in-
3387 terpreted in any way. By default, however, a dollar character is an es-
3388 cape character that can specify the insertion of characters from cap-
3389 ture groups and names from (*MARK) or other control verbs in the pat-
3397 brackets are required only if the following character would be inter-
3417 takes place in the original subject string (that is, previous replace-
3420 subject string. If an offset limit is set in the match context, search-
3424 the subject string by setting either or both of startoffset and an off-
3432 with zero length, an attempt to find a non-empty match at the same off-
3443 PCRE2_SUBSTITUTE_UNSET_EMPTY causes unset capture groups (including un-
3446 not set, an attempt to insert an unset group causes the PCRE2_ERROR_UN-
3451 replacement string. Without this option, only the dollar character is
3457 particular character codes, and backslash followed by any non-alphanu-
3464 current state: \U and \L change to upper or lower case forcing, respec-
3469 all inserted characters, including those from capture groups and let-
3474 Note that case forcing sequences such as \U...\E do not nest. For exam-
3476 \E has no effect. Note also that the PCRE2_ALT_BSUX and PCRE2_EX-
3483 ${<n>:-<string>}
3486 As before, <n> may be a group number or a name. The first form speci-
3507 substitutions. However, PCRE2_SUBSTITUTE_UNKNOWN_UNSET does cause un-
3511 PCRE2_SUBSTITUTE_UNSET_EMPTY, and PCRE2_SUBSTITUTE_EXTENDED are irrele-
3520 PCRE2_ERROR_NOSUBSTRING is returned for a non-existent substring inser-
3523 PCRE2_ERROR_UNSET is returned for an unset substring insertion (includ-
3525 when the simple (non-extended) syntax is used and PCRE2_SUBSTITUTE_UN-
3539 the replacement string, with more particular errors being PCRE2_ER-
3548 obtained by calling the pcre2_get_error_message() function (see "Ob-
3565 callout block structure, which contains the following fields, not nec-
3582 first callout, 2 for the second, and so on. The input and output point-
3597 If the value is zero, the replacement is accepted, and, if PCRE2_SUB-
3599 match. If the value is not zero, the current replacement is not ac-
3603 copied to the output and the call to pcre2_substitute() exits, return-
3613 capture groups are not required to be unique. Duplicate names are al-
3619 match, only one of each set of identically-named groups participates.
3624 to the given name that is set. Only if none are set is PCRE2_ERROR_UN-
3625 SET is returned. The pcre2_substring_number_from_name() function re-
3637 point to the first and last entries in the name-to-number table for the
3650 The traditional matching function uses a similar algorithm to Perl,
3651 which stops when it finds the first match at a given point in the sub-
3654 function (see below) instead. If you cannot use the alternative func-
3658 What you have to do is to insert a callout right at the end of the pat-
3659 tern. When your callout function is called, extract and save the cur-
3677 different characteristics to the normal algorithm, and is not compati-
3678 ble with Perl. Some of the features of PCRE2 patterns are not sup-
3686 is used in a different way, and this is described below. The other com-
3715 PCRE2_COPY_MATCHED_SUBJECT, PCRE2_ENDANCHORED, PCRE2_NOTBOL, PCRE2_NO-
3732 matches, but there is still at least one matching possibility. The por-
3735 more detailed discussion of partial and multi-segment matching, with
3741 stop as soon as it has found one match. Because of the way the alterna-
3757 When pcre2_dfa_match() succeeds, it may have matched more than one sub-
3776 which is the number of matched substrings. The offsets of the sub-
3782 Calls to the convenience functions that extract substrings by name re-
3783 turn the error PCRE2_ERROR_DFA_UFUNC (unsupported function) if used af-
3792 NOTE: PCRE2's "auto-possessification" optimization usually applies to
3795 matching, this means that only one possible match is found. If you re-
3796 ally do want multiple matches in such cases, either use an ungreedy re-
3797 peat such as "a\d+?" or set the PCRE2_NO_AUTO_POSSESS option when com-
3862 Copyright (c) 1997-2022 University of Cambridge.
3863 ------------------------------------------------------------------------------
3871 PCRE2 - Perl-compatible regular expressions (revised API)
3876 the library in Unix-like environments using the applications known as
3878 CMake instead of configure. The text file README contains general in-
3879 formation about building with Autotools (some of which is repeated be-
3881 systems. There is a lot more information about building PCRE2 without
3883 "by hand") in the text file called NON-AUTOTOOLS-BUILD. You should
3885 non-Unix-like environment.
3888 PCRE2 BUILD-TIME OPTIONS
3892 configure script, where the optional features are selected or dese-
3893 lected by providing options to configure before running the make com-
3894 mand. However, the same options can be selected in both Unix-like and
3895 non-Unix-like environments if you are using CMake instead of configure
3900 compiler, as described in NON-AUTOTOOLS-BUILD.
3903 ones such as the selection of the installation directory) can be ob-
3906 ./configure --help
3909 names begin with --enable or --disable. Because of the way that config-
3910 ure works, --enable and --disable always come in pairs, so the comple-
3913 with --with. At the end of a configure run, a summary of the configura-
3917 BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES
3919 By default, a library called libpcre2-8 is built, containing functions
3921 either as single-byte characters, or UTF-8 strings. You can also build
3922 two other libraries, called libpcre2-16 and libpcre2-32, which process
3923 strings that are contained in arrays of 16-bit and 32-bit code units,
3924 respectively. These can be interpreted either as single-unit characters
3925 or UTF-16/UTF-32 strings. To build these additional libraries, add one
3928 --enable-pcre2-16
3929 --enable-pcre2-32
3931 If you do not want the 8-bit library, add
3933 --disable-pcre2-8
3936 the POSIX wrapper is for the 8-bit library only, and that pcre2grep is
3937 an 8-bit program. Neither of these are built if you select only the
3938 16-bit or 32-bit libraries.
3947 --disable-shared
3948 --disable-static
3956 strings. To build it without Unicode support, add
3958 --disable-unicode
3961 It is not possible to build one library with Unicode support and an-
3962 other without in the same configuration.
3964 Of itself, Unicode support does not make PCRE2 treat strings as UTF-8,
3965 UTF-16 or UTF-32. To do that, applications that use the library can set
3966 the PCRE2_UTF option when they call pcre2_compile() to compile a pat-
3974 and Nd, script names, and some bi-directional properties are supported.
3986 mode, can cause unpredictable behaviour because it may leave the cur-
3987 rent matching point in the middle of a multi-code-unit character. The
3988 application can lock it out by setting the PCRE2_NEVER_BACKSLASH_C op-
3989 tion when calling pcre2_compile(). There is also a build-time option
3991 --enable-never-backslash-C
3996 JUST-IN-TIME COMPILER SUPPORT
3998 Just-in-time (JIT) compiler support is included in the build by speci-
4001 --enable-jit
4007 --enable-jit=auto
4014 --enable-jit-sealloc
4021 --disable-pcre2grep-jit
4029 the end of a line. This is the normal newline character on Unix-like
4033 --enable-newline-is-cr
4035 to the configure command. There is also an --enable-newline-is-lf op-
4039 the two-character sequence CRLF (CR immediately followed by LF). If you
4042 --enable-newline-is-crlf
4046 --enable-newline-is-anycrlf
4051 --enable-newline-is-any
4054 newline sequences are the three just mentioned, plus the single charac-
4059 --enable-newline-is-nul
4061 which causes NUL (binary zero) to be set as the default line-ending
4075 --enable-bsr-anycrlf
4077 the default is changed so that \R matches only CR, LF, or CRLF. What-
4085 part to another (for example, from an opening parenthesis to an alter-
4086 nation metacharacter). By default, in the 8-bit and 16-bit libraries,
4087 two-byte values are used for these offsets, leading to a maximum size
4088 for a compiled pattern of around 64 thousand code units. This is suffi-
4091 compile PCRE2 to use three-byte or four-byte offsets by adding a set-
4094 --with-link-size=3
4097 16-bit library, a value of 3 is rounded up to 4. In these libraries,
4099 to load additional data when handling them. For the 32-bit library the
4100 value is always 4 and cannot be overridden; the value of --with-link-
4113 --with-match-limit=500000
4127 --with-heap-limit=500
4137 for --with-match-limit. You can set a lower default limit by adding,
4140 --with-match-limit-depth=10000
4147 This limit was more useful in versions before 10.30, where function re-
4152 for lookaround assertions, atomic groups, and recursion within pat-
4163 --enable-rebuild-chartables
4166 Instead, a program called pcre2_dftables is compiled and run. This out-
4168 your C run-time system. This method of replacing the tables does not
4178 cc src/pcre2_dftables.c -o pcre2_dftables
4182 want to specify a locale, you must use the -L option:
4184 LC_ALL=fr_FR ./pcre2_dftables -L src/pcre2_chartables.c
4186 You can also specify -b (with or without -L). This causes the tables to
4188 can be loaded into memory by an application and passed to pcre2_com-
4190 The tables are just a string of bytes, independent of hardware charac-
4201 compiled to run in an 8-bit EBCDIC environment by adding
4203 --enable-ebcdic --disable-unicode
4205 to the configure command. This setting implies --enable-rebuild-charta-
4206 bles. You should only use it if you know that you are in an EBCDIC en-
4209 It is not possible to support both EBCDIC and UTF-8 codes in the same
4210 version of the library. Consequently, --enable-unicode and --enable-
4217 --enable-ebcdic-nl25
4219 as well as, or instead of, --enable-ebcdic. The EBCDIC character for CR
4221 0x25 is not chosen as LF is made to correspond to the Unicode NEL char-
4224 The options that select newline behaviour, such as --enable-newline-is-
4225 cr, and equivalent run-time options, refer to these character values in
4232 within the patterns it is matching. There are two kinds: one that gen-
4233 erates output using local code, and another that calls an external pro-
4234 gram or script. If --disable-pcre2grep-callout-fork is added to the
4236 --disable-pcre2grep-callout is used, all callouts are completely ig-
4237 nored. For more details of pcre2grep callouts, see the pcre2grep docu-
4247 --enable-pcre2grep-libz
4248 --enable-pcre2grep-libbz2
4250 to the configure command. These options naturally require that the rel-
4262 be processable is the notional buffer size. If a longer line is encoun-
4268 --with-pcre2grep-bufsize=51200
4269 --with-pcre2grep-max-bufsize=2097152
4272 values by using --buffer-size and --max-buffer-size on the command
4280 --enable-pcre2test-libreadline
4281 --enable-pcre2test-libedit
4283 to the configure command, pcre2test is linked with the libreadline or-
4285 it reads it using the readline() function. This provides line-editing
4286 and history facilities. Note that libreadline is GPL-licensed, so if
4291 Setting --enable-pcre2test-libreadline causes the -lreadline option to
4293 sytem-installed readline library this is sufficient. However, in some
4305 LIBS="-ncurses"
4314 --enable-debug
4324 --enable-valgrind
4327 certain memory regions as unaddressable. This allows it to detect in-
4337 --enable-coverage
4349 When --enable-coverage is used, the following addition targets are
4355 equivalent to running "make coverage-reset", "make coverage-baseline",
4356 "make check", and then "make coverage-report".
4358 make coverage-reset
4362 make coverage-baseline
4366 make coverage-report
4370 make coverage-clean-report
4372 This removes the generated coverage report without cleaning the cover-
4375 make coverage-clean-data
4377 This removes the captured coverage data without removing the coverage
4380 make coverage-clean
4383 For more information about code coverage, see the gcov and lcov docu-
4397 --disable-percent-zt
4409 --enable-fuzz-support
4411 At present this applies only to the 8-bit library. If set, it causes an
4412 extra library called libpcre2-fuzzsupport.a to be built, but not in-
4413 stalled. This contains a single function called LLVMFuzzerTestOneIn-
4420 Setting --enable-fuzz-support also causes a binary called pcre2fuz-
4435 --disable-stack-for-recursion
4444 pcre2api(3), pcre2-config(3).
4457 Copyright (c) 1997-2022 University of Cambridge.
4458 ------------------------------------------------------------------------------
4466 PCRE2 - Perl-compatible regular expressions (revised API)
4481 PCRE2 provides a feature called "callout", which is a means of tempo-
4487 When using the pcre2_substitute() function, an additional callout fea-
4497 ending delimiter is the same as the start, except for {, where the end-
4517 A(\d{2}|--)
4521 (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
4524 alternation bar. If the pattern contains a conditional group whose con-
4538 information when you are trying to optimize the performance of a par-
4548 Auto-possessification
4550 At compile time, PCRE2 "auto-possessifies" repeated items when it knows
4556 --->aaaa
4564 the auto-possessify feature by passing PCRE2_NO_AUTO_POSSESS to
4568 --->aaaa
4585 beginning of the subject, and pcre2_compile() remembers this. If a pat-
4586 tern has more than one top-level branch, automatic anchoring occurs if
4591 It is also disabled if the pattern contains (*PRUNE) or (*SKIP). How-
4597 --->aa
4604 This shows that all match attempts start at the beginning of the sub-
4605 ject. In other words, the pattern is anchored. You can disable this op-
4607 starting the pattern with (*NO_DOTSTAR_ANCHOR). In this case, the out-
4610 --->aa
4620 This shows more match attempts, starting at the second subject charac-
4637 string, and will immediately give a "no match" return without actually
4641 You can disable these optimizations by passing the PCRE2_NO_START_OPTI-
4651 to both normal, DFA, and JIT matching. The first argument to the call-
4677 version 1, and the callout_flags field for version 2. If you are writ-
4686 contains the number of the callout, in the range 0-255. This is the
4693 callout_string points to the string that is contained within the com-
4701 delimiter as callout_string[-1] if you need it.
4717 For calls to pcre2_match(), the offset_vector field is not (since re-
4719 matching function in the match data block. Instead it points to an in-
4725 The capture_last field contains the number of the most recently cap-
4727 number of the highest numbered captured substring so far. If no sub-
4733 The contents of ovector[2] to ovector[<capture_top>*2-1] can be in-
4742 was passed to the matching function in the match data block for call-
4753 at which the current match attempt started. However, if the escape se-
4768 parenthesis, the length includes meta characters that follow the paren-
4771 the length is one, unless a closing parenthesis is followed by a quan-
4774 was that of the entire group, and before an alternation bar or a clos-
4780 are used by pcre2test to show the next item to be matched when display-
4784 zero-terminated name of the most recently passed (*MARK), (*PRUNE), or
4786 Instances of (*PRUNE) or (*THEN) without a name do not obliterate a
4791 pcre2_dfa_match() or when JIT is being used. When pcre2_match() without
4806 starting position in the subject. Output from pcre2test does not indi-
4810 The information in the callout_flags field is provided so that applica-
4814 because there is no backtracking in DFA matching, and there is no sup-
4828 Negative values should normally be chosen from the set of PCRE2_ER-
4846 which they appear. Its first argument is a pointer to a callout enumer-
4848 passed to pcre2_callout_enumerate(). The data block contains the fol-
4866 non-zero minimum or a fixed maximum, the group is replicated inside the
4872 The callback function should normally return zero. If it returns a non-
4887 Copyright (c) 1997-2019 University of Cambridge.
4888 ------------------------------------------------------------------------------
4896 PCRE2 - Perl-compatible regular expressions (revised API)
4898 DIFFERENCES BETWEEN PCRE2 AND PERL
4901 and Perl handle regular expressions. The differences described here are
4902 with respect to Perl version 5.34.0, but as both Perl and PCRE2 are
4905 1. When PCRE2_DOTALL (equivalent to Perl's /s qualifier) is not set,
4906 the behaviour of the '.' metacharacter differs from Perl. In PCRE2, '.'
4907 matches the next character unless it is the start of a newline se-
4910 and NL (either 0x15 or 0x25) when using EBCDIC. In Perl, '.' appears
4913 2. PCRE2 has only a subset of Perl's Unicode support. Details of what
4916 3. Like Perl, PCRE2 allows repeat quantifiers on parenthesized asser-
4918 does not assert that the next three characters are not "a". It just as-
4920 PCRE2 optimizes this to run the assertion just once). Perl allows some
4923 on non-lookaround assertions.
4928 the condition is false). Perl may set such capture groups in other
4931 5. The following Perl escape sequences are not supported: \F, \l, \L,
4932 \u, \U, and \N when followed by a character name. \N on its own, match-
4933 ing a non-newline character, and \N{U+dd..}, matching a Unicode code
4935 letters are implemented by Perl's general string-handling and are not
4941 6. The Perl escape sequences \p, \P, and \X are supported only if PCRE2
4946 PCRE2 and Perl support the Cs (surrogate) property, but in PCRE2 its
4948 long synonyms for property names that Perl supports (such as \p{Let-
4954 from Perl in that $ and @ are also handled as literals inside the
4955 quotes. In Perl, they cause variable interpolation (PCRE2 does not have
4956 variables). Also, Perl does "double-quotish backslash interpolation" on
4961 Pattern PCRE2 matches Perl matches
4971 classes by both PCRE2 and Perl.
4980 and backtracking into subroutine calls is now supported, as in Perl.
4984 their effect is confined to that group; it does not extend to the sur-
4985 rounding pattern. This is not always the case in Perl. In particular,
4994 in C triggers (*PRUNE). Perl's behaviour is more complex; in many cases
4999 matching "aba" against the pattern /^(a(b)?)+$/ in Perl leaves $2 un-
5003 not as general as Perl's. This is a consequence of the fact the PCRE2
5004 works internally just with numbers, using an external table to trans-
5012 14. Perl used to recognize comments in some places that PCRE2 does not,
5014 modifier is set, Perl allowed white space between ( and ? though the
5016 may still be some cases where Perl behaves differently.
5018 15. Perl, when in warning mode, gives warnings for character classes
5019 such as [A-\d] or [a-[:digit:]]. It then treats the hyphens as liter-
5024 not affected when case-independent matching is specified. For example,
5025 \p{Lu} always matches an upper case letter. I think Perl has changed in
5030 17. From release 5.32.0, Perl locks out the use of \K in lookaround as-
5032 there is an option for re-enabling the previous behaviour. When this
5036 18. PCRE2 provides some extensions to the Perl regular expression fa-
5037 cilities. Perl 5.10 included new features that were not in earlier
5038 versions of Perl, some of which (such as named parentheses) were in
5039 PCRE2 for some time before. This list is with respect to Perl 5.34:
5043 match a different length of string. Perl used to require them all to
5047 (b) From PCRE2 10.23, backreferences to groups of fixed length are sup-
5048 ported in lookbehinds, provided that there is no possibility of refer-
5049 encing a non-unique number or name. Perl does not support backrefer-
5053 $ meta-character matches only at the very end of the string.
5056 faulted. (Perl can be made to issue a warning.)
5058 (e) If PCRE2_UNGREEDY is set, the greediness of the repetition quanti-
5059 fiers is inverted, that is, by default they are not greedy, but if fol-
5066 PCRE2_NOTEMPTY_ATSTART options have no Perl equivalents.
5071 (i) The callout facility is PCRE2-specific. Perl supports codeblocks
5074 (j) The partial matching facility is PCRE2-specific.
5077 different way and is not Perl-compatible.
5083 (m) PCRE2 supports non-atomic positive lookaround assertions. This is
5084 an extension to the lookaround facilities. The default, Perl-compatible
5087 19. The Perl /a modifier restricts /d numbers to pure ascii, and the
5088 /aa modifier restricts /i case-insensitive matching to pure ascii, ig-
5092 20. Perl has different limits than PCRE2. See the pcre2limit documenta-
5093 tion for details. Perl went with 5.10 from recursion to iteration keep-
5095 not fall into any stack-overflow limit. PCRE2 made a similar change at
5096 release 10.30, and also has many build-time and run-time customizable
5110 Copyright (c) 1997-2021 University of Cambridge.
5111 ------------------------------------------------------------------------------
5119 PCRE2 - Perl-compatible regular expressions (revised API)
5121 PCRE2 JUST-IN-TIME COMPILER SUPPORT
5123 Just-in-time compiling is a heavyweight optimization that can greatly
5124 speed up pattern matching. However, it comes at the cost of extra pro-
5126 the same pattern is going to be matched many times. This does not nec-
5128 anchored, matching attempts may take place many times at various posi-
5130 string is very long, it may still pay to use JIT even for one-off
5131 matches. JIT support is available for all of the 8-bit, 16-bit and
5132 32-bit PCRE2 libraries.
5134 JIT support applies only to the traditional Perl-compatible matching
5142 --enable-jit (or equivalent CMake option) must be set when PCRE2 is
5146 ARM 32-bit (v5, v7, and Thumb2)
5147 ARM 64-bit
5149 Intel x86 32-bit and 64-bit
5150 MIPS 32-bit and 64-bit
5151 Power PC 32-bit and 64-bit
5152 SPARC 32-bit
5154 If --enable-jit is set on an unsupported platform, compilation fails.
5156 A program can tell if JIT support is available by calling pcre2_con-
5160 falls back to the interpretive code if JIT is not available. For pro-
5162 path" API that is JIT-specific.
5171 second is zero or more of the following option bits: PCRE2_JIT_COM-
5182 the size of machine stack that it uses. The exact rules are not docu-
5183 mented because they may change at any time, in particular, when new op-
5187 PCRE2_JIT_COMPLETE requests the JIT compiler to generate code for com-
5188 plete matches. If you want to run partial matches using the PCRE2_PAR-
5193 pcre2_match() is called, the appropriate code is run if it is avail-
5198 the option bits. For example, you can call it once with PCRE2_JIT_COM-
5201 will ignore PCRE2_JIT_COMPLETE and just compile code for partial match-
5202 ing. If pcre2_jit_compile() is called with no option bits set, it imme-
5210 are described in the section entitled "Controlling the JIT stack" be-
5219 stack" below, even if you do not need to supply a non-default JIT
5221 be obeyed. If the match-time options are not right for JIT execution,
5224 If the JIT compiler finds an unsupported item, no JIT data is gener-
5226 pattern by calling pcre2_pattern_info() with the PCRE2_INFO_JITSIZE op-
5227 tion. A non-zero result means that JIT compilation was successful. A
5236 are normally expected to be a valid sequence of UTF code units. By de-
5237 fault, this is checked at the start of matching and an error is gener-
5245 PCRE2_MATCH_INVALID_UTF option has two effects: it tells the inter-
5246 preter in pcre2_match() to support invalid UTF, and, if pcre2_jit_com-
5247 pile() is called, the compiled JIT code also supports invalid UTF. De-
5252 PCRE2_JIT_INVALID_UTF, which currently exists only for backward compat-
5270 when running in a UTF mode, and a callout immediately before an asser-
5279 that the memory used for the JIT stack was insufficient. See "Control-
5293 large or complicated patterns need more than this. The error PCRE2_ER-
5294 ROR_JIT_STACKLIMIT is given when there is not enough stack. Three func-
5299 The pcre2_jit_stack_create() function creates a JIT stack. Its argu-
5305 function returns immediately, without doing anything. (For the techni-
5317 The first argument is a pointer to a match context. When this is subse-
5319 JIT stack is used. If this argument is NULL, the function returns imme-
5320 diately, without doing anything. There are three cases for the values
5339 is not obeyed when pcre2_match() is called with options that are incom-
5341 determine whether a match operation was executed by JIT or by the in-
5347 up non-sequential matches in one thread is to use callouts: if a call-
5352 you assign or pass back NULL from a callback, that is thread-safe, be-
5354 pass back a non-NULL JIT stack, this must be a different stack for each
5355 thread so that the application is thread-safe.
5357 Strictly speaking, even more is allowed. You can assign the same non-
5366 up non-default JIT stacks might operate:
5374 Use a one-line callback function
5385 PCRE2 (and JIT) is a recursive, depth-first engine, so it needs a stack
5387 child nodes. Allocating real machine stack on some platforms is diffi-
5394 Modern operating systems have a nice feature: they can reserve an ad-
5396 pages inside this address space, so the stack could grow without moving
5397 memory data (this is important because of pointers). Thus we can allo-
5417 You can free compiled patterns, contexts, and stacks in any order, any-
5430 this without keeping a list of patterns.
5436 Especially on embedded sytems, it might be a good idea to release mem-
5437 ory sometimes without freeing the stack. There is no API for this at
5439 allocated memory for any stack and another which allows releasing mem-
5453 The JIT executable allocator does not free all memory when it is possi-
5457 calling pcre2_jit_free_unused_memory(). Its argument is a general con-
5458 text, for custom memory management, or NULL for standard memory manage-
5464 This is a single-threaded example that specifies a JIT stack without
5501 The fast path function is called pcre2_jit_match(), and it takes ex-
5503 must be specified with a length; PCRE2_ZERO_TERMINATED is not sup-
5504 ported. Unsupported option bits (for example, PCRE2_ANCHORED, PCRE2_EN-
5507 pcre2_match(), plus PCRE2_ERROR_JIT_BADOPTION if a matching mode (par-
5511 number of other sanity checks are performed on the arguments. For exam-
5512 ple, if the subject pointer is NULL but the length is non-zero, an im-
5537 Copyright (c) 1997-2021 University of Cambridge.
5538 ------------------------------------------------------------------------------
5546 PCRE2 - Perl-compatible regular expressions (revised API)
5554 code units for the 8-bit and 16-bit libraries if PCRE2 is compiled with
5555 the default internal linkage size, which is 2 bytes for these li-
5558 (when building the 16-bit library, 3 is rounded up to 4). See the
5560 for details. In these cases the limit is substantially larger. How-
5561 ever, the speed of execution is slower. In the 32-bit library, the in-
5569 the largest number a PCRE2_SIZE variable can hold. PCRE2_SIZE is an un-
5571 is ~(PCRE2_SIZE)0) is reserved as a special indicator for zero-termi-
5590 (*THEN) verb is 255 code units for the 8-bit library and 65535 code
5591 units for the 16-bit and 32-bit libraries.
5594 number a 32-bit unsigned integer can hold.
5611 Copyright (c) 1997-2022 University of Cambridge.
5612 ------------------------------------------------------------------------------
5620 PCRE2 - Perl-compatible regular expressions (revised API)
5627 pcre2_match() function. This works in the same as as Perl's matching
5628 function, and provide a Perl-compatible matching operation. The just-
5629 in-time (JIT) optimization that is described in the pcre2jit documenta-
5633 it operates in a different way, and is not Perl-compatible. This alter-
5634 native has advantages and disadvantages compared with the standard al-
5654 The set of strings that are matched by a regular expression can be rep-
5659 tree: depth-first and breadth-first, and these correspond to the two
5665 In the terminology of Jeffrey Friedl's book "Mastering Regular Expres-
5667 depth-first search of the pattern tree. That is, it proceeds along a
5669 required. When there is a mismatch, the algorithm tries any alterna-
5678 that point the algorithm stops. Thus, if there is more than one possi-
5681 on the way the alternations and the greedy or ungreedy repetition quan-
5684 Because it ends up with a single path through the tree, it is rela-
5685 tively straightforward for this algorithm to keep track of the sub-
5692 This algorithm conducts a breadth-first search of the tree. Starting
5701 scans the subject string only once, without backtracking, there is one
5703 following or preceding the current point have to be independently in-
5710 this algorithm finds all of them, and in particular, it finds the long-
5718 the match data block is therefore not advisable when doing DFA match-
5728 the fifth character of the subject. The algorithm does not automati-
5731 PCRE2's "auto-possessification" optimization usually applies to charac-
5732 ter repeats at the end of a pattern (as well as internally). For exam-
5737 either use an ungreedy repeat ("a\d+?") or set the PCRE2_NO_AUTO_POS-
5741 not supported or behave differently in the alternative matching func-
5744 1. Because the algorithm finds all possible matches, the greedy or un-
5746 affect auto-possessification, as just described). During matching,
5755 a non-possessive quantifier. Similarly, if an atomic group is present,
5763 algorithm does not attempt to do this. This means that no captured sub-
5766 3. Because no substrings are captured, backreferences within the pat-
5769 4. For the same reason, conditional expressions that use a backrefer-
5775 6. Because many paths through the tree may be active, the \K escape se-
5784 these modes, because the alternative algorithm moves through the sub-
5792 10. The PCRE2_MATCH_INVALID_UTF option for pcre2_compile() is not sup-
5806 matching and discusses multi-segment matching.
5836 Copyright (c) 1997-2021 University of Cambridge.
5837 ------------------------------------------------------------------------------
5845 PCRE2 - Perl-compatible regular expressions
5860 Another example is checking a user input string as it is typed, to en-
5864 Partial matching is a PCRE2-specific feature; it is not Perl-compati-
5866 PCRE2_PARTIAL_SOFT options when calling a matching function. The dif-
5872 If you want to use partial matching with just-in-time optimized code,
5874 you must also call pcre2_jit_compile() with one or both of these op-
5880 PCRE2_JIT_COMPLETE should also be set if you are going to run non-par-
5885 Setting a partial matching option disables two of PCRE2's standard op-
5886 timization hints. PCRE2 remembers the last literal code unit in a pat-
5902 Example 1: if the pattern is /abc/ and the subject is "ab", more char-
5908 what is matched. In this case, only PCRE2_PARTIAL_HARD returns a par-
5919 assertions and the \K escape sequence provide ways of inspecting char-
5922 (2) The pattern contains one or more lookbehind assertions. This condi-
5923 tion exists in case there is a lookbehind that inspects characters be-
5930 because adding more characters might result in a non-empty match,
5932 "there is going to be a match at this point, but until some more char-
5933 acters are added, we do not know if it will be an empty string or some-
5943 A complete match has been found, starting and ending within this sub-
5955 the rest of the ovector are undefined. The appearance of \K in the pat-
5964 string "abc12", because all these characters are needed for a subse-
5965 quent re-match with additional characters.
5972 If this is matched against the subject string "abc123dog", both alter-
5975 and 9, identifying "123dog" as the first partial match. (In this exam-
5985 as a partial match is found, without continuing to search for possible
5987 partial match over a later complete match. For this reason, the assump-
5989 true end of the available data, which is why \z, \Z, \b, \B, and $ al-
5994 tried. If no complete match can be found, PCRE2_ERROR_PARTIAL is re-
5997 items in a pattern behave as if the subject string is potentially com-
5999 for \b and \B the end of the subject is treated as a non-alphanumeric.
6001 The difference between the two partial matching options can be illus-
6009 However, if PCRE2_PARTIAL_HARD is set, the result is PCRE2_ERROR_PAR-
6056 MULTI-SEGMENT MATCHING WITH pcre2_match()
6058 PCRE was not originally designed with multi-segment matching in mind.
6060 multi-segment matching possible have been added. A very long string can
6062 with the aim of achieving the same results that would happen if the en-
6074 When a partial match occurs, the next segment must be added to the cur-
6075 rent subject and the match re-run, using the startoffset argument of
6091 buffer is discarded, the second half is moved to the start of the buf-
6095 If there are memory constraints, you may want to discard text that pre-
6114 of characters that must be retained in order to get the right match re-
6119 use that to decide how much text to retain. The only lookbehind infor-
6124 maximum number of characters (not code units) that any individual look-
6129 In a non-UTF or a 32-bit case, moving back is just a subtraction, but
6130 in UTF-8 or UTF-16 you have to count characters while moving back
6137 without backtracking, searching for all possible matches simultane-
6138 ously. If the end of the subject is reached before the end of the pat-
6149 there is no difference between greedy and ungreedy repetition, its be-
6160 MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()
6164 and calling the function again with the same compiled regular expres-
6166 same working space as before, because this is where details of the pre-
6177 The first call has "23ja" as the subject, and requests partial match-
6180 last part is shown; PCRE2 does not retain the previously partially-
6194 match at one point in the subject are remembered. Depending on the ap-
6199 complete match, as described for pcre2_match() above. Another possibil-
6216 Copyright (c) 1997-2019 University of Cambridge.
6217 ------------------------------------------------------------------------------
6225 PCRE2 - Perl-compatible regular expressions (revised API)
6230 by PCRE2 are described in detail below. There is a quick-reference syn-
6231 tax summary in the pcre2syntax page. PCRE2 tries to match Perl syntax
6232 and semantics as closely as it can. PCRE2 also supports some alterna-
6233 tive regular expression syntax (which does not conflict with the Perl
6237 Perl's regular expressions are described in its own documentation, and
6239 of which have copious examples. Jeffrey Friedl's "Mastering Regular Ex-
6244 This document discusses the regular expression patterns that are sup-
6248 not Perl-compatible. Some of the features discussed below are not
6250 of the alternative function, and how it differs from the normal func-
6254 SPECIAL START-OF-PATTERN ITEMS
6257 set by special items at the start of a pattern. These are not Perl-com-
6259 writers who are not able to change the program that processes the pat-
6260 tern. Any number of these items may appear, but they must all be to-
6266 In the 8-bit and 16-bit PCRE2 libraries, characters may be coded either
6267 as single code units, or as multiple UTF-8 or UTF-16 code units. UTF-32
6268 can be specified for the 32-bit library, in which case it constrains
6279 restrict them to non-UTF data for security reasons. If the
6280 PCRE2_NEVER_UTF option is passed to pcre2_compile(), (*UTF) is not al-
6287 causes sequences such as \d and \w to use Unicode properties to deter-
6289 less than 256 via a lookup table. If also causes upper/lower casing op-
6302 to whichever matching function is subsequently called to match the pat-
6303 tern. These options lock out the matching of empty strings, either en-
6306 Disabling auto-possessification
6314 Disabling start-up optimizations
6317 setting the PCRE2_NO_START_OPTIMIZE option. This disables several opti-
6324 as setting the PCRE2_NO_DOTSTAR_ANCHOR option. This disables optimiza-
6325 tions that apply to patterns whose top-level branches all start with .*
6345 These facilities are provided to catch runaway matches that are pro-
6346 voked by patterns with huge matching trees. A common example is a pat-
6356 where d is any number of decimal digits. However, the value of the set-
6379 strings: a single CR (carriage return) character, a single LF (line-
6380 feed) character, the two-character sequence CRLF, any of the three pre-
6385 It is also possible to specify a newline convention by starting a pat-
6395 These override the default and the options given to the compiling func-
6396 tion. For example, on a Unix system where LF is the default newline se-
6405 The newline convention affects where the circumflex and dollar asser-
6406 tions are true. It also affects the interpretation of the dot metachar-
6409 escape sequence matches. By default, this is any Unicode newline se-
6410 quence, for Perl compatibility. However, this can be changed; see the
6420 starting a pattern with (*BSR_ANYCRLF). For completeness, (*BSR_UNI-
6427 character code instead of ASCII or Unicode (typically a mainframe sys-
6428 tem). In the sections below, character code values are ASCII or Uni-
6446 their lower case ASCII equivalents, are case-equivalent with Unicode
6456 There are two different sets of metacharacters: those that are recog-
6479 - indicates character range
6487 or a # character as part of the pattern. If the PCRE2_EXTENDED_MORE op-
6488 tion is set, the same applies, but in addition unescaped space and hor-
6492 settings can be changed within a pattern; see the section entitled "In-
6508 always safe to precede a non-alphanumeric with backslash to specify
6509 that it stands for itself. In particular, if you want to match a back-
6512 Only ASCII digits and letters have any special meaning after a back-
6517 do so by putting them between \Q and \E. This is different from Perl in
6519 whereas in Perl, $ and @ cause variable interpolation. Also, Perl does
6520 "double-quotish backslash interpolation" on any backslashes between \Q
6522 PCRE2 treats a backslash between \Q and \E just like any other charac-
6525 Pattern PCRE2 matches Perl matches
6542 Non-printing characters
6544 A second use of backslash provides a way of encoding non-printing char-
6546 appearance of non-printing characters in a pattern, but when a pattern
6548 following escape sequences instead of the binary character it repre-
6549 sents. In an ASCII or Unicode environment, these escapes are as fol-
6553 \cx "control-x", where x is any printable ASCII character
6566 By default, after \x that is not followed by {, from zero to two hexa-
6568 number of hexadecimal digits may appear between \x{ and }. If a charac-
6573 of the two syntaxes for \x or by an octal sequence. There is no differ-
6578 Support is available for some ECMAScript (aka JavaScript) escape se-
6579 quences via two compile-time options. If PCRE2_ALT_BSUX is set, the se-
6581 two hexadecimal digits is it recognized as a character escape. Other-
6582 wise it is interpreted as a literal "x" character. In this mode, sup-
6587 PCRE2_EXTRA_ALT_BSUX has the same effect as PCRE2_ALT_BSUX and, in ad-
6588 dition, \u{hhh..} is recognized as the character specified by hexadeci-
6592 The \N{U+hhh..} escape sequence is recognized only when PCRE2 is oper-
6593 ating in UTF mode. Perl also uses \N{name} to specify characters by
6595 followed by an opening brace (curly bracket) it has an entirely differ-
6598 There are some legacy applications where the escape sequence \r is ex-
6608 32 or greater than 126, a compile-time error occurs.
6612 The \c escape is processed as specified for Perl in the perlebcdic doc-
6613 ument. The only characters that are allowed after \c are A-Z, a-z, or
6614 one of @, [, \, ], ^, _, or ?. Any other character provokes a compile-
6616 letters (in either case) encode characters 1-26 (hex 01 to hex 1A); [,
6617 \, ], ^, and _ encode characters 27-31 (hex 1B to hex 1F), and \c? be-
6626 but because 127 is not a control character in EBCDIC, Perl makes it
6629 FF), but in the one Perl calls POSIX-BC its value is 95 (hex 5F). If
6630 certain other characters have POSIX-BC values, PCRE2 makes \c? generate
6634 than two digits, just those that are present are used. Thus the se-
6641 recent addition to Perl; it provides way of specifying character code
6646 a digit greater than zero. Instead, use \o{} or \x{} to specify numeri-
6647 cal character code points, and \g{} to specify backreferences. The fol-
6650 The handling of a backslash followed by a digit other than 0 is compli-
6651 cated, and Perl has changed over time, causing PCRE2 also to change.
6653 Outside a character class, PCRE2 reads the digit and any following dig-
6656 groups in the expression, the entire sequence is taken as a backrefer-
6658 discussion of parenthesized groups. Otherwise, up to three octal dig-
6661 Inside a character class, PCRE2 handles \8 and \9 as the literal char-
6662 acters "8" and "9", and otherwise reads up to three octal digits fol-
6663 lowing the backslash, using them to generate a data character. Any sub-
6690 8-bit non-UTF mode no greater than 0xff
6691 16-bit non-UTF mode no greater than 0xffff
6692 32-bit non-UTF mode no greater than 0xffffffff
6696 (the so-called "surrogate" code points). The check for these can be
6699 UTF-8 and UTF-32 modes, because these values are not representable in
6700 UTF-16.
6715 In Perl, the sequences \F, \l, \L, \u, and \U are recognized by its
6718 However, if either of the PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX op-
6724 The sequence \g followed by a signed or unsigned number, optionally en-
6731 For compatibility with Oniguruma, the non-Perl syntax \g followed by a
6734 Details are discussed later. Note that \g{...} (Perl syntax) and
6735 \g<...> (Oniguruma syntax) are not synonymous. The former is a backref-
6752 \W any "non-word" character
6757 has a different meaning. See the section entitled "Non-printing charac-
6758 ters" above for details. Perl also uses \N{name} to specify characters
6761 Each pair of lower and upper case escape sequences partitions the com-
6770 (13), and space (32), which are defined as white space in the "C" lo-
6771 cale. This list may vary if locale-specific matching is taking place.
6772 For example, in some locales the "non-breaking space" character (\xA0)
6776 or digit. By default, the definition of letters and digits is con-
6777 trolled by PCRE2's low-valued character tables, and may vary if locale-
6779 page). For example, in a French locale such as "fr_FR" in Unix-like
6786 be different for characters in the range 128-255 when locale-specific
6788 meanings from before Unicode support was available, mainly for effi-
6810 U+00A0 Non-break space
6817 U+2004 Three-per-em space
6818 U+2005 Four-per-em space
6819 U+2006 Six-per-em space
6824 U+202F Narrow no-break space
6838 In 8-bit, non-UTF-8 mode, only the characters with code points less
6844 any Unicode newline sequence. In 8-bit non-UTF-8 mode \R is equivalent
6849 This is an example of an "atomic group", details of which are given be-
6850 low. This particular group matches either the two-character sequence
6852 U+000A), VT (vertical tab, U+000B), FF (form feed, U+000C), CR (car-
6854 atomic group, the two-character sequence is treated as a single unit
6858 than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-
6864 PCRE2_BSR_ANYCRLF at compile time. (BSR is an abbreviation for "back-
6866 the case, the other behaviour can be requested via the PCRE2_BSR_UNI-
6873 These override the default and the options given to the compiling func-
6874 tion. Note that these special settings, which are not Perl-compatible,
6877 used. They can be combined with a change of newline convention; for ex-
6883 Inside a character class, \R is treated as an unrecognized escape se-
6888 When PCRE2 is built with Unicode support (the default), three addi-
6890 are available. They can be used in any mode, though in 8-bit and 16-bit
6891 non-UTF modes these sequences are of course limited to testing charac-
6893 In 32-bit non-UTF mode, code points greater than 0x10ffff (the Unicode
6894 limit) may be encountered. These are all treated as being in the Un-
6898 to do a multistage table lookup in order to find a character's prop-
6907 \P{xx} a character without the xx property
6910 The property names represented by xx above are not case-sensitive, and
6914 (including newline), Bidi_Class, a number of binary (yes/no) proper-
6916 other Perl properties such as "InMusicalSymbols" are not supported by
6922 There are three different syntax forms for matching a script. Each Uni-
6929 sign is an alternative to the colon. If a script name is given without
6930 a property type, for example, \p{Adlam}, it is treated as \p{scx:Ad-
6931 lam}. Perl changed to this interpretation at release 5.26 and PCRE2
6934 Unassigned characters (and in non-UTF 32-bit mode, characters with code
6936 that are not part of an identified script are lumped together as "Com-
6937 mon". The current list of recognized script names and their 4-character
6940 pcre2test -LS
6945 Each character has exactly one Unicode general category property, spec-
6946 ified by a two-letter abbreviation. For compatibility with Perl, nega-
6951 If only one letter is specified with \p or \P, it includes all the gen-
6978 Mn Non-spacing mark
7010 points are in the range U+D800 to U+DFFF. These characters are no dif-
7012 16-bit or 32-bit library). However, they are not valid in Unicode
7013 strings and so cannot be tested by PCRE2 in UTF mode, unless UTF valid-
7017 The long synonyms for property names that Perl supports (such as
7021 No character that is in the Unicode table has the Cn (unassigned) prop-
7027 different from the behaviour of current versions of Perl.
7036 pcre2test -LP
7055 L left-to-right
7056 LRE left-to-right embedding
7057 LRI left-to-right isolate
7058 LRO left-to-right override
7059 NSM non-spacing mark
7063 R right-to-left
7064 RLE right-to-left embedding
7065 RLI right-to-left isolate
7066 RLO right-to-left override
7071 case-insensitive; only the short names listed above are recognized.
7082 properties that had been used for emojis. Instead it introduced vari-
7083 ous emoji-specific properties. PCRE2 uses only the Extended Picto-
7092 2. Do not end between CR and LF; otherwise end after any control char-
7098 be followed by a V or T character; an LVT or T character may be fol-
7102 "zero-width joiner" character. Characters with the "mark" property al-
7109 property. Extend and ZWJ characters are allowed between the charac-
7112 7. Do not break within emoji flag sequences. That is, do not break be-
7120 As well as the standard Unicode properties described above, PCRE2 sup-
7121 ports four more that make it possible to convert traditional escape se-
7123 non-standard, non-Perl properties internally when PCRE2_UCP is set.
7128 Xsp Any Perl space character
7129 Xwd Any Perl "word" character
7131 Xan matches characters that have either the L (letter) or the N (num-
7134 (separator) property. Xsp is the same as Xps; in PCRE1 it used to ex-
7135 clude vertical tab, for Perl compatibility, but Perl changed. Xwd
7138 There is another non-standard property, Xuc, which matches any charac-
7145 Note that the Xuc property does not match these sequences but the char-
7151 characters not to be included in the final matched sequence that is re-
7162 mode), though it again reports the matched string as "bar". This fea-
7166 does not interfere with the setting of captured substrings. For exam-
7173 From version 5.32.0 Perl forbids the use of \K in lookaround asser-
7176 pcre2_compile() to re-enable the previous behaviour. When this option
7193 The final use of backslash is for certain simple assertions. An asser-
7195 a match, without consuming any characters from the subject string. The
7216 changed by setting the PCRE2_UCP option. When this is done, it also af-
7217 fects \b and \B. Neither PCRE2 nor Perl has a separate "start of word"
7225 set. Thus, they are independent of multiline mode. These three asser-
7227 which affect only the behaviour of the circumflex and dollar metachar-
7228 acters. However, if the startoffset argument of pcre2_match() is non-
7235 the start point of the matching process, as specified by the startoff-
7237 startoffset is non-zero. By calling pcre2_match() multiple times with
7238 appropriate arguments, you can mimic Perl's /g option, and it is in
7243 Perl's, which defines it as true at the end of the previous match. In
7244 Perl, these can be different when the previously matched string was
7255 The circumflex and dollar metacharacters are zero-width assertions.
7256 That is, they test for a particular condition being true without con-
7258 are concerned with matching the starts and ends of lines. If the new-
7259 line convention is set so that only the two-character sequence CRLF is
7265 point is at the start of the subject string. If the startoffset argu-
7266 ment of pcre2_match() is non-zero, or if PCRE2_NOTBOL is set, circum-
7268 character class, circumflex has an entirely different meaning (see be-
7275 if the pattern is constrained to match only at the start of the sub-
7280 matching point is at the end of the subject string, or immediately be-
7281 fore a newline at the end of the string (by default), unless PCRE2_NO-
7282 TEOL is set. Note, however, that it does not actually match the new-
7285 branch in which it appears. Dollar has no special meaning in a charac-
7297 a newline that ends the string, for compatibility with Perl. However,
7305 pcre2_match() is non-zero. The PCRE2_DOLLAR_ENDONLY option is ignored
7308 When the newline convention (see "Newline conventions" below) recog-
7309 nizes the two-character sequence CRLF as a newline, this is preferred,
7310 even if the single characters CR and LF are also recognized as new-
7324 Outside a character class, a dot in the pattern matches any one charac-
7325 ter in the subject string except (by default) a character that signi-
7329 Dot never matches a single line-ending character. When the two-charac-
7338 PCRE2_DOTALL option is set, a dot matches any one character, without
7339 exception. If the two-character sequence CRLF is present in the sub-
7342 The handling of dot is entirely independent of the handling of circum-
7352 the section entitled "Non-printing characters" above for details. Perl
7360 unit, whether or not a UTF mode is set. In the 8-bit library, one code
7361 unit is one byte; in the 16-bit library it is a 16-bit unit; in the
7362 32-bit library it is a 32-bit unit. Unlike a dot, \C always matches
7363 line-ending characters. The feature is provided in Perl in order to
7364 match individual bytes in UTF-8 mode, but it is unclear how it can use-
7368 one unit with \C in UTF-8 or UTF-16 mode means that the rest of the
7369 string may start with a malformed UTF character. This has undefined re-
7371 in a valid UTF string (by default it checks the subject string's valid-
7380 below) in UTF-8 or UTF-16 modes, because this would make it impossible
7383 these UTF modes. The former gives a match-time error; the latter fails
7386 In the 32-bit library, however, \C is always supported (when not ex-
7388 whether or not UTF-32 is specified.
7391 using it that avoids the problem of malformed UTF-8 or UTF-16 charac-
7393 as in this pattern, which could be used with a UTF-8 string (ignore
7396 (?| (?=[\x00-\x7f])(\C) |
7397 (?=[\x80-\x{7ff}])(\C)(\C) |
7398 (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
7399 (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
7403 below). The assertions at the start of each branch check the next UTF-8
7404 character for values whose encoding uses 1, 2, 3, or 4 bytes, respec-
7405 tively. The character's individual bytes are then captured by the ap-
7412 closing square bracket. A closing square bracket on its own is not spe-
7431 class that starts with a circumflex is not an assertion; it still con-
7437 letters in a class represent both their upper case and lower case ver-
7440 would. Note that there are two ASCII characters, K and S, that, in ad-
7441 dition to their lower case ASCII equivalents, are case-equivalent with
7442 Unicode U+212A (Kelvin sign) and U+017F (long S) respectively when ei-
7446 special way when matching character classes, whatever line-ending se-
7454 matches any hexadecimal digit. In UTF modes, the PCRE2_UCP option af-
7459 backspace character. The sequences \B, \R, and \X are not special in-
7464 The minus (hyphen) character can be used to specify a range of charac-
7465 ters in a character class. For example, [d-m] matches any letter be-
7470 [b-d-z] matches letters in the range b to d, a hyphen character, or z.
7472 Perl treats a hyphen as a literal if it appears before or after a POSIX
7475 class, Perl outputs a warning in its warning mode, as this is most
7479 It is not possible to have the literal character "]" as the end charac-
7480 ter of a range. A pattern such as [W-]46] is interpreted as a class of
7481 two characters ("W" and "-") followed by a literal string "46]", so it
7482 would match "W46]" or "-46]". However, if the "]" is escaped with a
7483 backslash it is interpreted as the end of range, so [W-\]46] is inter-
7488 Ranges normally include all code points between the start and end char-
7489 acters, inclusive. They can also be used for code points specified nu-
7490 merically, for example [\000-\037]. Ranges can include any characters
7491 that are valid for the current mode. In any UTF mode, the so-called
7494 PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES option disables this check). How-
7495 ever, ranges such as [\x{d7ff}-\x{e000}], which include the surrogates,
7499 points are both specified as literal letters in the same case. For com-
7500 patibility with Perl, EBCDIC code points within the range that are not
7501 letters are omitted. For example, [h-k] matches only four characters,
7504 [\x88-\x92] or [h-\x92], all code points are included.
7507 it matches the letters in either case. For example, [W-c] is equivalent
7508 to [][\\^_`wxyzabc], matched caselessly, and in a non-UTF mode, if
7509 character tables for a French locale are in use, [\xc8-\xcb] matches
7523 special compatibility feature - see the next two sections), and the
7524 terminating closing square bracket. However, escaping other non-al-
7530 Perl supports the POSIX notation for character classes. This uses names
7541 ascii character codes 0 - 127
7555 CR (13), and space (32). If locale-specific matching is taking place,
7559 The name "word" is a Perl extension, and "blank" is a GNU extension
7560 from Perl 5.8. Another Perl extension is negation, which is indicated
7565 matches "1", "2", or any non-digit. PCRE2 (and Perl) also recognize the
7570 the POSIX character classes, although this may be different for charac-
7571 ters in the range 128-255 when locale-specific matching is happening.
7574 This is achieved by replacing certain POSIX classes with other se-
7591 when printed. In Unicode property terms, it matches all char-
7596 U+2066 - U+2069 Various "isolate"s
7603 [:punct:] This matches all characters that have the Unicode P (punctua-
7622 support is not compatible with Perl. It is provided to help migrations
7624 that \b matches at the start and the end of a word (see "Simple asser-
7625 tions" above), and in a Perl-style pattern the preceding or following
7626 character normally shows which is wanted, without the need for the as-
7627 sertions that are used above in order to give exactly the POSIX behav-
7650 can be changed from within the pattern by a sequence of letters en-
7651 closed between "(?" and ")". These options are Perl-compatible, and
7652 are described in detail in the pcre2api documentation. The option let-
7662 For example, (?im) sets caseless, multiline matching. It is also possi-
7663 ble to unset these options by preceding the relevant letters with a hy-
7664 phen, for example (?-im). The two "extended" options are not indepen-
7667 A combined setting and unsetting such as (?im-sx), which sets
7671 the option is unset. An empty options setting "(?)" is allowed. Need-
7675 the above options to be unset. Thus, (?^) is equivalent to (?-imnsx).
7676 Letters may follow the circumflex to cause some options to be re-in-
7679 The PCRE2-specific options PCRE2_DUPNAMES and PCRE2_UNGREEDY can be
7680 changed in the same way as the Perl-compatible options by using the
7683 When one of these option changes occurs at top level (that is, not in-
7692 not used). By this means, options can be made to have different set-
7693 tings in different parts of the pattern. Any changes made in one alter-
7705 start of a non-capturing group (see the next section), the option let-
7713 Note: There are other PCRE2-specific options, applying to the whole
7714 pattern, which can be set by the application when the compiling func-
7720 are equivalent to setting the PCRE2_UTF and PCRE2_UCP options, respec-
7735 matches "cataract", "caterpillar", or "cat". Without the parentheses,
7738 2. It creates a "capture group". This means that, when the whole pat-
7750 the captured substrings are "red king", "red", and "king", and are num-
7754 helpful. There are often times when grouping is required without cap-
7766 start of a non-capturing group, the option letters may appear between
7774 the group is reached, an option setting in one branch does affect sub-
7775 sequent branches, so the above patterns match "SUNDAY" as well as "Sat-
7781 Perl 5.10 introduced a feature whereby each alternative in a group uses
7783 with (?| and is itself a non-capturing group. For example, consider
7788 Because the two alternatives are inside a (?| group, both sets of cap-
7792 not all, of one of a number of alternatives. Inside a (?| group, paren-
7795 whole group start after the highest number used in any branch. The fol-
7796 lowing example is taken from the Perl documentation. The numbers under-
7799 # before ---------------branch-reset----------- after
7814 A relative reference such as (?-1) is no different: it is just a conve-
7817 If a condition test for a group's having matched refers to a non-unique
7830 was not added to Perl until release 5.10. Python had the feature ear-
7832 PCRE2 supports both the Perl and the Python syntax.
7835 (?<name>...) or (?'name'...) as in Perl, or (?P<name>...) as in Python.
7838 must start with a non-digit. When PCRE2_UTF is set, the syntax of group
7842 ^[_A-Za-z][_A-Za-z0-9]*\z when PCRE2_UTF is not set
7850 if the names were not present. In both PCRE2 and Perl, capture groups
7853 complete name-to-number translation table from a compiled pattern, as
7857 Warning: When more than one capture group has the same number, as de-
7859 all of them. Perl allows identically numbered groups to have different
7865 Perl allows this, with both names AA and BB as aliases of group 1.
7870 number to be associated with more than one name. The example above pro-
7871 vokes a compile-time error. However, there is still scope for confu-
7880 By default, a name must be unique within a pattern, except that dupli-
7885 The duplicate name constraint can be disabled by setting the PCRE2_DUP-
7891 of a weekday, either as a 3-letter abbreviation or as the full name,
7909 If you make a backreference to a non-unique named group from elsewhere
7918 If you make a subroutine call to a non-unique named group, the one that
7926 true. This is the same behaviour as testing by number. For further de-
7947 The general repetition quantifier specifies a minimum and maximum num-
7968 the syntax of a quantifier, is taken as a literal character. For exam-
7973 of which is represented by a two-byte sequence in a UTF-8 string. Simi-
7979 the previous item and the quantifier were not present. This may be use-
7980 ful for capture groups that are referenced as subroutines from else-
7981 where in the pattern (but see also the section entitled "Defining cap-
7986 For convenience, the three most common quantifiers have single-charac-
7999 Earlier versions of Perl and PCRE1 used to give an error at compile
8004 does not prevent backtracking into any of the iterations if a subse-
8008 possible (up to the maximum number of permitted times), without causing
8010 gives problems is in trying to match comments in C programs. These ap-
8011 pear between /* and */ and within the comment, individual * and / char-
8012 acters may appear. An attempt to match C comments by applying the pat-
8040 Perl), the quantifiers are not greedy by default, but individual ones
8045 that is greater than 1 or with a limited maximum, more memory is re-
8046 quired for the compiled pattern, in proportion to the size of the mini-
8050 (equivalent to Perl's /s) is set, thus allowing the dot to match new-
8053 so there is no point in retrying the overall match at any position af-
8057 In cases where it is known that the subject string contains no new-
8058 lines, it is worth setting PCRE2_DOTALL in order to obtain this opti-
8068 If the subject is "xyz123abc123" the match point is the fourth charac-
8071 Another case where implicit anchoring is not applied is when the lead-
8077 It matches "ab" in the subject "aab". The use of the backtracking con-
8087 is "tweedledee". However, if there are nested capture groups, the cor-
8100 to be re-evaluated to see if a different number of repeats allows the
8116 re-evaluated in this way.
8124 Perl 5.28 introduced an experimental alphabetic form starting with (*
8139 example can be thought of as a maximizing repeat that must swallow ev-
8141 the number of digits they match in order to make the rest of the pat-
8146 group is just a single repeated item, as in the example above, a sim-
8147 pler notation, called a "possessive quantifier" can be used. This con-
8158 Possessive quantifiers are always greedy; the setting of the PCRE2_UN-
8159 GREEDY option is ignored. They are a convenient notation for the sim-
8165 The possessive quantifier syntax is an extension to the Perl 5.8 syn-
8169 way into Perl at release 5.10.
8174 when B must follow. This feature can be disabled by the PCRE2_NO_AUTO-
8177 When a pattern contains an unlimited repeat inside a group that can it-
8184 matches an unlimited number of substrings that either consist of non-
8192 * repeat in a large number of ways, and all have to be tried. (The ex-
8194 PCRE2 and Perl have an optimization that allows for fast failure when a
8202 sequences of non-digits cannot be broken, and failure happens quickly.
8215 words, the group that is referenced need not be to the left of the ref-
8223 subsection entitled "Non-printing characters" above for further details
8224 of the handling of digits following a backslash. Other forms of back-
8237 An unsigned number specifies an absolute reference without the ambigu-
8242 (abc(def)ghi)\g{-1}
8244 The sequence \g{-1} is a reference to the most recently started capture
8245 group before \g, that is, is it equivalent to \2 in this example. Simi-
8246 larly, \g{-2} would be equivalent to \1. The use of relative references
8248 by joining together fragments that contain references within them-
8252 of forward reference can be useful in patterns that repeat. Perl does
8264 time of the backreference, the case of letters is relevant. For exam-
8273 capture groups. The .NET syntax \k{name} and the Perl syntax \k<name>
8274 or \k'name' are supported, as is the Python syntax (?P=name). Perl
8294 the PCRE2_MATCH_UNSET_BACKREF option is set at compile time, a backref-
8297 Because there may be many capture groups in a pattern, all digits fol-
8298 lowing a backslash are taken as part of a potential backreference num-
8308 However, such references can be useful inside repeated groups. For ex-
8313 matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
8314 ation of the group, the backreference matches the character string cor-
8317 the backreference. This can be done using alternation, as in the exam-
8335 subject string, and those that look behind it, and in each case an as-
8338 group is matched in the normal way, and if it is true, matching contin-
8339 ues after it, but with the matching position in the subject string re-
8342 The Perl-compatible lookaround assertions are atomic. If an assertion
8343 is true, but there is a subsequent matching failure, there is no back-
8344 tracking into the assertion. However, there are some cases where non-
8345 atomic assertions can be useful. PCRE2 has some support for these, de-
8346 scribed in the section entitled "Non-atomic assertions" below, but they
8347 are not Perl-compatible.
8353 Assertion groups are not capture groups. If an assertion contains cap-
8355 the capture groups in the whole pattern. Within each branch of an as-
8357 way. For example, a sequence such as (.)\g{-1} can be used to check
8364 retained after a successful negative assertion. When an assertion con-
8367 For a positive assertion, internally captured substrings in the suc-
8368 cessful branch are retained, and matching continues with the next pat-
8377 Most assertion groups may be repeated; though it makes no sense to as-
8378 sert the same thing several times, the side effect of capturing in pos-
8389 to specify lookaround assertions. Perl 5.28 introduced some experimen-
8391 start with (* instead of (? and must be written using lower case let-
8399 For example, (*pla:foo) is the same assertion as (?=foo). In the fol-
8400 lowing sections, the various assertions are described using the origi-
8410 matches a word followed by a semicolon, but does not include the semi-
8426 most convenient way to do it is with (?!) because an empty string al-
8440 strings it matches must have a fixed length. However, if there are sev-
8441 eral top-level alternatives, they do not all have to have the same
8452 This is an extension compared with Perl, which requires all branches to
8457 is not permitted, because its single top-level branch can match two
8459 two top-level branches:
8464 of a lookbehind assertion to get round the fixed-length restriction.
8468 then try to match. If there are insufficient characters before the cur-
8471 In UTF-8 and UTF-16 modes, PCRE2 does not allow the \C escape (which
8474 the lookbehind. The \X and \R escapes, which can match different num-
8478 lookbehinds, as long as the called capture group matches a fixed-length
8482 Perl does not support backreferences in lookbehinds. PCRE2 does support
8483 them, but only if certain conditions are met. The PCRE2_MATCH_UN-
8493 Possessive quantifiers can be used in conjunction with lookbehind as-
8494 sertions to specify efficient matching of fixed-length strings at the
8500 proceeds from left to right, PCRE2 will look for each "a" in the sub-
8515 quantifier; it can match only the entire string. The subsequent lookbe-
8530 three characters are not "999". This pattern does not match "foo" pre-
8532 three of which are not "999". For example, it doesn't match "123abc-
8554 NON-ATOMIC ASSERTIONS
8556 The traditional Perl-compatible lookaround assertions are atomic. That
8557 is, if an assertion is true, but there is a subsequent matching fail-
8559 some cases where non-atomic positive assertions can be useful. PCRE2
8565 Consider the problem of finding the right-most word in a string that
8574 and sets the "x" option, which causes white space (introduced for read-
8578 words, when the assertion first succeeds, it captures the right-most
8584 succeeds, we are done, but if the last word in the string does not oc-
8586 lookhead (?= or (*pla: had been used, the assertion could not be re-en-
8590 Using a non-atomic lookahead, however, means that when the last word
8592 find the second-last word, and so on, until either the match succeeds,
8595 Two conditions must be met for a non-atomic assertion to be useful: the
8600 using a non-atomic assertion just wastes resources.
8602 There is one exception to backtracking into a non-atomic assertion. If
8603 an (*ACCEPT) control verb is triggered, the assertion succeeds atomi-
8607 Non-atomic assertions are not supported by the alternative matching
8625 matches are not a script run. After a failure, normal backtracking oc-
8626 curs. Script runs can be used to detect spoofing attacks using charac-
8628 "paypal.com" is an infamous example, where the letters could be a mix-
8629 ture of Latin and Cyrillic. This pattern ensures that the matched char-
8630 acters in a sequence of non-spaces that follow white space are a script
8646 \s+(?=[0-9_.]*\p{Latin})(*sr:\S+)
8659 Support for script runs is not available if PCRE2 is compiled without
8660 Unicode support. A compile-time error is given if any of the above con-
8662 matching function, pcre2_dfa_match() because they use the same mecha-
8677 (?(condition)yes-pattern)
8678 (?(condition)yes-pattern|no-pattern)
8680 If the condition is satisfied, the yes-pattern is used; otherwise the
8681 no-pattern (if present) is used. An absent no-pattern is equivalent to
8682 an empty string (it always matches). If there are more than two alter-
8683 natives in the group, a compile-time error occurs. Each of the two al-
8684 ternatives may itself contain nested groups of any form, including con-
8692 There are five kinds of condition: references to capture groups, refer-
8693 ences to recursion, two pseudo-conditions called DEFINE and VERSION,
8702 is true if any of them have matched. An alternative notation is to pre-
8703 cede the digits with a plus or minus sign. In this case, the group num-
8705 group can be referenced by (?(-1), the next most recent by (?(-2), and
8708 (The value zero in any of these forms is not used; it provokes a com-
8709 pile-time error.)
8711 Consider the following pattern, which contains non-significant white
8718 character is present, sets it as the first captured substring. The sec-
8722 opening parenthesis, the condition is true, and so the yes-pattern is
8723 executed and a closing parenthesis is required. Otherwise, since no-
8725 words, this pattern matches a sequence of non-parentheses, optionally
8731 ...other stuff... ( \( )? [^()]+ (?(-1) \) ) ...
8738 Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a
8740 PCRE1, which had this facility before Perl, the syntax (?(name)...) is
8742 the letter R followed by digits are ambiguous (see the following sec-
8753 "Recursion" in this sense refers to any subroutine-like call from one
8754 part of the pattern to another, whether or not it is actually recur-
8759 the name R, the condition is true if matching is currently in a recur-
8769 name, the condition tests for its being set, as described in the sec-
8771 group with the name R1 by adding (?<R1>) to the above pattern com-
8792 be only one alternative in the rest of the conditional group. It is al-
8794 DEFINE is that it can be used to define subroutines that can be refer-
8799 (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
8807 to match the four dot-separated components of an IPv4 address, insist-
8812 Programs that link with a PCRE2 library can check the version by call-
8814 that do not have access to the underlying code cannot do this. A spe-
8830 or lookbehind assertion. However, it must be a traditional atomic as-
8831 sertion, not one of the PCRE2-specific non-atomic assertions.
8833 Consider this pattern, again containing non-significant white space,
8836 (?(?=[^a-z]*[a-z])
8837 \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
8839 The condition is a positive lookahead assertion that matches an op-
8840 tional sequence of non-letters followed by a letter. In other words, it
8841 tests for the presence of at least one letter in the subject. If a let-
8844 strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
8847 When an assertion that is a condition contains capture groups, any cap-
8849 both positive and negative assertions, because matching always contin-
8850 ues after the assertion, whether it succeeds or fails. (Compare non-
8851 conditional assertions, for which captures are retained only for posi-
8870 at the start of the pattern, as described in the section entitled "New-
8874 when PCRE2_EXTENDED is set, and the default newline convention (a sin-
8888 unlimited nested parentheses. Without the use of recursion, the best
8893 For some time, Perl has provided a facility that allows regular expres-
8895 Perl code in the expression at run time, and the code can refer to the
8896 expression itself. A Perl pattern using code interpolation to solve the
8901 The (?p{...}) item interpolates Perl code at run time, and in this case
8904 Obviously, PCRE2 cannot support the interpolation of Perl code. In-
8908 into Perl at release 5.10.
8913 group. (If not, it is a non-recursive subroutine call, which is de-
8914 scribed in the next section.) The special item (?R) or (?0) is a recur-
8923 substrings which can either be a sequence of non-parentheses, or a re-
8926 possessive quantifier to avoid backtracking into sequences of non-
8939 of (?1) in the pattern above you can write (?-2) to refer to the second
8948 (?|(a)|(b)) (c) (?-2)
8951 (c) is number 2. When the reference (?-2) is encountered, the second
8954 the same if an absolute reference (?1) was used. In other words, rela-
8960 are always non-recursive subroutine calls, as described in the next
8963 An alternative approach is to use named parentheses. The Perl syntax
8964 for this is (?&name); PCRE1's earlier syntax (?P>name) is also sup-
8972 The example pattern that we have been looking at contains nested unlim-
8974 strings of non-parentheses is important when applying the pattern to
8986 callout function can be used (see below and the pcre2callout documenta-
8998 recursion. Consider this pattern, which matches text in angle brack-
9000 brackets (that is, when recursing), whereas any characters are permit-
9006 different alternatives for the recursive and non-recursive cases. The
9009 Differences in recursion processing between PCRE2 and Perl
9011 Some former differences between PCRE2 and Perl no longer exist.
9013 Before release 10.30, recursion processing in PCRE2 differed from Perl
9016 never re-entered, even if it contained untried alternatives and there
9018 recursion before Perl did.)
9021 treated as atomic. That is, they can be re-entered to try unused alter-
9023 now compatible with the way Perl works. If you want a subroutine call
9026 Supporting backtracking into recursions simplifies certain types of re-
9035 match fails. If you want to match typical palindromic phrases, the pat-
9036 tern has to ignore all non-word characters, which can be done like
9042 such as "A man, a plan, a canal: Panama!". Note the use of the posses-
9043 sive quantifier *+ to avoid backtracking into sequences of non-word
9044 characters. Without this, PCRE2 takes a great deal longer (ten times or
9045 more) to match typical phrases, and Perl takes so long that you think
9048 Another way in which PCRE2 and Perl used to differ in their recursion
9049 processing is in the handling of captured values. Formerly in Perl,
9061 to fail in Perl, but in later versions (I tried 5.024) it now works.
9070 to match at the current matching position. The called group may be de-
9071 fined before or after the reference. A numbered reference can be abso-
9075 (...(relative)...)...(?-1)...
9096 Processing options such as case-independence are fixed when a group is
9100 (abc)(?i:(?-1))
9112 For compatibility with Oniguruma, the non-Perl syntax \g followed by a
9114 an alternative syntax for calling a group as a subroutine, possibly re-
9124 (abc)(?i:\g<-1>)
9126 Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not
9133 Perl has a feature whereby using the sequence (?{...}) causes arbitrary
9134 Perl code to be obeyed in the middle of matching a regular expression.
9135 This makes it possible, amongst other things, to extract different sub-
9136 strings that match the same pair of parentheses when there is a repeti-
9139 PCRE2 provides a similar feature, but of course it cannot obey arbi-
9140 trary Perl code. The feature is called "callout". The caller of PCRE2
9144 passed, or if the callout entry point is set to NULL, callouts are dis-
9154 in a similar way to Perl.
9156 During matching, when PCRE2 reaches a callout point, the external func-
9163 time, and one side-effect is that sometimes callouts are skipped. If
9179 They are all numbered 255. If there is a conditional group in the pat-
9191 A delimited string may be used instead of a number as a callout argu-
9193 ending delimiter is the same as the start, except for {, where the end-
9206 Perl's terminology) that modify the behaviour of backtracking during
9212 By default, for compatibility with Perl, a name is any sequence of
9216 PCRE2_ALT_VERBNAMES option, but the result is no longer Perl-compati-
9222 and sequences such as \x{100} that define character code points. Char-
9228 names is skipped, and #-comments are recognized, exactly as in the rest
9232 The maximum length of a name is 255 in the 8-bit library and 65535 in
9233 the 16-bit and 32-bit libraries. If the name is empty, that is, if the
9235 the colon were not there. Any number of these verbs may occur in a pat-
9239 them can be used only when the pattern is to be matched using the tra-
9256 course, be processed. You can suppress the start-of-match optimizations
9257 by setting the PCRE2_NO_START_OPTIMIZE option when calling pcre2_com-
9262 Experiments with Perl suggest that it too has similar optimizations,
9274 then continues at the outer level. If (*ACCEPT) in triggered in a posi-
9278 If (*ACCEPT) is inside capturing parentheses, the data so far is cap-
9283 This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap-
9286 (*ACCEPT) is the only backtracking verb that is allowed to be quanti-
9294 is triggered and the match succeeds. In both cases, all but C is cap-
9295 tured. Whereas (*COMMIT) (see below) means "fail on backtrack", a re-
9298 Warning: (*ACCEPT) should not be used within a script run group, be-
9306 read. The Perl documentation notes that it is probably useful only when
9307 combined with (?{}) or (??{}). Those are, of course, Perl features that
9308 are not present in PCRE2. The nearest equivalent is the callout fea-
9316 (*ACCEPT:NAME) and (*FAIL:NAME) behave the same as (*MARK:NAME)(*AC-
9322 There is one verb whose main purpose is to track how a match was ar-
9323 rived at, though it also has a secondary use in conjunction with ad-
9328 A name is always required with this verb. For all the other backtrack-
9331 When a match succeeds, the name of the last-encountered mark name on
9332 the matching path is passed back to the caller as described in the sec-
9333 tion entitled "Other information about the match" in the pcre2api docu-
9340 back. A verb without a NAME argument is ignored for this purpose. Here
9352 The (*MARK) name is tagged with "MK:" in this output, and in this exam-
9354 efficient way of obtaining this information than putting each alterna-
9358 true, the name is recorded and passed back if it is the last-encoun-
9380 The following verbs do nothing when they are encountered. Matching con-
9382 causing a backtrack to the verb, a failure is forced. That is, back-
9386 group has been matched, there is never any backtracking into it. Back-
9390 These verbs differ in exactly what kind of failure occurs when back-
9392 when the verb is not in a subroutine or an assertion. Subsequent sec-
9398 matching failure that causes backtracking to reach it. Even if the pat-
9401 verb that is encountered, once it has been passed pcre2_match() is com-
9410 The behaviour of (*COMMIT:NAME) is not the same as (*MARK:NAME)(*COM-
9411 MIT). It is like (*MARK:NAME) in that the name is remembered for pass-
9413 that are set with (*MARK), ignoring those set by any of the other back-
9421 Note that (*COMMIT) at the start of a pattern is not the same as an an-
9422 chor, unless PCRE2's start-of-match optimizations are turned off, as
9438 (*COMMIT) causes the match to fail without trying any other starting
9444 the subject if there is a later matching failure that causes backtrack-
9450 (*PRUNE) is just an alternative to an atomic group or possessive quan-
9462 This verb, when given without a name, is like (*PRUNE), except that if
9464 character, but to the position in the subject where (*SKIP) was encoun-
9473 skips on to start the next attempt at "c". Note that a possessive quan-
9475 suppress backtracking during the first match attempt, the second at-
9489 found, the "bumpalong" advance is to the subject position that corre-
9495 atomic groups or assertions, because they are never re-entered by back-
9513 backtracks, and this causes a new matching attempt to start at the sec-
9523 This verb causes a skip to the next innermost alternative when back-
9526 that it can be used for a pattern-based if-then-else block:
9532 skips to the second alternative and tries COND2, without backtracking
9533 into COND1. If that succeeds and BAR fails, COND3 is tried. If subse-
9534 quently BAZ fails, there are no more alternatives, so there is a back-
9535 track to whatever came before the entire group. If (*THEN) is not in-
9543 A group that does not contain a | character is just a part of the en-
9544 closing alternative; it is not a nested alternation with only one al-
9545 ternative. The effect of (*THEN) extends beyond such a group to the en-
9546 closing alternative. Consider this pattern, where A, B, etc. are com-
9559 The effect of (*THEN) is now confined to the inner group. After a fail-
9564 Note that a conditional group is not considered as having two alterna-
9571 If the subject is "ba", this pattern does not match. Because .*? is un-
9591 that is backtracked onto first acts. For example, consider this pat-
9599 is consistent, but is not always the same as Perl's. It means that if
9611 PCRE2 sometimes differs from Perl in its handling of backtracking verbs
9616 If the subject is "abac", Perl matches unless its optimizations are
9628 succeed without any further processing; captured strings and a mark
9629 name (if set) are retained. In a standalone negative assertion, (*AC-
9630 CEPT) causes the assertion to fail without any further processing; cap-
9638 reach them. This means that, for the Perl-compatible assertions, their
9639 effect is confined to the assertion, because Perl lookaround assertions
9645 PCRE2 now supports non-atomic positive assertions, as described in the
9646 section entitled "Non-atomic assertions" above. These assertions must
9647 be standalone (not used as conditions). They are not Perl-compatible.
9648 For these assertions, a later backtrack does jump back into the asser-
9649 tion, and therefore verbs such as (*COMMIT) can be triggered by back-
9657 in a standalone positive assertion. In a conditional positive asser-
9659 or (*PRUNE) causes the condition to be false. However, for both stand-
9661 (*SKIP), or (*PRUNE) causes the assertion to be true, without consider-
9669 to succeed without any further processing. Matching then continues af-
9670 ter the subroutine call. Perl documents this behaviour. Perl's treat-
9677 when triggered by being backtracked to in a group called as a subrou-
9702 Copyright (c) 1997-2022 University of Cambridge.
9703 ------------------------------------------------------------------------------
9711 PCRE2 - Perl-compatible regular expressions (revised API)
9715 Two aspects of performance are discussed below: memory usage and pro-
9740 is not usually a problem. However, if the numbers are large, and par-
9746 uses over 50KiB when compiled using the 8-bit library. When PCRE2 is
9748 limit on a compiled pattern is 65535 code units in the 8-bit and 16-bit
9749 libraries, and this is reached with the above pattern if the outer rep-
9755 of PCRE2's "subroutine" facility. Re-writing the above pattern as
9761 this kind of pattern is not always exactly equivalent, because any cap-
9764 process patterns that PCRE2 cannot otherwise handle. The matching per-
9766 same. (This applies from release 10.30 - things were different in ear-
9772 From release 10.30, the interpretive (non-JIT) version of pcre2_match()
9773 uses very little system stack at run time. In earlier releases recur-
9775 cause problems, but this usage has been eliminated. Backtracking posi-
9781 On a 64-bit system the frame size for a pattern with no captures is 128
9785 the system stack, but this still caused some issues for multi-thread
9790 block and re-used if that block is used for another match. It is freed
9802 function calls, but only for processing atomic groups, lookaround as-
9807 has been re-factored to use heap memory when necessary for internal
9818 Certain items in regular expression patterns are processed more effi-
9820 [aeiou] than a set of single-character alternatives such as
9824 expressions for efficient performance. This document contains a few ob-
9828 slow, because PCRE2 has to use a multi-stage table lookup whenever it
9838 pcre2_match(); the performance loss is less with a DFA matching func-
9841 When a pattern begins with .* not in atomic parentheses, nor in paren-
9845 multiple top-level branches, they must all be anchorable. The optimiza-
9846 tion can be disabled by the PCRE2_NO_DOTSTAR_ANCHOR option, and is au-
9849 If PCRE2_DOTALL is not set, PCRE2 cannot make this optimization, be-
9851 subject string contains newlines, the pattern may match from the char-
9862 If you are using such a pattern with subject strings that do not con-
9864 PCRE2_DOTALL, or starting the pattern with ^.* or ^.*? to indicate ex-
9865 plicit anchoring. That saves PCRE2 from having to scan along the sub-
9879 in principle to try every possible variation, and this can take an ex-
9887 matching procedure, PCRE2 checks that there is a "b" later in the sub-
9888 ject string, and if there is not, it fails the match immediately. How-
9899 an atomic group or a possessive quantifier. This can often reduce mem-
9910 matched character. For a long string, a lot of memory is required. Con-
9916 This runs much faster, because sequences of characters that do not con-
9917 tain "<" are "swallowed" in one item inside the parentheses, and a pos-
9919 non-"<" characters. This version also uses a lot less memory because
9934 pcre2_match() or pcre2_dfa_match() is called. For details of these in-
9954 Copyright (c) 1997-2022 University of Cambridge.
9955 ------------------------------------------------------------------------------
9963 PCRE2 - Perl-compatible regular expressions (revised API)
9983 This set of functions provides a POSIX-style API for the PCRE2 regular
9984 expression 8-bit library. There are no POSIX-style wrappers for PCRE2's
9985 16-bit and 32-bit libraries. See the pcre2api documentation for a de-
9986 scription of PCRE2's native API, which contains much additional func-
9991 header file, and they all have unique names starting with pcre2_. How-
9992 ever, the pcre2posix.h header also contains macro definitions that con-
9994 This means that a program can use the usual POSIX names without running
9998 On Unix-like systems the PCRE2 POSIX library is called libpcre2-posix,
9999 so can be accessed by adding -lpcre2-posix to the command for linking
10001 also necessary to add -lpcre2-8.
10005 regcomp() etc. These simply passed their arguments to the PCRE2 func-
10018 names start with "REG_"; these are used for setting options and identi-
10033 PCRE2-specific features via the POSIX calling interface or to add BSD
10037 POSIX-like in style. The syntax and semantics of the regular expres-
10038 sions themselves are still those of Perl, subject to the setting of
10039 various PCRE2 options, as described below. "POSIX-like in style" means
10041 POSIX-compatible, and in multi-unit encoding domains it is probably
10045 described above, the standard POSIX names (without the pcre2_ prefix)
10051 The function pcre2_regcomp() is called to compile a pattern into an in-
10052 ternal form. By default, the pattern is a C string terminated by a bi-
10076 the defined POSIX behaviour for REG_NEWLINE (see the following sec-
10082 for compilation to the native function. This disables all meta charac-
10091 pcre2_regexec() for matching, the nmatch and pmatch arguments are ig-
10092 nored, and no captured strings are returned. Versions of the PCRE li-
10093 brary prior to 10.22 used to set the PCRE2_NO_AUTO_CAPTURE compile op-
10094 tion, but this no longer happens because it disables the use of back-
10101 the end of the pattern before calling pcre2_regcomp(). The pattern it-
10102 self may now contain binary zeros, which are treated as data charac-
10103 ters. Without REG_PEND, a binary zero terminates the pattern and the
10125 all data strings used for matching it to be treated as UTF-8 strings.
10129 function. This means the the regex is compiled with PCRE2 default se-
10131 subject string is the Perl way, not the POSIX way. Note that setting
10133 It does not affect the way newlines are matched by the dot metacharac-
10136 The yield of pcre2_regcomp() is zero on success, and non-zero other-
10139 number of capturing subpatterns in the regular expression. Various er-
10142 NOTE: If the yield of pcre2_regcomp() is non-zero, you must not attempt
10150 This area is not simple, because POSIX and Perl take different views of
10154 Perl and PCRE2:
10164 This is the equivalent table for a POSIX-compatible pattern matcher:
10175 API. By default, PCRE2's behaviour is the same as Perl's, except that
10176 there is no equivalent for PCRE2_DOLLAR_ENDONLY in Perl. In both PCRE2
10177 and Perl, there is no way to stop newline from matching [^a].
10181 there is no way to make PCRE2 behave exactly as for the REG_NEWLINE ac-
10197 The PCRE2_NOTBOL option is set when calling the underlying PCRE2 match-
10204 standard. However, setting this option can give more POSIX-like behav-
10209 The PCRE2_NOTEOL option is set when calling the underlying PCRE2 match-
10216 point to the first character beyond the string. There may be binary ze-
10223 relative to string + pmatch[0].rm_so, but this differs from other im-
10228 intended to be portable to other systems. Note that a non-zero rm_so
10236 pcre2_regexec() are ignored (except possibly as input for REG_STAR-
10239 The value of nmatch may be zero, and the value pmatch may be NULL (un-
10251 Unused entries in the array have both structure members set to -1.
10253 A successful match yields a zero return; various error codes are de-
10254 fined in the header file, of which REG_NOMATCH is the "expected" fail-
10260 The pcre2_regerror() function maps a non-zero errorcode from either
10263 A message terminated by a binary zero is placed in errbuf. If the buf-
10264 fer is too short, only the first errbuf_size - 1 characters of the er-
10272 Compiling a regular expression causes memory to be allocated and asso-
10274 such memory, after which preg may no longer be used as a compiled ex-
10288 Copyright (c) 1997-2021 University of Cambridge.
10289 ------------------------------------------------------------------------------
10297 PCRE2 - Perl-compatible regular expressions (revised API)
10305 can save this listing to re-create the contents of pcre2demo.c.
10310 used. If matching succeeds, the program outputs the portion of the sub-
10311 ject that matched, together with the contents of any captured sub-
10314 If the -g option is given on the command line, the program then goes on
10316 subject string. The logic is a little bit tricky because of the possi-
10320 The code in pcre2demo.c is an 8-bit program that uses the PCRE2 8-bit
10321 library. It handles strings and characters that are stored in 8-bit
10324 treated as UTF-8 strings, where characters may occupy multiple code
10328 for your operating system, you should be able to compile the demonstra-
10331 cc -o pcre2demo pcre2demo.c -lpcre2-8
10334 to the command line. For example, on a Unix-like system that has PCRE2
10335 installed in /usr/local, you can compile the demonstration program us-
10338 cc -o pcre2demo -I/usr/local/include pcre2demo.c \
10339 -L/usr/local/lib -lpcre2-8
10345 ./pcre2demo -g 'cat|dog' 'the dog sat on the cat'
10348 pcre2test, which supports many more facilities for testing regular ex-
10349 pressions using all three PCRE2 libraries (8-bit, 16-bit, and 32-bit,
10350 though not all three need be installed). The pcre2demo program is pro-
10357 ld.so.1: pcre2demo: fatal: libpcre2-8.so.0: open failed: No such file
10360 This is caused by the way shared library support works on those sys-
10363 -R/usr/local/lib
10378 Copyright (c) 1997-2016 University of Cambridge.
10379 ------------------------------------------------------------------------------
10385 PCRE2 - Perl-compatible regular expressions (revised API)
10387 SAVING AND RE-USING PRECOMPILED PCRE2 PATTERNS
10404 run. However, if you are using the just-in-time optimization feature,
10405 it is not possible to save and reload the JIT data, because it is posi-
10406 tion-dependent. The host on which the patterns are reloaded must be
10409 For example, patterns compiled on a 32-bit system using PCRE2's 16-bit
10410 library cannot be reloaded on a 64-bit system, nor can they be reloaded
10411 using the 8-bit library.
10418 linked with a fixed version of PCRE2 must be prepared to recompile pat-
10428 checking, not complete validation of what is being re-loaded. Corrupted
10440 in the byte stream (its size is 1088 bytes). For more details of char-
10441 acter tables, see the section on locale support in the pcre2api docu-
10447 the length of the vector. The third and fourth arguments point to vari-
10461 PCRE2_ERROR_BADMAGIC means either that a pattern's code has been cor-
10462 rupted, or that a slot in the vector does not point to a compiled pat-
10485 the 256 possible byte values. On systems that make a distinction be-
10486 tween binary and non-binary data, be sure that the file is opened for
10491 freed in the usual way by calling pcre2_code_free(). When you have fin-
10492 ished with the byte stream, it too must be freed by calling pcre2_seri-
10493 alize_free(). If this function is called with a NULL argument, it re-
10494 turns immediately without doing anything.
10497 RE-USING PRECOMPILED PATTERNS
10499 In order to re-use a set of saved patterns you must first make the se-
10501 from a file). The management of this memory block is up to the applica-
10503 find out how many compiled patterns are in the serialized data without
10512 and its length, and the third argument points to a byte stream. The fi-
10515 If this argument is NULL, malloc() and free() are used. After deserial-
10524 stream, it is filled with those that fit, and the remainder are ig-
10540 potential race issue if you are using multiple patterns that were de-
10541 coded from a single byte stream in a multithreaded application. A sin-
10543 and a reference count is used to arrange for its memory to be automati-
10550 If a pattern was processed by pcre2_jit_compile() before being serial-
10566 Copyright (c) 1997-2018 University of Cambridge.
10567 ------------------------------------------------------------------------------
10575 PCRE2 - Perl-compatible regular expressions (revised API)
10579 The full syntax and semantics of the regular expressions that are sup-
10581 document contains a quick-reference summary of the syntax.
10586 \x where x is non-alphanumeric is a literal x
10596 \cx "control-x", where x is any ASCII printing character
10617 read, but in ALT_BSUX mode \x must be followed by two hexadecimal dig-
10623 Note that \0dd is always an octal code. The treatment of backslash fol-
10624 lowed by a non-zero digit is complicated; for details see the section
10625 "Non-printing characters" in the pcre2pattern documentation, where de-
10643 \P{xx} a character without the xx property
10650 \W a "non-word" character
10654 middle of a UTF-8 or UTF-16 character. The application can lock out the
10658 By default, \d, \s, and \w match only ASCII characters, even in UTF-8
10659 mode or in the 16-bit and 32-bit libraries. However, if locale-specific
10661 points in the range 128-255. If the PCRE2_UCP option is set, the behav-
10665 Property descriptions in \p and \P are matched caselessly; hyphens, un-
10691 Mn Non-spacing mark
10723 Xsp Perl space: property Z or tab, NL, VT, FF, CR
10724 Xuc Univerally-named character: one that can be
10726 Xwd Perl word: property Xan or underscore
10728 Perl and POSIX space are now the same. Perl added VT to its space char-
10739 pcre2test -LP
10744 Many script names and their 4-letter abbreviations are recognized in
10746 of course). You can obtain a list of these scripts by running this com-
10749 pcre2test -LS
10768 L left-to-right
10769 LRE left-to-right embedding
10770 LRI left-to-right isolate
10771 LRO left-to-right override
10772 NSM non-spacing mark
10776 R right-to-left
10777 RLE right-to-left embedding
10778 RLI right-to-left isolate
10779 RLO right-to-left override
10788 [x-y] range (can be used for hex characters)
10794 ascii 0-127
10853 From release 10.38 \K is not permitted by default in lookaround asser-
10854 tions, for compatibility with Perl. However, if the PCRE2_EXTRA_AL-
10855 LOW_LOOKAROUND_BSK option is set, the previous behaviour is re-enabled.
10856 When this option is set, \K is honoured in positive assertions, but ig-
10868 (?<name>...) named capture group (Perl)
10869 (?'name'...) named capture group (Perl)
10871 (?:...) non-capture group
10872 (?|...) non-capture group; reset group numbers for
10875 In non-UTF modes, names may contain underscores and ASCII letters and
10882 (?>...) atomic non-capture group
10883 (*atomic:...) atomic non-capture group
10903 (?-...) unset option(s)
10907 a mixture of setting and unsetting such as (?i-x) is allowed, but there
10909 for example (?^in). An option setting may appear at the start of a non-
10912 The following are recognized only at the very start of a pattern or af-
10921 (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
10924 (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
10938 These are recognized only at the very start of the pattern or after op-
10951 These are recognized only at the very start of the pattern or after op-
10976 Each top-level branch of a lookbehind must be of a fixed length.
10979 NON-ATOMIC LOOKAROUND ASSERTIONS
10981 These assertions are specific to PCRE2 and are not Perl-compatible.
11007 \g-n relative reference by number
11009 \g{-n} relative reference by number
11010 \k<name> reference by name (Perl)
11011 \k'name' reference by name (Perl)
11012 \g{name} reference by name (Perl)
11022 (?-n) call subroutine by relative number
11023 (?&name) call subroutine by name (Perl)
11031 \g<-n> call subroutine by relative number (PCRE2 extension)
11032 \g'-n' call subroutine by relative number (PCRE2 extension)
11037 (?(condition)yes-pattern)
11038 (?(condition)yes-pattern|no-pattern)
11042 (?(-n) relative reference condition
11043 (?(<name>) named reference condition (Perl)
11044 (?('name') named reference condition (Perl)
11070 The following act only when a subsequent match failure causes a back-
11072 what happens afterwards. Those that advance the start-of-match point do
11114 Copyright (c) 1997-2022 University of Cambridge.
11115 ------------------------------------------------------------------------------
11123 PCRE - Perl-compatible regular expressions (revised API)
11128 it, you can build it without, in which case the library will be
11130 properties and can process strings of text in UTF-8, UTF-16, and UTF-32
11135 There are two ways of telling PCRE2 to switch to UTF mode, where char-
11148 one-code-unit characters. There are also some other changes to the way
11155 \P{..}, and \X can be used. This is not dependent on the PCRE2_UTF set-
11157 that Perl supports. Currently they are limited to the general category
11158 properties such as Lu for an upper case letter or Nd for a decimal num-
11162 general, only the short names for properties are supported. For exam-
11164 supported. Furthermore, in Perl, many properties may optionally be pre-
11165 fixed by "Is", for compatibility with Perl 5.6. PCRE2 does not support
11178 allowed in non-UTF mode.
11191 multi-unit characters (see the description of \C in the pcre2pattern
11192 documentation). For this reason, there is a build-time option that dis-
11193 ables support for \C completely. There is also a less draconian com-
11194 pile-time option for locking out the use of \C when a pattern is com-
11198 pcre2_dfa_match() when in UTF-8 or UTF-16 mode, that is, when a charac-
11200 modes provokes a match-time error. Also, the JIT optimization does not
11201 support \C in these modes. If JIT optimization is requested for a UTF-8
11202 or UTF-16 pattern that contains \C, it will not succeed, and so when
11203 pcre2_match() is called, the matching will be carried out by the inter-
11209 set as in non-UTF mode, all with code points less than 256. This re-
11214 you can use explicit Unicode property tests such as \p{Nd}. Alterna-
11215 tively, if you set the PCRE2_UCP option, the way that the character es-
11221 all low-valued characters, unless the PCRE2_UCP option is set.
11223 However, the special horizontal and vertical white space matching es-
11224 capes (\h, \H, \v, and \V) do match all the appropriate Unicode charac-
11228 UNICODE CASE-EQUIVALENCE
11232 are less than 128 and that have at most two case-equivalent values. For
11233 these, a direct table lookup is used for speed. A few Unicode charac-
11234 ters such as Greek sigma have more than two code points that are case-
11235 equivalent, and these are treated specially. Setting PCRE2_UCP without
11236 PCRE2_UTF allows Unicode-style case processing for non-UTF character
11237 encodings such as UCS-2.
11245 sequence of characters that are all from the same Unicode script. How-
11250 Every Unicode character has a Script property, mostly with a value cor-
11255 for the surrogate code points. In the PCRE2 32-bit library, characters
11257 which are accessible only in non-UTF mode, are assigned the Unknown
11261 include punctuation, emoji, mathematical, musical, and currency sym-
11264 "Inherited" is used for characters such as diacritical marks that mod-
11270 U+102E0 (Coptic Epact thousands mark) is used only with Arabic and Cop-
11272 called Script Extension exists. Its value is a list of scripts that ap-
11276 also some Common characters that have a single, non-Common script in
11282 constraint for decimal digits. These are covered in subsequent sec-
11289 run. Longer strings are checked using only the Script Extensions prop-
11292 If a character's Script Extension property is the single value "Inher-
11296 at least one script in common in their Script Extension lists. In set-
11312 The first has the Script Extension list Arabic, Hanifi Rohingya, Syr-
11314 of them could appear in script runs of either Arabic or Hanifi Ro-
11322 Katakana scripts together with Han; Korean uses Hangul and Han; Tai-
11325 "virtual scripts". Thus, a script run may contain a mixture of Hira-
11328 Bopomofo and Han. PCRE2 (like Perl) follows Unicode's Technical Stan-
11329 dard 39 ("Unicode Security Mechanisms", http://unicode.org/re-
11337 from the common ASCII digits. In addition to the script checking de-
11347 returned. The code unit offset to the offending character can be ex-
11352 and therefore want to skip these checks in order to improve perfor-
11354 scanned repeatedly. If you set the PCRE2_NO_UTF_CHECK option at com-
11371 UTF-16 and UTF-32 strings can indicate their endianness by special code
11372 knows as a byte-order mark (BOM). The PCRE2 functions do not handle
11377 pcre2_dfa_match() calls with a non-zero starting offset, the check is
11385 that the sequences \b and \B are one-character lookbehinds.
11389 the surrogate area. The so-called "non-character" code points are not
11394 UTF-16, where they are used in pairs to encode code points with values
11395 greater than 0xFFFF. The code points that are encoded by UTF-16 pairs
11396 are available independently in the UTF-8 and UTF-32 encodings. (In
11397 other words, the whole surrogate thing is a fudge for UTF-16 which un-
11398 fortunately messes up UTF-8 and UTF-32.)
11403 such as \x{d800} (a surrogate code point) you can set the PCRE2_EX-
11405 only in UTF-8 and UTF-32 modes, because these values are not repre-
11406 sentable in UTF-16.
11408 Errors in UTF-8 strings
11410 The following negative error codes are given for invalid UTF-8 strings:
11418 The string ends with a truncated UTF-8 character; the code specifies
11419 how many bytes are missing (1 to 5). Although RFC 3629 restricts UTF-8
11420 characters to be no longer than 4 bytes, the encoding scheme (origi-
11442 A 4-byte character has a value greater than 0x10ffff; these code points
11447 A 3-byte character has a value in the range 0xd800 to 0xdfff; this
11448 range of code points are reserved by RFC 3629 for use with UTF-16, and
11449 so are excluded from UTF-8.
11457 A 2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it codes
11459 For example, the two bytes 0xc0, 0xae give the value 0x2e, whose cor-
11465 binary value 0b10 (that is, the most significant bit is 1 and the sec-
11466 ond is 0). Such a byte can only validly occur as the second or subse-
11467 quent byte of a multi-byte character.
11472 can never occur in a valid UTF-8 string.
11474 Errors in UTF-16 strings
11476 The following negative error codes are given for invalid UTF-16
11484 Errors in UTF-32 strings
11486 The following negative error codes are given for invalid UTF-32
11496 UTF sequences if you call pcre2_compile() with the PCRE2_MATCH_IN-
11504 generate different code. If JIT is not used, the option affects the be-
11505 haviour of the interpretive code in pcre2_match(). When PCRE2_MATCH_IN-
11511 \p{Any}, it does not even match negative items such as [^X]. A lookbe-
11527 UTF-sequence, that sequence is skipped, and the match starts at the
11535 Using PCRE2_MATCH_INVALID_UTF, an application can run matches on arbi-
11551 Copyright (c) 1997-2021 University of Cambridge.
11552 ------------------------------------------------------------------------------