pcre2.txt - OpenGrok cross reference for /external/pcre/doc/pcre2.txt

Lines Matching +full:- +full:- +full:without +full:- +full:perl
1 -----------------------------------------------------------------------------
8 -----------------------------------------------------------------------------
16        PCRE2 - Perl-compatible regular expressions (revised API)
22        pattern matching using the same syntax and semantics as Perl, with just
25        API is more extensible, and it was simplified by abolishing  the  sepa-
31        As well as Perl-style regular expression patterns, some  features  that
32        appeared  in  Python and the original PCRE before they appeared in Perl
35        requesting some minor changes that give better  ECMAScript  (aka  Java-
38        The  source code for PCRE2 can be compiled to support strings of 8-bit,
39        16-bit, or 32-bit code units, which means that up to three separate li-
42        64-bit  environment that also supports 32-bit applications, versions of
43        PCRE2 that are compiled in both 64-bit and 32-bit modes may be needed.
45        The original work to extend PCRE to 16-bit and 32-bit  code  units  was
48        unit, or as UTF-encoded Unicode, with support for Unicode general cate-
54          pcre2test -C
57        ending in _8,  _16,  or  _32,  respectively  (for  example,  pcre2_com-
63        In addition to the Perl-compatible matching function, PCRE2 contains an
64        alternative  function that matches the same compiled patterns in a dif-
69        Details of exactly which Perl regular expression features are  and  are
76        client  to  discover  which  features are available. The features them-
77        selves are described in the pcre2build page. Documentation about build-
79        NON-AUTOTOOLS_BUILD files in the source distribution.
92        If you are using PCRE2 in a non-UTF application that permits  users  to
95        For  example, an 8-bit pattern that begins with "(*UTF)" turns on UTF-8
96        mode, which interprets patterns and subjects as strings of  UTF-8  code
97        units instead of individual 8-bit characters. This causes both the pat-
98        tern and any data against which it is matched to be checked  for  UTF-8
99        validity.  If the data string is very long, such a check might use suf-
100        ficiently many resources as to cause your application to  lose  perfor-
103        One  way  of guarding against this possibility is to use the pcre2_pat-
106        calling pcre2_compile(). This causes a compile time error if  the  pat-
107        tern contains a UTF-setting sequence.
110        be enabled from within the pattern, by specifying "(*UCP)".  This  fea-
118        The use of the \C escape sequence in a UTF-8 or UTF-16 pattern can lead
120        middle of a multi-code-unit character. The PCRE2_NEVER_BACKSLASH_C  op-
122        compile-time error if it is encountered. It is also possible  to  build
127        Nested  unlimited repeats in a pattern are a common example. PCRE2 pro-
130        pcre2_set_depth_limit() that can be used to restrict the amount of mem-
136        The  user  documentation for PCRE2 comprises a number of different sec-
142        (which  is a program listing), and the short pages for individual func-
143        tions, are concatenated in pcre2.txt, for ease of searching.  The  sec-
147          pcre2-config       show PCRE2 installation configuration information
151          pcre2compat        discussion of Perl compatibility
154          pcre2grep          description of the pcre2grep command (8-bit only)
155          pcre2jit           discussion of just-in-time optimization support
162          pcre2posix         the POSIX-compatible C API for the 8-bit library
186        Copyright (c) 1997-2021 University of Cambridge.
187 ------------------------------------------------------------------------------
195        PCRE2 - Perl-compatible regular expressions (revised API)
200        contains a description of all its native functions. See the pcre2 docu-
461        These functions provide a way of  converting  non-PCRE2  patterns  into
462        patterns that can be processed by pcre2_compile(). This facility is ex-
468 PCRE2 8-BIT, 16-BIT, AND 32-BIT LIBRARIES
470        There are three PCRE2 libraries, supporting 8-bit, 16-bit,  and  32-bit
473        for all three libraries. One, two, or all three can be installed simul-
474        taneously. On Unix-like systems the libraries  are  called  libpcre2-8,
475        libpcre2-16, and libpcre2-32, and they can also co-exist with the orig-
498        macros are defined whose names are the generic forms such as pcre2_com-
500        PCRE2_CODE_UNIT_WIDTH  to generate the appropriate width-specific func-
518        single  library.   For example, if you want to run a match using a pat-
524        their generic names, without the _8, _16, or _32 suffix.
530        There are also some wrapper functions for the 8-bit library that corre-
542        program against a non-dll PCRE2 library, you must  define  PCRE2_STATIC
546        and matching regular expressions in a Perl-compatible manner. A  sample
553        passed as bits in an options argument. There are also some more compli-
554        cated parameters such as custom memory  management  functions  and  re-
559        Just-in-time  (JIT)  compiler  support  is an optional feature of PCRE2
561        speeds  up  the matching performance of many patterns. Programs can re-
568        pcre2_jit_stack_assign() in order to control the JIT code's memory  us-
574        less sanity checking. The JIT-specific functions are discussed  in  the
577        A  second  matching function, pcre2_dfa_match(), which is not Perl-com-
581        there  are lookaround assertions). However, this algorithm does not re-
600        pcre2_substring_free() and pcre2_substring_list_free()  are  also  pro-
602        functions is called with a NULL argument, the function returns  immedi-
603        ately without doing anything.
612        Finally, there are functions for finding out information about  a  com-
627        ~(PCRE2_SIZE)0) is reserved as a special indicator for  zero-terminated
635        strings:  a  single  CR (carriage return) character, a single LF (line-
636        feed) character, the two-character sequence CRLF, any of the three pre-
645        Unix standard. However, the newline convention can be changed by an ap-
646        plication when calling pcre2_compile(), or it can be specified by  spe-
648        settings. See the pcre2pattern page for details of the special  charac-
654        dollar metacharacters, the handling of #-comments in /x mode, and, when
655        CRLF  is a recognized line ending sequence, the match position advance-
656        ment for a non-anchored pattern. There is more detail about this in the
666        In  a multithreaded application it is important to keep thread-specific
668        library  code  itself  is  thread-safe: it contains no static or global
669        variables. The API is designed to be fairly simple for non-threaded ap-
670        plications  while at the same time ensuring that multithreaded applica-
673        There are several different blocks of data that are used to pass infor-
681        is thread-safe, that is, the same compiled pattern can be used by  more
684        use  them.  However,  if the just-in-time (JIT) optimization feature is
695          Get a read-only (shared) lock (mutex) for pointer
707        The  reason  for checking the pointer a second time is as follows: Sev-
723          Get a read-only (shared) lock (mutex) for pointer
736        If JIT is being used, but the JIT compilation is not being done immedi-
741        pcre2_code_copy() or pcre2_code_copy_with_tables() can be used  to  ob-
742        tain  a  private  copy of the compiled code before calling the JIT com-
751        a PCRE2 function without using lots of arguments. The  parameters  that
755        In a multithreaded application, if the parameters in a context are val-
758        it must make its own thread-specific copy.
763        of a match. This includes details of what was matched, as well as addi-
772        memory management or non-standard character tables.  To  keep  function
776        that holds the parameter values.  Applications that do not need to  ad-
781        relevant  for  several  PCRE2 operations, a compile-time context, and a
782        match-time context.
786        At present, this context just contains pointers to (and data  for)  ex-
806        function  may be NULL, in which case the system memory management func-
809        might be.)  The private_malloc() function is used (if supplied) to  ob-
828        without doing anything.
832        A compile context is required if you want to provide an external  func-
834        values of any of the following compile-time parameters:
843        A compile context is also required if you are using custom memory  man-
844        agement.   If  none of these apply, just pass NULL as the context argu-
847        A compile context is created, copied, and freed by the following  func-
875        only argument is a general context. This function builds a set of char-
881        As  PCRE2  has developed, almost all the 32 option bits that are avail-
884        bits which are used for some newer, assumed rarer, options. This  func-
886        It does not modify any existing setting. The available options are  de-
896        largest  number  that  a  PCRE2_SIZE variable can hold, which is effec-
902        This specifies which characters or character sequences are to be recog-
905        two-character  sequence  CR followed by LF), PCRE2_NEWLINE_ANYCRLF (any
912        When a  pattern  is  compiled  with  the  PCRE2_EXTENDED  or  PCRE2_EX-
924        stops  rogue  patterns  using  up too much system stack when being com-
925        piled. The limit applies to parentheses of all kinds, not just  captur-
941        nesting,  and  the second is user data that is set up by the last argu-
943        should return zero if all is well, or non-zero to force an error.
959        A match context is created, copied, and freed by  the  following  func-
979        during a matching operation. Details are given in the pcre2callout doc-
986        This  sets up a callout function for PCRE2 to call after each substitu-
987        tion made by pcre2_substitute(). Details are given in the section enti-
993        The  offset_limit parameter limits how far an unanchored search can ad-
995        pcre2_match()  and  pcre2_dfa_match()  functions return PCRE2_ERROR_NO-
1005        When  using  this facility, you must set the PCRE2_USE_OFFSET_LIMIT op-
1007        code  can  be  compiled. If a match is started with a non-default match
1015        the first line and also within the offset limit. In other words, which-
1024        also applies to pcre2_dfa_match(), which may use the heap when process-
1026        atomic groups. This limit does not apply to matching with the JIT opti-
1038        where  ddd  is a decimal number. However, such a setting is ignored un-
1044        pcre2_match()  uses  the  heap are given in the pcre2perform documenta-
1047        For pcre2_dfa_match(), a vector on the system stack is used  when  pro-
1055        The match_limit parameter provides a means of preventing PCRE2 from us-
1069        When  pcre2_match() is called with a pattern that was successfully pro-
1076        The default value for the limit can be set when PCRE2 is built; the de-
1083        where  ddd  is a decimal number. However, such a setting is ignored un-
1110        If  the depth of internal recursive function calls is great enough, lo-
1111        cal workspace vectors are allocated on the heap from version 10.32  on-
1115        deal of memory. However, it is probably better to limit heap usage  di-
1127        where ddd is a decimal number. However, such a setting is  ignored  un-
1132 CHECKING BUILD-TIME OPTIONS
1142        required. The second argument is a pointer to memory into which the in-
1151        non-negative  on success, or the negative error code PCRE2_ERROR_BADOP-
1152        TION if the value in the first argument is not recognized. The  follow-
1159        PCRE2_BSR_UNICODE  means  that  \R  matches any Unicode line ending se-
1166        unit widths were selected when PCRE2 was  built.  The  1-bit  indicates
1167        8-bit  support, and the 2-bit and 4-bit indicate 16-bit and 32-bit sup-
1174        recursions, lookarounds, and atomic groups in  pcre2_dfa_match().  Fur-
1187        just-in-time compiling is available; otherwise it is set to zero.
1195        compiler  is  configured,  for  example "x86 32bit (little endian + un-
1206        the 16-bit library is compiled, a value of 3 is rounded up  to  4,  and
1207        when  the  32-bit  library  is compiled, internal linkages always use 4
1210        The default value of 2 for the 8-bit and 16-bit libraries is sufficient
1246        The  output is a uint32_t integer that gives the maximum depth of nest-
1250        take into account the stack that may already be used by the calling ap-
1256        This parameter is obsolete and should not be used in new code. The out-
1261        The output is a uint32_t integer that gives the length of PCRE2's char-
1270        without  Unicode  support,  the buffer is filled with the text "Unicode
1286        PCRE2 version string, zero-terminated. The number of code units used is
1287        returned. This is the length of the string plus one unit for the termi-
1305        length  (in  code units). If the pattern is zero-terminated, the length
1307        pointer to a block of memory that contains the compiled pattern and re-
1310        If the compile context argument ccontext is NULL, memory for  the  com-
1311        piled  pattern  is  obtained  by calling malloc(). Otherwise, it is ob-
1312        tained from the same memory function that was used for the compile con-
1314        it is no longer needed.  If pcre2_code_free() is called with a NULL ar-
1315        gument, it returns immediately, without doing anything.
1319        However,  if  the  code has been processed by the JIT compiler (see be-
1320        low), the JIT information cannot be copied (because it is  position-de-
1321        pendent).   The  new copy can initially be used only for non-JIT match-
1326        a multithreaded application to acquire a private copy  of  shared  com-
1335        pointing  to the new tables. The memory for the new tables is automati-
1347        described  in  the section entitled "Option bits for pcre2_match()" be-
1351        that  affect the compilation. It should be zero if none of them are re-
1353        particular,  those  that  are  compatible with Perl, but some others as
1354        well) can also be set and unset from within the pattern  (see  the  de-
1357        For  those options that can be different in different parts of the pat-
1363        Some  additional  options and less frequently required compile-time pa-
1364        rameters (for example, the newline setting) can be provided in  a  com-
1367        If errorcode or erroroffset is NULL, pcre2_compile() returns NULL imme-
1369        error code and an offset (number of code units) within the pattern, re-
1370        spectively, when pcre2_compile() returns NULL because a compilation er-
1373        There  are nearly 100 positive error codes that pcre2_compile() may re-
1375        error  codes that are used for invalid UTF strings when validity check-
1378        There is no separate documentation for the positive  error  codes,  be-
1380        pcre2_get_error_message() function (see "Obtaining a textual error mes-
1381        sage"  below)  should  be  self-explanatory.  Macro names starting with
1384        that returns the message "no error" if passed  to  pcre2_get_error_mes-
1387        The value returned in erroroffset is an indication of where in the pat-
1389        non-zero  value  is  not  necessarily the furthest point in the pattern
1392        assertion. For an invalid UTF-8 or UTF-16 string, the offset is that of
1398        mode. It may sometimes point into the middle of a UTF-8 or UTF-16 char-
1401        This  code  fragment shows a typical straightforward call to pcre2_com-
1409            PCRE2_ZERO_TERMINATED,  /* the pattern is zero-terminated */
1427        only way to do it in Perl.
1431        By  default, for compatibility with Perl, a closing square bracket that
1442        (1) \U matches an upper case "U" character; by default \U causes a com-
1443        pile time error (Perl uses \U to upper case subsequent characters).
1447        code  point  to match. By default, \u causes a compile time error (Perl
1452        code point to match. By default, as in Perl, a  hexadecimal  number  is
1457        using  the  PCRE2_EXTRA_ALT_BSUX  extra  option (see "Extra compile op-
1459        to  patterns.  Neither  of  these options affects the processing of re-
1468        Perl. If you want a multiline circumflex also to match after  a  termi-
1473        By  default, for compatibility with Perl, the name in any verb sequence
1474        such as (*MARK:NAME) is any sequence of characters that  does  not  in-
1476        it is not possible to include a closing parenthesis in the  name.  How-
1477        ever,  if  the PCRE2_ALT_VERBNAMES option is set, normal backslash pro-
1478        cessing is applied to verb names and only an unescaped  closing  paren-
1482        whitespace in verb names is skipped and #-comments are recognized,  ex-
1488        items, all with number 255, before each pattern  item,  except  immedi-
1489        ately  before  or after an explicit callout in the pattern. For discus-
1495        case  letters in the subject. It is equivalent to Perl's /i option, and
1500        characters, K and S, that, in addition to their lower case ASCII equiv-
1501        alents,  are case-equivalent with U+212A (Kelvin sign) and U+017F (long
1505        higher  code  points  (available  only  in  16-bit  or 32-bit mode) are
1511        at  the  end  of the subject string. Without this option, a dollar also
1515        Perl, and no way to set it within a pattern.
1521        ever matches one character, even if newlines are coded as CRLF. Without
1522        this option, a dot does not match when the current position in the sub-
1523        ject  is  at  a newline. This option is equivalent to Perl's /s option,
1524        and it can be changed within a pattern by a (?s) option setting. A neg-
1526        escape sequence always matches a non-newline character, independent  of
1543        patterns,  a  new  match is then tried at the next starting point. How-
1554        which is the only way to do it in Perl.
1558        matches,  which are necessarily substrings of the first one, must obvi-
1563        If this bit is set, most white space characters in the pattern are  to-
1567        {1,3}. Ignorable white space is permitted between an item and a follow-
1568        ing  quantifier  and  between a quantifier and a following + that indi-
1569        cates possessiveness. PCRE2_EXTENDED is equivalent to Perl's /x option,
1572        When  PCRE2  is compiled without Unicode support, PCRE2_EXTENDED recog-
1574        256 that are flagged as white space in its low-character table. The ta-
1581        When PCRE2 is compiled with Unicode support, in addition to these char-
1582        acters,  five  more Unicode "Pattern White Space" characters are recog-
1583        nized by PCRE2_EXTENDED. These are U+0085 (next line), U+200E (left-to-
1584        right  mark), U+200F (right-to-left mark), U+2028 (line separator), and
1586        recognized  by  Perl's /x option. Note that the horizontal and vertical
1590        As  well as ignoring most white space, PCRE2_EXTENDED also causes char-
1597        Which characters are interpreted as newlines can be specified by a set-
1599        special sequence at the start of the pattern, as described in the  sec-
1605        This option has the effect of PCRE2_EXTENDED,  but,  in  addition,  un-
1606        escaped  space and horizontal tab characters are ignored inside a char-
1608        set  of pattern white space characters that are ignored outside a char-
1609        acter class. PCRE2_EXTENDED_MORE is equivalent to  Perl's  /xx  option,
1616        start  of  matching, though the matched text may continue over the new-
1617        line. If startoffset is non-zero, the limiting newline is not necessar-
1619        string is "abc\nxyz" (where \n represents a single-character newline) a
1628        If this option is set, all meta-characters in the pattern are disabled,
1631        you are doing a lot of literal matching and  are  worried  about  effi-
1636        PCRE2_UTF,  and  PCRE2_USE_OFFSET_LIMIT.  The  extra  options PCRE2_EX-
1644        sequences.   This  facility  is not supported for DFA matching. For de-
1651        alternative to fail).  A pattern such as (\1)(a) succeeds when this op-
1653        fails by default, for Perl compatibility.  Setting  this  option  makes
1663        string, or before a terminating newline (except  when  PCRE2_DOLLAR_EN-
1665        character" metacharacter (.) does not match at a newline.  This  behav-
1666        iour (for ^, $, and dot) is the same as Perl.
1671        start and end. This is equivalent to Perl's /m option, and  it  can  be
1674        subject,  for compatibility with Perl.  However, you can change this by
1681        This option locks out the use of \C in the pattern that is  being  com-
1682        piled.   This  escape  can  cause  unpredictable  behaviour in UTF-8 or
1683        UTF-16 modes, because it may leave the current matching  point  in  the
1684        middle of a multi-code-unit character. This option may be useful in ap-
1686        is also a build-time option that permanently locks out the use of \C.
1700        This  option  locks out interpretation of the pattern as UTF-8, UTF-16,
1701        or UTF-32, depending on which library is in use. In particular, it pre-
1703        by starting the pattern with (*UTF). This option may be useful  in  ap-
1709        If this option is set, it disables the use of numbered capturing paren-
1713        is the same as Perl's /n option.  Note that, when this option  is  set,
1720        If this option is set, it disables "auto-possessification", which is an
1723        are in use, auto-possessification means that some  callouts  are  never
1731        .* is the first significant item in a top-level branch  of  a  pattern,
1751        the matching code searches the subject for that value, and fails  imme-
1752        diately  if it cannot find it, without actually running the main match-
1756        items  are  in use, these "start-up" optimizations can cause them to be
1757        skipped if the pattern is never actually used. The  start-up  optimiza-
1758        tions  are  in effect a pre-scan of the subject that takes place before
1761        The PCRE2_NO_START_OPTIMIZE option disables the start-up optimizations,
1774        start-up  optimization  scans along the subject, finds "A" and runs the
1775        first match attempt from there. The (*COMMIT) item means that the  pat-
1780        (*COMMIT)  prevents any further matches being tried, so the overall re-
1783        As another start-up optimization makes use of a minimum  length  for  a
1790        match  "BB", which is long enough. In the process, (*MARK:2) is encoun-
1792        found,  but  there is only one character left, so there are no more at-
1805        UTF-8 strings, UTF-16 strings, and UTF-32 strings in  the  pcre2unicode
1811        PCRE2_NO_UTF_CHECK option. When it is set, the effect of passing an in-
1819        Note also that setting PCRE2_NO_UTF_CHECK at compile time does not dis-
1820        able  the error that is given if an escape sequence for an invalid Uni-
1821        code code point is encountered in the pattern. In particular,  the  so-
1825        section entitled "Extra compile options" below.  However, this is  pos-
1826        sible only in UTF-8 and UTF-32 modes, because these values are not rep-
1827        resentable in UTF-16.
1834        PCRE2_UCP is set, Unicode properties are used instead to classify char-
1839        The second effect of PCRE2_UCP is to force the use of  Unicode  proper-
1841        greater than 127, even when PCRE2_UTF is not set. This makes it  possi-
1842        ble, for example, to process strings in the 16-bit UCS-2 code. This op-
1850        not  compatible  with Perl. It can also be set by a (?U) option setting
1856        is  going  to be used to set a non-default offset limit in a match con-
1858        offset  limit is set without this option. For more details, see the de-
1866        instead  of  single-code-unit  strings.  It  is available when PCRE2 is
1868        support is not available, the use of this option provokes an error. De-
1881        assertions, following Perl's lead. This option is provided to re-enable
1887        This option applies when compiling a pattern in UTF-8 or  UTF-32  mode.
1888        It  is  forbidden in UTF-16 mode, and ignored in non-UTF modes. Unicode
1890        in  UTF-16  to  encode  code points with values in the range 0x10000 to
1891        0x10ffff. The surrogates cannot therefore  be  represented  in  UTF-16.
1892        They can be represented in UTF-8 and UTF-32, but are defined as invalid
1893        code points, and cause errors if  encountered  in  a  UTF-8  or  UTF-32
1898        when using PCRE2 to check for unwanted characters in UTF-8 strings, ex-
1900        PCRE2_NO_UTF_CHECK  option  does not disable the error that occurs, be-
1903        If the extra option PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES is set,  surro-
1904        gate  code  point values in UTF-8 and UTF-32 patterns no longer provoke
1912        \x  in  the way that ECMAscript (aka JavaScript) does. Additional func-
1915        as a hexadecimal character code, where hhh.. is any number of hexadeci-
1921        escape such as \j or a malformed one such as \x{2z} causes  a  compile-
1922        time error when detected by pcre2_compile(). Perl is somewhat inconsis-
1924        "j",  and non-hexadecimal digits in \x{} are just ignored, though warn-
1925        ings are given in both cases if Perl's warning switch is enabled.  How-
1927        Perl.
1931        treated as single-character escapes. For example, \j is a  literal  "j"
1932        and  \x{2z}  is treated as the literal string "x{2z}". Setting this op-
1937        is not supported in a character class. To reiterate: this is a  danger-
1945        of a CR (carriage return) character. The option does not affect a  lit-
1951        This option is provided for use by  the  -x  option  of  pcre2grep.  It
1953        automatically inserting the code for "^(?:" at the start  of  the  com-
1955        the matched line may be in the middle of the subject string.  This  op-
1960        This  option  is  provided  for  use  by the -w option of pcre2grep. It
1968 JUST-IN-TIME (JIT) COMPILATION
1988        just-in-time  compiler  is available, further processes a compiled pat-
1994        for  patterns  to  be analyzed, and for one-off matches and simple pat-
2010        code  points  are  less than 256. By default, higher-valued code points
2016        \w  and friends to use Unicode property support instead of the built-in
2017        tables.  PCRE2_UCP also causes upper/lower casing operations on charac-
2025        PCRE2 contains a built-in set of character tables that are used by  de-
2026        fault.   These  are sufficient for many applications. Normally, the in-
2029        default "C" locale of the local system, which may cause them to be dif-
2032        The  built-in tables can be overridden by tables supplied by the appli-
2034        from  the  default.  As more and more applications change to using Uni-
2055        The locale name "fr_FR" is used on Linux and other  Unix-like  systems;
2074        or  whether  the processor is 32-bit or 64-bit. A copy of the result of
2076        re-used  later, even in a different program or on another computer. The
2081        used stand-alone to create a file that contains a set of binary tables.
2091        The  first  argument  for pcre2_pattern_info() is a pointer to the com-
2093        is  required,  and the third argument is a pointer to a variable to re-
2097        the function is zero for success, or one of the following negative num-
2107        typical call of pcre2_pattern_info(), to obtain the length of the  com-
2125        to  a  uint32_t variable. PCRE2_INFO_ARGOPTIONS returns exactly the op-
2126        tions that were passed to  pcre2_compile(),  whereas  PCRE2_INFO_ALLOP-
2127        TIONS  returns  the compile options as modified by any top-level (*XXX)
2130        compile context by calling the pcre2_set_compile_extra_options()  func-
2133        For  example, if the pattern /(*UTF)abc/ is compiled with the PCRE2_EX-
2140        A pattern compiled without PCRE2_ANCHORED is automatically anchored  by
2141        PCRE2 if the first significant item in every top-level branch is one of
2147          .*    sometimes - see below
2159        For  patterns  that are auto-anchored, the PCRE2_ANCHORED bit is set in
2190        been set, the call to pcre2_pattern_info() returns the error  PCRE2_ER-
2197        In  the absence of a single first code unit for a non-anchored pattern,
2198        pcre2_compile() may construct a 256-bit table that defines a fixed  set
2202        means "any code unit of value 255 or above". If such a table  was  con-
2209        a  non-anchored  pattern. The third argument should point to a uint32_t
2221        The  third  argument  should point to a uint32_t variable. In the 8-bit
2222        library, the value is always less than 256. In the 16-bit  library  the
2223        value  can  be  up  to 0xffff. In the 32-bit library in UTF-32 mode the
2224        value can be up to 0x10ffff, and up to 0xffffffff when not using UTF-32
2231        without  the  use  of  JIT. The third argument should point to a size_t
2233        in the pattern. Each additional capture group adds two PCRE2_SIZE vari-
2246        \r or \n or one of the  equivalent  hexadecimal  or  octal  escape  se-
2252        (*LIMIT_HEAP=nnnn) at the start, the value is returned. The third argu-
2260        Return 1 if the (?J) or (?-J) option setting is used  in  the  pattern,
2262        (?J) and (?-J) set and unset the local PCRE2_DUPNAMES  option,  respec-
2267        If  the  compiled  pattern was successfully processed by pcre2_jit_com-
2287        PCRE2_INFO_LASTCODETYPE returns 1. Otherwise, return 0. The third argu-
2293        third argument should point to a uint32_t variable. When a pattern con-
2295        whether or not it can match an empty string. PCRE2 takes a cautious ap-
2301        (*LIMIT_MATCH=nnnn)  at the start, the value is returned. The third ar-
2303        set, the call to pcre2_pattern_info() returns the error PCRE2_ERROR_UN-
2305        less  than  the limit set or defaulted by the caller of the match func-
2311        code  units)  when  it starts to process each of its branches. This re-
2313        should point to a uint32_t integer. The simple assertions \b and \B re-
2314        quire a one-character lookbehind and cause PCRE2_INFO_MAXLOOKBEHIND  to
2315        return  1  in  the absence of anything longer. \A also registers a one-
2319        Note that this information is useful for multi-segment matching only if
2321        (?<=a(?<=ba)c)  returns  a maximum lookbehind of 2, but when it is pro-
2323        character,  then  the  nested lookbehind also moves back by two charac-
2325        at  the start.  PCRE2_INFO_MAXLOOKBEHIND is really only useful as a de-
2327        multi-segment matching.
2344        PCRE2 supports the use of named as well as numbered capturing parenthe-
2345        ses.  The names are just an additional way of identifying the parenthe-
2347        pcre2_substring_get_byname()  are provided for extracting captured sub-
2351        do the conversion, you need to use the name-to-number map, which is de-
2354        The map consists of a number of  fixed-size  entries.  PCRE2_INFO_NAME-
2360        This is a PCRE2_SPTR pointer to a block of code units. In the 8-bit li-
2361        brary,  the first two bytes of each entry are the number of the captur-
2362        ing parenthesis, most significant byte first. In  the  16-bit  library,
2363        the  pointer  points  to 16-bit code units, the first of which contains
2364        the parenthesis number. In the 32-bit library, the  pointer  points  to
2365        32-bit  code units, the first of which contains the parenthesis number.
2369        capture groups with the same number, as described in the section on du-
2374        Duplicate names for capture groups with different numbers  are  permit-
2378        necessarily the case because later capture groups may have  lower  num-
2382        pattern after compilation by the 8-bit library  (assume  PCRE2_EXTENDED
2383        is set, so white space - including newlines - is ignored):
2385          (?<date> (?<year>(\d\d)?\d\d) -
2386          (?<month>\d\d) - (?<day>\d\d) )
2390        with non-printing bytes shows in hexadecimal, and undefined bytes shown
2399        name-to-number  map,  remember that the length of the entries is likely
2413        This identifies the character sequence that will be recognized as mean-
2418        Return  the  size  of  the compiled pattern in bytes (for all three li-
2422        pcre2_compile()  is  getting memory in which to place the compiled pat-
2423        tern may be slightly larger than the value returned by this option, be-
2425        over-estimate. Processing a pattern with the JIT compiler does not  al-
2441        which they appear. Its first argument is a pointer to a callout enumer-
2443        passed  to  pcre2_callout_enumerate(). The contents of the callout enu-
2453        PCRE2, with the same code unit width, and must also have the same endi-
2458        the  serialized form. They are described in the pcre2serialize documen-
2459        tation. Note that PCRE2 serialization does not  convert  compiled  pat-
2480        you must create a match data block by calling one of the creation func-
2487        to  record  the matched portion of the subject plus three captured sub-
2501        The second argument of pcre2_match_data_create() is a pointer to a gen-
2511        general  context, but in this case if NULL is passed, the memory is ob-
2517        after  a  match  operation  has  finished, using functions that are de-
2521        match  block  only  when  the  error  is PCRE2_ERROR_NOMATCH, PCRE2_ER-
2522        ROR_PARTIAL, or one of the error codes for an invalid UTF  string.  Ex-
2532        described in the section entitled "Option bits for  pcre2_match()"  be-
2537        NULL argument, it returns immediately, without doing anything.
2550        order to find multiple matches in the subject string or to  match  dif-
2553        This  function is the main matching facility of the library, and it op-
2554        erates in a Perl-like manner. For specialist use there is also  an  al-
2570        If the subject string is zero-terminated, the length can  be  given  as
2572        common matching parameters are to be changed. For details, see the sec-
2580        bytes for the 8-bit library, 16-bit code units for the 16-bit  library,
2581        and  32-bit  code units for the 32-bit library, whether or not UTF pro-
2583        zero,  the  subject is assumed to be an empty string. If length is non-
2589        by far the most common case. In UTF-8 or UTF-16 mode, the starting off-
2590        set must point to the start of a character, or to the end of  the  sub-
2591        ject  (in  UTF-32 mode, one code unit equals one character, so all off-
2592        sets are valid). Like the pattern string, the subject may  contain  bi-
2595        A  non-zero  starting offset is useful when searching for another match
2615        match an empty string. It is possible to emulate Perl's /g behaviour by
2622        so, and the current character is CR followed by LF, advance the  start-
2625        If a non-zero starting offset is passed when the pattern is anchored, a
2626        single attempt to match at the given offset is made. This can only suc-
2628        the subject. In other words, the anchoring must be the result  of  set-
2636        PCRE2_COPY_MATCHED_SUBJECT, PCRE2_ENDANCHORED, PCRE2_NOTBOL,  PCRE2_NO-
2641        Setting  PCRE2_ANCHORED  or PCRE2_ENDANCHORED at match time is not sup-
2642        ported by the just-in-time (JIT) compiler. If it is set,  JIT  matching
2660        must not be freed until all such operations are complete. For some  ap-
2669        also automatically freed if the match data block is re-used for another
2675        matches  must be right at the end of the subject string. Note that set-
2682        match before it. Setting this without  having  set  PCRE2_MULTILINE  at
2690        in multiline mode) a newline immediately before it. Setting this  with-
2692        match. This option affects only the behaviour of the dollar metacharac-
2714        subject is permitted.  If the pattern is anchored, such a match can oc-
2729        The latter special case is discussed in detail in the pcre2unicode doc-
2732        In the default case, if a non-zero starting offset is given, the  check
2740        that the sequences \b and \B are one-character lookbehinds.
2746        validity of UTF-8 strings, UTF-16 strings, and UTF-32  strings  in  the
2756        PCRE2_NO_UTF_CHECK is set at match time the effect of  passing  an  in-
2757        valid string as a subject, or an invalid value of startoffset, is unde-
2758        fined.  Your program may crash or loop indefinitely or give  wrong  re-
2764        These options turn on the partial matching feature. A partial match oc-
2766        there are not enough subject characters to complete the match. In addi-
2771        If this situation arises when PCRE2_PARTIAL_SOFT  (but  not  PCRE2_PAR-
2772        TIAL_HARD) is set, matching continues by testing any remaining alterna-
2774        returned  instead  of  PCRE2_ERROR_NOMATCH.  In other words, PCRE2_PAR-
2780        PCRE2_ERROR_PARTIAL,  without  considering  any  other alternatives. In
2781        other words, when PCRE2_PARTIAL_HARD is set, a partial match is consid-
2784        There is a more detailed discussion of partial and multi-segment match-
2790        When PCRE2 is built, a default newline convention is set; this is  usu-
2795        pcre2pattern  page. During matching, the newline choice affects the be-
2808        expected. For example, if the pattern is .+A (and the PCRE2_DOTALL  op-
2811        However,  the  pattern  [\r\n]A does match that string, because it con-
2812        tains an explicit CR or LF reference, and so advances only by one char-
2818        not count, nor does \s, even though it includes CR and LF in the  char-
2836        phrase  "capture  group" (Perl terminology) is used for a fragment of a
2845        Alternatively, you can make direct use of the vector of PCRE2_SIZE val-
2849        pcre2_get_ovector_count() returns the number of pairs of values it con-
2852        Within the ovector, the first in each pair of values is set to the off-
2854        offset  of the first code unit after the end of a substring. These val-
2856        are byte offsets in the 8-bit library, 16-bit offsets in the 16-bit li-
2857        brary, and 32-bit offsets in the 32-bit library.
2865        the portion of the subject string that was matched by the  entire  pat-
2869        been captured, the returned value is 3. If there are no  captured  sub-
2878        If  a  capture group is matched repeatedly within a single match opera-
2879        tion, it is the last portion of the subject that it matched that is re-
2895        Offset values that correspond to unused groups at the end  of  the  ex-
2904        in the pattern are never changed. That is, if a pattern contains n cap-
2906        pcre2_match().  The  other  elements retain whatever values they previ-
2927        returns a pointer to the zero-terminated name, which is within the com-
2935        backtracking  verbs  without  names do not count. Thus, for example, if
2937        After a "no match" or a partial match, the last encountered name is re-
2947        Warning:  By  default, certain start-of-match optimizations are used to
2950        for the presence of "c" in the subject before running the matching  en-
2951        gine. This check fails for "bx", causing a match failure without seeing
2952        any marks. You can disable the start-of-match optimizations by  setting
2959        offset  of  the character at which the match started. For a non-partial
2972        If  pcre2_match() fails, it returns a negative number. This can be con-
2973        verted to a text string by calling the pcre2_get_error_message()  func-
2978        of  UTF-specific negative error codes is returned. Details are given in
2993        PCRE2 stores a 4-byte "magic number" at the start of the compiled code,
3000        a  library  of a different code unit width, for example, a pattern com-
3001        piled by the 8-bit library is passed to  a  16-bit  or  32-bit  library
3041        This error is returned when a pattern that was successfully studied us-
3042        ing JIT is being matched, but the memory available for the just-in-time
3056        also  returned  if PCRE2_COPY_MATCHED_SUBJECT is set and memory alloca-
3066        within  the  pattern. Specifically, it means that either the whole pat-
3069        might do this are detected and faulted at compile time, but  more  com-
3080        match,  or  auxiliary)  can be obtained by calling pcre2_get_error_mes-
3087        The returned message is terminated with a trailing zero, and the  func-
3089        zero. If the error number is unknown, the negative error code PCRE2_ER-
3113        extracting  captured  substrings  as  new,  separate,   zero-terminated
3119        zero refers to the entire matched substring, with higher numbers refer-
3130        extracts a zero-length empty string.
3132        You  can  find the length in code units of a captured substring without
3139        The pcre2_substring_copy_bynumber() function  copies  a  captured  sub-
3142        function  that  was  used for the match data block. The first two argu-
3159        code is returned.  If a substring number greater than zero is used  af-
3182        pattern  is  (abc)|(def) and the subject is "def", and the ovector con-
3193        The pcre2_substring_list_get() function  extracts  all  available  sub-
3195        builds a second list that contains their lengths (in code  units),  ex-
3207        therefore need the lengths, you may supply NULL as the lengthsptr argu-
3209        function is zero if all went well, or PCRE2_ERROR_NOMEMORY if the  mem-
3216        be distinguished from a genuine zero-length substring by inspecting the
3237        To  extract a substring by name, you first have to find associated num-
3244        the name by calling pcre2_substring_number_from_name(). The first argu-
3253        the "bynumber" functions, the only difference being that the second ar-
3261        than the number of slots in the ovector, PCRE2_ERROR_UNAVAILABLE is re-
3267        group numbers in the pcre2pattern page, you cannot use names to distin-
3286        can be given as PCRE2_ZERO_TERMINATED for a zero-terminated string.  As
3287        a  special  case,  if  replacement is NULL and rlength is zero, the re-
3288        placement is assumed to be an empty string. If rlength is non-zero,  an
3291        There is an option (see PCRE2_SUBSTITUTE_REPLACEMENT_ONLY below) to re-
3294        that requests multiple replacements  (see  PCRE2_SUBSTITUTE_GLOBAL  be-
3299        never  greater  than one unless PCRE2_SUBSTITUTE_GLOBAL is set. A nega-
3304        error return. For global replacements, matches in which \K in a lookbe-
3309        pcre2_match(), except that the partial matching options are not permit-
3311        block  is obtained and freed within this function, using memory manage-
3318        will always be a no-match error. The contents of the ovector within the
3326        arguments. The data in the match_data block (return code,  offset  vec-
3328        pcre2_match() from within pcre2_substitute(). This allows  an  applica-
3329        tion to check for a match before choosing to substitute, without having
3333        changed   when   PCRE2_SUBSTITUTE_MATCHED   is  set.  If  PCRE2_SUBSTI-
3334        TUTE_GLOBAL is also set, pcre2_match() is called after the  first  sub-
3335        stitution  to  check for further matches, but this is done using an in-
3339        The  code  argument is not used for matching before the first substitu-
3341        even  when  PCRE2_SUBSTITUTE_GLOBAL is not set, because it contains in-
3342        formation such as the UTF setting and the number of capturing parenthe-
3346        subject string with matched substrings replaced. However, if PCRE2_SUB-
3352        The outlengthptr argument of pcre2_substitute() must point to  a  vari-
3358        If the function is not successful, the value set via  outlengthptr  de-
3360        string, the value is the offset in the replacement string where the er-
3361        ror  was  detected.  For  other errors, the value is PCRE2_UNSET by de-
3362        fault. This includes the case of the output buffer being too small, un-
3366        buffer is too small. The default action is to return PCRE2_ERROR_NOMEM-
3368        continues to go through the motions of matching and substituting (with-
3369        out,  of course, writing anything) in order to compute the size of buf-
3371        variable,  with  the  result  of  the  function  still  being PCRE2_ER-
3376        that the entire operation is carried out twice. Depending on the appli-
3378        the  excess  afterwards,  instead   of   using   PCRE2_SUBSTITUTE_OVER-
3383        invalid UTF replacement string causes an immediate return with the rel-
3386        If PCRE2_SUBSTITUTE_LITERAL is set, the replacement string is  not  in-
3387        terpreted in any way. By default, however, a dollar character is an es-
3388        cape character that can specify the insertion of characters  from  cap-
3389        ture  groups  and names from (*MARK) or other control verbs in the pat-
3397        brackets  are  required only if the following character would be inter-
3417        takes  place in the original subject string (that is, previous replace-
3420        subject string. If an offset limit is set in the match context, search-
3424        the subject string by setting either or both of startoffset and an off-
3432        with zero length, an attempt to find a non-empty match at the same off-
3443        PCRE2_SUBSTITUTE_UNSET_EMPTY causes unset capture groups (including un-
3446        not set, an attempt to insert an unset group causes the PCRE2_ERROR_UN-
3451        replacement  string.  Without this option, only the dollar character is
3457        particular  character codes, and backslash followed by any non-alphanu-
3464        current state: \U and \L change to upper or lower case forcing, respec-
3469        all  inserted  characters, including those from capture groups and let-
3474        Note that case forcing sequences such as \U...\E do not nest. For exam-
3476        \E has no effect. Note  also  that  the  PCRE2_ALT_BSUX  and  PCRE2_EX-
3483          ${<n>:-<string>}
3486        As  before,  <n> may be a group number or a name. The first form speci-
3507        substitutions. However, PCRE2_SUBSTITUTE_UNKNOWN_UNSET does  cause  un-
3511        PCRE2_SUBSTITUTE_UNSET_EMPTY, and PCRE2_SUBSTITUTE_EXTENDED are irrele-
3520        PCRE2_ERROR_NOSUBSTRING is returned for a non-existent substring inser-
3523        PCRE2_ERROR_UNSET is returned for an unset substring insertion (includ-
3525        when the simple (non-extended) syntax is used and  PCRE2_SUBSTITUTE_UN-
3539        the replacement string, with more  particular  errors  being  PCRE2_ER-
3548        obtained  by  calling  the pcre2_get_error_message() function (see "Ob-
3565        callout block structure, which contains the following fields, not  nec-
3582        first callout, 2 for the second, and so on. The input and output point-
3597        If the value is zero, the replacement is accepted, and,  if  PCRE2_SUB-
3599        match. If the value is not zero, the current  replacement  is  not  ac-
3603        copied to the output and the call to pcre2_substitute() exits,  return-
3613        capture groups are not required to be unique. Duplicate names  are  al-
3619        match, only one of each set of identically-named  groups  participates.
3624        to  the given name that is set. Only if none are set is PCRE2_ERROR_UN-
3625        SET is returned. The  pcre2_substring_number_from_name()  function  re-
3637        point to the first and last entries in the name-to-number table for the
3650        The traditional matching function uses a  similar  algorithm  to  Perl,
3651        which  stops when it finds the first match at a given point in the sub-
3654        function (see below) instead. If you cannot use the  alternative  func-
3658        What you have to do is to insert a callout right at the end of the pat-
3659        tern.   When your callout function is called, extract and save the cur-
3677        different characteristics to the normal algorithm, and is not  compati-
3678        ble  with  Perl.  Some  of  the features of PCRE2 patterns are not sup-
3686        is used in a different way, and this is described below. The other com-
3715        PCRE2_COPY_MATCHED_SUBJECT,  PCRE2_ENDANCHORED, PCRE2_NOTBOL, PCRE2_NO-
3732        matches, but there is still at least one matching possibility. The por-
3735        more detailed discussion of partial and  multi-segment  matching,  with
3741        stop as soon as it has found one match. Because of the way the alterna-
3757        When pcre2_dfa_match() succeeds, it may have matched more than one sub-
3776        which  is  the  number  of  matched substrings. The offsets of the sub-
3782        Calls  to the convenience functions that extract substrings by name re-
3783        turn the error PCRE2_ERROR_DFA_UFUNC (unsupported function) if used af-
3792        NOTE:  PCRE2's  "auto-possessification" optimization usually applies to
3795        matching, this means that only one possible match is found. If you  re-
3796        ally do want multiple matches in such cases, either use an ungreedy re-
3797        peat such as "a\d+?" or set the PCRE2_NO_AUTO_POSSESS option when  com-
3862        Copyright (c) 1997-2022 University of Cambridge.
3863 ------------------------------------------------------------------------------
3871        PCRE2 - Perl-compatible regular expressions (revised API)
3876        the library in Unix-like environments using the applications  known  as
3878        CMake instead of configure. The text file README contains  general  in-
3879        formation  about building with Autotools (some of which is repeated be-
3881        systems.  There  is a lot more information about building PCRE2 without
3883        "by  hand")  in  the  text file called NON-AUTOTOOLS-BUILD.  You should
3885        non-Unix-like environment.
3888 PCRE2 BUILD-TIME OPTIONS
3892        configure  script,  where  the  optional features are selected or dese-
3893        lected by providing options to configure before running the  make  com-
3894        mand.  However,  the same options can be selected in both Unix-like and
3895        non-Unix-like environments if you are using CMake instead of  configure
3900        compiler, as described in NON-AUTOTOOLS-BUILD.
3903        ones such as the selection of the installation directory)  can  be  ob-
3906          ./configure --help
3909        names begin with --enable or --disable. Because of the way that config-
3910        ure  works, --enable and --disable always come in pairs, so the comple-
3913        with --with. At the end of a configure run, a summary of the configura-
3917 BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES
3919        By  default, a library called libpcre2-8 is built, containing functions
3921        either  as single-byte characters, or UTF-8 strings. You can also build
3922        two other libraries, called libpcre2-16 and libpcre2-32, which  process
3923        strings  that  are contained in arrays of 16-bit and 32-bit code units,
3924        respectively. These can be interpreted either as single-unit characters
3925        or  UTF-16/UTF-32 strings. To build these additional libraries, add one
3928          --enable-pcre2-16
3929          --enable-pcre2-32
3931        If you do not want the 8-bit library, add
3933          --disable-pcre2-8
3936        the  POSIX wrapper is for the 8-bit library only, and that pcre2grep is
3937        an 8-bit program. Neither of these are built if  you  select  only  the
3938        16-bit or 32-bit libraries.
3947          --disable-shared
3948          --disable-static
3956        strings.  To build it without Unicode support, add
3958          --disable-unicode
3961        It  is  not  possible to build one library with Unicode support and an-
3962        other without in the same configuration.
3964        Of itself, Unicode support does not make PCRE2 treat strings as  UTF-8,
3965        UTF-16 or UTF-32. To do that, applications that use the library can set
3966        the PCRE2_UTF option when they call pcre2_compile() to compile  a  pat-
3974        and Nd, script names, and some bi-directional properties are supported.
3986        mode,  can  cause unpredictable behaviour because it may leave the cur-
3987        rent matching point in the middle of a multi-code-unit  character.  The
3988        application  can lock it out by setting the PCRE2_NEVER_BACKSLASH_C op-
3989        tion when calling pcre2_compile(). There is also a build-time option
3991          --enable-never-backslash-C
3996 JUST-IN-TIME COMPILER SUPPORT
3998        Just-in-time (JIT) compiler support is included in the build by  speci-
4001          --enable-jit
4007          --enable-jit=auto
4014          --enable-jit-sealloc
4021          --disable-pcre2grep-jit
4029        the end of a line. This is the normal newline  character  on  Unix-like
4033          --enable-newline-is-cr
4035        to the configure command. There is also an  --enable-newline-is-lf  op-
4039        the two-character sequence CRLF (CR immediately followed by LF). If you
4042          --enable-newline-is-crlf
4046          --enable-newline-is-anycrlf
4051          --enable-newline-is-any
4054        newline sequences are the three just mentioned, plus the single charac-
4059          --enable-newline-is-nul
4061        which causes NUL (binary zero) to be set  as  the  default  line-ending
4075          --enable-bsr-anycrlf
4077        the  default  is changed so that \R matches only CR, LF, or CRLF. What-
4085        part to another (for example, from an opening parenthesis to an  alter-
4086        nation  metacharacter).  By default, in the 8-bit and 16-bit libraries,
4087        two-byte values are used for these offsets, leading to a  maximum  size
4088        for a compiled pattern of around 64 thousand code units. This is suffi-
4091        compile PCRE2 to use three-byte or four-byte offsets by adding  a  set-
4094          --with-link-size=3
4097        16-bit library, a value of 3 is rounded up to 4.  In  these  libraries,
4099        to load additional data when handling them. For the 32-bit library  the
4100        value  is  always 4 and cannot be overridden; the value of --with-link-
4113          --with-match-limit=500000
4127          --with-heap-limit=500
4137        for  --with-match-limit.  You  can set a lower default limit by adding,
4140          --with-match-limit-depth=10000
4147        This limit was more useful in versions before 10.30, where function re-
4152        for lookaround assertions, atomic groups,  and  recursion  within  pat-
4163          --enable-rebuild-chartables
4166        Instead, a program called pcre2_dftables is compiled and run. This out-
4168        your  C  run-time  system. This method of replacing the tables does not
4178          cc src/pcre2_dftables.c -o pcre2_dftables
4182        want to specify a locale, you must use the -L option:
4184          LC_ALL=fr_FR ./pcre2_dftables -L src/pcre2_chartables.c
4186        You can also specify -b (with or without -L). This causes the tables to
4188        can be loaded into memory by an application and  passed  to  pcre2_com-
4190        The tables are just a string of bytes, independent of hardware  charac-
4201        compiled to run in an 8-bit EBCDIC environment by adding
4203          --enable-ebcdic --disable-unicode
4205        to the configure command. This setting implies --enable-rebuild-charta-
4206        bles.  You should only use it if you know that you are in an EBCDIC en-
4209        It is not possible to support both EBCDIC and UTF-8 codes in  the  same
4210        version  of  the  library. Consequently, --enable-unicode and --enable-
4217          --enable-ebcdic-nl25
4219        as well as, or instead of, --enable-ebcdic. The EBCDIC character for CR
4221        0x25 is not chosen as LF is made to correspond to the Unicode NEL char-
4224        The options that select newline behaviour, such as --enable-newline-is-
4225        cr, and equivalent run-time options, refer to these character values in
4232        within the patterns it is matching. There are two kinds: one that  gen-
4233        erates output using local code, and another that calls an external pro-
4234        gram or script.  If --disable-pcre2grep-callout-fork is  added  to  the
4236        --disable-pcre2grep-callout is used, all callouts  are  completely  ig-
4237        nored.  For more details of pcre2grep callouts, see the pcre2grep docu-
4247          --enable-pcre2grep-libz
4248          --enable-pcre2grep-libbz2
4250        to the configure command. These options naturally require that the rel-
4262        be processable is the notional buffer size. If a longer line is encoun-
4268          --with-pcre2grep-bufsize=51200
4269          --with-pcre2grep-max-bufsize=2097152
4272        values by using --buffer-size  and  --max-buffer-size  on  the  command
4280          --enable-pcre2test-libreadline
4281          --enable-pcre2test-libedit
4283        to  the configure command, pcre2test is linked with the libreadline or-
4285        it  reads  it using the readline() function. This provides line-editing
4286        and history facilities. Note that libreadline is  GPL-licensed,  so  if
4291        Setting  --enable-pcre2test-libreadline causes the -lreadline option to
4293        sytem-installed  readline  library this is sufficient. However, in some
4305          LIBS="-ncurses"
4314          --enable-debug
4324          --enable-valgrind
4327        certain  memory  regions as unaddressable. This allows it to detect in-
4337          --enable-coverage
4349        When --enable-coverage is used,  the  following  addition  targets  are
4355        equivalent to running "make coverage-reset", "make  coverage-baseline",
4356        "make check", and then "make coverage-report".
4358          make coverage-reset
4362          make coverage-baseline
4366          make coverage-report
4370          make coverage-clean-report
4372        This  removes the generated coverage report without cleaning the cover-
4375          make coverage-clean-data
4377        This removes the captured coverage data without removing  the  coverage
4380          make coverage-clean
4383        For more information about code coverage, see the gcov and  lcov  docu-
4397          --disable-percent-zt
4409          --enable-fuzz-support
4411        At present this applies only to the 8-bit library. If set, it causes an
4412        extra  library  called  libpcre2-fuzzsupport.a to be built, but not in-
4413        stalled. This contains a single  function  called  LLVMFuzzerTestOneIn-
4420        Setting  --enable-fuzz-support  also  causes  a binary called pcre2fuz-
4435          --disable-stack-for-recursion
4444        pcre2api(3), pcre2-config(3).
4457        Copyright (c) 1997-2022 University of Cambridge.
4458 ------------------------------------------------------------------------------
4466        PCRE2 - Perl-compatible regular expressions (revised API)
4481        PCRE2  provides  a feature called "callout", which is a means of tempo-
4487        When  using the pcre2_substitute() function, an additional callout fea-
4497        ending delimiter is the same as the start, except for {, where the end-
4517          A(\d{2}|--)
4521          (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
4524        alternation bar. If the pattern contains a conditional group whose con-
4538        information when you are trying to optimize the performance of  a  par-
4548    Auto-possessification
4550        At compile time, PCRE2 "auto-possessifies" repeated items when it knows
4556          --->aaaa
4564        the  auto-possessify  feature  by  passing   PCRE2_NO_AUTO_POSSESS   to
4568          --->aaaa
4585        beginning of the subject, and pcre2_compile() remembers this. If a pat-
4586        tern has more than one top-level branch, automatic anchoring occurs  if
4591        It  is  also disabled if the pattern contains (*PRUNE) or (*SKIP). How-
4597          --->aa
4604        This  shows  that all match attempts start at the beginning of the sub-
4605        ject. In other words, the pattern is anchored. You can disable this op-
4607        starting the pattern with (*NO_DOTSTAR_ANCHOR). In this case, the  out-
4610          --->aa
4620        This  shows more match attempts, starting at the second subject charac-
4637        string, and will immediately give a "no match" return without  actually
4641        You can disable these optimizations by passing the PCRE2_NO_START_OPTI-
4651        to  both normal, DFA, and JIT matching. The first argument to the call-
4677        version 1, and the callout_flags field for version 2. If you are  writ-
4686        contains  the  number  of  the callout, in the range 0-255. This is the
4693        callout_string points to the string that is contained within  the  com-
4701        delimiter as callout_string[-1] if you need it.
4717        For calls to pcre2_match(), the offset_vector field is not  (since  re-
4719        matching function in the match data block. Instead it points to an  in-
4725        The capture_last field contains the number of the  most  recently  cap-
4727        number of the highest numbered captured substring so far.  If  no  sub-
4733        The contents of ovector[2] to  ovector[<capture_top>*2-1]  can  be  in-
4742        was  passed  to the matching function in the match data block for call-
4753        at which the current match attempt started. However, if the escape  se-
4768        parenthesis, the length includes meta characters that follow the paren-
4771        the length is one, unless a closing parenthesis is followed by a  quan-
4774        was  that of the entire group, and before an alternation bar or a clos-
4780        are used by pcre2test to show the next item to be matched when display-
4784        zero-terminated  name of the most recently passed (*MARK), (*PRUNE), or
4786        Instances  of  (*PRUNE)  or  (*THEN) without a name do not obliterate a
4791        pcre2_dfa_match() or when JIT is being used. When pcre2_match() without
4806        starting  position in the subject. Output from pcre2test does not indi-
4810        The information in the callout_flags field is provided so that applica-
4814        because  there is no backtracking in DFA matching, and there is no sup-
4828        Negative values should normally be chosen from  the  set  of  PCRE2_ER-
4846        which they appear. Its first argument is a pointer to a callout enumer-
4848        passed  to  pcre2_callout_enumerate(). The data block contains the fol-
4866        non-zero minimum or a fixed maximum, the group is replicated inside the
4872        The callback function should normally return zero. If it returns a non-
4887        Copyright (c) 1997-2019 University of Cambridge.
4888 ------------------------------------------------------------------------------
4896        PCRE2 - Perl-compatible regular expressions (revised API)
4898 DIFFERENCES BETWEEN PCRE2 AND PERL
4901        and Perl handle regular expressions. The differences described here are
4902        with  respect  to  Perl  version 5.34.0, but as both Perl and PCRE2 are
4905        1. When PCRE2_DOTALL (equivalent to Perl's /s qualifier)  is  not  set,
4906        the behaviour of the '.' metacharacter differs from Perl. In PCRE2, '.'
4907        matches the next character unless it is the  start  of  a  newline  se-
4910        and  NL  (either  0x15 or 0x25) when using EBCDIC. In Perl, '.' appears
4913        2. PCRE2 has only a subset of Perl's Unicode support. Details  of  what
4916        3.  Like  Perl, PCRE2 allows repeat quantifiers on parenthesized asser-
4918        does not assert that the next three characters are not "a". It just as-
4920        PCRE2  optimizes this to run the assertion just once). Perl allows some
4923        on non-lookaround assertions.
4928        the  condition  is  false).   Perl may set such capture groups in other
4931        5. The following Perl escape sequences are not supported: \F,  \l,  \L,
4932        \u, \U, and \N when followed by a character name. \N on its own, match-
4933        ing a non-newline character, and \N{U+dd..}, matching  a  Unicode  code
4935        letters are implemented by Perl's general string-handling and  are  not
4941        6. The Perl escape sequences \p, \P, and \X are supported only if PCRE2
4946        PCRE2  and  Perl  support the Cs (surrogate) property, but in PCRE2 its
4948        long  synonyms  for  property names that Perl supports (such as \p{Let-
4954        from  Perl  in  that  $  and  @ are also handled as literals inside the
4955        quotes. In Perl, they cause variable interpolation (PCRE2 does not have
4956        variables). Also, Perl does "double-quotish backslash interpolation" on
4961            Pattern            PCRE2 matches     Perl matches
4971        classes by both PCRE2 and Perl.
4980        and backtracking into subroutine calls is now supported, as in Perl.
4984        their effect is confined to that group; it does not extend to the  sur-
4985        rounding  pattern.  This is not always the case in Perl. In particular,
4994        in C triggers (*PRUNE). Perl's behaviour is more complex; in many cases
4999        matching  "aba"  against the pattern /^(a(b)?)+$/ in Perl leaves $2 un-
5003        not  as  general as Perl's. This is a consequence of the fact the PCRE2
5004        works internally just with numbers, using an external table  to  trans-
5012        14. Perl used to recognize comments in some places that PCRE2 does not,
5014        modifier is set, Perl allowed white space between ( and  ?  though  the
5016        may still be some cases where Perl behaves differently.
5018        15. Perl, when in warning mode, gives warnings  for  character  classes
5019        such  as  [A-\d] or [a-[:digit:]]. It then treats the hyphens as liter-
5024        not affected when case-independent matching is specified. For  example,
5025        \p{Lu} always matches an upper case letter. I think Perl has changed in
5030        17. From release 5.32.0, Perl locks out the use of \K in lookaround as-
5032        there is an option for re-enabling the previous  behaviour.  When  this
5036        18. PCRE2 provides some extensions to the Perl regular  expression  fa-
5037        cilities.   Perl  5.10  included  new features that were not in earlier
5038        versions of Perl, some of which (such as  named  parentheses)  were  in
5039        PCRE2 for some time before. This list is with respect to Perl 5.34:
5043        match  a  different  length of string. Perl used to require them all to
5047        (b) From PCRE2 10.23, backreferences to groups of fixed length are sup-
5048        ported in lookbehinds, provided that there is no possibility of  refer-
5049        encing  a  non-unique  number or name. Perl does not support backrefer-
5053        $ meta-character matches only at the very end of the string.
5056        faulted. (Perl can be made to issue a warning.)
5058        (e) If PCRE2_UNGREEDY is set, the greediness of the repetition  quanti-
5059        fiers is inverted, that is, by default they are not greedy, but if fol-
5066        PCRE2_NOTEMPTY_ATSTART options have no Perl equivalents.
5071        (i)  The  callout  facility is PCRE2-specific. Perl supports codeblocks
5074        (j) The partial matching facility is PCRE2-specific.
5077        different way and is not Perl-compatible.
5083        (m)  PCRE2  supports non-atomic positive lookaround assertions. This is
5084        an extension to the lookaround facilities. The default, Perl-compatible
5087        19.  The  Perl  /a modifier restricts /d numbers to pure ascii, and the
5088        /aa modifier restricts /i case-insensitive matching to pure ascii,  ig-
5092        20. Perl has different limits than PCRE2. See the pcre2limit documenta-
5093        tion for details. Perl went with 5.10 from recursion to iteration keep-
5095        not  fall into any stack-overflow limit. PCRE2 made a similar change at
5096        release 10.30, and also has many build-time and  run-time  customizable
5110        Copyright (c) 1997-2021 University of Cambridge.
5111 ------------------------------------------------------------------------------
5119        PCRE2 - Perl-compatible regular expressions (revised API)
5121 PCRE2 JUST-IN-TIME COMPILER SUPPORT
5123        Just-in-time  compiling  is a heavyweight optimization that can greatly
5124        speed up pattern matching. However, it comes at the cost of extra  pro-
5126        the same pattern is going to be matched many times. This does not  nec-
5128        anchored, matching attempts may take place many times at various  posi-
5130        string is very long, it may still pay  to  use  JIT  even  for  one-off
5131        matches.  JIT  support  is  available  for all of the 8-bit, 16-bit and
5132        32-bit PCRE2 libraries.
5134        JIT support applies only to the  traditional  Perl-compatible  matching
5142        --enable-jit  (or  equivalent  CMake  option) must be set when PCRE2 is
5146          ARM 32-bit (v5, v7, and Thumb2)
5147          ARM 64-bit
5149          Intel x86 32-bit and 64-bit
5150          MIPS 32-bit and 64-bit
5151          Power PC 32-bit and 64-bit
5152          SPARC 32-bit
5154        If --enable-jit is set on an unsupported platform, compilation fails.
5156        A  program  can  tell if JIT support is available by calling pcre2_con-
5160        falls  back  to the interpretive code if JIT is not available. For pro-
5162        path" API that is JIT-specific.
5171        second  is  zero  or  more of the following option bits: PCRE2_JIT_COM-
5182        the size of machine stack that it uses. The exact rules are  not  docu-
5183        mented because they may change at any time, in particular, when new op-
5187        PCRE2_JIT_COMPLETE  requests the JIT compiler to generate code for com-
5188        plete matches. If you want to run partial matches using the  PCRE2_PAR-
5193        pcre2_match()  is  called,  the appropriate code is run if it is avail-
5198        the option bits. For example, you can call it once with  PCRE2_JIT_COM-
5201        will ignore PCRE2_JIT_COMPLETE and just compile code for partial match-
5202        ing. If pcre2_jit_compile() is called with no option bits set, it imme-
5210        are described in the section entitled "Controlling the JIT  stack"  be-
5219        stack"  below,  even  if  you  do  not need to supply a non-default JIT
5221        be  obeyed.  If the match-time options are not right for JIT execution,
5224        If the JIT compiler finds an unsupported item, no JIT  data  is  gener-
5226        pattern by calling pcre2_pattern_info() with the PCRE2_INFO_JITSIZE op-
5227        tion.  A  non-zero  result means that JIT compilation was successful. A
5236        are  normally expected to be a valid sequence of UTF code units. By de-
5237        fault, this is checked at the start of matching and an error is  gener-
5245        PCRE2_MATCH_INVALID_UTF option has two effects:  it  tells  the  inter-
5246        preter  in pcre2_match() to support invalid UTF, and, if pcre2_jit_com-
5247        pile() is called, the compiled JIT code also supports invalid UTF.  De-
5252        PCRE2_JIT_INVALID_UTF, which currently exists only for backward compat-
5270        when running in a UTF mode, and a callout immediately before an  asser-
5279        that the memory used for the JIT stack was insufficient. See  "Control-
5293        large  or complicated patterns need more than this. The error PCRE2_ER-
5294        ROR_JIT_STACKLIMIT is given when there is not enough stack. Three func-
5299        The  pcre2_jit_stack_create()  function  creates a JIT stack. Its argu-
5305        function  returns immediately, without doing anything. (For the techni-
5317        The first argument is a pointer to a match context. When this is subse-
5319        JIT stack is used. If this argument is NULL, the function returns imme-
5320        diately, without doing anything. There are three cases for  the  values
5339        is not obeyed when pcre2_match() is called with options that are incom-
5341        determine whether a match operation was executed by JIT or by  the  in-
5347        up non-sequential matches in one thread is to use callouts: if a  call-
5352        you  assign or pass back NULL from a callback, that is thread-safe, be-
5354        pass back a non-NULL JIT stack, this must be a different stack for each
5355        thread so that the application is thread-safe.
5357        Strictly speaking, even more is allowed. You can assign the  same  non-
5366        up non-default JIT stacks might operate:
5374          Use a one-line callback function
5385        PCRE2 (and JIT) is a recursive, depth-first engine, so it needs a stack
5387        child nodes.  Allocating real machine stack on some platforms is diffi-
5394        Modern  operating  systems have a nice feature: they can reserve an ad-
5396        pages inside this address space, so the stack could grow without moving
5397        memory data (this is important because of pointers). Thus we can  allo-
5417        You can free compiled patterns, contexts, and stacks in any order, any-
5430        this without keeping a list of patterns.
5436        Especially  on embedded sytems, it might be a good idea to release mem-
5437        ory sometimes without freeing the stack. There is no API  for  this  at
5439        allocated memory for any stack and another which allows releasing  mem-
5453        The JIT executable allocator does not free all memory when it is possi-
5457        calling  pcre2_jit_free_unused_memory(). Its argument is a general con-
5458        text, for custom memory management, or NULL for standard memory manage-
5464        This  is  a  single-threaded example that specifies a JIT stack without
5501        The fast path function is called pcre2_jit_match(), and  it  takes  ex-
5503        must be specified with a  length;  PCRE2_ZERO_TERMINATED  is  not  sup-
5504        ported. Unsupported option bits (for example, PCRE2_ANCHORED, PCRE2_EN-
5507        pcre2_match(), plus PCRE2_ERROR_JIT_BADOPTION if a matching mode  (par-
5511        number of other sanity checks are performed on the arguments. For exam-
5512        ple,  if the subject pointer is NULL but the length is non-zero, an im-
5537        Copyright (c) 1997-2021 University of Cambridge.
5538 ------------------------------------------------------------------------------
5546        PCRE2 - Perl-compatible regular expressions (revised API)
5554        code units for the 8-bit and 16-bit libraries if PCRE2 is compiled with
5555        the default internal linkage size, which  is  2  bytes  for  these  li-
5558        (when  building  the  16-bit  library,  3  is rounded up to 4). See the
5560        for  details.  In  these cases the limit is substantially larger.  How-
5561        ever, the speed of execution is slower. In the 32-bit library, the  in-
5569        the largest number a PCRE2_SIZE variable can hold. PCRE2_SIZE is an un-
5571        is ~(PCRE2_SIZE)0) is reserved as a special indicator  for  zero-termi-
5590        (*THEN)  verb  is  255  code units for the 8-bit library and 65535 code
5591        units for the 16-bit and 32-bit libraries.
5594        number a 32-bit unsigned integer can hold.
5611        Copyright (c) 1997-2022 University of Cambridge.
5612 ------------------------------------------------------------------------------
5620        PCRE2 - Perl-compatible regular expressions (revised API)
5627        pcre2_match() function. This works in the same as  as  Perl's  matching
5628        function,  and  provide a Perl-compatible matching operation. The just-
5629        in-time (JIT) optimization that is described in the pcre2jit documenta-
5633        it operates in a different way, and is not Perl-compatible. This alter-
5634        native  has advantages and disadvantages compared with the standard al-
5654        The set of strings that are matched by a regular expression can be rep-
5659        tree:  depth-first  and  breadth-first, and these correspond to the two
5665        In the terminology of Jeffrey Friedl's book "Mastering Regular  Expres-
5667        depth-first search of the pattern tree. That is, it  proceeds  along  a
5669        required. When there is a mismatch, the algorithm  tries  any  alterna-
5678        that  point the algorithm stops. Thus, if there is more than one possi-
5681        on the way the alternations and the greedy or ungreedy repetition quan-
5684        Because  it  ends  up  with a single path through the tree, it is rela-
5685        tively straightforward for this algorithm to keep  track  of  the  sub-
5692        This algorithm conducts a breadth-first search of  the  tree.  Starting
5701        scans  the subject string only once, without backtracking, there is one
5703        following  or  preceding the current point have to be independently in-
5710        this algorithm finds all of them, and in particular, it finds the long-
5718        the match data block is therefore not advisable when doing  DFA  match-
5728        the fifth character of the subject. The algorithm  does  not  automati-
5731        PCRE2's "auto-possessification" optimization usually applies to charac-
5732        ter repeats at the end of a pattern (as well as internally). For  exam-
5737        either  use  an ungreedy repeat ("a\d+?") or set the PCRE2_NO_AUTO_POS-
5741        not  supported  or behave differently in the alternative matching func-
5744        1. Because the algorithm finds all possible matches, the greedy or  un-
5746        affect auto-possessification,  as  just  described).  During  matching,
5755        a  non-possessive quantifier. Similarly, if an atomic group is present,
5763        algorithm does not attempt to do this. This means that no captured sub-
5766        3. Because no substrings are captured, backreferences within  the  pat-
5769        4.  For  the same reason, conditional expressions that use a backrefer-
5775        6. Because many paths through the tree may be active, the \K escape se-
5784        these modes, because the alternative algorithm moves through  the  sub-
5792        10. The PCRE2_MATCH_INVALID_UTF option for pcre2_compile() is not  sup-
5806        matching and discusses multi-segment matching.
5836        Copyright (c) 1997-2021 University of Cambridge.
5837 ------------------------------------------------------------------------------
5845        PCRE2 - Perl-compatible regular expressions
5860        Another example is checking a user input string as it is typed, to  en-
5864        Partial matching is a PCRE2-specific feature; it is  not  Perl-compati-
5866        PCRE2_PARTIAL_SOFT options when calling a matching function.  The  dif-
5872        If you want to use partial matching with just-in-time  optimized  code,
5874        you must also call pcre2_jit_compile() with one or both  of  these  op-
5880        PCRE2_JIT_COMPLETE  should also be set if you are going to run non-par-
5885        Setting a partial matching option disables two of PCRE2's standard  op-
5886        timization  hints. PCRE2 remembers the last literal code unit in a pat-
5902        Example 1: if the pattern is /abc/ and the subject is "ab", more  char-
5908        what is matched. In this case, only PCRE2_PARTIAL_HARD returns  a  par-
5919        assertions  and the \K escape sequence provide ways of inspecting char-
5922        (2) The pattern contains one or more lookbehind assertions. This condi-
5923        tion  exists in case there is a lookbehind that inspects characters be-
5930        because adding more characters  might  result  in  a  non-empty  match,
5932        "there is going to be a match at this point, but until some more  char-
5933        acters are added, we do not know if it will be an empty string or some-
5943          A complete match has been found, starting and ending within this sub-
5955        the rest of the ovector are undefined. The appearance of \K in the pat-
5964        string "abc12", because all these characters are needed  for  a  subse-
5965        quent re-match with additional characters.
5972        If this is matched against the subject string "abc123dog", both  alter-
5975        and  9, identifying "123dog" as the first partial match. (In this exam-
5985        as  a partial match is found, without continuing to search for possible
5987        partial match over a later complete match. For this reason, the assump-
5989        true  end of the available data, which is why \z, \Z, \b, \B, and $ al-
5994        tried. If no complete match can be found,  PCRE2_ERROR_PARTIAL  is  re-
5997        items  in a pattern behave as if the subject string is potentially com-
5999        for \b and \B the end of the subject is treated as a non-alphanumeric.
6001        The  difference  between the two partial matching options can be illus-
6009        However,  if  PCRE2_PARTIAL_HARD is set, the result is PCRE2_ERROR_PAR-
6056 MULTI-SEGMENT MATCHING WITH pcre2_match()
6058        PCRE  was  not originally designed with multi-segment matching in mind.
6060        multi-segment matching possible have been added. A very long string can
6062        with the aim of achieving the same results that would happen if the en-
6074        When a partial match occurs, the next segment must be added to the cur-
6075        rent subject and the match re-run, using the  startoffset  argument  of
6091        buffer  is discarded, the second half is moved to the start of the buf-
6095        If there are memory constraints, you may want to discard text that pre-
6114        of characters that must be retained in order to get the right match re-
6119        use that to decide how much text to retain. The only lookbehind  infor-
6124        maximum number of characters (not code units) that any individual look-
6129        In a non-UTF or a 32-bit case, moving back is just a  subtraction,  but
6130        in  UTF-8  or  UTF-16  you  have  to count characters while moving back
6137        without  backtracking,  searching  for  all possible matches simultane-
6138        ously. If the end of the subject is reached before the end of the  pat-
6149        there  is no difference between greedy and ungreedy repetition, its be-
6160 MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()
6164        and  calling  the function again with the same compiled regular expres-
6166        same working space as before, because this is where details of the pre-
6177        The first call has "23ja" as the subject, and requests  partial  match-
6180        last  part  is  shown;  PCRE2 does not retain the previously partially-
6194        match  at one point in the subject are remembered. Depending on the ap-
6199        complete match, as described for pcre2_match() above. Another possibil-
6216        Copyright (c) 1997-2019 University of Cambridge.
6217 ------------------------------------------------------------------------------
6225        PCRE2 - Perl-compatible regular expressions (revised API)
6230        by PCRE2 are described in detail below. There is a quick-reference syn-
6231        tax  summary  in the pcre2syntax page. PCRE2 tries to match Perl syntax
6232        and semantics as closely as it can.  PCRE2 also supports some  alterna-
6233        tive  regular  expression syntax (which does not conflict with the Perl
6237        Perl's  regular expressions are described in its own documentation, and
6239        of which have copious examples. Jeffrey Friedl's "Mastering Regular Ex-
6244        This document discusses the regular expression patterns that  are  sup-
6248        not Perl-compatible. Some of  the  features  discussed  below  are  not
6250        of the alternative function, and how it differs from the  normal  func-
6254 SPECIAL START-OF-PATTERN ITEMS
6257        set by special items at the start of a pattern. These are not Perl-com-
6259        writers who are not able to change the program that processes the  pat-
6260        tern.  Any  number  of these items may appear, but they must all be to-
6266        In the 8-bit and 16-bit PCRE2 libraries, characters may be coded either
6267        as single code units, or as multiple UTF-8 or UTF-16 code units. UTF-32
6268        can  be  specified  for the 32-bit library, in which case it constrains
6279        restrict  them  to  non-UTF  data  for   security   reasons.   If   the
6280        PCRE2_NEVER_UTF  option is passed to pcre2_compile(), (*UTF) is not al-
6287        causes sequences such as \d and \w to use Unicode properties to  deter-
6289        less than 256 via a lookup table. If also causes upper/lower casing op-
6302        to whichever matching function is subsequently called to match the pat-
6303        tern. These options lock out the matching of empty strings, either  en-
6306    Disabling auto-possessification
6314    Disabling start-up optimizations
6317        setting the PCRE2_NO_START_OPTIMIZE option. This disables several opti-
6324        as  setting the PCRE2_NO_DOTSTAR_ANCHOR option. This disables optimiza-
6325        tions that apply to patterns whose top-level branches all start with .*
6345        These facilities are provided to catch runaway matches  that  are  pro-
6346        voked  by patterns with huge matching trees. A common example is a pat-
6356        where d is any number of decimal digits. However, the value of the set-
6379        strings: a single CR (carriage return) character, a  single  LF  (line-
6380        feed) character, the two-character sequence CRLF, any of the three pre-
6385        It is also possible to specify a newline convention by starting a  pat-
6395        These override the default and the options given to the compiling func-
6396        tion. For example, on a Unix system where LF is the default newline se-
6405        The  newline  convention affects where the circumflex and dollar asser-
6406        tions are true. It also affects the interpretation of the dot metachar-
6409        escape  sequence  matches.  By default, this is any Unicode newline se-
6410        quence, for Perl compatibility. However, this can be changed;  see  the
6420        starting a pattern with (*BSR_ANYCRLF).  For  completeness,  (*BSR_UNI-
6427        character code instead of ASCII or Unicode (typically a mainframe  sys-
6428        tem).  In  the  sections below, character code values are ASCII or Uni-
6446        their lower case ASCII equivalents, are  case-equivalent  with  Unicode
6456        There are two different sets of metacharacters: those that  are  recog-
6479          -      indicates character range
6487        or a # character as part of the pattern. If the PCRE2_EXTENDED_MORE op-
6488        tion is set, the same applies, but in addition unescaped space and hor-
6492        settings can be changed within a pattern; see the section entitled "In-
6508        always safe to precede a non-alphanumeric  with  backslash  to  specify
6509        that it stands for itself.  In particular, if you want to match a back-
6512        Only ASCII digits and letters have any special meaning  after  a  back-
6517        do so by putting them between \Q and \E. This is different from Perl in
6519        whereas  in Perl, $ and @ cause variable interpolation. Also, Perl does
6520        "double-quotish backslash interpolation" on any backslashes between  \Q
6522        PCRE2 treats a backslash between \Q and \E just like any other  charac-
6525          Pattern            PCRE2 matches   Perl matches
6542    Non-printing characters
6544        A second use of backslash provides a way of encoding non-printing char-
6546        appearance  of non-printing characters in a pattern, but when a pattern
6548        following  escape  sequences  instead of the binary character it repre-
6549        sents. In an ASCII or Unicode environment, these escapes  are  as  fol-
6553          \cx         "control-x", where x is any printable ASCII character
6566        By  default, after \x that is not followed by {, from zero to two hexa-
6568        number of hexadecimal digits may appear between \x{ and }. If a charac-
6573        of the two syntaxes for \x or by an octal sequence. There is no differ-
6578        Support  is  available  for some ECMAScript (aka JavaScript) escape se-
6579        quences via two compile-time options. If PCRE2_ALT_BSUX is set, the se-
6581        two hexadecimal digits is it recognized as a character  escape.  Other-
6582        wise  it  is interpreted as a literal "x" character. In this mode, sup-
6587        PCRE2_EXTRA_ALT_BSUX has the same effect as PCRE2_ALT_BSUX and, in  ad-
6588        dition, \u{hhh..} is recognized as the character specified by hexadeci-
6592        The  \N{U+hhh..} escape sequence is recognized only when PCRE2 is oper-
6593        ating in UTF mode. Perl also uses \N{name}  to  specify  characters  by
6595        followed by an opening brace (curly bracket) it has an entirely differ-
6598        There  are some legacy applications where the escape sequence \r is ex-
6608        32 or greater than 126, a compile-time error occurs.
6612        The \c escape is processed as specified for Perl in the perlebcdic doc-
6613        ument. The only characters that are allowed after \c are A-Z,  a-z,  or
6614        one  of @, [, \, ], ^, _, or ?. Any other character provokes a compile-
6616        letters  (in either case) encode characters 1-26 (hex 01 to hex 1A); [,
6617        \, ], ^, and _ encode characters 27-31 (hex 1B to hex 1F), and \c?  be-
6626        but  because  127  is  not a control character in EBCDIC, Perl makes it
6629        FF), but in the one Perl calls POSIX-BC its value is 95  (hex  5F).  If
6630        certain other characters have POSIX-BC values, PCRE2 makes \c? generate
6634        than  two  digits,  just  those that are present are used. Thus the se-
6641        recent addition to Perl; it provides way of specifying  character  code
6646        a digit greater than zero. Instead, use \o{} or \x{} to specify numeri-
6647        cal character code points, and \g{} to specify backreferences. The fol-
6650        The handling of a backslash followed by a digit other than 0 is compli-
6651        cated, and Perl has changed over time, causing PCRE2 also to change.
6653        Outside a character class, PCRE2 reads the digit and any following dig-
6656        groups  in the expression, the entire sequence is taken as a backrefer-
6658        discussion  of parenthesized groups.  Otherwise, up to three octal dig-
6661        Inside a character class, PCRE2 handles \8 and \9 as the literal  char-
6662        acters  "8"  and "9", and otherwise reads up to three octal digits fol-
6663        lowing the backslash, using them to generate a data character. Any sub-
6690          8-bit non-UTF mode    no greater than 0xff
6691          16-bit non-UTF mode   no greater than 0xffff
6692          32-bit non-UTF mode   no greater than 0xffffffff
6696        (the so-called "surrogate" code points). The check  for  these  can  be
6699        UTF-8  and  UTF-32 modes, because these values are not representable in
6700        UTF-16.
6715        In  Perl,  the  sequences  \F, \l, \L, \u, and \U are recognized by its
6718        However, if either of the PCRE2_ALT_BSUX  or  PCRE2_EXTRA_ALT_BSUX  op-
6724        The sequence \g followed by a signed or unsigned number, optionally en-
6731        For  compatibility with Oniguruma, the non-Perl syntax \g followed by a
6734        Details are discussed later.   Note  that  \g{...}  (Perl  syntax)  and
6735        \g<...> (Oniguruma syntax) are not synonymous. The former is a backref-
6752          \W     any "non-word" character
6757        has a different meaning. See the section entitled "Non-printing charac-
6758        ters" above for details. Perl also uses \N{name} to specify  characters
6761        Each  pair of lower and upper case escape sequences partitions the com-
6770        (13), and space (32), which are defined as white space in the  "C"  lo-
6771        cale.  This  list may vary if locale-specific matching is taking place.
6772        For example, in some locales the "non-breaking space" character  (\xA0)
6776        or digit.  By default, the definition of letters  and  digits  is  con-
6777        trolled by PCRE2's low-valued character tables, and may vary if locale-
6779        page).  For  example,  in  a French locale such as "fr_FR" in Unix-like
6786        be different for characters in the range 128-255  when  locale-specific
6788        meanings from before Unicode support was available,  mainly  for  effi-
6810          U+00A0     Non-break space
6817          U+2004     Three-per-em space
6818          U+2005     Four-per-em space
6819          U+2006     Six-per-em space
6824          U+202F     Narrow no-break space
6838        In 8-bit, non-UTF-8 mode, only the characters  with  code  points  less
6844        any Unicode newline sequence. In 8-bit non-UTF-8 mode \R is  equivalent
6849        This is an example of an "atomic group", details of which are given be-
6850        low.  This particular group matches either the  two-character  sequence
6852        U+000A), VT (vertical tab, U+000B), FF (form feed,  U+000C),  CR  (car-
6854        atomic group, the two-character sequence is treated as  a  single  unit
6858        than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-
6864        PCRE2_BSR_ANYCRLF at compile time. (BSR is an abbreviation  for  "back-
6866        the case, the other behaviour can be requested via  the  PCRE2_BSR_UNI-
6873        These override the default and the options given to the compiling func-
6874        tion.  Note that these special settings, which are not Perl-compatible,
6877        used. They can be combined with a change of newline convention; for ex-
6883        Inside a character class, \R is treated as an unrecognized  escape  se-
6888        When  PCRE2  is  built  with Unicode support (the default), three addi-
6890        are available. They can be used in any mode, though in 8-bit and 16-bit
6891        non-UTF modes these sequences are of course limited to testing  charac-
6893        In 32-bit non-UTF mode, code points greater than 0x10ffff (the  Unicode
6894        limit)  may  be  encountered. These are all treated as being in the Un-
6898        to  do  a  multistage table lookup in order to find a character's prop-
6907          \P{xx}   a character without the xx property
6910        The  property names represented by xx above are not case-sensitive, and
6914        (including  newline),  Bidi_Class,  a number of binary (yes/no) proper-
6916        other  Perl  properties such as "InMusicalSymbols" are not supported by
6922        There are three different syntax forms for matching a script. Each Uni-
6929        sign is an alternative to the colon. If a script name is given  without
6930        a  property  type,  for example, \p{Adlam}, it is treated as \p{scx:Ad-
6931        lam}. Perl changed to this interpretation at  release  5.26  and  PCRE2
6934        Unassigned characters (and in non-UTF 32-bit mode, characters with code
6936        that  are not part of an identified script are lumped together as "Com-
6937        mon". The current list of recognized script names and their 4-character
6940          pcre2test -LS
6945        Each character has exactly one Unicode general category property, spec-
6946        ified by a two-letter abbreviation. For compatibility with Perl,  nega-
6951        If only one letter is specified with \p or \P, it includes all the gen-
6978          Mn    Non-spacing mark
7010        points  are in the range U+D800 to U+DFFF. These characters are no dif-
7012        16-bit  or  32-bit  library).   However,  they are not valid in Unicode
7013        strings and so cannot be tested by PCRE2 in UTF mode, unless UTF valid-
7017        The long synonyms for  property  names  that  Perl  supports  (such  as
7021        No character that is in the Unicode table has the Cn (unassigned) prop-
7027        different from the behaviour of current versions of Perl.
7036          pcre2test -LP
7055          L           left-to-right
7056          LRE         left-to-right embedding
7057          LRI         left-to-right isolate
7058          LRO         left-to-right override
7059          NSM         non-spacing mark
7063          R           right-to-left
7064          RLE         right-to-left embedding
7065          RLI         right-to-left isolate
7066          RLO         right-to-left override
7071        case-insensitive; only the short names listed above are recognized.
7082        properties  that had been used for emojis.  Instead it introduced vari-
7083        ous emoji-specific properties. PCRE2  uses  only  the  Extended  Picto-
7092        2.  Do not end between CR and LF; otherwise end after any control char-
7098        be  followed  by  a V or T character; an LVT or T character may be fol-
7102        "zero-width  joiner" character. Characters with the "mark" property al-
7109        property.  Extend and ZWJ characters are allowed  between  the  charac-
7112        7.  Do not break within emoji flag sequences. That is, do not break be-
7120        As  well as the standard Unicode properties described above, PCRE2 sup-
7121        ports four more that make it possible to convert traditional escape se-
7123        non-standard, non-Perl properties internally  when  PCRE2_UCP  is  set.
7128          Xsp   Any Perl space character
7129          Xwd   Any Perl "word" character
7131        Xan  matches  characters that have either the L (letter) or the N (num-
7134        (separator) property.  Xsp is the same as Xps; in PCRE1 it used to  ex-
7135        clude  vertical  tab,  for  Perl  compatibility,  but Perl changed. Xwd
7138        There is another non-standard property, Xuc, which matches any  charac-
7145        Note that the Xuc property does not match these sequences but the char-
7151        characters not to be included in the final matched sequence that is re-
7162        mode), though it again reports the matched string as "bar".  This  fea-
7166        does not interfere with the setting of captured substrings.  For  exam-
7173        From  version  5.32.0  Perl  forbids the use of \K in lookaround asser-
7176        pcre2_compile() to re-enable the previous behaviour. When  this  option
7193        The  final use of backslash is for certain simple assertions. An asser-
7195        a  match, without consuming any characters from the subject string. The
7216        changed by setting the PCRE2_UCP option. When this is done, it also af-
7217        fects  \b and \B. Neither PCRE2 nor Perl has a separate "start of word"
7225        set.  Thus,  they are independent of multiline mode. These three asser-
7227        which  affect only the behaviour of the circumflex and dollar metachar-
7228        acters. However, if the startoffset argument of pcre2_match()  is  non-
7235        the  start point of the matching process, as specified by the startoff-
7237        startoffset  is  non-zero. By calling pcre2_match() multiple times with
7238        appropriate arguments, you can mimic Perl's /g option,  and  it  is  in
7243        Perl's,  which  defines it as true at the end of the previous match. In
7244        Perl, these can be different when the  previously  matched  string  was
7255        The circumflex and dollar  metacharacters  are  zero-width  assertions.
7256        That  is,  they test for a particular condition being true without con-
7258        are  concerned  with matching the starts and ends of lines. If the new-
7259        line convention is set so that only the two-character sequence CRLF  is
7265        point is at the start of the subject string. If the  startoffset  argu-
7266        ment  of  pcre2_match() is non-zero, or if PCRE2_NOTBOL is set, circum-
7268        character  class, circumflex has an entirely different meaning (see be-
7275        if the pattern is constrained to match only at the start  of  the  sub-
7280        matching  point is at the end of the subject string, or immediately be-
7281        fore a newline at the end of the string (by default), unless  PCRE2_NO-
7282        TEOL  is  set.  Note, however, that it does not actually match the new-
7285        branch in which it appears. Dollar has no special meaning in a  charac-
7297        a newline that ends the string, for compatibility with  Perl.  However,
7305        pcre2_match() is non-zero. The PCRE2_DOLLAR_ENDONLY option  is  ignored
7308        When  the  newline  convention (see "Newline conventions" below) recog-
7309        nizes the two-character sequence CRLF as a newline, this is  preferred,
7310        even  if  the  single  characters CR and LF are also recognized as new-
7324        Outside a character class, a dot in the pattern matches any one charac-
7325        ter  in  the subject string except (by default) a character that signi-
7329        Dot  never matches a single line-ending character. When the two-charac-
7338        PCRE2_DOTALL option is set, a dot matches any  one  character,  without
7339        exception.   If  the two-character sequence CRLF is present in the sub-
7342        The handling of dot is entirely independent of the handling of  circum-
7352        the section entitled "Non-printing characters" above for details.  Perl
7360        unit,  whether or not a UTF mode is set. In the 8-bit library, one code
7361        unit is one byte; in the 16-bit library it is a  16-bit  unit;  in  the
7362        32-bit  library  it  is  a 32-bit unit. Unlike a dot, \C always matches
7363        line-ending characters. The feature is provided in  Perl  in  order  to
7364        match individual bytes in UTF-8 mode, but it is unclear how it can use-
7368        one  unit  with  \C  in UTF-8 or UTF-16 mode means that the rest of the
7369        string may start with a malformed UTF character. This has undefined re-
7371        in a valid UTF string (by default it checks the subject string's valid-
7380        below)  in UTF-8 or UTF-16 modes, because this would make it impossible
7383        these UTF modes.  The former gives a match-time error; the latter fails
7386        In  the  32-bit  library, however, \C is always supported (when not ex-
7388        whether or not UTF-32 is specified.
7391        using it that avoids the problem of malformed UTF-8 or  UTF-16  charac-
7393        as in this pattern, which could be used with  a  UTF-8  string  (ignore
7396          (?| (?=[\x00-\x7f])(\C) |
7397              (?=[\x80-\x{7ff}])(\C)(\C) |
7398              (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
7399              (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
7403        below). The assertions at the start of each branch check the next UTF-8
7404        character for values whose encoding uses 1, 2, 3, or 4  bytes,  respec-
7405        tively.  The  character's individual bytes are then captured by the ap-
7412        closing square bracket. A closing square bracket on its own is not spe-
7431        class that starts with a circumflex is not an assertion; it still  con-
7437        letters in a class represent both their upper case and lower case  ver-
7440        would.  Note that there are two ASCII characters, K and S, that, in ad-
7441        dition to their lower case ASCII equivalents, are case-equivalent  with
7442        Unicode  U+212A (Kelvin sign) and U+017F (long S) respectively when ei-
7446        special  way  when matching character classes, whatever line-ending se-
7454        matches  any  hexadecimal digit. In UTF modes, the PCRE2_UCP option af-
7459        backspace character. The sequences \B, \R, and \X are not  special  in-
7464        The  minus (hyphen) character can be used to specify a range of charac-
7465        ters in a character class. For example, [d-m] matches  any  letter  be-
7470        [b-d-z] matches letters in the range b to d, a hyphen character, or z.
7472        Perl treats a hyphen as a literal if it appears before or after a POSIX
7475        class, Perl outputs a warning in its warning  mode,  as  this  is  most
7479        It is not possible to have the literal character "]" as the end charac-
7480        ter  of a range. A pattern such as [W-]46] is interpreted as a class of
7481        two characters ("W" and "-") followed by a literal string "46]", so  it
7482        would  match  "W46]"  or  "-46]". However, if the "]" is escaped with a
7483        backslash it is interpreted as the end of range, so [W-\]46] is  inter-
7488        Ranges normally include all code points between the start and end char-
7489        acters, inclusive. They can also be used for code points specified  nu-
7490        merically,  for  example [\000-\037]. Ranges can include any characters
7491        that are valid for the current mode. In any  UTF  mode,  the  so-called
7494        PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES  option  disables this check). How-
7495        ever, ranges such as [\x{d7ff}-\x{e000}], which include the surrogates,
7499        points are both specified as literal letters in the same case. For com-
7500        patibility  with Perl, EBCDIC code points within the range that are not
7501        letters are omitted. For example, [h-k] matches only  four  characters,
7504        [\x88-\x92] or [h-\x92], all code points are included.
7507        it matches the letters in either case. For example, [W-c] is equivalent
7508        to  [][\\^_`wxyzabc],  matched  caselessly,  and  in a non-UTF mode, if
7509        character tables for a French locale are in  use,  [\xc8-\xcb]  matches
7523        special compatibility feature - see the next  two  sections),  and  the
7524        terminating  closing  square  bracket.  However, escaping other non-al-
7530        Perl supports the POSIX notation for character classes. This uses names
7541          ascii    character codes 0 - 127
7555        CR (13), and space (32). If locale-specific matching is  taking  place,
7559        The name "word" is a Perl extension, and "blank"  is  a  GNU  extension
7560        from  Perl  5.8. Another Perl extension is negation, which is indicated
7565        matches "1", "2", or any non-digit. PCRE2 (and Perl) also recognize the
7570        the POSIX character classes, although this may be different for charac-
7571        ters in the range 128-255 when locale-specific matching  is  happening.
7574        This  is  achieved  by  replacing  certain POSIX classes with other se-
7591                  when printed. In Unicode property terms, it matches all char-
7596                    U+2066 - U+2069  Various "isolate"s
7603        [:punct:] This matches all characters that have the Unicode P (punctua-
7622        support is not compatible with Perl. It is provided to help  migrations
7624        that \b matches at the start and the end of a word (see "Simple  asser-
7625        tions"  above),  and in a Perl-style pattern the preceding or following
7626        character normally shows which is wanted, without the need for the  as-
7627        sertions  that are used above in order to give exactly the POSIX behav-
7650        can be changed from within the pattern by a  sequence  of  letters  en-
7651        closed  between  "(?"   and ")". These options are Perl-compatible, and
7652        are described in detail in the pcre2api documentation. The option  let-
7662        For example, (?im) sets caseless, multiline matching. It is also possi-
7663        ble to unset these options by preceding the relevant letters with a hy-
7664        phen,  for  example (?-im). The two "extended" options are not indepen-
7667        A  combined  setting  and  unsetting  such  as  (?im-sx),  which   sets
7671        the option is unset. An empty options setting "(?)" is  allowed.  Need-
7675        the above options to be unset. Thus, (?^) is equivalent  to  (?-imnsx).
7676        Letters  may  follow  the circumflex to cause some options to be re-in-
7679        The PCRE2-specific options PCRE2_DUPNAMES  and  PCRE2_UNGREEDY  can  be
7680        changed  in  the  same  way as the Perl-compatible options by using the
7683        When one of these option changes occurs at top level (that is, not  in-
7692        not  used).   By this means, options can be made to have different set-
7693        tings in different parts of the pattern. Any changes made in one alter-
7705        start of a non-capturing group (see the next section), the option  let-
7713        Note:  There  are  other  PCRE2-specific options, applying to the whole
7714        pattern, which can be set by the application when the  compiling  func-
7720        are equivalent to setting the PCRE2_UTF and PCRE2_UCP options,  respec-
7735        matches "cataract", "caterpillar", or "cat". Without  the  parentheses,
7738        2.  It  creates a "capture group". This means that, when the whole pat-
7750        the captured substrings are "red king", "red", and "king", and are num-
7754        helpful.   There are often times when grouping is required without cap-
7766        start  of  a non-capturing group, the option letters may appear between
7774        the group is reached, an option setting in one branch does affect  sub-
7775        sequent branches, so the above patterns match "SUNDAY" as well as "Sat-
7781        Perl 5.10 introduced a feature whereby each alternative in a group uses
7783        with (?| and is itself a non-capturing  group.  For  example,  consider
7788        Because  the two alternatives are inside a (?| group, both sets of cap-
7792        not all, of one of a number of alternatives. Inside a (?| group, paren-
7795        whole group start after the highest number used in any branch. The fol-
7796        lowing example is taken from the Perl documentation. The numbers under-
7799          # before  ---------------branch-reset----------- after
7814        A relative reference such as (?-1) is no different: it is just a conve-
7817        If a condition test for a group's having matched refers to a non-unique
7830        was not added to Perl until release 5.10. Python had the  feature  ear-
7832        PCRE2 supports both the Perl and the Python syntax.
7835        (?<name>...) or (?'name'...) as in Perl, or (?P<name>...) as in Python.
7838        must start with a non-digit. When PCRE2_UTF is set, the syntax of group
7842          ^[_A-Za-z][_A-Za-z0-9]*\z   when PCRE2_UTF is not set
7850        if  the  names were not present. In both PCRE2 and Perl, capture groups
7853        complete name-to-number translation table from a compiled  pattern,  as
7857        Warning: When more than one capture group has the same number,  as  de-
7859        all of them. Perl allows identically numbered groups to have  different
7865        Perl allows this, with both names AA and BB  as  aliases  of  group  1.
7870        number to be associated with more than one name. The example above pro-
7871        vokes a compile-time error. However, there is still  scope  for  confu-
7880        By  default, a name must be unique within a pattern, except that dupli-
7885        The duplicate name constraint can be disabled by setting the PCRE2_DUP-
7891        of a weekday, either as a 3-letter abbreviation or as  the  full  name,
7909        If  you make a backreference to a non-unique named group from elsewhere
7918        If you make a subroutine call to a non-unique named group, the one that
7926        true.  This is the same behaviour as testing by number. For further de-
7947        The general repetition quantifier specifies a minimum and maximum  num-
7968        the  syntax of a quantifier, is taken as a literal character. For exam-
7973        of which is represented by a two-byte sequence in a UTF-8 string. Simi-
7979        the previous item and the quantifier were not present. This may be use-
7980        ful  for  capture  groups that are referenced as subroutines from else-
7981        where in the pattern (but see also the section entitled "Defining  cap-
7986        For  convenience, the three most common quantifiers have single-charac-
7999        Earlier versions of Perl and PCRE1 used to give  an  error  at  compile
8004        does  not  prevent  backtracking into any of the iterations if a subse-
8008        possible (up to the maximum number of permitted times), without causing
8010        gives  problems is in trying to match comments in C programs. These ap-
8011        pear between /* and */ and within the comment, individual * and / char-
8012        acters  may appear. An attempt to match C comments by applying the pat-
8040        Perl),  the  quantifiers are not greedy by default, but individual ones
8045        that is greater than 1 or with a limited maximum, more  memory  is  re-
8046        quired for the compiled pattern, in proportion to the size of the mini-
8050        (equivalent  to  Perl's /s) is set, thus allowing the dot to match new-
8053        so there is no point in retrying the overall match at any position  af-
8057        In cases where it is known that the subject  string  contains  no  new-
8058        lines,  it  is worth setting PCRE2_DOTALL in order to obtain this opti-
8068        If  the subject is "xyz123abc123" the match point is the fourth charac-
8071        Another case where implicit anchoring is not applied is when the  lead-
8077        It matches "ab" in the subject "aab". The use of the backtracking  con-
8087        is "tweedledee". However, if there are nested capture groups, the  cor-
8100        to be re-evaluated to see if a different number of repeats  allows  the
8116        re-evaluated in this way.
8124        Perl 5.28 introduced an experimental alphabetic form starting  with  (*
8139        example  can be thought of as a maximizing repeat that must swallow ev-
8141        the  number  of digits they match in order to make the rest of the pat-
8146        group is just a single repeated item, as in the example above,  a  sim-
8147        pler  notation, called a "possessive quantifier" can be used. This con-
8158        Possessive quantifiers are always greedy; the setting of the  PCRE2_UN-
8159        GREEDY  option  is ignored. They are a convenient notation for the sim-
8165        The  possessive  quantifier syntax is an extension to the Perl 5.8 syn-
8169        way into Perl at release 5.10.
8174        when B must follow.  This feature can be disabled by the PCRE2_NO_AUTO-
8177        When a pattern contains an unlimited repeat inside a group that can it-
8184        matches an unlimited number of substrings that either consist  of  non-
8192        * repeat in a large number of ways, and all have to be tried. (The  ex-
8194        PCRE2 and Perl have an optimization that allows for fast failure when a
8202        sequences of non-digits cannot be broken, and failure happens quickly.
8215        words, the group that is referenced need not be to the left of the ref-
8223        subsection entitled "Non-printing characters" above for further details
8224        of  the  handling of digits following a backslash. Other forms of back-
8237        An unsigned number specifies an absolute reference without the  ambigu-
8242          (abc(def)ghi)\g{-1}
8244        The sequence \g{-1} is a reference to the most recently started capture
8245        group before \g, that is, is it equivalent to \2 in this example. Simi-
8246        larly, \g{-2} would be equivalent to \1. The use of relative references
8248        by  joining  together  fragments  that  contain references within them-
8252        of  forward  reference can be useful in patterns that repeat. Perl does
8264        time  of  the backreference, the case of letters is relevant. For exam-
8273        capture groups. The .NET syntax \k{name} and the Perl  syntax  \k<name>
8274        or  \k'name'  are  supported,  as  is the Python syntax (?P=name). Perl
8294        the PCRE2_MATCH_UNSET_BACKREF option is set at compile time, a backref-
8297        Because  there may be many capture groups in a pattern, all digits fol-
8298        lowing a backslash are taken as part of a potential backreference  num-
8308        However, such references can be useful inside repeated groups. For  ex-
8313        matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
8314        ation of the group, the backreference matches the character string cor-
8317        the  backreference. This can be done using alternation, as in the exam-
8335        subject string, and those that look behind it, and in each case an  as-
8338        group is matched in the normal way, and if it is true, matching contin-
8339        ues after it, but with the matching position in the subject string  re-
8342        The  Perl-compatible  lookaround assertions are atomic. If an assertion
8343        is true, but there is a subsequent matching failure, there is no  back-
8344        tracking  into  the assertion. However, there are some cases where non-
8345        atomic assertions can be useful. PCRE2 has some support for these,  de-
8346        scribed in the section entitled "Non-atomic assertions" below, but they
8347        are not Perl-compatible.
8353        Assertion groups are not capture groups. If an assertion contains  cap-
8355        the capture groups in the whole pattern. Within each branch of  an  as-
8357        way. For example, a sequence such as (.)\g{-1} can  be  used  to  check
8364        retained  after a successful negative assertion. When an assertion con-
8367        For a positive assertion, internally captured substrings  in  the  suc-
8368        cessful  branch are retained, and matching continues with the next pat-
8377        Most assertion groups may be repeated; though it makes no sense to  as-
8378        sert the same thing several times, the side effect of capturing in pos-
8389        to specify lookaround assertions. Perl 5.28 introduced some  experimen-
8391        start with (* instead of (? and must be written using lower  case  let-
8399        For  example,  (*pla:foo) is the same assertion as (?=foo). In the fol-
8400        lowing sections, the various assertions are described using the  origi-
8410        matches a word followed by a semicolon, but does not include the  semi-
8426        most convenient way to do it is with (?!) because an empty  string  al-
8440        strings it matches must have a fixed length. However, if there are sev-
8441        eral  top-level  alternatives,  they  do  not all have to have the same
8452        This is an extension compared with Perl, which requires all branches to
8457        is  not  permitted,  because  its single top-level branch can match two
8459        two top-level branches:
8464        of a lookbehind assertion to get round the fixed-length restriction.
8468        then try to match. If there are insufficient characters before the cur-
8471        In  UTF-8  and  UTF-16 modes, PCRE2 does not allow the \C escape (which
8474        the lookbehind. The \X and \R escapes, which can match  different  num-
8478        lookbehinds, as long as the called capture group matches a fixed-length
8482        Perl does not support backreferences in lookbehinds. PCRE2 does support
8483        them,  but  only  if  certain  conditions  are met. The PCRE2_MATCH_UN-
8493        Possessive  quantifiers  can be used in conjunction with lookbehind as-
8494        sertions to specify efficient matching of fixed-length strings  at  the
8500        proceeds from left to right, PCRE2 will look for each "a" in  the  sub-
8515        quantifier; it can match only the entire string. The subsequent lookbe-
8530        three characters are not "999".  This pattern does not match "foo" pre-
8532        three of which are not "999". For example, it  doesn't  match  "123abc-
8554 NON-ATOMIC ASSERTIONS
8556        The  traditional Perl-compatible lookaround assertions are atomic. That
8557        is, if an assertion is true, but there is a subsequent  matching  fail-
8559        some cases where non-atomic positive assertions can  be  useful.  PCRE2
8565        Consider  the  problem  of finding the right-most word in a string that
8574        and sets the "x" option, which causes white space (introduced for read-
8578        words, when the assertion first succeeds, it  captures  the  right-most
8584        succeeds, we are done, but if the last word in the string does not  oc-
8586        lookhead (?= or (*pla: had been used, the assertion could not be re-en-
8590        Using a non-atomic lookahead, however, means that when  the  last  word
8592        find the second-last word, and so on, until either the match  succeeds,
8595        Two conditions must be met for a non-atomic assertion to be useful: the
8600        using a non-atomic assertion just wastes resources.
8602        There is one exception to backtracking into a non-atomic assertion.  If
8603        an  (*ACCEPT)  control verb is triggered, the assertion succeeds atomi-
8607        Non-atomic  assertions  are  not  supported by the alternative matching
8625        matches are not a script run. After a failure, normal backtracking  oc-
8626        curs.  Script runs can be used to detect spoofing attacks using charac-
8628        "paypal.com"  is an infamous example, where the letters could be a mix-
8629        ture of Latin and Cyrillic. This pattern ensures that the matched char-
8630        acters in a sequence of non-spaces that follow white space are a script
8646          \s+(?=[0-9_.]*\p{Latin})(*sr:\S+)
8659        Support  for  script runs is not available if PCRE2 is compiled without
8660        Unicode support. A compile-time error is given if any of the above con-
8662        matching function, pcre2_dfa_match() because they use the  same  mecha-
8677          (?(condition)yes-pattern)
8678          (?(condition)yes-pattern|no-pattern)
8680        If the condition is satisfied, the yes-pattern is used;  otherwise  the
8681        no-pattern  (if present) is used. An absent no-pattern is equivalent to
8682        an empty string (it always matches). If there are more than two  alter-
8683        natives  in the group, a compile-time error occurs. Each of the two al-
8684        ternatives may itself contain nested groups of any form, including con-
8692        There are five kinds of condition: references to capture groups, refer-
8693        ences to recursion, two pseudo-conditions called  DEFINE  and  VERSION,
8702        is true if any of them have matched. An alternative notation is to pre-
8703        cede the digits with a plus or minus sign. In this case, the group num-
8705        group can be referenced by (?(-1), the next most recent by (?(-2),  and
8708        (The  value  zero in any of these forms is not used; it provokes a com-
8709        pile-time error.)
8711        Consider the following pattern, which  contains  non-significant  white
8718        character is present, sets it as the first captured substring. The sec-
8722        opening  parenthesis,  the condition is true, and so the yes-pattern is
8723        executed and a closing parenthesis is required.  Otherwise,  since  no-
8725        words, this pattern matches a sequence of  non-parentheses,  optionally
8731          ...other stuff... ( \( )?    [^()]+    (?(-1) \) ) ...
8738        Perl  uses  the  syntax  (?(<name>)...) or (?('name')...) to test for a
8740        PCRE1,  which had this facility before Perl, the syntax (?(name)...) is
8742        the  letter  R followed by digits are ambiguous (see the following sec-
8753        "Recursion" in this sense refers to any subroutine-like call  from  one
8754        part  of  the  pattern to another, whether or not it is actually recur-
8759        the name R, the condition is true if matching is currently in a  recur-
8769        name, the condition tests for its being set, as described in  the  sec-
8771        group with the name R1 by adding (?<R1>)  to  the  above  pattern  com-
8792        be only one alternative in the rest of the conditional group. It is al-
8794        DEFINE  is that it can be used to define subroutines that can be refer-
8799          (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
8807        to  match the four dot-separated components of an IPv4 address, insist-
8812        Programs that link with a PCRE2 library can check the version by  call-
8814        that do not have access to the underlying code cannot do this.  A  spe-
8830        or lookbehind assertion. However, it must be a traditional  atomic  as-
8831        sertion, not one of the PCRE2-specific non-atomic assertions.
8833        Consider  this  pattern,  again containing non-significant white space,
8836          (?(?=[^a-z]*[a-z])
8837          \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )
8839        The condition is a positive lookahead assertion  that  matches  an  op-
8840        tional sequence of non-letters followed by a letter. In other words, it
8841        tests for the presence of at least one letter in the subject. If a let-
8844        strings  in  one  of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
8847        When an assertion that is a condition contains capture groups, any cap-
8849        both positive and negative assertions, because matching always  contin-
8850        ues  after  the  assertion, whether it succeeds or fails. (Compare non-
8851        conditional assertions, for which captures are retained only for  posi-
8870        at the start of the pattern, as described in the section entitled "New-
8874        when PCRE2_EXTENDED is set, and the default newline convention (a  sin-
8888        unlimited  nested  parentheses.  Without the use of recursion, the best
8893        For some time, Perl has provided a facility that allows regular expres-
8895        Perl code in the expression at run time, and the code can refer to  the
8896        expression itself. A Perl pattern using code interpolation to solve the
8901        The (?p{...}) item interpolates Perl code at run time, and in this case
8904        Obviously,  PCRE2  cannot  support  the interpolation of Perl code. In-
8908        into Perl at release 5.10.
8913        group. (If not, it is a non-recursive subroutine  call,  which  is  de-
8914        scribed in the next section.) The special item (?R) or (?0) is a recur-
8923        substrings which can either be a sequence of non-parentheses, or a  re-
8926        possessive  quantifier  to  avoid  backtracking  into sequences of non-
8939        of (?1) in the pattern above you can write (?-2) to refer to the second
8948          (?|(a)|(b)) (c) (?-2)
8951        (c) is number 2. When the reference (?-2) is  encountered,  the  second
8954        the  same if an absolute reference (?1) was used. In other words, rela-
8960        are  always  non-recursive  subroutine  calls, as described in the next
8963        An alternative approach is to use named parentheses.  The  Perl  syntax
8964        for  this  is  (?&name);  PCRE1's earlier syntax (?P>name) is also sup-
8972        The example pattern that we have been looking at contains nested unlim-
8974        strings  of  non-parentheses  is important when applying the pattern to
8986        callout function can be used (see below and the pcre2callout documenta-
8998        recursion.   Consider  this pattern, which matches text in angle brack-
9000        brackets  (that is, when recursing), whereas any characters are permit-
9006        different  alternatives  for the recursive and non-recursive cases. The
9009    Differences in recursion processing between PCRE2 and Perl
9011        Some former differences between PCRE2 and Perl no longer exist.
9013        Before release 10.30, recursion processing in PCRE2 differed from  Perl
9016        never  re-entered,  even if it contained untried alternatives and there
9018        recursion before Perl did.)
9021        treated as atomic. That is, they can be re-entered to try unused alter-
9023        now compatible with the way Perl works. If you want a  subroutine  call
9026        Supporting backtracking into recursions simplifies certain types of re-
9035        match fails. If you want to match typical palindromic phrases, the pat-
9036        tern  has  to  ignore  all  non-word characters, which can be done like
9042        such  as "A man, a plan, a canal: Panama!". Note the use of the posses-
9043        sive quantifier *+ to avoid backtracking  into  sequences  of  non-word
9044        characters. Without this, PCRE2 takes a great deal longer (ten times or
9045        more) to match typical phrases, and Perl takes so long that  you  think
9048        Another  way  in which PCRE2 and Perl used to differ in their recursion
9049        processing is in the handling of captured  values.  Formerly  in  Perl,
9061        to fail in Perl, but in later versions (I tried 5.024) it now works.
9070        to match at the current matching position. The called group may be  de-
9071        fined  before or after the reference. A numbered reference can be abso-
9075          (...(relative)...)...(?-1)...
9096        Processing options such as case-independence are fixed when a group  is
9100          (abc)(?i:(?-1))
9112        For  compatibility with Oniguruma, the non-Perl syntax \g followed by a
9114        an alternative syntax for calling a group as a subroutine, possibly re-
9124          (abc)(?i:\g<-1>)
9126        Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are  not
9133        Perl has a feature whereby using the sequence (?{...}) causes arbitrary
9134        Perl  code to be obeyed in the middle of matching a regular expression.
9135        This makes it possible, amongst other things, to extract different sub-
9136        strings that match the same pair of parentheses when there is a repeti-
9139        PCRE2 provides a similar feature, but of course it  cannot  obey  arbi-
9140        trary  Perl  code. The feature is called "callout". The caller of PCRE2
9144        passed, or if the callout entry point is set to NULL, callouts are dis-
9154        in a similar way to Perl.
9156        During matching, when PCRE2 reaches a callout point, the external func-
9163        time, and one side-effect is that sometimes callouts  are  skipped.  If
9179        They are all numbered 255. If there is a conditional group in the  pat-
9191        A delimited string may be used instead of a number as a  callout  argu-
9193        ending delimiter is the same as the start, except for {, where the end-
9206        Perl's terminology) that modify the behaviour  of  backtracking  during
9212        By default, for compatibility with Perl, a  name  is  any  sequence  of
9216        PCRE2_ALT_VERBNAMES option, but the result is no  longer  Perl-compati-
9222        and sequences such as \x{100} that define character code points.  Char-
9228        names is skipped, and #-comments are recognized, exactly as in the rest
9232        The maximum length of a name is 255 in the 8-bit library and  65535  in
9233        the  16-bit and 32-bit libraries. If the name is empty, that is, if the
9235        the colon were not there. Any number of these verbs may occur in a pat-
9239        them  can be used only when the pattern is to be matched using the tra-
9256        course, be processed. You can suppress the start-of-match optimizations
9257        by  setting  the PCRE2_NO_START_OPTIMIZE option when calling pcre2_com-
9262        Experiments with Perl suggest that it too  has  similar  optimizations,
9274        then continues at the outer level. If (*ACCEPT) in triggered in a posi-
9278        If (*ACCEPT) is inside capturing parentheses, the data so far  is  cap-
9283        This  matches  "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap-
9286        (*ACCEPT) is the only backtracking verb that is allowed to  be  quanti-
9294        is triggered and the match succeeds. In both cases, all but C  is  cap-
9295        tured.  Whereas  (*COMMIT) (see below) means "fail on backtrack", a re-
9298        Warning: (*ACCEPT) should not be used within a script  run  group,  be-
9306        read. The Perl documentation notes that it is probably useful only when
9307        combined with (?{}) or (??{}). Those are, of course, Perl features that
9308        are not present in PCRE2. The nearest equivalent is  the  callout  fea-
9316        (*ACCEPT:NAME) and (*FAIL:NAME) behave the  same  as  (*MARK:NAME)(*AC-
9322        There is one verb whose main purpose is to track how a  match  was  ar-
9323        rived  at,  though  it also has a secondary use in conjunction with ad-
9328        A name is always required with this verb. For all the other  backtrack-
9331        When  a  match  succeeds, the name of the last-encountered mark name on
9332        the matching path is passed back to the caller as described in the sec-
9333        tion entitled "Other information about the match" in the pcre2api docu-
9340        back. A verb without a NAME argument is ignored for this purpose.  Here
9352        The (*MARK) name is tagged with "MK:" in this output, and in this exam-
9354        efficient way of obtaining this information than putting each  alterna-
9358        true, the name is recorded and passed back if it  is  the  last-encoun-
9380        The following verbs do nothing when they are encountered. Matching con-
9382        causing a backtrack to the verb, a failure is forced.  That  is,  back-
9386        group has been matched, there is never any backtracking into it.  Back-
9390        These verbs differ in exactly what kind of failure  occurs  when  back-
9392        when the verb is not in a subroutine or an assertion.  Subsequent  sec-
9398        matching failure that causes backtracking to reach it. Even if the pat-
9401        verb that is encountered, once it has been passed pcre2_match() is com-
9410        The behaviour of (*COMMIT:NAME) is not the same  as  (*MARK:NAME)(*COM-
9411        MIT).  It is like (*MARK:NAME) in that the name is remembered for pass-
9413        that are set with (*MARK), ignoring those set by any of the other back-
9421        Note that (*COMMIT) at the start of a pattern is not the same as an an-
9422        chor, unless PCRE2's start-of-match optimizations are  turned  off,  as
9438        (*COMMIT) causes the match to fail without trying  any  other  starting
9444        the subject if there is a later matching failure that causes backtrack-
9450        (*PRUNE) is just an alternative to an atomic group or possessive  quan-
9462        This verb, when given without a name, is like (*PRUNE), except that  if
9464        character, but to the position in the subject where (*SKIP) was encoun-
9473        skips on to start the next attempt at "c". Note that a possessive quan-
9475        suppress  backtracking  during  the first match attempt, the second at-
9489        found,  the  "bumpalong" advance is to the subject position that corre-
9495        atomic groups or assertions, because they are never re-entered by back-
9513        backtracks, and this causes a new matching attempt to start at the sec-
9523        This  verb  causes  a skip to the next innermost alternative when back-
9526        that it can be used for a pattern-based if-then-else block:
9532        skips to the second alternative and tries COND2,  without  backtracking
9533        into  COND1.  If that succeeds and BAR fails, COND3 is tried. If subse-
9534        quently BAZ fails, there are no more alternatives, so there is a  back-
9535        track  to  whatever came before the entire group. If (*THEN) is not in-
9543        A  group  that does not contain a | character is just a part of the en-
9544        closing alternative; it is not a nested alternation with only  one  al-
9545        ternative. The effect of (*THEN) extends beyond such a group to the en-
9546        closing alternative.  Consider this pattern, where A, B, etc. are  com-
9559        The effect of (*THEN) is now confined to the inner group. After a fail-
9564        Note that a conditional group is not considered as having two  alterna-
9571        If the subject is "ba", this pattern does not match. Because .*? is un-
9591        that  is  backtracked  onto first acts. For example, consider this pat-
9599        is  consistent,  but is not always the same as Perl's. It means that if
9611        PCRE2 sometimes differs from Perl in its handling of backtracking verbs
9616        If the subject is "abac", Perl matches  unless  its  optimizations  are
9628        succeed without any further processing; captured  strings  and  a  mark
9629        name  (if  set) are retained. In a standalone negative assertion, (*AC-
9630        CEPT) causes the assertion to fail without any further processing; cap-
9638        reach them. This means that, for the Perl-compatible assertions,  their
9639        effect is confined to the assertion, because Perl lookaround assertions
9645        PCRE2  now supports non-atomic positive assertions, as described in the
9646        section entitled "Non-atomic assertions" above. These  assertions  must
9647        be  standalone  (not used as conditions). They are not Perl-compatible.
9648        For these assertions, a later backtrack does jump back into the  asser-
9649        tion,  and  therefore verbs such as (*COMMIT) can be triggered by back-
9657        in  a  standalone  positive assertion. In a conditional positive asser-
9659        or  (*PRUNE) causes the condition to be false. However, for both stand-
9661        (*SKIP), or (*PRUNE) causes the assertion to be true, without consider-
9669        to  succeed without any further processing. Matching then continues af-
9670        ter the subroutine call. Perl documents this behaviour.  Perl's  treat-
9677        when  triggered  by being backtracked to in a group called as a subrou-
9702        Copyright (c) 1997-2022 University of Cambridge.
9703 ------------------------------------------------------------------------------
9711        PCRE2 - Perl-compatible regular expressions (revised API)
9715        Two  aspects  of performance are discussed below: memory usage and pro-
9740        is  not  usually a problem. However, if the numbers are large, and par-
9746        uses  over  50KiB  when compiled using the 8-bit library. When PCRE2 is
9748        limit on a compiled pattern is 65535 code units in the 8-bit and 16-bit
9749        libraries, and this is reached with the above pattern if the outer rep-
9755        of PCRE2's "subroutine" facility. Re-writing the above pattern as
9761        this kind of pattern is not always exactly equivalent, because any cap-
9764        process  patterns that PCRE2 cannot otherwise handle. The matching per-
9766        same.  (This applies from release 10.30 - things were different in ear-
9772        From release 10.30, the interpretive (non-JIT) version of pcre2_match()
9773        uses  very  little system stack at run time. In earlier releases recur-
9775        cause  problems, but this usage has been eliminated. Backtracking posi-
9781        On a 64-bit system the frame size for a pattern with no captures is 128
9785        the  system  stack,  but this still caused some issues for multi-thread
9790        block and re-used if that block is used for another match. It is  freed
9802        function calls, but only for processing atomic groups,  lookaround  as-
9807        has  been  re-factored  to  use heap memory when necessary for internal
9818        Certain  items  in regular expression patterns are processed more effi-
9820        [aeiou]   than   a   set   of  single-character  alternatives  such  as
9824        expressions for efficient performance. This document contains a few ob-
9828        slow,  because  PCRE2 has to use a multi-stage table lookup whenever it
9838        pcre2_match();  the  performance loss is less with a DFA matching func-
9841        When a pattern begins with .* not in atomic parentheses, nor in  paren-
9845        multiple top-level branches, they must all be anchorable. The optimiza-
9846        tion  can be disabled by the PCRE2_NO_DOTSTAR_ANCHOR option, and is au-
9849        If PCRE2_DOTALL is not set, PCRE2 cannot make  this  optimization,  be-
9851        subject string contains newlines, the pattern may match from the  char-
9862        If  you  are using such a pattern with subject strings that do not con-
9864        PCRE2_DOTALL,  or starting the pattern with ^.* or ^.*? to indicate ex-
9865        plicit anchoring. That saves PCRE2 from having to scan along  the  sub-
9879        in principle to try every possible variation, and this can take an  ex-
9887        matching procedure, PCRE2 checks that there is a "b" later in the  sub-
9888        ject  string, and if there is not, it fails the match immediately. How-
9899        an atomic group or a possessive quantifier. This can often reduce  mem-
9910        matched character. For a long string, a lot of memory is required. Con-
9916        This runs much faster, because sequences of characters that do not con-
9917        tain "<" are "swallowed" in one item inside the parentheses, and a pos-
9919        non-"<" characters. This version also uses a lot  less  memory  because
9934        pcre2_match()  or pcre2_dfa_match() is called. For details of these in-
9954        Copyright (c) 1997-2022 University of Cambridge.
9955 ------------------------------------------------------------------------------
9963        PCRE2 - Perl-compatible regular expressions (revised API)
9983        This  set of functions provides a POSIX-style API for the PCRE2 regular
9984        expression 8-bit library. There are no POSIX-style wrappers for PCRE2's
9985        16-bit  and  32-bit libraries. See the pcre2api documentation for a de-
9986        scription of PCRE2's native API, which contains much  additional  func-
9991        header  file, and they all have unique names starting with pcre2_. How-
9992        ever, the pcre2posix.h header also contains macro definitions that con-
9994        This means that a program can use the usual POSIX names without running
9998        On Unix-like systems the PCRE2 POSIX library is called  libpcre2-posix,
9999        so  can  be accessed by adding -lpcre2-posix to the command for linking
10001        also necessary to add -lpcre2-8.
10005        regcomp()  etc.  These simply passed their arguments to the PCRE2 func-
10018        names start with "REG_"; these are used for setting options and identi-
10033        PCRE2-specific features via the POSIX calling interface or to  add  BSD
10037        POSIX-like in style. The syntax and semantics of  the  regular  expres-
10038        sions  themselves  are  still  those of Perl, subject to the setting of
10039        various PCRE2 options, as described below. "POSIX-like in style"  means
10041        POSIX-compatible, and in multi-unit encoding  domains  it  is  probably
10045        described above, the standard POSIX names (without the  pcre2_  prefix)
10051        The function pcre2_regcomp() is called to compile a pattern into an in-
10052        ternal form. By default, the pattern is a C string terminated by a  bi-
10076        the defined POSIX behaviour for REG_NEWLINE  (see  the  following  sec-
10082        for compilation to the native function. This disables all meta  charac-
10091        pcre2_regexec()  for  matching, the nmatch and pmatch arguments are ig-
10092        nored, and no captured strings are returned. Versions of the  PCRE  li-
10093        brary  prior to 10.22 used to set the PCRE2_NO_AUTO_CAPTURE compile op-
10094        tion, but this no longer happens because it disables the use  of  back-
10101        the  end of the pattern before calling pcre2_regcomp(). The pattern it-
10102        self may now contain binary zeros, which are treated  as  data  charac-
10103        ters.  Without  REG_PEND,  a binary zero terminates the pattern and the
10125        all  data  strings used for matching it to be treated as UTF-8 strings.
10129        function.   This means the the regex is compiled with PCRE2 default se-
10131        subject  string  is  the Perl way, not the POSIX way. Note that setting
10133        It  does not affect the way newlines are matched by the dot metacharac-
10136        The yield of pcre2_regcomp() is zero on success,  and  non-zero  other-
10139        number  of capturing subpatterns in the regular expression. Various er-
10142        NOTE: If the yield of pcre2_regcomp() is non-zero, you must not attempt
10150        This area is not simple, because POSIX and Perl take different views of
10154        Perl and PCRE2:
10164        This is the equivalent table for a POSIX-compatible pattern matcher:
10175        API. By default, PCRE2's behaviour is the same as Perl's,  except  that
10176        there  is no equivalent for PCRE2_DOLLAR_ENDONLY in Perl. In both PCRE2
10177        and Perl, there is no way to stop newline from matching [^a].
10181        there is no way to make PCRE2 behave exactly as for the REG_NEWLINE ac-
10197        The PCRE2_NOTBOL option is set when calling the underlying PCRE2 match-
10204        standard. However, setting this option can give more POSIX-like  behav-
10209        The PCRE2_NOTEOL option is set when calling the underlying PCRE2 match-
10216        point to the first character beyond the string. There may be binary ze-
10223        relative  to  string + pmatch[0].rm_so, but this differs from other im-
10228        intended to be portable to other systems. Note that  a  non-zero  rm_so
10236        pcre2_regexec() are ignored (except possibly  as  input  for  REG_STAR-
10239        The  value of nmatch may be zero, and the value pmatch may be NULL (un-
10251        Unused entries in the array have both structure members set to -1.
10253        A  successful  match  yields a zero return; various error codes are de-
10254        fined in the header file, of which REG_NOMATCH is the "expected"  fail-
10260        The  pcre2_regerror()  function  maps  a non-zero errorcode from either
10263        A message terminated by a binary zero is placed in errbuf. If the  buf-
10264        fer  is too short, only the first errbuf_size - 1 characters of the er-
10272        Compiling a regular expression causes memory to be allocated and  asso-
10274        such memory, after which preg may no longer be used as a  compiled  ex-
10288        Copyright (c) 1997-2021 University of Cambridge.
10289 ------------------------------------------------------------------------------
10297        PCRE2 - Perl-compatible regular expressions (revised API)
10305        can save this listing to re-create the contents of pcre2demo.c.
10310        used. If matching succeeds, the program outputs the portion of the sub-
10311        ject  that  matched,  together  with  the contents of any captured sub-
10314        If the -g option is given on the command line, the program then goes on
10316        subject string. The logic is a little bit tricky because of the  possi-
10320        The code in pcre2demo.c is an 8-bit program that uses the  PCRE2  8-bit
10321        library.  It  handles  strings  and characters that are stored in 8-bit
10324        treated as UTF-8 strings, where characters  may  occupy  multiple  code
10328        for your operating system, you should be able to compile the demonstra-
10331          cc -o pcre2demo pcre2demo.c -lpcre2-8
10334        to the command line. For example, on a Unix-like system that has  PCRE2
10335        installed  in /usr/local, you can compile the demonstration program us-
10338          cc -o pcre2demo -I/usr/local/include pcre2demo.c \
10339             -L/usr/local/lib -lpcre2-8
10345          ./pcre2demo -g 'cat|dog' 'the dog sat on the cat'
10348        pcre2test, which supports many more facilities for testing regular  ex-
10349        pressions  using  all three PCRE2 libraries (8-bit, 16-bit, and 32-bit,
10350        though not all three need be installed). The pcre2demo program is  pro-
10357          ld.so.1: pcre2demo: fatal: libpcre2-8.so.0: open failed: No such file
10360        This is caused by the way shared library support works  on  those  sys-
10363          -R/usr/local/lib
10378        Copyright (c) 1997-2016 University of Cambridge.
10379 ------------------------------------------------------------------------------
10385        PCRE2 - Perl-compatible regular expressions (revised API)
10387 SAVING AND RE-USING PRECOMPILED PCRE2 PATTERNS
10404        run. However, if you are using the just-in-time  optimization  feature,
10405        it is not possible to save and reload the JIT data, because it is posi-
10406        tion-dependent. The host on which the patterns  are  reloaded  must  be
10409        For  example, patterns compiled on a 32-bit system using PCRE2's 16-bit
10410        library cannot be reloaded on a 64-bit system, nor can they be reloaded
10411        using the 8-bit library.
10418        linked with a fixed version of PCRE2 must be prepared to recompile pat-
10428        checking, not complete validation of what is being re-loaded. Corrupted
10440        in the byte stream (its size is 1088 bytes). For more details of  char-
10441        acter  tables,  see the section on locale support in the pcre2api docu-
10447        the length of the vector. The third and fourth arguments point to vari-
10461        PCRE2_ERROR_BADMAGIC  means  either that a pattern's code has been cor-
10462        rupted, or that a slot in the vector does not point to a compiled  pat-
10485        the 256 possible byte values. On systems that make  a  distinction  be-
10486        tween  binary  and non-binary data, be sure that the file is opened for
10491        freed in the usual way by calling pcre2_code_free(). When you have fin-
10492        ished with the byte stream, it too must be freed by calling pcre2_seri-
10493        alize_free(). If this function is called with a NULL argument,  it  re-
10494        turns immediately without doing anything.
10497 RE-USING PRECOMPILED PATTERNS
10499        In  order to re-use a set of saved patterns you must first make the se-
10501        from a file). The management of this memory block is up to the applica-
10503        find  out how many compiled patterns are in the serialized data without
10512        and its length, and the third argument points to a byte stream. The fi-
10515        If this argument is NULL, malloc() and free() are used. After deserial-
10524        stream, it is filled with those that fit, and  the  remainder  are  ig-
10540        potential  race  issue if you are using multiple patterns that were de-
10541        coded from a single byte stream in a multithreaded application. A  sin-
10543        and a reference count is used to arrange for its memory to be automati-
10550        If  a pattern was processed by pcre2_jit_compile() before being serial-
10566        Copyright (c) 1997-2018 University of Cambridge.
10567 ------------------------------------------------------------------------------
10575        PCRE2 - Perl-compatible regular expressions (revised API)
10579        The  full syntax and semantics of the regular expressions that are sup-
10581        document contains a quick-reference summary of the syntax.
10586          \x         where x is non-alphanumeric is a literal x
10596          \cx        "control-x", where x is any ASCII printing character
10617        read, but in ALT_BSUX mode \x must be followed by two hexadecimal  dig-
10623        Note that \0dd is always an octal code. The treatment of backslash fol-
10624        lowed  by  a non-zero digit is complicated; for details see the section
10625        "Non-printing characters" in the pcre2pattern documentation, where  de-
10643          \P{xx}     a character without the xx property
10650          \W         a "non-word" character
10654        middle of a UTF-8 or UTF-16 character. The application can lock out the
10658        By  default,  \d, \s, and \w match only ASCII characters, even in UTF-8
10659        mode or in the 16-bit and 32-bit libraries. However, if locale-specific
10661        points in the range 128-255. If the PCRE2_UCP option is set, the behav-
10665        Property descriptions in \p and \P are matched caselessly; hyphens, un-
10691          Mn         Non-spacing mark
10723          Xsp        Perl space: property Z or tab, NL, VT, FF, CR
10724          Xuc        Univerally-named character: one that can be
10726          Xwd        Perl word: property Xan or underscore
10728        Perl and POSIX space are now the same. Perl added VT to its space char-
10739          pcre2test -LP
10744        Many script names and their 4-letter abbreviations  are  recognized  in
10746        of course). You can obtain a list of these scripts by running this com-
10749          pcre2test -LS
10768          L           left-to-right
10769          LRE         left-to-right embedding
10770          LRI         left-to-right isolate
10771          LRO         left-to-right override
10772          NSM         non-spacing mark
10776          R           right-to-left
10777          RLE         right-to-left embedding
10778          RLI         right-to-left isolate
10779          RLO         right-to-left override
10788          [x-y]       range (can be used for hex characters)
10794          ascii       0-127
10853        From  release 10.38 \K is not permitted by default in lookaround asser-
10854        tions, for compatibility with Perl.  However,  if  the  PCRE2_EXTRA_AL-
10855        LOW_LOOKAROUND_BSK option is set, the previous behaviour is re-enabled.
10856        When this option is set, \K is honoured in positive assertions, but ig-
10868          (?<name>...)    named capture group (Perl)
10869          (?'name'...)    named capture group (Perl)
10871          (?:...)         non-capture group
10872          (?|...)         non-capture group; reset group numbers for
10875        In  non-UTF  modes, names may contain underscores and ASCII letters and
10882          (?>...)         atomic non-capture group
10883          (*atomic:...)   atomic non-capture group
10903          (?-...)         unset option(s)
10907        a mixture of setting and unsetting such as (?i-x) is allowed, but there
10909        for example (?^in). An option setting may appear at the start of a non-
10912        The following are recognized only at the very start of a pattern or af-
10921          (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
10924          (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
10938        These are recognized only at the very start of the pattern or after op-
10951        These are recognized only at the very start of the pattern or after op-
10976        Each top-level branch of a lookbehind must be of a fixed length.
10979 NON-ATOMIC LOOKAROUND ASSERTIONS
10981        These assertions are specific to PCRE2 and are not Perl-compatible.
11007          \g-n            relative reference by number
11009          \g{-n}          relative reference by number
11010          \k<name>        reference by name (Perl)
11011          \k'name'        reference by name (Perl)
11012          \g{name}        reference by name (Perl)
11022          (?-n)           call subroutine by relative number
11023          (?&name)        call subroutine by name (Perl)
11031          \g<-n>          call subroutine by relative number (PCRE2 extension)
11032          \g'-n'          call subroutine by relative number (PCRE2 extension)
11037          (?(condition)yes-pattern)
11038          (?(condition)yes-pattern|no-pattern)
11042          (?(-n)              relative reference condition
11043          (?(<name>)          named reference condition (Perl)
11044          (?('name')          named reference condition (Perl)
11070        The  following  act only when a subsequent match failure causes a back-
11072        what happens afterwards. Those that advance the start-of-match point do
11114        Copyright (c) 1997-2022 University of Cambridge.
11115 ------------------------------------------------------------------------------
11123        PCRE - Perl-compatible regular expressions (revised API)
11128        it, you can build it  without,  in  which  case  the  library  will  be
11130        properties and can process strings of text in UTF-8, UTF-16, and UTF-32
11135        There  are two ways of telling PCRE2 to switch to UTF mode, where char-
11148        one-code-unit characters. There are also some other changes to the  way
11155        \P{..}, and \X can be used. This is not dependent on the PCRE2_UTF set-
11157        that Perl supports. Currently they are limited to the general  category
11158        properties such as Lu for an upper case letter or Nd for a decimal num-
11162        general,  only the short names for properties are supported.  For exam-
11164        supported. Furthermore, in Perl, many properties may optionally be pre-
11165        fixed by "Is", for compatibility with Perl 5.6. PCRE2 does not  support
11178        allowed in non-UTF mode.
11191        multi-unit  characters  (see  the description of \C in the pcre2pattern
11192        documentation). For this reason, there is a build-time option that dis-
11193        ables  support  for  \C completely. There is also a less draconian com-
11194        pile-time option for locking out the use of \C when a pattern  is  com-
11198        pcre2_dfa_match() when in UTF-8 or UTF-16 mode, that is, when a charac-
11200        modes provokes a match-time error. Also, the JIT optimization does  not
11201        support \C in these modes. If JIT optimization is requested for a UTF-8
11202        or UTF-16 pattern that contains \C, it will not succeed,  and  so  when
11203        pcre2_match() is called, the matching will be carried out by the inter-
11209        set  as  in  non-UTF mode, all with code points less than 256. This re-
11214        you can use explicit Unicode property tests such  as  \p{Nd}.  Alterna-
11215        tively, if you set the PCRE2_UCP option, the way that the character es-
11221        all low-valued characters, unless the PCRE2_UCP option is set.
11223        However,  the  special horizontal and vertical white space matching es-
11224        capes (\h, \H, \v, and \V) do match all the appropriate Unicode charac-
11228 UNICODE CASE-EQUIVALENCE
11232        are less than 128 and that have at most two case-equivalent values. For
11233        these, a direct table lookup is used for speed. A few  Unicode  charac-
11234        ters  such as Greek sigma have more than two code points that are case-
11235        equivalent, and these are treated specially. Setting PCRE2_UCP  without
11236        PCRE2_UTF  allows  Unicode-style  case processing for non-UTF character
11237        encodings such as UCS-2.
11245        sequence  of characters that are all from the same Unicode script. How-
11250        Every Unicode character has a Script property, mostly with a value cor-
11255        for  the surrogate code points. In the PCRE2 32-bit library, characters
11257        which  are  accessible  only  in non-UTF mode, are assigned the Unknown
11261        include  punctuation,  emoji,  mathematical, musical, and currency sym-
11264        "Inherited" is used for characters such as diacritical marks that  mod-
11270        U+102E0 (Coptic Epact thousands mark) is used only with Arabic and Cop-
11272        called Script Extension exists. Its value is a list of scripts that ap-
11276        also some Common characters that have a single,  non-Common  script  in
11282        constraint for decimal digits. These are  covered  in  subsequent  sec-
11289        run.  Longer strings are checked using only the Script Extensions prop-
11292        If a character's Script Extension property is the single value  "Inher-
11296        at least one script in common in their Script Extension lists. In  set-
11312        The  first  has the Script Extension list Arabic, Hanifi Rohingya, Syr-
11314        of  them  could  appear  in  script runs of either Arabic or Hanifi Ro-
11322        Katakana  scripts  together  with Han; Korean uses Hangul and Han; Tai-
11325        "virtual scripts". Thus, a script run may contain a  mixture  of  Hira-
11328        Bopomofo  and  Han. PCRE2 (like Perl) follows Unicode's Technical Stan-
11329        dard  39   ("Unicode   Security   Mechanisms",   http://unicode.org/re-
11337        from the common ASCII digits. In addition to the  script  checking  de-
11347        returned.  The  code  unit offset to the offending character can be ex-
11352        and therefore want to skip these checks in  order  to  improve  perfor-
11354        scanned repeatedly.  If you set the PCRE2_NO_UTF_CHECK option  at  com-
11371        UTF-16 and UTF-32 strings can indicate their endianness by special code
11372        knows as a byte-order mark (BOM). The PCRE2  functions  do  not  handle
11377        pcre2_dfa_match()  calls  with a non-zero starting offset, the check is
11385        that the sequences \b and \B are one-character lookbehinds.
11389        the  surrogate  area. The so-called "non-character" code points are not
11394        UTF-16, where they are used in pairs to encode code points with  values
11395        greater  than  0xFFFF. The code points that are encoded by UTF-16 pairs
11396        are available independently in the  UTF-8  and  UTF-32  encodings.  (In
11397        other  words, the whole surrogate thing is a fudge for UTF-16 which un-
11398        fortunately messes up UTF-8 and UTF-32.)
11403        such  as  \x{d800}  (a  surrogate code point) you can set the PCRE2_EX-
11405        only  in  UTF-8  and  UTF-32 modes, because these values are not repre-
11406        sentable in UTF-16.
11408    Errors in UTF-8 strings
11410        The following negative error codes are given for invalid UTF-8 strings:
11418        The string ends with a truncated UTF-8 character;  the  code  specifies
11419        how  many bytes are missing (1 to 5). Although RFC 3629 restricts UTF-8
11420        characters to be no longer than 4 bytes, the  encoding  scheme  (origi-
11442        A 4-byte character has a value greater than 0x10ffff; these code points
11447        A 3-byte character has a value in the  range  0xd800  to  0xdfff;  this
11448        range  of code points are reserved by RFC 3629 for use with UTF-16, and
11449        so are excluded from UTF-8.
11457        A 2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it  codes
11459        For example, the two bytes 0xc0, 0xae give the value 0x2e,  whose  cor-
11465        binary value 0b10 (that is, the most significant bit is 1 and the  sec-
11466        ond  is  0). Such a byte can only validly occur as the second or subse-
11467        quent byte of a multi-byte character.
11472        can never occur in a valid UTF-8 string.
11474    Errors in UTF-16 strings
11476        The  following  negative  error  codes  are  given  for  invalid UTF-16
11484    Errors in UTF-32 strings
11486        The following  negative  error  codes  are  given  for  invalid  UTF-32
11496        UTF sequences if you  call  pcre2_compile()  with  the  PCRE2_MATCH_IN-
11504        generate different code. If JIT is not used, the option affects the be-
11505        haviour of the interpretive code in pcre2_match(). When PCRE2_MATCH_IN-
11511        \p{Any},  it does not even match negative items such as [^X]. A lookbe-
11527        UTF-sequence,  that  sequence  is  skipped, and the match starts at the
11535        Using  PCRE2_MATCH_INVALID_UTF, an application can run matches on arbi-
11551        Copyright (c) 1997-2021 University of Cambridge.
11552 ------------------------------------------------------------------------------