pcre2.txt - OpenGrok cross reference for /external/pcre/dist2/doc/pcre2.txt

Lines Matching +full:- +full:- +full:without +full:- +full:perl
1 -----------------------------------------------------------------------------
8 -----------------------------------------------------------------------------
16        PCRE2 - Perl-compatible regular expressions (revised API)
22        pattern matching using the same syntax and semantics as Perl, with just
25        API is more extensible, and it was simplified by abolishing  the  sepa-
30        As  well  as Perl-style regular expression patterns, some features that
31        appeared in Python and the original PCRE before they appeared  in  Perl
37        The source code for PCRE2 can be compiled to support 8-bit, 16-bit,  or
38        32-bit  code units, which means that up to three separate libraries may
39        be installed.  The original work to extend PCRE to  16-bit  and  32-bit
40        code  units  was  done  by Zoltan Herczeg and Christian Persch, respec-
42        character  per  code  unit, or as UTF-encoded Unicode, with support for
45        code units must be enabled explicitly at run time. The version of  Uni-
48          pcre2test -C
51        ending in _8,  _16,  or  _32,  respectively  (for  example,  pcre2_com-
57        In addition to the Perl-compatible matching function, PCRE2 contains an
58        alternative  function that matches the same compiled patterns in a dif-
63        Details of exactly which Perl regular expression features are  and  are
70        client  to  discover  which  features are available. The features them-
71        selves are described in the pcre2build page. Documentation about build-
73        NON-AUTOTOOLS_BUILD files in the source distribution.
86        If you are using PCRE2 in a non-UTF application that permits  users  to
89        For  example, an 8-bit pattern that begins with "(*UTF)" turns on UTF-8
90        mode, which interprets patterns and subjects as strings of  UTF-8  code
91        units instead of individual 8-bit characters. This causes both the pat-
92        tern and any data against which it is matched to be checked  for  UTF-8
93        validity.  If the data string is very long, such a check might use suf-
94        ficiently many resources as to cause your application to  lose  perfor-
97        One  way  of guarding against this possibility is to use the pcre2_pat-
100        calling pcre2_compile(). This causes a compile time error if  the  pat-
101        tern contains a UTF-setting sequence.
104        be enabled from within the pattern, by specifying "(*UCP)".  This  fea-
112        The use of the \C escape sequence in a UTF-8 or UTF-16 pattern can lead
114        middle of  a  multi-code-unit  character.  The  PCRE2_NEVER_BACKSLASH_C
116        a compile-time error if it is encountered. It is also possible to build
121        Nested  unlimited repeats in a pattern are a common example. PCRE2 pro-
124        pcre2_set_depth_limit() that can be used to restrict the amount of mem-
130        The  user  documentation for PCRE2 comprises a number of different sec-
136        (which  is a program listing), and the short pages for individual func-
137        tions, are concatenated in pcre2.txt, for ease of searching.  The  sec-
141          pcre2-config       show PCRE2 installation configuration information
145          pcre2compat        discussion of Perl compatibility
148          pcre2grep          description of the pcre2grep command (8-bit only)
149          pcre2jit           discussion of just-in-time optimization support
156          pcre2posix         the POSIX-compatible C API for the 8-bit library
181        Copyright (c) 1997-2018 University of Cambridge.
182 ------------------------------------------------------------------------------
190        PCRE2 - Perl-compatible regular expressions (revised API)
195        contains a description of all its native functions. See the pcre2 docu-
448        These functions provide a way of  converting  non-PCRE2  patterns  into
455 PCRE2 8-BIT, 16-BIT, AND 32-BIT LIBRARIES
457        There are three PCRE2 libraries, supporting 8-bit, 16-bit,  and  32-bit
460        for all three libraries. One, two, or all three can be installed simul-
461        taneously. On Unix-like systems the libraries  are  called  libpcre2-8,
462        libpcre2-16, and libpcre2-32, and they can also co-exist with the orig-
485        macros are defined whose names are the generic forms such as pcre2_com-
487        PCRE2_CODE_UNIT_WIDTH  to generate the appropriate width-specific func-
505        single  library.   For example, if you want to run a match using a pat-
511        their generic names, without the _8, _16, or _32 suffix.
517        There are also some wrapper functions for the 8-bit library that corre-
529        program against a non-dll PCRE2 library, you must  define  PCRE2_STATIC
533        and matching regular expressions in a Perl-compatible manner. A  sample
540        passed as bits in an options argument. There are also some more compli-
546        Just-in-time  (JIT)  compiler  support  is an optional feature of PCRE2
561        less sanity checking. The JIT-specific functions are discussed  in  the
564        A  second  matching function, pcre2_dfa_match(), which is not Perl-com-
569        return captured substrings. A description of  the  two  matching  algo-
588        pcre2_substring_free()  and  pcre2_substring_list_free()  are also pro-
590        functions  is called with a NULL argument, the function returns immedi-
591        ately without doing anything.
600        Finally,  there  are functions for finding out information about a com-
615        ~(PCRE2_SIZE)0)  is reserved as a special indicator for zero-terminated
623        strings: a single CR (carriage return) character, a  single  LF  (line-
624        feed) character, the two-character sequence CRLF, any of the three pre-
642        dollar metacharacters, the handling of #-comments in /x mode, and, when
643        CRLF is a recognized line ending sequence, the match position  advance-
644        ment for a non-anchored pattern. There is more detail about this in the
654        In a multithreaded application it is important to keep  thread-specific
656        library code itself is thread-safe: it contains  no  static  or  global
657        variables.  The  API  is  designed to be fairly simple for non-threaded
658        applications while at the same time ensuring that multithreaded  appli-
661        There are several different blocks of data that are used to pass infor-
669        is  thread-safe, that is, the same compiled pattern can be used by more
672        use them. However, if the just-in-time (JIT)  optimization  feature  is
682          Get a read-only (shared) lock (mutex) for pointer
694        If JIT is being used, but the JIT compilation is not being done immedi-
700        obtain  a private copy of the compiled code before calling the JIT com-
709        a PCRE2 function without using lots of arguments. The  parameters  that
713        In a multithreaded application, if the parameters in a context are val-
716        it must make its own thread-specific copy.
721        of a match. This includes details of what was matched, as well as addi-
730        memory management or non-standard character tables.  To  keep  function
739        relevant  for  several  PCRE2 operations, a compile-time context, and a
740        match-time context.
764        function  may be NULL, in which case the system memory management func-
786        without doing anything.
790        A compile context is required if you want to provide an external  func-
792        values of any of the following compile-time parameters:
801        A compile context is also required if you are using custom memory  man-
802        agement.   If  none of these apply, just pass NULL as the context argu-
805        A compile context is created, copied, and freed by the following  func-
833        only argument is a general context. This function builds a set of char-
839        As  PCRE2  has developed, almost all the 32 option bits that are avail-
842        bits which are used for some newer, assumed rarer, options. This  func-
854        largest  number  that  a  PCRE2_SIZE variable can hold, which is effec-
860        This specifies which characters or character sequences are to be recog-
863        two-character  sequence  CR followed by LF), PCRE2_NEWLINE_ANYCRLF (any
871        PCRE2_EXTENDED_MORE option, the newline convention affects the recogni-
883        limit applies to parentheses of all kinds, not just capturing parenthe-
899        nesting,  and  the second is user data that is set up by the last argu-
901        should return zero if all is well, or non-zero to force an error.
917        A match context is created, copied, and freed by  the  following  func-
937        during a matching operation. Details are given in the pcre2callout doc-
957        option when calling pcre2_compile() so that when JIT is in use, differ-
958        ent code can be compiled. If a match  is  started  with  a  non-default
959        match  limit when PCRE2_USE_OFFSET_LIMIT is not set, an error is gener-
967        the first line and also within the offset limit. In other words, which-
976        also applies to pcre2_dfa_match(), which may use the heap when process-
978        atomic groups. This limit does not apply to matching with the JIT opti-
994        The  pcre2_match() function starts out using a 20KiB vector on the sys-
995        tem stack for recording backtracking points. The more nested backtrack-
998        too small. If the heap limit is set to a value less than 21 (in partic-
1000        that  do not have a lot of nested backtracking can be successfully pro-
1025        When  pcre2_match() is called with a pattern that was successfully pro-
1088 CHECKING BUILD-TIME OPTIONS
1107        non-negative  on success, or the negative error code PCRE2_ERROR_BADOP-
1108        TION if the value in the first argument is not recognized. The  follow-
1122        unit widths were selected when PCRE2 was  built.  The  1-bit  indicates
1123        8-bit  support, and the 2-bit and 4-bit indicate 16-bit and 32-bit sup-
1130        recursions, lookarounds, and atomic groups in  pcre2_dfa_match().  Fur-
1143        just-in-time compiling is available; otherwise it is set to zero.
1162        the 16-bit library is compiled, a value of 3 is rounded up  to  4,  and
1163        when  the  32-bit  library  is compiled, internal linkages always use 4
1166        The default value of 2 for the 8-bit and 16-bit libraries is sufficient
1202        The  output is a uint32_t integer that gives the maximum depth of nest-
1212        This parameter is obsolete and should not be used in new code. The out-
1220        without  Unicode  support,  the buffer is filled with the text "Unicode
1236        PCRE2 version string, zero-terminated. The number of code units used is
1237        returned. This is the length of the string plus one unit for the termi-
1255        length  (in  code units). If the pattern is zero-terminated, the length
1260        If the compile context argument ccontext is NULL, memory for  the  com-
1265        NULL argument, it returns immediately, without doing anything.
1270        below), the JIT information cannot be copied (because it  is  position-
1271        dependent).  The new copy can initially be used only for non-JIT match-
1276        a multithreaded application to acquire a private copy  of  shared  com-
1285        pointing  to the new tables. The memory for the new tables is automati-
1293        After  running a match, you must not free a compiled pattern (or a sub-
1300        particular, those that are compatible with Perl,  but  some  others  as
1304        For those options that can be different in different parts of the  pat-
1310        Other, less frequently required compile-time parameters  (for  example,
1314        If errorcode or erroroffset is NULL, pcre2_compile() returns NULL imme-
1318        error has occurred. The values are not defined when compilation is suc-
1319        cessful and pcre2_compile() returns a non-NULL value.
1322        return if it finds an error in the pattern. There are also  some  nega-
1328        "Obtaining a textual error message" below) should be  self-explanatory.
1332        The value returned in erroroffset is an indication of where in the pat-
1336        the failing assertion. For an invalid UTF-8 or UTF-16 string, the  off-
1342        mode. It may sometimes point into the middle of a UTF-8 or UTF-16 char-
1345        This  code  fragment shows a typical straightforward call to pcre2_com-
1353            PCRE2_ZERO_TERMINATED,  /* the pattern is zero-terminated */
1368        only way to do it in Perl.
1372        By  default, for compatibility with Perl, a closing square bracket that
1383        (1) \U matches an upper case "U" character; by default \U causes a com-
1384        pile time error (Perl uses \U to upper case subsequent characters).
1388        code  point  to match. By default, \u causes a compile time error (Perl
1393        code point to match. By default, as in Perl, a  hexadecimal  number  is
1403        Perl. If you want a multiline circumflex also to match after  a  termi-
1408        By  default, for compatibility with Perl, the name in any verb sequence
1417        whitespace in verb names is  skipped  and  #-comments  are  recognized,
1423        items, all with number 255, before each pattern  item,  except  immedi-
1424        ately  before  or after an explicit callout in the pattern. For discus-
1430        case  letters in the subject. It is equivalent to Perl's /i option, and
1437        points (available only in 16-bit or 32-bit mode)  are  treated  as  not
1443        at the end of the subject string. Without this option,  a  dollar  also
1447        Perl, and no way to set it within a pattern.
1453        ever matches one character, even if newlines are coded as CRLF. Without
1454        this option, a dot does not match when the current position in the sub-
1455        ject is at a newline. This option is equivalent to  Perl's  /s  option,
1456        and it can be changed within a pattern by a (?s) option setting. A neg-
1458        escape  sequence always matches a non-newline character, independent of
1475        patterns, a new match is then tried at the next  starting  point.  How-
1486        which is the only way to do it in Perl.
1490        matches, which are necessarily substrings of the first one, must  obvi-
1496        totally ignored except when escaped or inside a character  class.  How-
1498        introduce various parenthesized subpatterns, nor within numerical quan-
1500        item and a following quantifier and between a quantifier and a  follow-
1502        Perl's /x option, and it can be changed within  a  pattern  by  a  (?x)
1505        When  PCRE2  is compiled without Unicode support, PCRE2_EXTENDED recog-
1507        256 that are flagged as white space in its low-character table. The ta-
1514        When PCRE2 is compiled with Unicode support, in addition to these char-
1515        acters,  five  more Unicode "Pattern White Space" characters are recog-
1516        nized by PCRE2_EXTENDED. These are U+0085 (next line), U+200E (left-to-
1517        right  mark), U+200F (right-to-left mark), U+2028 (line separator), and
1519        recognized  by  Perl's /x option. Note that the horizontal and vertical
1523        As  well as ignoring most white space, PCRE2_EXTENDED also causes char-
1530        Which characters are interpreted as newlines can be specified by a set-
1532        special sequence at the start of the pattern, as described in the  sec-
1542        character  class.  PCRE2_EXTENDED_MORE  is  equivalent  to  Perl's  /xx
1543        option,  and  it can be changed within a pattern by a (?xx) option set-
1550        start of matching, though the matched text may continue over  the  new-
1551        line. If startoffset is non-zero, the limiting newline is not necessar-
1553        string is "abc\nxyz" (where \n represents a single-character newline) a
1562        If this option is set, all meta-characters in the pattern are disabled,
1565        you  are  doing  a  lot of literal matching and are worried about effi-
1580        fails by default, for Perl compatibility.  Setting  this  option  makes
1590        string,  or  before  a  terminating  newline  (except  when  PCRE2_DOL-
1593        behaviour (for ^, $, and dot) is the same as Perl.
1598        start and end. This is equivalent to Perl's /m option, and  it  can  be
1601        subject,  for compatibility with Perl.  However, you can change this by
1608        This option locks out the use of \C in the pattern that is  being  com-
1609        piled.   This  escape  can  cause  unpredictable  behaviour in UTF-8 or
1610        UTF-16 modes, because it may leave the current matching  point  in  the
1611        middle  of  a  multi-code-unit  character. This option may be useful in
1613        there is also a build-time option that permanently locks out the use of
1628        This option locks out interpretation of the pattern as  UTF-8,  UTF-16,
1629        or UTF-32, depending on which library is in use. In particular, it pre-
1632        applications that process patterns from external sources. The  combina-
1637        If this option is set, it disables the use of numbered capturing paren-
1641        is  the  same as Perl's /n option.  Note that, when this option is set,
1648        If this option is set, it disables "auto-possessification", which is an
1651        are  in  use,  auto-possessification means that some callouts are never
1659        .*  is  the  first significant item in a top-level branch of a pattern,
1662        atomic group or a capturing group that is the subject of  a  backrefer-
1663        ence,  or  if  the pattern contains (*PRUNE) or (*SKIP). When the opti-
1679        the  matching code searches the subject for that value, and fails imme-
1680        diately if it cannot find it, without actually running the main  match-
1684        items are in use, these "start-up" optimizations can cause them  to  be
1685        skipped  if  the pattern is never actually used. The start-up optimiza-
1686        tions are in effect a pre-scan of the subject that takes  place  before
1689        The PCRE2_NO_START_OPTIMIZE option disables the start-up optimizations,
1702        start-up optimization scans along the subject, finds "A" and  runs  the
1703        first  match attempt from there. The (*COMMIT) item means that the pat-
1711        There  are  also  other  start-up optimizations. For example, a minimum
1728        UTF-8 strings, UTF-16 strings, and UTF-32 strings in  the  pcre2unicode
1742        Note also that setting PCRE2_NO_UTF_CHECK at compile time does not dis-
1743        able  the error that is given if an escape sequence for an invalid Uni-
1744        code code point is encountered in the pattern. In particular,  the  so-
1748        section entitled "Extra compile options" below.  However, this is  pos-
1749        sible only in UTF-8 and UTF-32 modes, because these values are not rep-
1750        resentable in UTF-16.
1760        option is available only if PCRE2 has been compiled with  Unicode  sup-
1767        not  compatible  with Perl. It can also be set by a (?U) option setting
1773        is  going  to be used to set a non-default offset limit in a match con-
1775        offset  limit  is  set  without  this option. For more details, see the
1783        instead  of  single-code-unit  strings.  It  is available when PCRE2 is
1792        Unlike the main compile-time options, the extra options are  not  saved
1799        This  option  applies when compiling a pattern in UTF-8 or UTF-32 mode.
1800        It is forbidden in UTF-16 mode, and ignored in non-UTF  modes.  Unicode
1802        in UTF-16 to encode code points with values in  the  range  0x10000  to
1803        0x10ffff.  The  surrogates  cannot  therefore be represented in UTF-16.
1804        They can be represented in UTF-8 and UTF-32, but are defined as invalid
1805        code  points,  and  cause  errors  if  encountered in a UTF-8 or UTF-32
1810        when using PCRE2 to check for unwanted  characters  in  UTF-8  strings,
1813        because  it applies only to the testing of input strings for UTF valid-
1816        If the extra option PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES is set,  surro-
1817        gate  code  point values in UTF-8 and UTF-32 patterns no longer provoke
1825        escape  such  as \j or a malformed one such as \x{2z} causes a compile-
1826        time error when detected by pcre2_compile(). Perl is somewhat inconsis-
1828        "j", and non-hexadecimal digits in \x{} are just ignored, though  warn-
1829        ings  are given in both cases if Perl's warning switch is enabled. How-
1831        Perl.
1835        treated  as  single-character escapes. For example, \j is a literal "j"
1837        option  means  that  typos in patterns may go undetected and have unex-
1842        This option is provided for use by  the  -x  option  of  pcre2grep.  It
1844        automatically inserting the code for "^(?:" at the start  of  the  com-
1851        This  option  is  provided  for  use  by the -w option of pcre2grep. It
1859 JUST-IN-TIME (JIT) COMPILATION
1879        just-in-time  compiler  is available, further processes a compiled pat-
1885        for  patterns  to  be analyzed, and for one-off matches and simple pat-
1896        points  are  less than 256. By default, higher-valued code points never
1897        match escapes such as \w or \d.  However, if PCRE2 is built  with  Uni-
1898        code support, all characters can be tested with \p and \P, or, alterna-
1901        the built-in tables.
1911        default "C" locale of the local system, which may cause them to be dif-
1914        The  internal tables can be overridden by tables supplied by the appli-
1916        from  the  default.  As more and more applications change to using Uni-
1933        The locale name "fr_FR" is used on Linux and other  Unix-like  systems;
1940        pcre2_match()  and pcre_dfa_match(). Thus, for any single pattern, com-
1951        The  first  argument  for pcre2_pattern_info() is a pointer to the com-
1957        the function is zero for success, or one of the following negative num-
1967        typical call of pcre2_pattern_info(), to obtain the length of the  com-
1986        options that were passed to pcre2_compile(), whereas  PCRE2_INFO_ALLOP-
1987        TIONS  returns  the compile options as modified by any top-level (*XXX)
1990        compile context by calling the pcre2_set_compile_extra_options()  func-
1996        change within a pattern do not affect the result  of  PCRE2_INFO_ALLOP-
2000        A pattern compiled without PCRE2_ANCHORED is automatically anchored  by
2001        PCRE2 if the first significant item in every top-level branch is one of
2007          .*    sometimes - see below
2019        For  patterns  that are auto-anchored, the PCRE2_ANCHORED bit is set in
2028        characters of the given group, but in addition, the check that  a  cap-
2041        Return  the highest capturing subpattern number in the pattern. In pat-
2051        PCRE2_ERROR_UNSET. Note that this limit will only be used during match-
2057        In  the absence of a single first code unit for a non-anchored pattern,
2058        pcre2_compile() may construct a 256-bit table that defines a fixed  set
2062        means "any code unit of value 255 or above". If such a table  was  con-
2069        a  non-anchored pattern. The third argument should point to an uint32_t
2081        The  third  argument should point to an uint32_t variable. In the 8-bit
2082        library, the value is always less than 256. In the 16-bit  library  the
2083        value  can  be  up  to 0xffff. In the 32-bit library in UTF-32 mode the
2084        value can be up to 0x10ffff, and up to 0xffffffff when not using UTF-32
2091        without  the  use  of  JIT. The third argument should point to a size_t
2112        (*LIMIT_HEAP=nnnn) at the start, the value is returned. The third argu-
2120        Return 1 if the (?J) or (?-J) option setting is used  in  the  pattern,
2122        (?J) and (?-J) set and unset the local PCRE2_DUPNAMES  option,  respec-
2127        If  the  compiled  pattern was successfully processed by pcre2_jit_com-
2147        PCRE2_INFO_LASTCODETYPE returns 1. Otherwise, return 0. The third argu-
2154        contains recursive subroutine calls it is not always possible to deter-
2155        mine  whether  or  not it can match an empty string. PCRE2 takes a cau-
2164        PCRE2_ERROR_UNSET. Note that this limit will only be used during match-
2170        Return the number of characters (not code units) in the longest lookbe-
2172        uint32_t integer. This information is useful when  doing  multi-segment
2174        assertions \b and \B require a one-character lookbehind. \A also regis-
2175        ters  a  one-character  lookbehind, though it does not actually inspect
2177        from  the old segment is retained when a new segment is processed. Oth-
2185        number  of characters, which in UTF mode may be different from the num-
2195        PCRE2 supports the use of named as well as numbered capturing parenthe-
2196        ses. The names are just an additional way of identifying the  parenthe-
2198        pcre2_substring_get_byname() are provided for extracting captured  sub-
2202        do the conversion, you need to use the  name-to-number  map,  which  is
2205        The  map  consists  of a number of fixed-size entries. PCRE2_INFO_NAME-
2211        This  is  a  PCRE2_SPTR  pointer to a block of code units. In the 8-bit
2212        library, the first two bytes of each entry are the number of  the  cap-
2213        turing parenthesis, most significant byte first. In the 16-bit library,
2214        the pointer points to 16-bit code units, the first  of  which  contains
2215        the  parenthesis  number.  In the 32-bit library, the pointer points to
2216        32-bit code units, the first of which contains the parenthesis  number.
2232        pattern after compilation by the 8-bit library  (assume  PCRE2_EXTENDED
2233        is set, so white space - including newlines - is ignored):
2235          (?<date> (?<year>(\d\d)?\d\d) -
2236          (?<month>\d\d) - (?<day>\d\d) )
2240        with non-printing bytes shows in hexadecimal, and undefined bytes shown
2249        name-to-number  map,  remember that the length of the entries is likely
2263        This identifies the character sequence that will be recognized as mean-
2272        pcre2_compile()  is  getting memory in which to place the compiled pat-
2275        over-estimate. Processing a pattern with  the  JIT  compiler  does  not
2291        which they appear. Its first argument is a pointer to a callout enumer-
2293        passed  to  pcre2_callout_enumerate(). The contents of the callout enu-
2303        PCRE2, with the same code unit width, and must also have the same endi-
2308        the  serialized form. They are described in the pcre2serialize documen-
2309        tation. Note that PCRE2 serialization does not  convert  compiled  pat-
2331        you must create a match data block by calling one of the creation func-
2338        pcre2_match_data_create(), so it is always possible to return the over-
2341        The second argument of pcre2_match_data_create() is a pointer to a gen-
2348        right size to hold all the substrings a pattern might capture. The sec-
2374        NULL argument, it returns immediately, without doing anything.
2387        order  to  find multiple matches in the subject string or to match dif-
2391        operates  in  a  Perl-like  manner. For specialist use there is also an
2407        If  the  subject  string is zero-terminated, the length can be given as
2409        common matching parameters are to be changed. For details, see the sec-
2417        bytes  for the 8-bit library, 16-bit code units for the 16-bit library,
2418        and 32-bit code units for the 32-bit library, whether or not  UTF  pro-
2424        by far the most common case. In UTF-8 or UTF-16 mode, the starting off-
2425        set  must  point to the start of a character, or to the end of the sub-
2426        ject (in UTF-32 mode, one code unit equals one character, so  all  off-
2430        A non-zero starting offset is useful when searching for  another  match
2445        string again, but with startoffset set to 4, it finds the second occur-
2450        match an empty string. It is possible to emulate Perl's /g behaviour by
2457        so,  and the current character is CR followed by LF, advance the start-
2460        If a non-zero starting offset is passed when the pattern is anchored, a
2461        single attempt to match at the given offset is made. This can only suc-
2463        the  subject.  In other words, the anchoring must be the result of set-
2472        PCRE2_NO_JIT, PCRE2_NO_UTF_CHECK,  PCRE2_PARTIAL_HARD,  and  PCRE2_PAR-
2475        Setting  PCRE2_ANCHORED  or PCRE2_ENDANCHORED at match time is not sup-
2476        ported by the just-in-time (JIT) compiler. If it is set,  JIT  matching
2492        matches must be right at the end of the subject string. Note that  set-
2499        match  before  it.  Setting  this without having set PCRE2_MULTILINE at
2507        in  multiline mode) a newline immediately before it. Setting this with-
2509        match. This option affects only the behaviour of the dollar metacharac-
2545        called.   If  a non-zero starting offset is given, the check is applied
2546        only to that part of the subject that could be inspected during  match-
2553        sequences \b and \B are one-character lookbehinds.
2559        validity  of  UTF-8  strings, UTF-16 strings, and UTF-32 strings in the
2582        the  caller  is prepared to handle a partial match, but only if no com-
2587        PCRE2_ERROR_PARTIAL, without considering  any  other  alternatives.  In
2588        other words, when PCRE2_PARTIAL_HARD is set, a partial match is consid-
2591        There is a more detailed discussion of partial and multi-segment match-
2597        When  PCRE2 is built, a default newline convention is set; this is usu-
2602        pcre2pattern page. During matching, the newline choice affects the  be-
2618        However, the pattern [\r\n]A does match that string,  because  it  con-
2619        tains an explicit CR or LF reference, and so advances only by one char-
2625        not  count, nor does \s, even though it includes CR and LF in the char-
2643        phrase "capturing subpattern" or "capturing group" is used for a  frag-
2652        Alternatively, you can make direct use of the vector of PCRE2_SIZE val-
2656        pcre2_get_ovector_count() returns the number of pairs of values it con-
2659        Within the ovector, the first in each pair of values is set to the off-
2661        offset of the first code unit after the end of a substring. These  val-
2663        are byte offsets in the 8-bit library, 16-bit  offsets  in  the  16-bit
2664        library, and 32-bit offsets in the 32-bit library.
2672        the  portion  of the subject string that was matched by the entire pat-
2676        been  captured,  the returned value is 3. If there are no captured sub-
2699        2 is not. When this happens, both values in  the  offset  pairs  corre-
2705        are not matched.  The return from the function is 2, because the  high-
2706        est used capturing subpattern number is 1. The offsets for for the sec-
2711        in the pattern are never changed. That is, if a pattern contains n cap-
2713        pcre2_match(). The other elements retain whatever  values  they  previ-
2733        returns a pointer to the zero-terminated name, which is within the com-
2741        through  the  pattern.  Instances of (*PRUNE) and (*THEN) without names
2754        Warning: By default, certain start-of-match optimizations are  used  to
2758        engine. This check fails for "bx", causing a match failure without see-
2759        ing any marks. You can disable the start-of-match optimizations by set-
2766        offset of the character at which the match started. For  a  non-partial
2779        If pcre2_match() fails, it returns a negative number. This can be  con-
2780        verted  to a text string by calling the pcre2_get_error_message() func-
2785        of UTF-specific negative error codes is returned. Details are given  in
2800        PCRE2 stores a 4-byte "magic number" at the start of the compiled code,
2807        a library of a different code unit width, for example, a  pattern  com-
2808        piled  by  the  8-bit  library  is passed to a 16-bit or 32-bit library
2849        using JIT is being matched, but the memory available for  the  just-in-
2850        time  processing stack is not large enough. See the pcre2jit documenta-
2872        within  the  pattern. Specifically, it means that either the whole pat-
2875        might do this are detected and faulted at compile time, but  more  com-
2886        match,  or  auxiliary)  can be obtained by calling pcre2_get_error_mes-
2893        The returned message is terminated with a trailing zero, and the  func-
2896        PCRE2_ERROR_BADDATA  is  returned. If the buffer is too small, the mes-
2919        extracting  captured  substrings  as  new,  separate,   zero-terminated
2925        zero refers to the entire matched substring, with higher numbers refer-
2936        extracts a zero-length empty string.
2938        You  can  find the length in code units of a captured substring without
2945        The pcre2_substring_copy_bynumber() function  copies  a  captured  sub-
2948        function  that  was  used for the match data block. The first two argu-
2988        pattern  is  (abc)|(def) and the subject is "def", and the ovector con-
2999        The pcre2_substring_list_get() function  extracts  all  available  sub-
3013        therefore need the lengths, you may supply NULL as the lengthsptr argu-
3015        function is zero if all went well, or PCRE2_ERROR_NOMEMORY if the  mem-
3022        This  can  be  distinguished  from  a  genuine zero-length substring by
3024        PCRE2_UNSET   for   unset   substrings,   or   by   calling  pcre2_sub-
3044        To extract a substring by name, you first have to find associated  num-
3051        the name by calling pcre2_substring_number_from_name(). The first argu-
3071        Warning: If the pattern uses the (?| feature to set up multiple subpat-
3092        given as PCRE2_ZERO_TERMINATED for a zero-terminated string. Matches in
3100        pcre2_match(), except that the partial matching options are not permit-
3102        block is obtained and freed within this function, using memory  manage-
3112        length, in code units, of the output buffer. If the  function  is  suc-
3130        option is set, a dollar character is an escape character that can spec-
3140        brackets are required only if the following character would  be  inter-
3164        takes place in the original subject string (that is, previous  replace-
3167        subject string. If an offset limit is set in the match context, search-
3171        the subject string by setting either or both of startoffset and an off-
3179        with zero length, an attempt to find a non-empty match at the same off-
3186        buffer is too small. The default action is to return PCRE2_ERROR_NOMEM-
3188        continues to go through the motions of matching and substituting (with-
3189        out,  of course, writing anything) in order to compute the size of buf-
3196        that the entire operation is carried out twice. Depending on the appli-
3198        the  excess  afterwards,  instead   of   using   PCRE2_SUBSTITUTE_OVER-
3215        replacement  string.  Without this option, only the dollar character is
3221        particular  character codes, and backslash followed by any non-alphanu-
3228        current state: \U and \L change to upper or lower case forcing, respec-
3233        all inserted  characters, including those from captured groups and let-
3236        Note that case forcing sequences such as \U...\E do not nest. For exam-
3244          ${<n>:-<string>}
3247        As before, <n> may be a group number or a name. The first  form  speci-
3279        PCRE2_ERROR_NOSUBSTRING is returned for a non-existent substring inser-
3282        PCRE2_ERROR_UNSET is returned for an unset substring insertion (includ-
3284        when  the  simple  (non-extended)  syntax  is  used  and  PCRE2_SUBSTI-
3294        PCRE2_ERROR_BADREPESCAPE  (invalid  escape  sequence), PCRE2_ERROR_REP-
3295        MISSINGBRACE (closing curly bracket not found),  PCRE2_ERROR_BADSUBSTI-
3336        point to the first and last entries in the name-to-number table for the
3349        The  traditional  matching  function  uses a similar algorithm to Perl,
3350        which stops when it finds the first match at a given point in the  sub-
3353        function  (see  below) instead. If you cannot use the alternative func-
3357        What you have to do is to insert a callout right at the end of the pat-
3358        tern.  When your callout function is called, extract and save the  cur-
3375        not backtrack.  This has different characteristics to the normal  algo-
3376        rithm,  and  is not compatible with Perl. Some of the features of PCRE2
3379        algorithms, and a list of features that pcre2_dfa_match() does not sup-
3384        is used in a different way, and this is described below. The other com-
3412        zero. The only bits that may be set  are  PCRE2_ANCHORED,  PCRE2_ENDAN-
3430        matches, but there is still at least one matching possibility. The por-
3433        more detailed discussion of partial and  multi-segment  matching,  with
3439        stop as soon as it has found one match. Because of the way the alterna-
3455        When pcre2_dfa_match() succeeds, it may have matched more than one sub-
3474        which  is  the  number  of  matched substrings. The offsets of the sub-
3477        any capturing groups that may exist in the pattern, because DFA  match-
3490        NOTE:  PCRE2's  "auto-possessification" optimization usually applies to
3553        Copyright (c) 1997-2018 University of Cambridge.
3554 ------------------------------------------------------------------------------
3562        PCRE2 - Perl-compatible regular expressions (revised API)
3567        the library in Unix-like environments using the applications  known  as
3572        systems.  There  is a lot more information about building PCRE2 without
3574        "by  hand")  in  the  text file called NON-AUTOTOOLS-BUILD.  You should
3576        non-Unix-like environment.
3579 PCRE2 BUILD-TIME OPTIONS
3583        configure  script,  where  the  optional features are selected or dese-
3584        lected by providing options to configure before running the  make  com-
3585        mand.  However,  the same options can be selected in both Unix-like and
3586        non-Unix-like environments if you are using CMake instead of  configure
3591        compiler, as described in NON-AUTOTOOLS-BUILD.
3597          ./configure --help
3600        names begin with --enable or --disable. Because of the way that config-
3601        ure  works, --enable and --disable always come in pairs, so the comple-
3604        with --with. At the end of a configure run, a summary of the configura-
3608 BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES
3610        By  default, a library called libpcre2-8 is built, containing functions
3612        either  as single-byte characters, or UTF-8 strings. You can also build
3613        two other libraries, called libpcre2-16 and libpcre2-32, which  process
3614        strings  that  are contained in arrays of 16-bit and 32-bit code units,
3615        respectively. These can be interpreted either as single-unit characters
3616        or  UTF-16/UTF-32 strings. To build these additional libraries, add one
3619          --enable-pcre2-16
3620          --enable-pcre2-32
3622        If you do not want the 8-bit library, add
3624          --disable-pcre2-8
3627        the  POSIX wrapper is for the 8-bit library only, and that pcre2grep is
3628        an 8-bit program. Neither of these are built if  you  select  only  the
3629        16-bit or 32-bit libraries.
3638          --disable-shared
3639          --disable-static
3647        strings.  To build it without Unicode support, add
3649          --disable-unicode
3653        another without, in the same configuration.
3655        Of itself, Unicode support does not make PCRE2 treat strings as  UTF-8,
3656        UTF-16 or UTF-32. To do that, applications that use the library can set
3657        the PCRE2_UTF option when they call pcre2_compile() to compile  a  pat-
3665        and Nd are supported. Details are given in the pcre2pattern  documenta-
3677        mode,  can  cause unpredictable behaviour because it may leave the cur-
3678        rent matching point in the middle of a multi-code-unit  character.  The
3680        option when calling pcre2_compile(). There is also a build-time option
3682          --enable-never-backslash-C
3687 JUST-IN-TIME COMPILER SUPPORT
3689        Just-in-time (JIT) compiler support is included in the build by  speci-
3692          --enable-jit
3698          --enable-jit=auto
3705          --enable-jit-sealloc
3712          --disable-pcre2grep-jit
3720        the end of a line. This is the normal newline  character  on  Unix-like
3724          --enable-newline-is-cr
3726        to the configure  command.  There  is  also  an  --enable-newline-is-lf
3730        the two-character sequence CRLF (CR immediately followed by LF). If you
3733          --enable-newline-is-crlf
3737          --enable-newline-is-anycrlf
3742          --enable-newline-is-any
3745        newline sequences are the three just mentioned, plus the single charac-
3750          --enable-newline-is-nul
3752        which causes NUL (binary zero) to be set  as  the  default  line-ending
3766          --enable-bsr-anycrlf
3768        the  default  is changed so that \R matches only CR, LF, or CRLF. What-
3776        part to another (for example, from an opening parenthesis to an  alter-
3777        nation  metacharacter).  By default, in the 8-bit and 16-bit libraries,
3778        two-byte values are used for these offsets, leading to a  maximum  size
3779        for a compiled pattern of around 64 thousand code units. This is suffi-
3782        compile PCRE2 to use three-byte or four-byte offsets by adding  a  set-
3785          --with-link-size=3
3788        16-bit library, a value of 3 is rounded up to 4.  In  these  libraries,
3790        to load additional data when handling them. For the 32-bit library  the
3791        value  is  always 4 and cannot be overridden; the value of --with-link-
3804          --with-match-limit=500000
3810        The  pcre2_match() function starts out using a 20KiB vector on the sys-
3819          --with-heap-limit=500
3829        for --with-match-limit. You can set a lower default  limit  by  adding,
3832          --with-match-limit_depth=10000
3844        for  lookaround  assertions,  atomic  groups, and recursion within pat-
3855          --enable-rebuild-chartables
3860        C run-time system. This method of replacing the tables does not work if
3871        compiled to run in an 8-bit EBCDIC environment by adding
3873          --enable-ebcdic --disable-unicode
3875        to the configure command. This setting implies --enable-rebuild-charta-
3879        It is not possible to support both EBCDIC and UTF-8 codes in  the  same
3880        version  of  the  library. Consequently, --enable-unicode and --enable-
3887          --enable-ebcdic-nl25
3889        as well as, or instead of, --enable-ebcdic. The EBCDIC character for CR
3891        0x25 is not chosen as LF is made to correspond to the Unicode NEL char-
3894        The options that select newline behaviour, such as --enable-newline-is-
3895        cr, and equivalent run-time options, refer to these character values in
3901        By default, on non-Windows systems, pcre2grep supports the use of call-
3904        This support can be disabled by adding  --disable-pcre2grep-callout  to
3914          --enable-pcre2grep-libz
3915          --enable-pcre2grep-libbz2
3917        to the configure command. These options naturally require that the rel-
3929        be processable is the notional buffer size. If a longer line is encoun-
3935          --with-pcre2grep-bufsize=51200
3936          --with-pcre2grep-max-bufsize=2097152
3939        values  by  using  --buffer-size  and  --max-buffer-size on the command
3947          --enable-pcre2test-libreadline
3948          --enable-pcre2test-libedit
3952        it reads it using the readline() function. This  provides  line-editing
3953        and  history  facilities.  Note that libreadline is GPL-licensed, so if
3958        Setting --enable-pcre2test-libreadline causes the -lreadline option  to
3960        sytem-installed readline library this is sufficient. However,  in  some
3972          LIBS="-ncurses"
3981          --enable-debug
3991          --enable-valgrind
4005          --enable-coverage
4017        When --enable-coverage is used,  the  following  addition  targets  are
4023        equivalent to running "make coverage-reset", "make  coverage-baseline",
4024        "make check", and then "make coverage-report".
4026          make coverage-reset
4030          make coverage-baseline
4034          make coverage-report
4038          make coverage-clean-report
4040        This  removes the generated coverage report without cleaning the cover-
4043          make coverage-clean-data
4045        This removes the captured coverage data without removing  the  coverage
4048          make coverage-clean
4051        For more information about code coverage, see the gcov and  lcov  docu-
4060          --enable-fuzz-support
4062        At present this applies only to the 8-bit library. If set, it causes an
4063        extra  library  called  libpcre2-fuzzsupport.a  to  be  built,  but not
4064        installed. This contains a single function called  LLVMFuzzerTestOneIn-
4071        Setting  --enable-fuzz-support  also  causes  a binary called pcre2fuz-
4086          --disable-stack-for-recursion
4095        pcre2api(3), pcre2-config(3).
4108        Copyright (c) 1997-2018 University of Cambridge.
4109 ------------------------------------------------------------------------------
4117        PCRE2 - Perl-compatible regular expressions (revised API)
4132        PCRE2  provides  a feature called "callout", which is a means of tempo-
4143        ending delimiter is the same as the start, except for {, where the end-
4163          A(\d{2}|--)
4167          (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
4170        alternation bar. If the pattern contains a conditional group whose con-
4184        information  when  you are trying to optimize the performance of a par-
4194    Auto-possessification
4196        At compile time, PCRE2 "auto-possessifies" repeated items when it knows
4202          --->aaaa
4210        the   auto-possessify   feature  by  passing  PCRE2_NO_AUTO_POSSESS  to
4214          --->aaaa
4231        beginning of the subject, and pcre2_compile() remembers this. If a pat-
4232        tern  has more than one top-level branch, automatic anchoring occurs if
4237        It is also disabled if the pattern contains (*PRUNE) or  (*SKIP).  How-
4243          --->aa
4250        This shows that all match attempts start at the beginning of  the  sub-
4253        starting  the pattern with (*NO_DOTSTAR_ANCHOR). In this case, the out-
4256          --->aa
4266        This shows more match attempts, starting at the second subject  charac-
4283        string,  and will immediately give a "no match" return without actually
4287        You can disable these optimizations by passing the PCRE2_NO_START_OPTI-
4297        to both normal, DFA, and JIT matching. The first argument to the  call-
4323        version  1, and the callout_flags field for version 2. If you are writ-
4332        contains the number of the callout, in the range  0-255.  This  is  the
4339        callout_string  points  to the string that is contained within the com-
4347        delimiter as callout_string[-1] if you need it.
4371        The  capture_last  field  contains the number of the most recently cap-
4373        number  of  the  highest numbered captured substring so far. If no sub-
4379        The   contents  of  ovector[2]  to  ovector[<capture_top>*2-1]  can  be
4388        was passed to the matching function in the match data block  for  call-
4414        parenthesis, the length includes meta characters that follow the paren-
4417        the  length is one, unless a closing parenthesis is followed by a quan-
4426        are used by pcre2test to show the next item to be matched when display-
4430        zero-terminated name of the most recently passed (*MARK), (*PRUNE),  or
4432        Instances of (*PRUNE) or (*THEN) without a name  do  not  obliterate  a
4437        pcre2_dfa_match() or when JIT is being used. When pcre2_match() without
4452        starting position in the subject. Output from pcre2test does not  indi-
4456        The information in the callout_flags field is provided so that applica-
4460        because there is no backtracking in DFA matching, and there is no  sup-
4493        which they appear. Its first argument is a pointer to a callout enumer-
4495        passed  to  pcre2_callout_enumerate(). The data block contains the fol-
4513        non-zero minimum or a fixed maximum, the group is replicated inside the
4519        The callback function should normally return zero. If it returns a non-
4534        Copyright (c) 1997-2018 University of Cambridge.
4535 ------------------------------------------------------------------------------
4543        PCRE2 - Perl-compatible regular expressions (revised API)
4545 DIFFERENCES BETWEEN PCRE2 AND PERL
4547        This document describes the differences in the ways that PCRE2 and Perl
4549        respect  to Perl versions 5.26, but as both Perl and PCRE2 are continu-
4552        1. PCRE2 has only a subset of Perl's Unicode support. Details  of  what
4555        2.  Like  Perl, PCRE2 allows repeat quantifiers on parenthesized asser-
4559        PCRE2  optimizes this to run the assertion just once). Perl allows some
4563        3.  Capturing  subpatterns that occur inside negative lookaround asser-
4568        4. The following Perl escape sequences are not supported: \F,  \l,  \L,
4569        \u, \U, and \N when followed by a character name. \N on its own, match-
4570        ing a non-newline character, and \N{U+dd..}, matching  a  Unicode  code
4572        letters are implemented by Perl's general string-handling and  are  not
4577        5. The Perl escape sequences \p, \P, and \X are supported only if PCRE2
4582        which Perl does not; the Perl documentation says  "Because  Perl  hides
4583        the need for the user to understand the internal representation of Uni-
4584        code characters, there is no need to implement the somewhat messy  con-
4589        from  Perl  in  that  $  and  @ are also handled as literals inside the
4590        quotes. In Perl, they cause variable interpolation (but of course PCRE2
4591        does  not  have  variables).  Also, Perl does "double-quotish backslash
4592        interpolation" on any backslashes between \Q and \E which, its documen-
4597            Pattern            PCRE2 matches     Perl matches
4616        and backtracking into subroutine calls is now supported, as in Perl.
4620        effect  is  confined to that subpattern; it does not extend to the sur-
4621        rounding pattern. This is not always the case in Perl.  In  particular,
4630        in C triggers (*PRUNE). Perl's behaviour is more complex; in many cases
4638        matching  "aba"  against  the  pattern  /^(a(b)?)+$/  in Perl leaves $2
4641        13. PCRE2's handling of duplicate subpattern numbers and duplicate sub-
4642        pattern names is not as general as Perl's. This is a consequence of the
4648        distinguish  which  parentheses matched, because both names map to cap-
4652        14. Perl used to recognize comments in some places that PCRE2 does not,
4654        /x modifier is set, Perl allowed white space between ( and ? though the
4656        may still be some cases where Perl behaves differently.
4658        15.  Perl,  when  in warning mode, gives warnings for character classes
4659        such as [A-\d] or [a-[:digit:]]. It then treats the hyphens  as  liter-
4664        not  affected when case-independent matching is specified. For example,
4665        \p{Lu} always matches an upper case letter. I think Perl has changed in
4670        17.  PCRE2  provides  some  extensions  to  the Perl regular expression
4671        facilities.  Perl 5.10 includes new features that are  not  in  earlier
4672        versions  of  Perl,  some  of which (such as named parentheses) were in
4673        PCRE2 for some time before. This list is with respect to Perl 5.26:
4677        different length of string. Perl requires them all  to  have  the  same
4680        (b) From PCRE2 10.23, backreferences to groups of fixed length are sup-
4681        ported in lookbehinds, provided that there is no possibility of  refer-
4682        encing  a  non-unique  number or name. Perl does not support backrefer-
4686        $ meta-character matches only at the very end of the string.
4689        faulted. (Perl can be made to issue a warning.)
4691        (e) If PCRE2_UNGREEDY is set, the greediness of the repetition  quanti-
4692        fiers is inverted, that is, by default they are not greedy, but if fol-
4699        PCRE2_NOTEMPTY_ATSTART options have no Perl equivalents.
4704        (i)  The  callout  facility is PCRE2-specific. Perl supports codeblocks
4707        (j) The partial matching facility is PCRE2-specific.
4710        different way and is not Perl-compatible.
4716        18.  The  Perl  /a modifier restricts /d numbers to pure ascii, and the
4717        /aa modifier restricts /i  case-insensitive  matching  to  pure  ascii,
4721        19. Perl has different limits than PCRE2. See the pcre2limit documenta-
4722        tion for details. Perl went with 5.10 from recursion to iteration keep-
4724        not  fall into any stack-overflow limit. PCRE2 made a similar change at
4725        release 10.30, and also has many build-time and  run-time  customizable
4739        Copyright (c) 1997-2018 University of Cambridge.
4740 ------------------------------------------------------------------------------
4748        PCRE2 - Perl-compatible regular expressions (revised API)
4750 PCRE2 JUST-IN-TIME COMPILER SUPPORT
4752        Just-in-time  compiling  is a heavyweight optimization that can greatly
4753        speed up pattern matching. However, it comes at the cost of extra  pro-
4755        the same pattern is going to be matched many times. This does not  nec-
4757        anchored, matching attempts may take place many times at various  posi-
4759        string is very long, it may still pay  to  use  JIT  even  for  one-off
4760        matches.  JIT  support  is  available  for all of the 8-bit, 16-bit and
4761        32-bit PCRE2 libraries.
4763        JIT support applies only to the  traditional  Perl-compatible  matching
4771        --enable-jit  (or  equivalent  CMake  option) must be set when PCRE2 is
4775          ARM 32-bit (v5, v7, and Thumb2)
4776          ARM 64-bit
4777          Intel x86 32-bit and 64-bit
4778          MIPS 32-bit and 64-bit
4779          Power PC 32-bit and 64-bit
4780          SPARC 32-bit
4782        If --enable-jit is set on an unsupported platform, compilation fails.
4784        A  program  can  tell if JIT support is available by calling pcre2_con-
4788        falls  back  to the interpretive code if JIT is not available. For pro-
4790        path" API that is JIT-specific.
4799        second  is  zero  or  more of the following option bits: PCRE2_JIT_COM-
4810        the size of machine stack that it uses. The exact rules are  not  docu-
4815        PCRE2_JIT_COMPLETE  requests the JIT compiler to generate code for com-
4816        plete matches. If you want to run partial matches using the  PCRE2_PAR-
4821        pcre2_match()  is  called,  the appropriate code is run if it is avail-
4826        the option bits. For example, you can call it once with  PCRE2_JIT_COM-
4829        will ignore PCRE2_JIT_COMPLETE and just compile code for partial match-
4830        ing. If pcre2_jit_compile() is called with no option bits set, it imme-
4847        stack"  below,  even  if  you  do  not need to supply a non-default JIT
4849        be  obeyed.  If the match-time options are not right for JIT execution,
4852        If the JIT compiler finds an unsupported item, no JIT  data  is  gener-
4855        option.  A non-zero result means that JIT compilation was successful. A
4872        when  running in a UTF mode, and a callout immediately before an asser-
4881        that  the memory used for the JIT stack was insufficient. See "Control-
4901        The pcre2_jit_stack_create() function creates a JIT  stack.  Its  argu-
4907        function returns immediately, without doing anything. (For the  techni-
4919        The first argument is a pointer to a match context. When this is subse-
4921        JIT stack is used. If this argument is NULL, the function returns imme-
4922        diately,  without  doing anything. There are three cases for the values
4941        is not obeyed when pcre2_match() is called with options that are incom-
4949        up  non-sequential matches in one thread is to use callouts: if a call-
4954        you assign or pass back NULL from  a  callback,  that  is  thread-safe,
4956        or pass back a non-NULL JIT stack, this must be a different  stack  for
4957        each thread so that the application is thread-safe.
4959        Strictly  speaking,  even more is allowed. You can assign the same non-
4968        up non-default JIT stacks might operate:
4976          Use a one-line callback function
4987        PCRE2 (and JIT) is a recursive, depth-first engine, so it needs a stack
4989        child nodes.  Allocating real machine stack on some platforms is diffi-
4997        address space instead of allocating memory. We can safely allocate mem-
4998        ory pages inside this address space, so the stack  could  grow  without
5019        You can free compiled patterns, contexts, and stacks in any order, any-
5032        this without keeping a list of patterns.
5038        Especially on embedded sytems, it might be a good idea to release  mem-
5039        ory  sometimes  without  freeing the stack. There is no API for this at
5041        allocated  memory for any stack and another which allows releasing mem-
5055        The JIT executable allocator does not free all memory when it is possi-
5059        calling pcre2_jit_free_unused_memory(). Its argument is a general  con-
5060        text, for custom memory management, or NULL for standard memory manage-
5066        This is a single-threaded example that specifies a  JIT  stack  without
5111        number of other sanity checks are performed on the arguments. For exam-
5136        Copyright (c) 1997-2018 University of Cambridge.
5137 ------------------------------------------------------------------------------
5145        PCRE2 - Perl-compatible regular expressions (revised API)
5153        code units for the 8-bit and 16-bit libraries if PCRE2 is compiled with
5157        (when  building  the  16-bit  library,  3  is rounded up to 4). See the
5159        for  details.  In  these cases the limit is substantially larger.  How-
5160        ever, the speed of execution is slower.  In  the  32-bit  library,  the
5170        (that is ~(PCRE2_SIZE)0) is reserved as a special indicator  for  zero-
5190        (*THEN) verb is 255 code units for the 8-bit  library  and  65535  code
5191        units for the 16-bit and 32-bit libraries.
5194        number a 32-bit unsigned integer can hold.
5207        Copyright (c) 1997-2017 University of Cambridge.
5208 ------------------------------------------------------------------------------
5216        PCRE2 - Perl-compatible regular expressions (revised API)
5223        pcre2_match() function. This works in the same as  as  Perl's  matching
5224        function,  and  provide a Perl-compatible matching operation. The just-
5225        in-time (JIT) optimization that is described in the pcre2jit documenta-
5229        it operates in a different way, and is not Perl-compatible. This alter-
5250        The set of strings that are matched by a regular expression can be rep-
5255        tree:  depth-first  and  breadth-first, and these correspond to the two
5261        In the terminology of Jeffrey Friedl's book "Mastering Regular  Expres-
5263        depth-first search of the pattern tree. That is, it  proceeds  along  a
5265        required. When there is a mismatch, the algorithm  tries  any  alterna-
5274        that  point the algorithm stops. Thus, if there is more than one possi-
5280        Because  it  ends  up  with a single path through the tree, it is rela-
5281        tively straightforward for this algorithm to keep  track  of  the  sub-
5288        This algorithm conducts a breadth-first search of  the  tree.  Starting
5297        scans  the subject string only once, without backtracking, there is one
5306        this algorithm finds all of them, and in particular, it finds the long-
5308        an option to stop the algorithm after the first match (which is  neces-
5318        the fifth character of the subject. The algorithm  does  not  automati-
5321        PCRE2's "auto-possessification" optimization usually applies to charac-
5322        ter repeats at the end of a pattern (as well as internally). For  exam-
5327        either  use  an ungreedy repeat ("a\d+?") or set the PCRE2_NO_AUTO_POS-
5331        not  supported  by the alternative matching algorithm. They are as fol-
5336        may affect auto-possessification, as just described). During  matching,
5345        a  non-possessive quantifier. Similarly, if an atomic group is present,
5353        algorithm does not attempt to do this. This means that no captured sub-
5356        3. Because no substrings are captured, backreferences within  the  pat-
5359        4.  For  the same reason, conditional expressions that use a backrefer-
5373        these  modes,  because the alternative algorithm moves through the sub-
5384        Using  the alternative matching algorithm provides the following advan-
5387        1. All possible matches (at a single point in the subject) are automat-
5393        once, and never needs to backtrack (except for lookbehinds), it is pos-
5396        also  possible  to  do  multi-segment matching using the standard algo-
5397        rithm, by retaining partially matched substrings, it  is  more  compli-
5399        and discusses multi-segment matching.
5426        Copyright (c) 1997-2014 University of Cambridge.
5427 ------------------------------------------------------------------------------
5435        PCRE2 - Perl-compatible regular expressions
5441        the  entire pattern, PCRE2_ERROR_NOMATCH is returned. There are circum-
5454        reflecting the character that has been typed, for example. This immedi-
5467        If  you  want to use partial matching with just-in-time optimized code,
5473        PCRE2_JIT_COMPLETE should also be set if you are going to run  non-par-
5483        shorter strings. This optimization is also disabled for partial  match-
5490        the subject string is reached successfully, but  matching  cannot  con-
5491        tinue because more characters are needed. However, at least one charac-
5495        of  a matched string. The requirement for inspecting at least one char-
5496        acter exists because an empty string can  always  be  matched;  without
5502        the rest of the ovector are undefined. The appearance of \K in the pat-
5511        string  "abc12",  because  all these characters are needed for a subse-
5512        quent re-match with additional characters.
5520        match, the partial match is remembered, but matching continues as  nor-
5525        This  option  is "soft" because it prefers a complete match over a par-
5529        of the subject is treated as a non-alphanumeric.
5536        If this is matched against the subject string "abc123dog", both  alter-
5546        returned as soon as a partial match is  found,  without  continuing  to
5557        The  difference  between the two partial matching options can be illus-
5565        However,  if  PCRE2_PARTIAL_HARD is set, the result is PCRE2_ERROR_PAR-
5586        without backtracking, searching for  all  possible  matches  simultane-
5587        ously.  If the end of the subject is reached before the end of the pat-
5600        behaviour is different from  the  standard  functions  when  PCRE2_PAR-
5614        boundaries,  partial matching with PCRE2_PARTIAL_SOFT can give counter-
5649        matched substrings. The remaining four strings do not  match  the  com-
5657 MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()
5661        and  calling  the function again with the same compiled regular expres-
5663        same working space as before, because this is where details of the pre-
5672        The first call has "23ja" as the subject, and requests  partial  match-
5675        last  part  is  shown;  PCRE2 does not retain the previously partially-
5685        this may or may not be what you want.  The only way to allow for start-
5695 MULTI-SEGMENT MATCHING WITH pcre2_match()
5700        re-run,  starting from the point where the partial match occurred. Ear-
5719 ISSUES WITH MULTI-SEGMENT MATCHING
5721        Certain types of pattern may give problems with multi-segment matching,
5727        option, but in practice when doing multi-segment matching you should be
5730        2. If a pattern contains a lookbehind assertion, characters  that  pre-
5742        retained. In a non-UTF or a 32-bit situation, moving  back  is  just  a
5743        subtraction,  but in UTF-8 or UTF-16 you have to count characters while
5752        the string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maxi-
5784        been  found,  continuation to a new subject segment is no longer possi-
5809        matching  multi-segment  data.  The  example above then behaves differ-
5849        re-running  the  entire  match  can  also be used with the DFA matching
5866        Copyright (c) 1997-2014 University of Cambridge.
5867 ------------------------------------------------------------------------------
5875        PCRE2 - Perl-compatible regular expressions (revised API)
5880        by PCRE2 are described in detail below. There is a quick-reference syn-
5881        tax  summary  in the pcre2syntax page. PCRE2 tries to match Perl syntax
5882        and semantics as closely as it can.  PCRE2 also supports some  alterna-
5883        tive  regular  expression syntax (which does not conflict with the Perl
5887        Perl's  regular expressions are described in its own documentation, and
5897        different  algorithm  that is not Perl-compatible. Some of the features
5898        discussed below are not available when DFA matching is used. The advan-
5903 SPECIAL START-OF-PATTERN ITEMS
5906        set by special items at the start of a pattern. These are not Perl-com-
5908        writers  who are not able to change the program that processes the pat-
5915        In the 8-bit and 16-bit PCRE2 libraries, characters may be coded either
5916        as single code units, or as multiple UTF-8 or UTF-16 code units. UTF-32
5917        can be specified for the 32-bit library, in which  case  it  constrains
5928        restrict   them   to   non-UTF   data  for  security  reasons.  If  the
5936        causes  sequences such as \d and \w to use Unicode properties to deter-
5949        to whichever matching function is subsequently called to match the pat-
5953    Disabling auto-possessification
5961    Disabling start-up optimizations
5964        setting the PCRE2_NO_START_OPTIMIZE option. This disables several opti-
5971        as setting the PCRE2_NO_DOTSTAR_ANCHOR option. This disables  optimiza-
5972        tions that apply to patterns whose top-level branches all start with .*
5992        These  facilities  are  provided to catch runaway matches that are pro-
5993        voked by patterns with huge matching trees (a typical example is a pat-
6003        where d is any number of decimal digits. However, the value of the set-
6026        strings:  a  single  CR (carriage return) character, a single LF (line-
6027        feed) character, the two-character sequence CRLF, any of the three pre-
6032        It  is also possible to specify a newline convention by starting a pat-
6042        These override the default and the options given to the compiling func-
6052        The newline convention affects where the circumflex and  dollar  asser-
6053        tions are true. It also affects the interpretation of the dot metachar-
6057        sequence, for Perl compatibility. However, this can be changed; see the
6067        starting  a  pattern  with (*BSR_ANYCRLF). For completeness, (*BSR_UNI-
6074        character  code instead of ASCII or Unicode (typically a mainframe sys-
6075        tem). In the sections below, character code values are  ASCII  or  Uni-
6098        There are two different sets of metacharacters: those that  are  recog-
6124          -      indicates character range
6142        always safe to precede a non-alphanumeric  with  backslash  to  specify
6143        that it stands for itself.  In particular, if you want to match a back-
6156        If you want to remove the special meaning from a  sequence  of  charac-
6157        ters,  you can do so by putting them between \Q and \E. This is differ-
6158        ent from Perl in that $ and  @  are  handled  as  literals  in  \Q...\E
6159        sequences  in PCRE2, whereas in Perl, $ and @ cause variable interpola-
6160        tion. Also, Perl does "double-quotish backslash interpolation"  on  any
6165          Pattern            PCRE2 matches   Perl matches
6182    Non-printing characters
6184        A second use of backslash provides a way of encoding non-printing char-
6186        appearance  of non-printing characters in a pattern, but when a pattern
6192          \cx         "control-x", where x is any printable ASCII character
6207        option is set, that is, when PCRE2 is operating in a Unicode mode. Perl
6218        32 or greater than 126, a compile-time error occurs.
6222        The \c escape is processed as specified for Perl in the perlebcdic doc-
6223        ument.  The  only characters that are allowed after \c are A-Z, a-z, or
6224        one of @, [, \, ], ^, _, or ?. Any other character provokes a  compile-
6226        letters (in either case) encode characters 1-26 (hex 01 to hex 1A);  [,
6227        \,  ],  ^,  and  _  encode characters 27-31 (hex 1B to hex 1F), and \c?
6236        but because 127 is not a control character in  EBCDIC,  Perl  makes  it
6239        FF),  but  in  the one Perl calls POSIX-BC its value is 95 (hex 5F). If
6240        certain other characters have POSIX-BC values, PCRE2 makes \c? generate
6251        recent  addition  to Perl; it provides way of specifying character code
6256        a digit greater than zero. Instead, use \o{} or \x{} to specify numeri-
6257        cal character code points, and \g{} to specify backreferences. The fol-
6260        The handling of a backslash followed by a digit other than 0 is compli-
6261        cated, and Perl has changed over time, causing PCRE2 also to change.
6263        Outside a character class, PCRE2 reads the digit and any following dig-
6267        backreference.  A description of how this works is given later, follow-
6271        Inside  a character class, PCRE2 handles \8 and \9 as the literal char-
6272        acters "8" and "9", and otherwise reads up to three octal  digits  fol-
6273        lowing the backslash, using them to generate a data character. Any sub-
6295        By  default, after \x that is not followed by {, from zero to two hexa-
6297        number of hexadecimal digits may appear between \x{ and }. If a charac-
6302        just described only when it is followed by two hexadecimal digits. Oth-
6305        by  four hexadecimal digits; otherwise it matches a literal "u" charac-
6309        two syntaxes for \x (or by \u in PCRE2_ALT_BSUX mode). There is no dif-
6318          8-bit non-UTF mode    no greater than 0xff
6319          16-bit non-UTF mode   no greater than 0xffff
6320          32-bit non-UTF mode   no greater than 0xffffffff
6324        (the  so-called  "surrogate"  code  points). The check for these can be
6327        UTF-8 and UTF-32 modes, because these values are not  representable  in
6328        UTF-16.
6343        In Perl, the sequences \F, \l, \L, \u, and \U  are  recognized  by  its
6358        For  compatibility with Oniguruma, the non-Perl syntax \g followed by a
6361        Details are discussed later.   Note  that  \g{...}  (Perl  syntax)  and
6362        \g<...> (Oniguruma syntax) are not synonymous. The former is a backref-
6379          \W     any "non-word" character
6384        has a different meaning. See the section entitled "Non-printing charac-
6385        ters" above for details. Perl also uses \N{name} to specify  characters
6388        Each  pair of lower and upper case escape sequences partitions the com-
6398        locale. This list may vary if locale-specific matching is taking place.
6399        For example, in some locales the "non-breaking space" character  (\xA0)
6403        or digit.  By default, the definition of letters  and  digits  is  con-
6404        trolled by PCRE2's low-valued character tables, and may vary if locale-
6406        page).  For  example,  in  a French locale such as "fr_FR" in Unix-like
6413        be different for characters in the range 128-255  when  locale-specific
6415        meanings from before Unicode support was available,  mainly  for  effi-
6437          U+00A0     Non-break space
6444          U+2004     Three-per-em space
6445          U+2005     Four-per-em space
6446          U+2006     Six-per-em space
6451          U+202F     Narrow no-break space
6465        In 8-bit, non-UTF-8 mode, only the characters  with  code  points  less
6471        any Unicode newline sequence. In 8-bit non-UTF-8 mode \R is  equivalent
6477        below.  This particular group matches either the two-character sequence
6479        U+000A), VT (vertical tab, U+000B), FF (form feed,  U+000C),  CR  (car-
6481        atomic group, the two-character sequence is treated as  a  single  unit
6485        than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-
6491        PCRE2_BSR_ANYCRLF at compile time. (BSR is an  abbrevation  for  "back-
6493        the case, the other behaviour can be requested via  the  PCRE2_BSR_UNI-
6500        These override the default and the options given to the compiling func-
6501        tion.  Note that these special settings, which are not Perl-compatible,
6515        When  PCRE2  is  built  with Unicode support (the default), three addi-
6517        are  available.  In 8-bit non-UTF-8 mode, these sequences are of course
6519        they do work in this mode.  In 32-bit non-UTF mode, code points greater
6525          \P{xx}   a character without the xx property
6531        (described  in the next section).  Other Perl properties such as "InMu-
6545        Adlam, Ahom, Anatolian_Hieroglyphs, Arabic,  Armenian,  Avestan,  Bali-
6547        Braille, Buginese, Buhid, Canadian_Aboriginal, Carian,  Caucasian_Alba-
6553        Inscriptional_Pahlavi, Inscriptional_Parthian, Javanese,  Kaithi,  Kan-
6555        Latin, Lepcha, Limbu, Linear_A, Linear_B, Lisu, Lycian,  Lydian,  Maha-
6559        Nabataean, New_Tai_Lue, Newa, Nko, Nushu, Ogham, Ol_Chiki,  Old_Hungar-
6560        ian,  Old_Italic,  Old_North_Arabian, Old_Permic, Old_Persian, Old_Sog-
6563        Psalter_Pahlavi, Rejang, Runic, Samaritan,  Saurashtra,  Sharada,  Sha-
6566        Tai_Viet,  Takri,  Tamil,  Tangut, Telugu, Thaana, Thai, Tibetan, Tifi-
6569        Each character has exactly one Unicode general category property, spec-
6570        ified  by a two-letter abbreviation. For compatibility with Perl, nega-
6575        If only one letter is specified with \p or \P, it includes all the gen-
6602          Mn    Non-spacing mark
6637        page). Perl does not support the Cs property.
6639        The  long  synonyms  for  property  names  that  Perl supports (such as
6643        No character that is in the Unicode table has the Cn (unassigned) prop-
6649        different from the behaviour of current versions of Perl.
6652        to do a multistage table lookup in order to find  a  character's  prop-
6667        properties  that had been used for emojis.  Instead it introduced vari-
6668        ous emoji-specific properties. PCRE2  uses  only  the  Extended  Picto-
6677        2.  Do not end between CR and LF; otherwise end after any control char-
6687        "zero-width  joiner"  character.  Characters  with  the "mark" property
6694        property.  Extend and ZWJ characters are allowed  between  the  charac-
6705        As  well as the standard Unicode properties described above, PCRE2 sup-
6708        non-standard, non-Perl properties internally  when  PCRE2_UCP  is  set.
6713          Xsp   Any Perl space character
6714          Xwd   Any Perl "word" character
6716        Xan  matches  characters that have either the L (letter) or the N (num-
6720        exclude  vertical  tab,  for  Perl compatibility, but Perl changed. Xwd
6723        There is another non-standard property, Xuc, which matches any  charac-
6730        Note that the Xuc property does not match these sequences but the char-
6747        mode), though it again reports the matched string as "bar".  This  fea-
6751        does not interfere with the setting of captured substrings.  For  exam-
6758        Perl  documents  that  the  use  of  \K  within assertions is "not well
6762        be  greater  than the end of the match. Using \K in a lookbehind asser-
6763        tion at the start of a pattern can also lead to odd effects. For  exam-
6775        The final use of backslash is for certain simple assertions. An  asser-
6777        a match, without consuming any characters from the subject string.  The
6799        PCRE2 nor Perl has a separate "start of word" or "end of word"  metase-
6806        set.  Thus,  they are independent of multiline mode. These three asser-
6808        which  affect only the behaviour of the circumflex and dollar metachar-
6809        acters. However, if the startoffset argument of pcre2_match()  is  non-
6816        the  start point of the matching process, as specified by the startoff-
6818        startoffset  is  non-zero. By calling pcre2_match() multiple times with
6819        appropriate arguments, you can mimic Perl's /g option,  and  it  is  in
6824        Perl's,  which  defines it as true at the end of the previous match. In
6825        Perl, these can be different when the  previously  matched  string  was
6836        The circumflex and dollar  metacharacters  are  zero-width  assertions.
6837        That  is,  they test for a particular condition being true without con-
6839        are  concerned  with matching the starts and ends of lines. If the new-
6840        line convention is set so that only the two-character sequence CRLF  is
6846        point is at the start of the subject string. If the  startoffset  argu-
6847        ment  of  pcre2_match() is non-zero, or if PCRE2_NOTBOL is set, circum-
6856        if the pattern is constrained to match only at the start  of  the  sub-
6864        newline. Dollar need not be the last character of the pattern if a num-
6866        branch in which it appears. Dollar has no special meaning in a  charac-
6878        a newline that ends the string, for compatibility with  Perl.  However,
6886        pcre2_match() is non-zero. The PCRE2_DOLLAR_ENDONLY option  is  ignored
6889        When  the  newline  convention (see "Newline conventions" below) recog-
6890        nizes the two-character sequence CRLF as a newline, this is  preferred,
6891        even  if  the  single  characters CR and LF are also recognized as new-
6905        Outside a character class, a dot in the pattern matches any one charac-
6906        ter  in  the subject string except (by default) a character that signi-
6910        that  character; when the two-character sequence CRLF is used, dot does
6912        matches  all characters (including isolated CRs and LFs). When any Uni-
6917        PCRE2_DOTALL option is set, a dot matches any  one  character,  without
6918        exception.   If  the two-character sequence CRLF is present in the sub-
6921        The handling of dot is entirely independent of the handling of  circum-
6931        the section entitled "Non-printing characters" above for details.  Perl
6939        unit,  whether or not a UTF mode is set. In the 8-bit library, one code
6940        unit is one byte; in the 16-bit library it is a  16-bit  unit;  in  the
6941        32-bit  library  it  is  a 32-bit unit. Unlike a dot, \C always matches
6942        line-ending characters. The feature is provided in  Perl  in  order  to
6943        match individual bytes in UTF-8 mode, but it is unclear how it can use-
6947        one  unit  with  \C  in UTF-8 or UTF-16 mode means that the rest of the
6949        results, because PCRE2 assumes that it is matching character by charac-
6959        below)  in UTF-8 or UTF-16 modes, because this would make it impossible
6962        these UTF modes.  The former gives a match-time error; the latter fails
6965        In  the  32-bit  library,  however,  \C  is  always supported (when not
6967        whether or not UTF-32 is specified.
6970        using it that avoids the problem of malformed UTF-8 or  UTF-16  charac-
6972        as in this pattern, which could be used with  a  UTF-8  string  (ignore
6975          (?| (?=[\x00-\x7f])(\C) |
6976              (?=[\x80-\x{7ff}])(\C)(\C) |
6977              (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
6978              (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
6981        parentheses numbers in each alternative (see "Duplicate Subpattern Num-
6983        UTF-8 character for values whose encoding uses 1, 2,  3,  or  4  bytes,
6991        closing square bracket. A closing square bracket on its own is not spe-
7010        class that starts with a circumflex is not an assertion; it still  con-
7016        letters in a class represent both their upper case and lower case  ver-
7022        special way  when  matching  character  classes,  whatever  line-ending
7037        sequences,  they  cause an error. The same is true for \N when not fol-
7040        The minus (hyphen) character can be used to specify a range of  charac-
7041        ters  in  a  character  class.  For  example,  [d-m] matches any letter
7046        example, [b-d-z] matches letters in the range b to d, a hyphen  charac-
7049        Perl treats a hyphen as a literal if it appears before or after a POSIX
7052        class, Perl outputs a warning in its warning  mode,  as  this  is  most
7056        It is not possible to have the literal character "]" as the end charac-
7057        ter  of a range. A pattern such as [W-]46] is interpreted as a class of
7058        two characters ("W" and "-") followed by a literal string "46]", so  it
7059        would  match  "W46]"  or  "-46]". However, if the "]" is escaped with a
7060        backslash it is interpreted as the end of range, so [W-\]46] is  inter-
7065        Ranges normally include all code points between the start and end char-
7067        numerically, for example [\000-\037]. Ranges can include any characters
7068        that are valid for the current mode. In any  UTF  mode,  the  so-called
7071        PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES  option  disables this check). How-
7072        ever, ranges such as [\x{d7ff}-\x{e000}], which include the surrogates,
7076        points are both specified as literal letters in the same case. For com-
7077        patibility  with Perl, EBCDIC code points within the range that are not
7078        letters are omitted. For example, [h-k] matches only  four  characters,
7081        [\x88-\x92] or [h-\x92], all code points are included.
7084        it matches the letters in either case. For example, [W-c] is equivalent
7085        to  [][\\^_`wxyzabc],  matched  caselessly,  and  in a non-UTF mode, if
7086        character tables for a French locale are in  use,  [\xc8-\xcb]  matches
7100        special compatibility feature - see the next  two  sections),  and  the
7101        terminating  closing  square  bracket.  However,  escaping  other  non-
7107        Perl supports the POSIX notation for character classes. This uses names
7118          ascii    character codes 0 - 127
7132        CR (13), and space (32). If locale-specific matching is  taking  place,
7136        The name "word" is a Perl extension, and "blank"  is  a  GNU  extension
7137        from  Perl  5.8. Another Perl extension is negation, which is indicated
7142        matches "1", "2", or any non-digit. PCRE2 (and Perl) also recognize the
7147        the POSIX character classes, although this may be different for charac-
7148        ters in the range 128-255 when locale-specific matching  is  happening.
7168                  when printed. In Unicode property terms, it matches all char-
7173                    U+2066 - U+2069  Various "isolate"s
7180        [:punct:] This matches all characters that have the Unicode P (punctua-
7199        support is not compatible with Perl. It is provided to help  migrations
7201        that \b matches at the start and the end of a word (see "Simple  asser-
7202        tions"  above),  and in a Perl-style pattern the preceding or following
7203        character normally shows which is wanted,  without  the  need  for  the
7204        assertions  that  are used above in order to give exactly the POSIX be-
7228        enclosed  between "(?"  and ")". These options are Perl-compatible, and
7229        are described in detail in the pcre2api documentation. The option  let-
7239        For example, (?im) sets caseless, multiline matching. It is also possi-
7241        hyphen, for example (?-im). The two "extended" options are not indepen-
7244        A  combined  setting  and  unsetting  such  as  (?im-sx),  which   sets
7248        the option is unset. An empty options setting "(?)" is  allowed.  Need-
7252        the above options to be unset. Thus, (?^) is equivalent  to  (?-imnsx).
7253        Letters  may  follow  the  circumflex  to  cause some options to be re-
7256        The PCRE2-specific options PCRE2_DUPNAMES  and  PCRE2_UNGREEDY  can  be
7257        changed  in  the  same  way as the Perl-compatible options by using the
7269        not  used).   By this means, options can be made to have different set-
7270        tings in different parts of the pattern. Any changes made in one alter-
7282        start of a non-capturing subpattern (see the next section), the  option
7290        Note:  There  are  other  PCRE2-specific options that can be set by the
7291        application when the compiling function is called. The pattern can con-
7311        matches  "cataract",  "caterpillar", or "cat". Without the parentheses,
7327        the captured substrings are "red king", "red", and "king", and are num-
7332        without a capturing requirement. If an opening parenthesis is  followed
7333        by  a question mark and a colon, the subpattern does not do any captur-
7344        start of a non-capturing subpattern,  the  option  letters  may  appear
7359        Perl 5.10 introduced a feature whereby each alternative in a subpattern
7361        starts  with (?| and is itself a non-capturing subpattern. For example,
7366        Because the two alternatives are inside a (?| group, both sets of  cap-
7370        not all, of one of a number of alternatives. Inside a (?| group, paren-
7373        subpattern  start after the highest number used in any branch. The fol-
7374        lowing example is taken from the Perl documentation. The numbers under-
7377          # before  ---------------branch-reset----------- after
7393        A relative reference such as (?-1) is no different: it is just a conve-
7396        If a condition test for a subpattern's having matched refers to a  non-
7397        unique  number, the test is true if any of the subpatterns of that num-
7407        very hard to keep track of the numbers in  complicated  patterns.  Fur-
7409        with this difficulty, PCRE2 supports the naming  of  capturing  subpat-
7410        terns.  This  feature  was not added to Perl until release 5.10. Python
7412        the Python syntax. PCRE2 supports both the Perl and the Python syntax.
7415        (?<name>...) or (?'name'...) as in Perl, or (?P<name>...) as in Python.
7417        must start with a non-digit. References to capturing  parentheses  from
7418        other parts of the pattern, such as backreferences, recursion, and con-
7422        exactly  as if the names were not present. In both PCRE2 and Perl, cap-
7425        for extracting the complete name-to-number  translation  table  from  a
7426        compiled  pattern, as well as convenience functions for extracting cap-
7431        to all of them.  Perl allows identically numbered subpatterns  to  have
7437        Perl allows this, with both names AA and BB  as  aliases  of  group  1.
7442        number to be associated with more than one name. The example above pro-
7443        vokes a compile-time error. However, there is still  scope  for  confu-
7453        By default, a name must be unique within a pattern, except that  dupli-
7459        The duplicate name constraint can be disabled by setting the PCRE2_DUP-
7463        a weekday, either as a 3-letter abbreviation or as the full  name,  and
7478        problem is to use a "branch reset" subpattern, as described in the pre-
7481        If you make a backreference to a non-unique named subpattern from else-
7484        first one that is set is used for the reference. For example, this pat-
7490        If you make a subroutine call to a non-unique named subpattern, the one
7519        The  general repetition quantifier specifies a minimum and maximum num-
7540        the syntax of a quantifier, is taken as a literal character. For  exam-
7545        of which is represented by a two-byte sequence in a UTF-8 string. Simi-
7551        the previous item and the quantifier were not present. This may be use-
7557        For  convenience, the three most common quantifiers have single-charac-
7570        Earlier versions of Perl and PCRE1 used to give  an  error  at  compile
7573        subpattern  does in fact match no characters, the loop is forcibly bro-
7577        as  possible  (up  to  the  maximum number of permitted times), without
7594        and instead matches the minimum number of times possible, so  the  pat-
7611        Perl),  the  quantifiers are not greedy by default, but individual ones
7621        (equivalent  to  Perl's /s) is set, thus allowing the dot to match new-
7628        In cases where it is known that the subject  string  contains  no  new-
7629        lines,  it  is worth setting PCRE2_DOTALL in order to obtain this opti-
7639        If  the subject is "xyz123abc123" the match point is the fourth charac-
7642        Another case where implicit anchoring is not applied is when the  lead-
7648        It matches "ab" in the subject "aab". The use of the backtracking  con-
7652        When a capturing subpattern is repeated, the value captured is the sub-
7659        the  corresponding captured values may have been set in previous itera-
7671        to be re-evaluated to see if a different number of repeats  allows  the
7687        to be re-evaluated in this way.
7695        This kind of parenthesis "locks up" the  part of the  pattern  it  con-
7706        must  swallow  everything  it can. So, while both \d+ and \d+? are pre-
7732        The  possessive  quantifier syntax is an extension to the Perl 5.8 syn-
7736        found its way into Perl at release 5.10.
7741        when B must follow.  This feature can be disabled by the PCRE2_NO_AUTO-
7751        matches an unlimited number of substrings that either consist  of  non-
7761        both PCRE2 and Perl have an optimization that allows for  fast  failure
7762        when  a single character is used. They remember the last single charac-
7769        sequences of non-digits cannot be broken, and failure happens quickly.
7775        0  (and possibly further digits) is a backreference to a capturing sub-
7781        there  are  not that many capturing left parentheses in the entire pat-
7783        to  the left of the reference for numbers less than 8. A "forward back-
7785        and  the  subpattern to the right has participated in an earlier itera-
7791        See the subsection entitled "Non-printing characters" above for further
7805        An  unsigned number specifies an absolute reference without the ambigu-
7810          (abc(def)ghi)\g{-1}
7812        The sequence \g{-1} is a reference to the most recently started captur-
7813        ing subpattern before \g, that is, is it equivalent to \2 in this exam-
7814        ple.  Similarly, \g{-2} would be equivalent to \1. The use of  relative
7821        Perl does not support the use of + in this way.
7823        A backreference matches whatever actually matched the capturing subpat-
7832        time of the backreference, the case of letters is relevant.  For  exam-
7841        subpatterns.  The  .NET syntax \k{name} and the Perl syntax \k<name> or
7842        \k'name' are supported, as is the Python syntax (?P=name). Perl  5.10's
7856        subpattern  has not actually been used in a particular match, any back-
7862        the PCRE2_MATCH_UNSET_BACKREF option is set at compile time, a backref-
7865        Because there may be many capturing parentheses in a pattern, all  dig-
7866        its  following  a backslash are taken as part of a potential backrefer-
7877        matches.  However, such references can be useful inside  repeated  sub-
7882        matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
7886        the  backreference. This can be done using alternation, as in the exam-
7915        referenced in the usual way.  For example, a sequence such as (.)\g{-1}
7922        retained after a successful negative assertion. When an assertion  con-
7925        For  a  positive  assertion, internally captured substrings in the suc-
7926        cessful branch are retained, and matching continues with the next  pat-
7935        For   compatibility  with  Perl,  most  assertion  subpatterns  may  be
7938        useful. However, an assertion that forms the  condition  for  a  condi-
7939        tional  subpattern may not be quantified. In practice, for other asser-
7948        tried with and without the assertion, the order depending on the greed-
7962        matches a word followed by a semicolon, but does not include the  semi-
7992        strings it matches must have a fixed length. However, if there are sev-
7993        eral  top-level  alternatives,  they  do  not all have to have the same
8004        This is an extension compared with Perl, which requires all branches to
8009        is  not  permitted,  because  its single top-level branch can match two
8011        two top-level branches:
8016        of a lookbehind assertion to get round the fixed-length restriction.
8020        then try to match. If there are insufficient characters before the cur-
8023        In  UTF-8  and  UTF-16 modes, PCRE2 does not allow the \C escape (which
8026        the lookbehind. The \X and \R escapes, which can match  different  num-
8030        lookbehinds, as long as the subpattern matches a  fixed-length  string.
8034        Perl does not support backreferences in lookbehinds. PCRE2 does support
8046        assertions to specify efficient matching of fixed-length strings at the
8052        proceeds from left to right, PCRE2 will look for each "a" in  the  sub-
8067        quantifier; it can match only the entire string. The subsequent lookbe-
8082        three characters are not "999".  This pattern does not match "foo" pre-
8084        three of which are not "999". For example, it  doesn't  match  "123abc-
8108        It  is possible to cause the matching process to obey a subpattern con-
8110        on  the result of an assertion, or whether a specific capturing subpat-
8114          (?(condition)yes-pattern)
8115          (?(condition)yes-pattern|no-pattern)
8117        If  the  condition is satisfied, the yes-pattern is used; otherwise the
8118        no-pattern (if present) is used. An absent no-pattern is equivalent  to
8119        an  empty string (it always matches). If there are more than two alter-
8120        natives in the subpattern, a compile-time error occurs. Each of the two
8121        alternatives may itself contain nested subpatterns of any form, includ-
8129        There are five kinds of condition: references  to  subpatterns,  refer-
8130        ences  to  recursion,  two pseudo-conditions called DEFINE and VERSION,
8136        the condition is true if a capturing subpattern of that number has pre-
8139        numbers), the condition is true if any of them have matched. An  alter-
8142        most  recently opened parentheses can be referenced by (?(-1), the next
8143        most recent by (?(-2), and so on. Inside loops it can also  make  sense
8146        is not used; it provokes a compile-time error.)
8148        Consider  the  following  pattern, which contains non-significant white
8155        character is present, sets it as the first captured substring. The sec-
8160        yes-pattern  is  executed and a closing parenthesis is required. Other-
8161        wise, since no-pattern is not present, the subpattern matches  nothing.
8162        In  other  words,  this  pattern matches a sequence of non-parentheses,
8168          ...other stuff... ( \( )?    [^()]+    (?(-1) \) ) ...
8175        Perl uses the syntax (?(<name>)...) or (?('name')...)  to  test  for  a
8177        PCRE1, which had this facility before Perl, the syntax (?(name)...)  is
8179        the letter R followed by digits are ambiguous (see the  following  sec-
8192        "Recursion"  in  this sense refers to any subroutine-like call from one
8193        part of the pattern to another, whether or not it  is  actually  recur-
8231        be only one alternative in the subpattern. It is always skipped if con-
8233        can be used to define subroutines that can  be  referenced  from  else-
8234        where. (The use of subroutines is described below.) For example, a pat-
8238          (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
8246        to match the four dot-separated components of an IPv4 address,  insist-
8251        Programs  that link with a PCRE2 library can check the version by call-
8253        that  do  not have access to the underlying code cannot do this. A spe-
8269        assertion.  Consider  this  pattern,  again  containing non-significant
8272          (?(?=[^a-z]*[a-z])
8273          \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )
8276        optional  sequence of non-letters followed by a letter. In other words,
8280        strings  in  one  of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
8285        for both positive and negative assertions, because matching always con-
8286        tinues after the assertion, whether it succeeds or fails. (Compare non-
8306        at the start of the pattern, as described in the section entitled "New-
8310        when PCRE2_EXTENDED is set, and the default newline convention (a  sin-
8324        unlimited  nested  parentheses.  Without the use of recursion, the best
8329        For some time, Perl has provided a facility that allows regular expres-
8331        Perl code in the expression at run time, and the code can refer to  the
8332        expression itself. A Perl pattern using code interpolation to solve the
8337        The (?p{...}) item interpolates Perl code at run time, and in this case
8340        Obviously,  PCRE2  cannot  support  the  interpolation  of  Perl  code.
8341        Instead, it supports special syntax for recursion of  the  entire  pat-
8342        tern, and also for individual subpattern recursion. After its introduc-
8344        introduced into Perl at release 5.10.
8349        subpattern. (If not, it is a non-recursive subroutine  call,  which  is
8359        substrings which can either be a  sequence  of  non-parentheses,  or  a
8360        recursive  match  of the pattern itself (that is, a correctly parenthe-
8362        of a possessive quantifier to avoid backtracking into sequences of non-
8375        of (?1) in the pattern above you can write (?-2) to refer to the second
8380        Be aware however, that if duplicate subpattern numbers are in use, rel-
8384          (?|(a)|(b)) (c) (?-2)
8387        group (c) is number 2. When the reference  (?-2)  is  encountered,  the
8396        because  the  reference  is  not inside the parentheses that are refer-
8397        enced. They are always non-recursive subroutine calls, as described  in
8400        An  alternative  approach  is to use named parentheses. The Perl syntax
8401        for this is (?&name); PCRE1's earlier syntax  (?P>name)  is  also  sup-
8409        The example pattern that we have been looking at contains nested unlim-
8411        strings of non-parentheses is important when applying  the  pattern  to
8423        callout function can be used (see below and the pcre2callout documenta-
8429        which is the last value taken on at the top level. If a capturing  sub-
8435        recursion.  Consider this pattern, which matches text in  angle  brack-
8437        brackets (that is, when recursing), whereas any characters are  permit-
8443        two different alternatives for the recursive and  non-recursive  cases.
8446    Differences in recursion processing between PCRE2 and Perl
8448        Some former differences between PCRE2 and Perl no longer exist.
8450        Before  release 10.30, recursion processing in PCRE2 differed from Perl
8453        never re-entered, even if it contained untried alternatives  and  there
8455        recursion before Perl did.)
8458        treated as atomic. That is, they can be re-entered to try unused alter-
8460        now  compatible  with the way Perl works. If you want a subroutine call
8473        match fails. If you want to match typical palindromic phrases, the pat-
8474        tern  has  to  ignore  all  non-word characters, which can be done like
8480        such  as "A man, a plan, a canal: Panama!". Note the use of the posses-
8481        sive quantifier *+ to avoid backtracking  into  sequences  of  non-word
8482        characters. Without this, PCRE2 takes a great deal longer (ten times or
8483        more) to match typical phrases, and Perl takes so long that  you  think
8486        Another  way  in which PCRE2 and Perl used to differ in their recursion
8487        processing is in the handling of captured  values.  Formerly  in  Perl,
8489        next section), it had no access to any values that were  captured  out-
8499        to fail in Perl, but in later versions (I tried 5.024) it now works.
8513          (...(relative)...)...(?-1)...
8534        Processing options such as case-independence are fixed when  a  subpat-
8538          (abc)(?i:(?-1))
8550        For  compatibility with Oniguruma, the non-Perl syntax \g followed by a
8553        possibly recursively. Here are two of the examples used above,  rewrit-
8562          (abc)(?i:\g<-1>)
8564        Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are  not
8571        Perl has a feature whereby using the sequence (?{...}) causes arbitrary
8572        Perl  code to be obeyed in the middle of matching a regular expression.
8573        This makes it possible, amongst other things, to extract different sub-
8574        strings that match the same pair of parentheses when there is a repeti-
8577        PCRE2 provides a similar feature, but of course it  cannot  obey  arbi-
8578        trary  Perl  code. The feature is called "callout". The caller of PCRE2
8582        passed, or if the callout entry point is set to NULL, callouts are dis-
8592        in a similar way to Perl.
8594        During matching, when PCRE2 reaches a callout point, the external func-
8601        time, and one side-effect is that sometimes callouts  are  skipped.  If
8617        They are all numbered 255. If there is a conditional group in the  pat-
8629        A delimited string may be used instead of a number as a  callout  argu-
8631        ending delimiter is the same as the start, except for {, where the end-
8644        Perl's terminology) that modify the behaviour  of  backtracking  during
8649        By  default,  for  compatibility  with  Perl, a name is any sequence of
8653        PCRE2_ALT_VERBNAMES  option,  but the result is no longer Perl-compati-
8659        and  sequences such as \x{100} that define character code points. Char-
8665        names is skipped, and #-comments are recognized, exactly as in the rest
8669        The  maximum  length of a name is 255 in the 8-bit library and 65535 in
8670        the 16-bit and 32-bit libraries. If the name is empty, that is, if  the
8672        the colon were not there. Any number of these verbs may occur in a pat-
8676        them can be used only when the pattern is to be matched using the  tra-
8683        subpatterns called as subroutines (whether or not recursively) is docu-
8693        course, be processed. You can suppress the start-of-match optimizations
8694        by setting the PCRE2_NO_START_OPTIMIZE option when  calling  pcre2_com-
8699        Experiments  with  Perl  suggest that it too has similar optimizations,
8711        then continues at the outer level. If (*ACCEPT) in triggered in a posi-
8715        If  (*ACCEPT)  is inside capturing parentheses, the data so far is cap-
8720        This matches "AB", "AAD", or "ACD"; when it matches "AB", "B"  is  cap-
8727        read. The Perl documentation notes that it is probably useful only when
8728        combined with (?{}) or (??{}). Those are, of course, Perl features that
8729        are  not  present  in PCRE2. The nearest equivalent is the callout fea-
8752        When a match succeeds, the name of the last-encountered (*MARK:NAME) on
8753        the matching path is passed back to the caller as described in the sec-
8754        tion entitled "Other information about the match" in the pcre2api docu-
8775        The (*MARK) name is tagged with "MK:" in this output, and in this exam-
8777        efficient  way of obtaining this information than putting each alterna-
8781        true,  the  name  is recorded and passed back if it is the last-encoun-
8803        The following verbs do nothing when they are encountered. Matching con-
8805        causing  a  backtrack  to the verb, a failure is forced. That is, back-
8809        group  has been matched, there is never any backtracking into it. Back-
8813        These  verbs  differ  in exactly what kind of failure occurs when back-
8815        when  the  verb is not in a subroutine or an assertion. Subsequent sec-
8821        matching failure that causes backtracking to reach it. Even if the pat-
8824        verb that is encountered, once it has been passed pcre2_match() is com-
8833        The  behaviour  of (*COMMIT:NAME) is not the same as (*MARK:NAME)(*COM-
8834        MIT). It is like (*MARK:NAME) in that the name is remembered for  pass-
8845        anchor,  unless PCRE2's start-of-match optimizations are turned off, as
8861        (*COMMIT)  causes  the  match to fail without trying any other starting
8867        the subject if there is a later matching failure that causes backtrack-
8873        (*PRUNE)  is just an alternative to an atomic group or possessive quan-
8885        This  verb, when given without a name, is like (*PRUNE), except that if
8887        character, but to the position in the subject where (*SKIP) was encoun-
8896        skips on to start the next attempt at "c". Note that a possessive quan-
8907        found,  the  "bumpalong" advance is to the subject position that corre-
8913        atomic groups or assertions, because they are never re-entered by back-
8931        backtracks, and this causes a new matching attempt to start at the sec-
8942        This verb causes a skip to the next innermost  alternative  when  back-
8945        that it can be used for a pattern-based if-then-else block:
8951        skips  to  the second alternative and tries COND2, without backtracking
8952        into COND1. If that succeeds and BAR fails, COND3 is tried.  If  subse-
8953        quently  BAZ fails, there are no more alternatives, so there is a back-
8979        failure in C, matching moves to (*FAIL), which causes the whole subpat-
9010        that is backtracked onto first acts. For example,  consider  this  pat-
9018        is consistent, but is not always the same as Perl's. It means  that  if
9030        PCRE2 sometimes differs from Perl in its handling of backtracking verbs
9035        If  the  subject  is  "abac", Perl matches unless its optimizations are
9047        succeed  without any further processing; captured strings and a (*MARK)
9049        (*ACCEPT)  causes the assertion to fail without any further processing;
9068        in a standalone positive assertion. In a  conditional  positive  asser-
9070        or (*PRUNE) causes the condition to be false. However, for both  stand-
9072        (*SKIP), or (*PRUNE) causes the assertion to be true, without consider-
9077        These  behaviours  occur whether or not the subpattern is called recur-
9081        match  to succeed without any further processing. Matching then contin-
9082        ues after the subroutine call. Perl documents  this  behaviour.  Perl's
9089        when triggered by being backtracked to in a subpattern called as a sub-
9115        Copyright (c) 1997-2018 University of Cambridge.
9116 ------------------------------------------------------------------------------
9124        PCRE2 - Perl-compatible regular expressions (revised API)
9128        Two  aspects  of performance are discussed below: memory usage and pro-
9139        subpattern has a quantifier with a minimum greater than 1 and/or a lim-
9153        is  not  usually a problem. However, if the numbers are large, and par-
9159        uses  over  50KiB  when compiled using the 8-bit library. When PCRE2 is
9161        limit on a compiled pattern is 65535 code units in the 8-bit and 16-bit
9162        libraries, and this is reached with the above pattern if the outer rep-
9168        of PCRE2's "subroutine" facility. Re-writing the above pattern as
9174        this kind of pattern is not always exactly equivalent, because any cap-
9177        process  patterns that PCRE2 cannot otherwise handle. The matching per-
9179        same.  (This applies from release 10.30 - things were different in ear-
9185        From release 10.30, the interpretive (non-JIT) version of pcre2_match()
9186        uses  very  little system stack at run time. In earlier releases recur-
9188        cause  problems, but this usage has been eliminated. Backtracking posi-
9194        used.  Rewriting patterns to be time-efficient, as described below, may
9203        has been re-factored to use heap memory  when  necessary  for  internal
9214        Certain items in regular expression patterns are processed  more  effi-
9216        [aeiou]  than  a  set  of   single-character   alternatives   such   as
9224        slow, because PCRE2 has to use a multi-stage table lookup  whenever  it
9234        pcre2_match(); the performance loss is less with a DFA  matching  func-
9237        When  a pattern begins with .* not in atomic parentheses, nor in paren-
9241        multiple top-level branches, they must all be anchorable. The optimiza-
9247        subject  string contains newlines, the pattern may match from the char-
9258        If you are using such a pattern with subject strings that do  not  con-
9261        explicit anchoring. That saves PCRE2 from having to scan along the sub-
9283        matching  procedure, PCRE2 checks that there is a "b" later in the sub-
9284        ject string, and if there is not, it fails the match immediately.  How-
9295        an  atomic group or a possessive quantifier. This can often reduce mem-
9306        matched character. For a long string, a lot of memory is required. Con-
9312        This runs much faster, because sequences of characters that do not con-
9313        tain "<" are "swallowed" in one item inside the parentheses, and a pos-
9315        non-"<"  characters.  This  version also uses a lot less memory because
9350        Copyright (c) 1997-2018 University of Cambridge.
9351 ------------------------------------------------------------------------------
9359        PCRE2 - Perl-compatible regular expressions (revised API)
9379        This  set of functions provides a POSIX-style API for the PCRE2 regular
9380        expression 8-bit library. See the pcre2api documentation for a descrip-
9381        tion  of PCRE2's native API, which contains much additional functional-
9382        ity. There are no POSIX-style wrappers for PCRE2's  16-bit  and  32-bit
9388        called libpcre2-posix.a, so can be accessed by adding -lpcre2-posix  to
9391        -lpcre2-8.
9402        PCRE2-specific features via the POSIX calling interface or to  add  BSD
9406        POSIX-like in style. The syntax and semantics of  the  regular  expres-
9407        sions  themselves  are  still  those of Perl, subject to the setting of
9408        various PCRE2 options, as described below. "POSIX-like in style"  means
9410        POSIX-compatible, and in multi-unit encoding  domains  it  is  probably
9416        two structure types, regex_t for  compiled  internal  forms,  and  reg-
9417        match_t  for  returning  captured substrings. It also defines some con-
9427        structure that is used as a base for storing information about the com-
9449        the defined POSIX behaviour for REG_NEWLINE  (see  the  following  sec-
9455        for compilation to the native function. This disables all meta  charac-
9464        for  matching, the nmatch and pmatch arguments are ignored, and no cap-
9474        now contain binary zeros, which are treated as data characters. Without
9496        all  data  strings used for matching it to be treated as UTF-8 strings.
9502        subject  string  is  the Perl way, not the POSIX way. Note that setting
9504        It  does not affect the way newlines are matched by the dot metacharac-
9507        The yield of regcomp() is zero on success, and non-zero otherwise.  The
9513        NOTE: If the yield of regcomp() is non-zero, you must  not  attempt  to
9520        This area is not simple, because POSIX and Perl take different views of
9524        Perl and PCRE2:
9534        This is the equivalent table for a POSIX-compatible pattern matcher:
9545        API.  By  default, PCRE2's behaviour is the same as Perl's, except that
9546        there is no equivalent for PCRE2_DOLLAR_ENDONLY in Perl. In both  PCRE2
9547        and Perl, there is no way to stop newline from matching [^a].
9552        action. When using the POSIX API, passing REG_NEWLINE to  PCRE2's  reg-
9554        and REG_DOTALL passes PCRE2_DOTALL. There is no way to pass  PCRE2_DOL-
9567        The PCRE2_NOTBOL option is set when calling the underlying PCRE2 match-
9574        standard.  However, setting this option can give more POSIX-like behav-
9579        The PCRE2_NOTEOL option is set when calling the underlying PCRE2 match-
9598        intended  to  be  portable to other systems. Note that a non-zero rm_so
9620        Unused entries in the array have both structure members set to -1.
9629        The regerror() function maps a non-zero errorcode from either regcomp()
9633        the first errbuf_size - 1 characters of the error message are used. The
9641        Compiling a regular expression causes memory to be allocated and  asso-
9643        memory, after which preg may no longer be used as  a  compiled  expres-
9657        Copyright (c) 1997-2017 University of Cambridge.
9658 ------------------------------------------------------------------------------
9666        PCRE2 - Perl-compatible regular expressions (revised API)
9674        can save this listing to re-create the contents of pcre2demo.c.
9679        used. If matching succeeds, the program outputs the portion of the sub-
9680        ject  that  matched,  together  with  the contents of any captured sub-
9683        If the -g option is given on the command line, the program then goes on
9685        subject string. The logic is a little bit tricky because of the  possi-
9689        The code in pcre2demo.c is an 8-bit program that uses the  PCRE2  8-bit
9690        library.  It  handles  strings  and characters that are stored in 8-bit
9693        treated as UTF-8 strings, where characters  may  occupy  multiple  code
9697        for your operating system, you should be able to compile the demonstra-
9700          cc -o pcre2demo pcre2demo.c -lpcre2-8
9703        to the command line. For example, on a Unix-like system that has  PCRE2
9707          cc -o pcre2demo -I/usr/local/include pcre2demo.c \
9708             -L/usr/local/lib -lpcre2-8
9714          ./pcre2demo -g 'cat|dog' 'the dog sat on the cat'
9718        expressions using all three PCRE2 libraries (8-bit, 16-bit, and 32-bit,
9719        though not all three need be installed). The pcre2demo program is  pro-
9726          ld.so.1: pcre2demo: fatal: libpcre2-8.so.0: open failed: No such file
9729        This is caused by the way shared library support works  on  those  sys-
9732          -R/usr/local/lib
9747        Copyright (c) 1997-2016 University of Cambridge.
9748 ------------------------------------------------------------------------------
9754        PCRE2 - Perl-compatible regular expressions (revised API)
9756 SAVING AND RE-USING PRECOMPILED PCRE2 PATTERNS
9773        run. However, if you are using the just-in-time  optimization  feature,
9774        it is not possible to save and reload the JIT data, because it is posi-
9775        tion-dependent. The host on which the patterns  are  reloaded  must  be
9778        For  example, patterns compiled on a 32-bit system using PCRE2's 16-bit
9779        library cannot be reloaded on a 64-bit system, nor can they be reloaded
9780        using the 8-bit library.
9787        linked with a fixed version of PCRE2 must be prepared to recompile pat-
9797        checking, not complete validation of what is being re-loaded. Corrupted
9809        in the byte stream (its size is 1088 bytes). For more details of  char-
9810        acter  tables,  see the section on locale support in the pcre2api docu-
9816        the length of the vector. The third and fourth arguments point to vari-
9830        PCRE2_ERROR_BADMAGIC  means  either that a pattern's code has been cor-
9831        rupted, or that a slot in the vector does not point to a compiled  pat-
9855        between binary and non-binary data, be sure that the file is opened for
9860        freed in the usual way by calling pcre2_code_free(). When you have fin-
9861        ished with the byte stream, it too must be freed by calling pcre2_seri-
9863        returns immediately without doing anything.
9866 RE-USING PRECOMPILED PATTERNS
9868        In  order  to  re-use  a  set of saved patterns you must first make the
9869        serialized byte stream available in main memory (for example, by  read-
9873        data without actually decoding the patterns:
9884        If this argument is NULL, malloc() and free() are used. After deserial-
9913        and a reference count is used to arrange for its memory to be automati-
9920        If  a pattern was processed by pcre2_jit_compile() before being serial-
9936        Copyright (c) 1997-2018 University of Cambridge.
9937 ------------------------------------------------------------------------------
9945        PCRE2 - Perl-compatible regular expressions (revised API)
9949        The  full syntax and semantics of the regular expressions that are sup-
9951        document contains a quick-reference summary of the syntax.
9956          \x         where x is non-alphanumeric is a literal x
9965          \cx        "control-x", where x is any ASCII printing character
9980        Note that \0dd is always an octal code. The treatment of backslash fol-
9981        lowed by a non-zero digit is complicated; for details see  the  section
9982        "Non-printing  characters"  in  the  pcre2pattern  documentation, where
9989        read, but if PCRE2_ALT_BSUX is set, \x must be followed by two hexadec-
9991        matches a literal "x".  Likewise, if \u (in ALT_BSUX mode) is not  fol-
10006          \P{xx}     a character without the xx property
10013          \W         a "non-word" character
10017        middle of a UTF-8 or UTF-16 character. The application can lock out the
10021        By default, \d, \s, and \w match only ASCII characters, even  in  UTF-8
10022        mode or in the 16-bit and 32-bit libraries. However, if locale-specific
10024        points in the range 128-255. If the PCRE2_UCP option is set, the behav-
10049          Mn         Non-spacing mark
10081          Xsp        Perl space: property Z or tab, NL, VT, FF, CR
10082          Xuc        Univerally-named character: one that can be
10084          Xwd        Perl word: property Xan or underscore
10086        Perl and POSIX space are now the same. Perl added VT to its space char-
10092        Adlam, Ahom, Anatolian_Hieroglyphs, Arabic,  Armenian,  Avestan,  Bali-
10094        Braille, Buginese, Buhid, Canadian_Aboriginal, Carian,  Caucasian_Alba-
10100        Inscriptional_Pahlavi, Inscriptional_Parthian, Javanese,  Kaithi,  Kan-
10102        Latin, Lepcha, Limbu, Linear_A, Linear_B, Lisu, Lycian,  Lydian,  Maha-
10106        Nabataean, New_Tai_Lue, Newa, Nko, Nushu, Ogham, Ol_Chiki,  Old_Hungar-
10107        ian,  Old_Italic,  Old_North_Arabian, Old_Permic, Old_Persian, Old_Sog-
10110        Psalter_Pahlavi, Rejang, Runic, Samaritan,  Saurashtra,  Sharada,  Sha-
10113        Tai_Viet,  Takri,  Tamil,  Tangut, Telugu, Thaana, Thai, Tibetan, Tifi-
10121          [x-y]       range (can be used for hex characters)
10127          ascii       0-127
10197          (?<name>...)    named capturing group (Perl)
10198          (?'name'...)    named capturing group (Perl)
10200          (?:...)         non-capturing group
10201          (?|...)         non-capturing group; reset group numbers for
10207          (?>...)         atomic, non-capturing group
10227          (?-...)         unset option(s)
10231        a mixture of setting and unsetting such as (?i-x) is allowed, but there
10233        for example (?^in). An option setting may appear at the start of a non-
10245          (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
10248          (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
10289        Each top-level branch of a look behind must be of a fixed length.
10298          \g-n            relative reference by number
10300          \g{-n}          relative reference by number
10301          \k<name>        reference by name (Perl)
10302          \k'name'        reference by name (Perl)
10303          \g{name}        reference by name (Perl)
10313          (?-n)           call subpattern by relative number
10314          (?&name)        call subpattern by name (Perl)
10322          \g<-n>          call subpattern by relative number (PCRE2 extension)
10323          \g'-n'          call subpattern by relative number (PCRE2 extension)
10328          (?(condition)yes-pattern)
10329          (?(condition)yes-pattern|no-pattern)
10333          (?(-n)              relative reference condition
10334          (?(<name>)          named reference condition (Perl)
10335          (?('name')          named reference condition (Perl)
10361        The following act only when a subsequent match failure causes  a  back-
10363        what happens afterwards. Those that advance the start-of-match point do
10405        Copyright (c) 1997-2018 University of Cambridge.
10406 ------------------------------------------------------------------------------
10414        PCRE - Perl-compatible regular expressions (revised API)
10420        in  UTF-8, UTF-16, or UTF-32 format (depending on the code unit width).
10425        (*UTF). When either of these is the case, both the pattern and any sub-
10427        instead of strings of individual one-code-unit  characters.  There  are
10428        also  some  other  changes  to the way characters are handled, as docu-
10431        If you do not need Unicode support you can build PCRE2 without  it,  in
10443        names  for  properties are supported. For example, \p{L} matches a let-
10444        ter. Its Perl synonym, \p{Letter}, is not supported.   Furthermore,  in
10445        Perl,  many properties may optionally be prefixed by "Is", for compati-
10446        bility with Perl 5.6. PCRE2 does not support this.
10458        allowed in non-UTF modes.
10468        multi-unit  characters  (see  the description of \C in the pcre2pattern
10472        pcre2_dfa_match() when in UTF-8 or UTF-16 mode, that is, when a charac-
10474        modes  provokes a match-time error. Also, the JIT optimization does not
10475        support \C in these modes. If JIT optimization is requested for a UTF-8
10476        or  UTF-16  pattern  that contains \C, it will not succeed, and so when
10483        set as in non-UTF mode, all  with  code  points  less  than  256.  This
10489        Alternatively, if you set the PCRE2_UCP option, the way that the  char-
10495        all low-valued characters, unless the PCRE2_UCP option is set.
10498        escapes (\h, \H, \v, and \V) do match all the appropriate Unicode char-
10502 CASE-EQUIVALENCE IN UTF MODES
10504        Case-insensitive matching in a UTF mode makes use of Unicode properties
10506        at most two case-equivalent values. For these, a direct table lookup is
10508        than two code points that are case-equivalent, and these are treated as
10521        UTF-16 and UTF-32 strings can indicate their endianness by special code
10522        knows  as  a  byte-order  mark (BOM). The PCRE2 functions do not handle
10526        case  of  pcre2_match()  and  pcre2_dfa_match()  calls  with a non-zero
10530        end  of  the subject. If there are no lookbehind assertions in the pat-
10534        the  starting offset. Note that the sequences \b and \B are one-charac-
10539        the surrogate area. The so-called "non-character" code points  are  not
10544        UTF-16,  where they are used in pairs to encode code points with values
10545        greater than 0xFFFF. The code points that are encoded by  UTF-16  pairs
10546        are  available  independently  in  the  UTF-8 and UTF-32 encodings. (In
10547        other words, the whole surrogate thing is  a  fudge  for  UTF-16  which
10548        unfortunately messes up UTF-8 and UTF-32.)
10551        and therefore want to skip these checks in  order  to  improve  perfor-
10553        scanned repeatedly.  If you set the PCRE2_NO_UTF_CHECK option  at  com-
10569        PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra option. However, this is pos-
10570        sible only in UTF-8 and UTF-32 modes, because these values are not rep-
10571        resentable in UTF-16.
10573    Errors in UTF-8 strings
10575        The following negative error codes are given for invalid UTF-8 strings:
10583        The string ends with a truncated UTF-8 character;  the  code  specifies
10584        how  many bytes are missing (1 to 5). Although RFC 3629 restricts UTF-8
10585        characters to be no longer than 4 bytes, the  encoding  scheme  (origi-
10607        A  4-byte character has a value greater than 0x10fff; these code points
10612        A 3-byte character has a value in the  range  0xd800  to  0xdfff;  this
10613        range  of code points are reserved by RFC 3629 for use with UTF-16, and
10614        so are excluded from UTF-8.
10622        A 2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it  codes
10624        For example, the two bytes 0xc0, 0xae give the value 0x2e,  whose  cor-
10630        binary value 0b10 (that is, the most significant bit is 1 and the  sec-
10631        ond  is  0). Such a byte can only validly occur as the second or subse-
10632        quent byte of a multi-byte character.
10637        can never occur in a valid UTF-8 string.
10639    Errors in UTF-16 strings
10641        The  following  negative  error  codes  are  given  for  invalid UTF-16
10649    Errors in UTF-32 strings
10651        The following  negative  error  codes  are  given  for  invalid  UTF-32
10668        Copyright (c) 1997-2018 University of Cambridge.
10669 ------------------------------------------------------------------------------