pcre2.txt - OpenGrok cross reference for /external/pcre/dist2/doc/pcre2.txt

Lines Matching +full:ipv4 +full:- +full:simple +full:- +full:service +full:- +full:config
1 -----------------------------------------------------------------------------
8 -----------------------------------------------------------------------------
16        PCRE2 - Perl-compatible regular expressions (revised API)
25        API is more extensible, and it was simplified by abolishing  the  sepa-
30        As  well  as Perl-style regular expression patterns, some features that
37        The source code for PCRE2 can be compiled to support 8-bit, 16-bit,  or
38        32-bit  code units, which means that up to three separate libraries may
39        be installed.  The original work to extend PCRE to  16-bit  and  32-bit
40        code  units  was  done  by Zoltan Herczeg and Christian Persch, respec-
42        character  per  code  unit, or as UTF-encoded Unicode, with support for
45        code units must be enabled explicitly at run time. The version of  Uni-
48          pcre2test -C
51        ending in _8,  _16,  or  _32,  respectively  (for  example,  pcre2_com-
57        In addition to the Perl-compatible matching function, PCRE2 contains an
58        alternative  function that matches the same compiled patterns in a dif-
70        client  to  discover  which  features are available. The features them-
71        selves are described in the pcre2build page. Documentation about build-
73        NON-AUTOTOOLS_BUILD files in the source distribution.
86        If you are using PCRE2 in a non-UTF application that permits  users  to
89        For  example, an 8-bit pattern that begins with "(*UTF)" turns on UTF-8
90        mode, which interprets patterns and subjects as strings of  UTF-8  code
91        units instead of individual 8-bit characters. This causes both the pat-
92        tern and any data against which it is matched to be checked  for  UTF-8
93        validity.  If the data string is very long, such a check might use suf-
94        ficiently many resources as to cause your application to  lose  perfor-
97        One  way  of guarding against this possibility is to use the pcre2_pat-
100        calling pcre2_compile(). This causes a compile time error if  the  pat-
101        tern contains a UTF-setting sequence.
104        be enabled from within the pattern, by specifying "(*UCP)".  This  fea-
112        The use of the \C escape sequence in a UTF-8 or UTF-16 pattern can lead
114        middle of  a  multi-code-unit  character.  The  PCRE2_NEVER_BACKSLASH_C
116        a compile-time error if it is encountered. It is also possible to build
121        Nested  unlimited repeats in a pattern are a common example. PCRE2 pro-
124        pcre2_set_depth_limit() that can be used to restrict the amount of mem-
130        The  user  documentation for PCRE2 comprises a number of different sec-
136        (which  is a program listing), and the short pages for individual func-
137        tions, are concatenated in pcre2.txt, for ease of searching.  The  sec-
141          pcre2-config       show PCRE2 installation configuration information
148          pcre2grep          description of the pcre2grep command (8-bit only)
149          pcre2jit           discussion of just-in-time optimization support
156          pcre2posix         the POSIX-compatible C API for the 8-bit library
170        University Computing Service
181        Copyright (c) 1997-2018 University of Cambridge.
182 ------------------------------------------------------------------------------
190        PCRE2 - Perl-compatible regular expressions (revised API)
195        contains a description of all its native functions. See the pcre2 docu-
448        These functions provide a way of  converting  non-PCRE2  patterns  into
455 PCRE2 8-BIT, 16-BIT, AND 32-BIT LIBRARIES
457        There are three PCRE2 libraries, supporting 8-bit, 16-bit,  and  32-bit
460        for all three libraries. One, two, or all three can be installed simul-
461        taneously. On Unix-like systems the libraries  are  called  libpcre2-8,
462        libpcre2-16, and libpcre2-32, and they can also co-exist with the orig-
485        macros are defined whose names are the generic forms such as pcre2_com-
487        PCRE2_CODE_UNIT_WIDTH  to generate the appropriate width-specific func-
505        single  library.   For example, if you want to run a match using a pat-
517        There are also some wrapper functions for the 8-bit library that corre-
529        program against a non-dll PCRE2 library, you must  define  PCRE2_STATIC
533        and matching regular expressions in a Perl-compatible manner. A  sample
540        passed as bits in an options argument. There are also some more compli-
543        blocks, described below). Simple applications do not need to  make  use
546        Just-in-time  (JIT)  compiler  support  is an optional feature of PCRE2
561        less sanity checking. The JIT-specific functions are discussed  in  the
564        A  second  matching function, pcre2_dfa_match(), which is not Perl-com-
569        return captured substrings. A description of  the  two  matching  algo-
588        pcre2_substring_free()  and  pcre2_substring_list_free()  are also pro-
590        functions  is called with a NULL argument, the function returns immedi-
600        Finally,  there  are functions for finding out information about a com-
615        ~(PCRE2_SIZE)0)  is reserved as a special indicator for zero-terminated
623        strings: a single CR (carriage return) character, a  single  LF  (line-
624        feed) character, the two-character sequence CRLF, any of the three pre-
642        dollar metacharacters, the handling of #-comments in /x mode, and, when
643        CRLF is a recognized line ending sequence, the match position  advance-
644        ment for a non-anchored pattern. There is more detail about this in the
654        In a multithreaded application it is important to keep  thread-specific
656        library code itself is thread-safe: it contains  no  static  or  global
657        variables.  The  API  is  designed to be fairly simple for non-threaded
658        applications while at the same time ensuring that multithreaded  appli-
661        There are several different blocks of data that are used to pass infor-
669        is  thread-safe, that is, the same compiled pattern can be used by more
672        use them. However, if the just-in-time (JIT)  optimization  feature  is
682          Get a read-only (shared) lock (mutex) for pointer
694        If JIT is being used, but the JIT compilation is not being done immedi-
700        obtain  a private copy of the compiled code before calling the JIT com-
713        In a multithreaded application, if the parameters in a context are val-
716        it must make its own thread-specific copy.
721        of a match. This includes details of what was matched, as well as addi-
730        memory management or non-standard character tables.  To  keep  function
739        relevant  for  several  PCRE2 operations, a compile-time context, and a
740        match-time context.
764        function  may be NULL, in which case the system memory management func-
790        A compile context is required if you want to provide an external  func-
792        values of any of the following compile-time parameters:
801        A compile context is also required if you are using custom memory  man-
802        agement.   If  none of these apply, just pass NULL as the context argu-
805        A compile context is created, copied, and freed by the following  func-
833        only argument is a general context. This function builds a set of char-
839        As  PCRE2  has developed, almost all the 32 option bits that are avail-
842        bits which are used for some newer, assumed rarer, options. This  func-
854        largest  number  that  a  PCRE2_SIZE variable can hold, which is effec-
860        This specifies which characters or character sequences are to be recog-
863        two-character  sequence  CR followed by LF), PCRE2_NEWLINE_ANYCRLF (any
871        PCRE2_EXTENDED_MORE option, the newline convention affects the recogni-
883        limit applies to parentheses of all kinds, not just capturing parenthe-
899        nesting,  and  the second is user data that is set up by the last argu-
901        should return zero if all is well, or non-zero to force an error.
917        A match context is created, copied, and freed by  the  following  func-
937        during a matching operation. Details are given in the pcre2callout doc-
957        option when calling pcre2_compile() so that when JIT is in use, differ-
958        ent code can be compiled. If a match  is  started  with  a  non-default
959        match  limit when PCRE2_USE_OFFSET_LIMIT is not set, an error is gener-
967        the first line and also within the offset limit. In other words, which-
976        also applies to pcre2_dfa_match(), which may use the heap when process-
978        atomic groups. This limit does not apply to matching with the JIT opti-
994        The  pcre2_match() function starts out using a 20KiB vector on the sys-
995        tem stack for recording backtracking points. The more nested backtrack-
998        too small. If the heap limit is set to a value less than 21 (in partic-
1000        that  do not have a lot of nested backtracking can be successfully pro-
1025        When  pcre2_match() is called with a pattern that was successfully pro-
1088 CHECKING BUILD-TIME OPTIONS
1107        non-negative  on success, or the negative error code PCRE2_ERROR_BADOP-
1108        TION if the value in the first argument is not recognized. The  follow-
1122        unit widths were selected when PCRE2 was  built.  The  1-bit  indicates
1123        8-bit  support, and the 2-bit and 4-bit indicate 16-bit and 32-bit sup-
1130        recursions, lookarounds, and atomic groups in  pcre2_dfa_match().  Fur-
1143        just-in-time compiling is available; otherwise it is set to zero.
1162        the 16-bit library is compiled, a value of 3 is rounded up  to  4,  and
1163        when  the  32-bit  library  is compiled, internal linkages always use 4
1166        The default value of 2 for the 8-bit and 16-bit libraries is sufficient
1202        The  output is a uint32_t integer that gives the maximum depth of nest-
1212        This parameter is obsolete and should not be used in new code. The out-
1236        PCRE2 version string, zero-terminated. The number of code units used is
1237        returned. This is the length of the string plus one unit for the termi-
1255        length  (in  code units). If the pattern is zero-terminated, the length
1260        If the compile context argument ccontext is NULL, memory for  the  com-
1270        below), the JIT information cannot be copied (because it  is  position-
1271        dependent).  The new copy can initially be used only for non-JIT match-
1276        a multithreaded application to acquire a private copy  of  shared  com-
1285        pointing  to the new tables. The memory for the new tables is automati-
1293        After  running a match, you must not free a compiled pattern (or a sub-
1304        For those options that can be different in different parts of the  pat-
1310        Other, less frequently required compile-time parameters  (for  example,
1314        If errorcode or erroroffset is NULL, pcre2_compile() returns NULL imme-
1318        error has occurred. The values are not defined when compilation is suc-
1319        cessful and pcre2_compile() returns a non-NULL value.
1322        return if it finds an error in the pattern. There are also  some  nega-
1328        "Obtaining a textual error message" below) should be  self-explanatory.
1332        The value returned in erroroffset is an indication of where in the pat-
1336        the failing assertion. For an invalid UTF-8 or UTF-16 string, the  off-
1342        mode. It may sometimes point into the middle of a UTF-8 or UTF-16 char-
1345        This  code  fragment shows a typical straightforward call to pcre2_com-
1353            PCRE2_ZERO_TERMINATED,  /* the pattern is zero-terminated */
1383        (1) \U matches an upper case "U" character; by default \U causes a com-
1403        Perl. If you want a multiline circumflex also to match after  a  termi-
1417        whitespace in verb names is  skipped  and  #-comments  are  recognized,
1423        items, all with number 255, before each pattern  item,  except  immedi-
1424        ately  before  or after an explicit callout in the pattern. For discus-
1437        points (available only in 16-bit or 32-bit mode)  are  treated  as  not
1454        this option, a dot does not match when the current position in the sub-
1456        and it can be changed within a pattern by a (?s) option setting. A neg-
1458        escape  sequence always matches a non-newline character, independent of
1475        patterns, a new match is then tried at the next  starting  point.  How-
1490        matches, which are necessarily substrings of the first one, must  obvi-
1496        totally ignored except when escaped or inside a character  class.  How-
1498        introduce various parenthesized subpatterns, nor within numerical quan-
1500        item and a following quantifier and between a quantifier and a  follow-
1505        When  PCRE2  is compiled without Unicode support, PCRE2_EXTENDED recog-
1507        256 that are flagged as white space in its low-character table. The ta-
1514        When PCRE2 is compiled with Unicode support, in addition to these char-
1515        acters,  five  more Unicode "Pattern White Space" characters are recog-
1516        nized by PCRE2_EXTENDED. These are U+0085 (next line), U+200E (left-to-
1517        right  mark), U+200F (right-to-left mark), U+2028 (line separator), and
1523        As  well as ignoring most white space, PCRE2_EXTENDED also causes char-
1530        Which characters are interpreted as newlines can be specified by a set-
1532        special sequence at the start of the pattern, as described in the  sec-
1543        option,  and  it can be changed within a pattern by a (?xx) option set-
1550        start of matching, though the matched text may continue over  the  new-
1551        line. If startoffset is non-zero, the limiting newline is not necessar-
1553        string is "abc\nxyz" (where \n represents a single-character newline) a
1562        If this option is set, all meta-characters in the pattern are disabled,
1565        you  are  doing  a  lot of literal matching and are worried about effi-
1590        string,  or  before  a  terminating  newline  (except  when  PCRE2_DOL-
1608        This option locks out the use of \C in the pattern that is  being  com-
1609        piled.   This  escape  can  cause  unpredictable  behaviour in UTF-8 or
1610        UTF-16 modes, because it may leave the current matching  point  in  the
1611        middle  of  a  multi-code-unit  character. This option may be useful in
1613        there is also a build-time option that permanently locks out the use of
1628        This option locks out interpretation of the pattern as  UTF-8,  UTF-16,
1629        or UTF-32, depending on which library is in use. In particular, it pre-
1632        applications that process patterns from external sources. The  combina-
1637        If this option is set, it disables the use of numbered capturing paren-
1648        If this option is set, it disables "auto-possessification", which is an
1651        are  in  use,  auto-possessification means that some callouts are never
1659        .*  is  the  first significant item in a top-level branch of a pattern,
1662        atomic group or a capturing group that is the subject of  a  backrefer-
1663        ence,  or  if  the pattern contains (*PRUNE) or (*SKIP). When the opti-
1679        the  matching code searches the subject for that value, and fails imme-
1680        diately if it cannot find it, without actually running the main  match-
1684        items are in use, these "start-up" optimizations can cause them  to  be
1685        skipped  if  the pattern is never actually used. The start-up optimiza-
1686        tions are in effect a pre-scan of the subject that takes  place  before
1689        The PCRE2_NO_START_OPTIMIZE option disables the start-up optimizations,
1702        start-up optimization scans along the subject, finds "A" and  runs  the
1703        first  match attempt from there. The (*COMMIT) item means that the pat-
1711        There  are  also  other  start-up optimizations. For example, a minimum
1728        UTF-8 strings, UTF-16 strings, and UTF-32 strings in  the  pcre2unicode
1742        Note also that setting PCRE2_NO_UTF_CHECK at compile time does not dis-
1743        able  the error that is given if an escape sequence for an invalid Uni-
1744        code code point is encountered in the pattern. In particular,  the  so-
1748        section entitled "Extra compile options" below.  However, this is  pos-
1749        sible only in UTF-8 and UTF-32 modes, because these values are not rep-
1750        resentable in UTF-16.
1760        option is available only if PCRE2 has been compiled with  Unicode  sup-
1773        is  going  to be used to set a non-default offset limit in a match con-
1783        instead  of  single-code-unit  strings.  It  is available when PCRE2 is
1792        Unlike the main compile-time options, the extra options are  not  saved
1799        This  option  applies when compiling a pattern in UTF-8 or UTF-32 mode.
1800        It is forbidden in UTF-16 mode, and ignored in non-UTF  modes.  Unicode
1802        in UTF-16 to encode code points with values in  the  range  0x10000  to
1803        0x10ffff.  The  surrogates  cannot  therefore be represented in UTF-16.
1804        They can be represented in UTF-8 and UTF-32, but are defined as invalid
1805        code  points,  and  cause  errors  if  encountered in a UTF-8 or UTF-32
1810        when using PCRE2 to check for unwanted  characters  in  UTF-8  strings,
1813        because  it applies only to the testing of input strings for UTF valid-
1816        If the extra option PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES is set,  surro-
1817        gate  code  point values in UTF-8 and UTF-32 patterns no longer provoke
1825        escape  such  as \j or a malformed one such as \x{2z} causes a compile-
1826        time error when detected by pcre2_compile(). Perl is somewhat inconsis-
1828        "j", and non-hexadecimal digits in \x{} are just ignored, though  warn-
1829        ings  are given in both cases if Perl's warning switch is enabled. How-
1835        treated  as  single-character escapes. For example, \j is a literal "j"
1837        option  means  that  typos in patterns may go undetected and have unex-
1842        This option is provided for use by  the  -x  option  of  pcre2grep.  It
1844        automatically inserting the code for "^(?:" at the start  of  the  com-
1851        This  option  is  provided  for  use  by the -w option of pcre2grep. It
1859 JUST-IN-TIME (JIT) COMPILATION
1879        just-in-time  compiler  is available, further processes a compiled pat-
1885        for  patterns  to  be analyzed, and for one-off matches and simple pat-
1896        points  are  less than 256. By default, higher-valued code points never
1897        match escapes such as \w or \d.  However, if PCRE2 is built  with  Uni-
1898        code support, all characters can be tested with \p and \P, or, alterna-
1901        the built-in tables.
1911        default "C" locale of the local system, which may cause them to be dif-
1914        The  internal tables can be overridden by tables supplied by the appli-
1916        from  the  default.  As more and more applications change to using Uni-
1933        The locale name "fr_FR" is used on Linux and other  Unix-like  systems;
1940        pcre2_match()  and pcre_dfa_match(). Thus, for any single pattern, com-
1951        The  first  argument  for pcre2_pattern_info() is a pointer to the com-
1957        the function is zero for success, or one of the following negative num-
1966        an  simple check against passing an arbitrary memory pointer. Here is a
1967        typical call of pcre2_pattern_info(), to obtain the length of the  com-
1986        options that were passed to pcre2_compile(), whereas  PCRE2_INFO_ALLOP-
1987        TIONS  returns  the compile options as modified by any top-level (*XXX)
1990        compile context by calling the pcre2_set_compile_extra_options()  func-
1996        change within a pattern do not affect the result  of  PCRE2_INFO_ALLOP-
2001        PCRE2 if the first significant item in every top-level branch is one of
2007          .*    sometimes - see below
2019        For  patterns  that are auto-anchored, the PCRE2_ANCHORED bit is set in
2028        characters of the given group, but in addition, the check that  a  cap-
2041        Return  the highest capturing subpattern number in the pattern. In pat-
2051        PCRE2_ERROR_UNSET. Note that this limit will only be used during match-
2057        In  the absence of a single first code unit for a non-anchored pattern,
2058        pcre2_compile() may construct a 256-bit table that defines a fixed  set
2062        means "any code unit of value 255 or above". If such a table  was  con-
2069        a  non-anchored pattern. The third argument should point to an uint32_t
2081        The  third  argument should point to an uint32_t variable. In the 8-bit
2082        library, the value is always less than 256. In the 16-bit  library  the
2083        value  can  be  up  to 0xffff. In the 32-bit library in UTF-32 mode the
2084        value can be up to 0x10ffff, and up to 0xffffffff when not using UTF-32
2112        (*LIMIT_HEAP=nnnn) at the start, the value is returned. The third argu-
2120        Return 1 if the (?J) or (?-J) option setting is used  in  the  pattern,
2122        (?J) and (?-J) set and unset the local PCRE2_DUPNAMES  option,  respec-
2127        If  the  compiled  pattern was successfully processed by pcre2_jit_com-
2147        PCRE2_INFO_LASTCODETYPE returns 1. Otherwise, return 0. The third argu-
2154        contains recursive subroutine calls it is not always possible to deter-
2155        mine  whether  or  not it can match an empty string. PCRE2 takes a cau-
2164        PCRE2_ERROR_UNSET. Note that this limit will only be used during match-
2170        Return the number of characters (not code units) in the longest lookbe-
2172        uint32_t integer. This information is useful when  doing  multi-segment
2173        matching  using  the  partial matching facilities. Note that the simple
2174        assertions \b and \B require a one-character lookbehind. \A also regis-
2175        ters  a  one-character  lookbehind, though it does not actually inspect
2177        from  the old segment is retained when a new segment is processed. Oth-
2185        number  of characters, which in UTF mode may be different from the num-
2195        PCRE2 supports the use of named as well as numbered capturing parenthe-
2196        ses. The names are just an additional way of identifying the  parenthe-
2198        pcre2_substring_get_byname() are provided for extracting captured  sub-
2202        do the conversion, you need to use the  name-to-number  map,  which  is
2205        The  map  consists  of a number of fixed-size entries. PCRE2_INFO_NAME-
2211        This  is  a  PCRE2_SPTR  pointer to a block of code units. In the 8-bit
2212        library, the first two bytes of each entry are the number of  the  cap-
2213        turing parenthesis, most significant byte first. In the 16-bit library,
2214        the pointer points to 16-bit code units, the first  of  which  contains
2215        the  parenthesis  number.  In the 32-bit library, the pointer points to
2216        32-bit code units, the first of which contains the parenthesis  number.
2231        As  a  simple  example of the name/number table, consider the following
2232        pattern after compilation by the 8-bit library  (assume  PCRE2_EXTENDED
2233        is set, so white space - including newlines - is ignored):
2235          (?<date> (?<year>(\d\d)?\d\d) -
2236          (?<month>\d\d) - (?<day>\d\d) )
2240        with non-printing bytes shows in hexadecimal, and undefined bytes shown
2249        name-to-number  map,  remember that the length of the entries is likely
2263        This identifies the character sequence that will be recognized as mean-
2272        pcre2_compile()  is  getting memory in which to place the compiled pat-
2275        over-estimate. Processing a pattern with  the  JIT  compiler  does  not
2291        which they appear. Its first argument is a pointer to a callout enumer-
2293        passed  to  pcre2_callout_enumerate(). The contents of the callout enu-
2303        PCRE2, with the same code unit width, and must also have the same endi-
2308        the  serialized form. They are described in the pcre2serialize documen-
2309        tation. Note that PCRE2 serialization does not  convert  compiled  pat-
2331        you must create a match data block by calling one of the creation func-
2338        pcre2_match_data_create(), so it is always possible to return the over-
2341        The second argument of pcre2_match_data_create() is a pointer to a gen-
2348        right size to hold all the substrings a pattern might capture. The sec-
2387        order  to  find multiple matches in the subject string or to match dif-
2391        operates  in  a  Perl-like  manner. For specialist use there is also an
2395        Here is an example of a simple call to pcre2_match():
2407        If  the  subject  string is zero-terminated, the length can be given as
2409        common matching parameters are to be changed. For details, see the sec-
2417        bytes  for the 8-bit library, 16-bit code units for the 16-bit library,
2418        and 32-bit code units for the 32-bit library, whether or not  UTF  pro-
2424        by far the most common case. In UTF-8 or UTF-16 mode, the starting off-
2425        set  must  point to the start of a character, or to the end of the sub-
2426        ject (in UTF-32 mode, one code unit equals one character, so  all  off-
2430        A non-zero starting offset is useful when searching for  another  match
2445        string again, but with startoffset set to 4, it finds the second occur-
2457        so,  and the current character is CR followed by LF, advance the start-
2460        If a non-zero starting offset is passed when the pattern is anchored, a
2461        single attempt to match at the given offset is made. This can only suc-
2463        the  subject.  In other words, the anchoring must be the result of set-
2472        PCRE2_NO_JIT, PCRE2_NO_UTF_CHECK,  PCRE2_PARTIAL_HARD,  and  PCRE2_PAR-
2475        Setting  PCRE2_ANCHORED  or PCRE2_ENDANCHORED at match time is not sup-
2476        ported by the just-in-time (JIT) compiler. If it is set,  JIT  matching
2492        matches must be right at the end of the subject string. Note that  set-
2507        in  multiline mode) a newline immediately before it. Setting this with-
2509        match. This option affects only the behaviour of the dollar metacharac-
2545        called.   If  a non-zero starting offset is given, the check is applied
2546        only to that part of the subject that could be inspected during  match-
2553        sequences \b and \B are one-character lookbehinds.
2559        validity  of  UTF-8  strings, UTF-16 strings, and UTF-32 strings in the
2582        the  caller  is prepared to handle a partial match, but only if no com-
2588        other words, when PCRE2_PARTIAL_HARD is set, a partial match is consid-
2591        There is a more detailed discussion of partial and multi-segment match-
2597        When  PCRE2 is built, a default newline convention is set; this is usu-
2602        pcre2pattern page. During matching, the newline choice affects the  be-
2618        However, the pattern [\r\n]A does match that string,  because  it  con-
2619        tains an explicit CR or LF reference, and so advances only by one char-
2625        not  count, nor does \s, even though it includes CR and LF in the char-
2643        phrase "capturing subpattern" or "capturing group" is used for a  frag-
2652        Alternatively, you can make direct use of the vector of PCRE2_SIZE val-
2656        pcre2_get_ovector_count() returns the number of pairs of values it con-
2659        Within the ovector, the first in each pair of values is set to the off-
2661        offset of the first code unit after the end of a substring. These  val-
2663        are byte offsets in the 8-bit library, 16-bit  offsets  in  the  16-bit
2664        library, and 32-bit offsets in the 32-bit library.
2672        the  portion  of the subject string that was matched by the entire pat-
2676        been  captured,  the returned value is 3. If there are no captured sub-
2699        2 is not. When this happens, both values in  the  offset  pairs  corre-
2705        are not matched.  The return from the function is 2, because the  high-
2706        est used capturing subpattern number is 1. The offsets for for the sec-
2711        in the pattern are never changed. That is, if a pattern contains n cap-
2713        pcre2_match(). The other elements retain whatever  values  they  previ-
2733        returns a pointer to the zero-terminated name, which is within the com-
2754        Warning: By default, certain start-of-match optimizations are  used  to
2758        engine. This check fails for "bx", causing a match failure without see-
2759        ing any marks. You can disable the start-of-match optimizations by set-
2766        offset of the character at which the match started. For  a  non-partial
2779        If pcre2_match() fails, it returns a negative number. This can be  con-
2780        verted  to a text string by calling the pcre2_get_error_message() func-
2785        of UTF-specific negative error codes is returned. Details are given  in
2800        PCRE2 stores a 4-byte "magic number" at the start of the compiled code,
2807        a library of a different code unit width, for example, a  pattern  com-
2808        piled  by  the  8-bit  library  is passed to a 16-bit or 32-bit library
2849        using JIT is being matched, but the memory available for  the  just-in-
2850        time  processing stack is not large enough. See the pcre2jit documenta-
2872        within  the  pattern. Specifically, it means that either the whole pat-
2874        the  same  position  in  the  subject string. Some simple patterns that
2875        might do this are detected and faulted at compile time, but  more  com-
2886        match,  or  auxiliary)  can be obtained by calling pcre2_get_error_mes-
2893        The returned message is terminated with a trailing zero, and the  func-
2896        PCRE2_ERROR_BADDATA  is  returned. If the buffer is too small, the mes-
2919        extracting  captured  substrings  as  new,  separate,   zero-terminated
2925        zero refers to the entire matched substring, with higher numbers refer-
2936        extracts a zero-length empty string.
2945        The pcre2_substring_copy_bynumber() function  copies  a  captured  sub-
2948        function  that  was  used for the match data block. The first two argu-
2988        pattern  is  (abc)|(def) and the subject is "def", and the ovector con-
2999        The pcre2_substring_list_get() function  extracts  all  available  sub-
3013        therefore need the lengths, you may supply NULL as the lengthsptr argu-
3015        function is zero if all went well, or PCRE2_ERROR_NOMEMORY if the  mem-
3022        This  can  be  distinguished  from  a  genuine zero-length substring by
3024        PCRE2_UNSET   for   unset   substrings,   or   by   calling  pcre2_sub-
3044        To extract a substring by name, you first have to find associated  num-
3051        the name by calling pcre2_substring_number_from_name(). The first argu-
3071        Warning: If the pattern uses the (?| feature to set up multiple subpat-
3092        given as PCRE2_ZERO_TERMINATED for a zero-terminated string. Matches in
3100        pcre2_match(), except that the partial matching options are not permit-
3102        block is obtained and freed within this function, using memory  manage-
3112        length, in code units, of the output buffer. If the  function  is  suc-
3130        option is set, a dollar character is an escape character that can spec-
3140        brackets are required only if the following character would  be  inter-
3151        used  to  perform  simple simultaneous substitutions, as this pcre2test
3164        takes place in the original subject string (that is, previous  replace-
3167        subject string. If an offset limit is set in the match context, search-
3171        the subject string by setting either or both of startoffset and an off-
3179        with zero length, an attempt to find a non-empty match at the same off-
3186        buffer is too small. The default action is to return PCRE2_ERROR_NOMEM-
3188        continues to go through the motions of matching and substituting (with-
3189        out,  of course, writing anything) in order to compute the size of buf-
3196        that the entire operation is carried out twice. Depending on the appli-
3198        the  excess  afterwards,  instead   of   using   PCRE2_SUBSTITUTE_OVER-
3221        particular  character codes, and backslash followed by any non-alphanu-
3228        current state: \U and \L change to upper or lower case forcing, respec-
3233        all inserted  characters, including those from captured groups and let-
3236        Note that case forcing sequences such as \U...\E do not nest. For exam-
3244          ${<n>:-<string>}
3247        As before, <n> may be a group number or a name. The first  form  speci-
3279        PCRE2_ERROR_NOSUBSTRING is returned for a non-existent substring inser-
3282        PCRE2_ERROR_UNSET is returned for an unset substring insertion (includ-
3284        when  the  simple  (non-extended)  syntax  is  used  and  PCRE2_SUBSTI-
3294        PCRE2_ERROR_BADREPESCAPE  (invalid  escape  sequence), PCRE2_ERROR_REP-
3295        MISSINGBRACE (closing curly bracket not found),  PCRE2_ERROR_BADSUBSTI-
3336        point to the first and last entries in the name-to-number table for the
3350        which stops when it finds the first match at a given point in the  sub-
3353        function  (see  below) instead. If you cannot use the alternative func-
3357        What you have to do is to insert a callout right at the end of the pat-
3358        tern.  When your callout function is called, extract and save the  cur-
3375        not backtrack.  This has different characteristics to the normal  algo-
3379        algorithms, and a list of features that pcre2_dfa_match() does not sup-
3384        is used in a different way, and this is described below. The other com-
3394        Here is an example of a simple call to pcre2_dfa_match():
3412        zero. The only bits that may be set  are  PCRE2_ANCHORED,  PCRE2_ENDAN-
3430        matches, but there is still at least one matching possibility. The por-
3433        more detailed discussion of partial and  multi-segment  matching,  with
3439        stop as soon as it has found one match. Because of the way the alterna-
3455        When pcre2_dfa_match() succeeds, it may have matched more than one sub-
3474        which  is  the  number  of  matched substrings. The offsets of the sub-
3477        any capturing groups that may exist in the pattern, because DFA  match-
3490        NOTE:  PCRE2's  "auto-possessification" optimization usually applies to
3546        University Computing Service
3553        Copyright (c) 1997-2018 University of Cambridge.
3554 ------------------------------------------------------------------------------
3562        PCRE2 - Perl-compatible regular expressions (revised API)
3567        the library in Unix-like environments using the applications  known  as
3574        "by  hand")  in  the  text file called NON-AUTOTOOLS-BUILD.  You should
3576        non-Unix-like environment.
3579 PCRE2 BUILD-TIME OPTIONS
3583        configure  script,  where  the  optional features are selected or dese-
3584        lected by providing options to configure before running the  make  com-
3585        mand.  However,  the same options can be selected in both Unix-like and
3586        non-Unix-like environments if you are using CMake instead of  configure
3590        by editing the config.h file, or by passing parameter settings  to  the
3591        compiler, as described in NON-AUTOTOOLS-BUILD.
3597          ./configure --help
3600        names begin with --enable or --disable. Because of the way that config-
3601        ure  works, --enable and --disable always come in pairs, so the comple-
3604        with --with. At the end of a configure run, a summary of the configura-
3608 BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES
3610        By  default, a library called libpcre2-8 is built, containing functions
3612        either  as single-byte characters, or UTF-8 strings. You can also build
3613        two other libraries, called libpcre2-16 and libpcre2-32, which  process
3614        strings  that  are contained in arrays of 16-bit and 32-bit code units,
3615        respectively. These can be interpreted either as single-unit characters
3616        or  UTF-16/UTF-32 strings. To build these additional libraries, add one
3619          --enable-pcre2-16
3620          --enable-pcre2-32
3622        If you do not want the 8-bit library, add
3624          --disable-pcre2-8
3627        the  POSIX wrapper is for the 8-bit library only, and that pcre2grep is
3628        an 8-bit program. Neither of these are built if  you  select  only  the
3629        16-bit or 32-bit libraries.
3638          --disable-shared
3639          --disable-static
3649          --disable-unicode
3655        Of itself, Unicode support does not make PCRE2 treat strings as  UTF-8,
3656        UTF-16 or UTF-32. To do that, applications that use the library can set
3657        the PCRE2_UTF option when they call pcre2_compile() to compile  a  pat-
3665        and Nd are supported. Details are given in the pcre2pattern  documenta-
3677        mode,  can  cause unpredictable behaviour because it may leave the cur-
3678        rent matching point in the middle of a multi-code-unit  character.  The
3680        option when calling pcre2_compile(). There is also a build-time option
3682          --enable-never-backslash-C
3687 JUST-IN-TIME COMPILER SUPPORT
3689        Just-in-time (JIT) compiler support is included in the build by  speci-
3692          --enable-jit
3698          --enable-jit=auto
3705          --enable-jit-sealloc
3712          --disable-pcre2grep-jit
3720        the end of a line. This is the normal newline  character  on  Unix-like
3724          --enable-newline-is-cr
3726        to the configure  command.  There  is  also  an  --enable-newline-is-lf
3730        the two-character sequence CRLF (CR immediately followed by LF). If you
3733          --enable-newline-is-crlf
3737          --enable-newline-is-anycrlf
3742          --enable-newline-is-any
3745        newline sequences are the three just mentioned, plus the single charac-
3750          --enable-newline-is-nul
3752        which causes NUL (binary zero) to be set  as  the  default  line-ending
3766          --enable-bsr-anycrlf
3768        the  default  is changed so that \R matches only CR, LF, or CRLF. What-
3776        part to another (for example, from an opening parenthesis to an  alter-
3777        nation  metacharacter).  By default, in the 8-bit and 16-bit libraries,
3778        two-byte values are used for these offsets, leading to a  maximum  size
3779        for a compiled pattern of around 64 thousand code units. This is suffi-
3782        compile PCRE2 to use three-byte or four-byte offsets by adding  a  set-
3785          --with-link-size=3
3788        16-bit library, a value of 3 is rounded up to 4.  In  these  libraries,
3790        to load additional data when handling them. For the 32-bit library  the
3791        value  is  always 4 and cannot be overridden; the value of --with-link-
3804          --with-match-limit=500000
3810        The  pcre2_match() function starts out using a 20KiB vector on the sys-
3819          --with-heap-limit=500
3829        for --with-match-limit. You can set a lower default  limit  by  adding,
3832          --with-match-limit_depth=10000
3844        for  lookaround  assertions,  atomic  groups, and recursion within pat-
3855          --enable-rebuild-chartables
3860        C run-time system. This method of replacing the tables does not work if
3871        compiled to run in an 8-bit EBCDIC environment by adding
3873          --enable-ebcdic --disable-unicode
3875        to the configure command. This setting implies --enable-rebuild-charta-
3879        It is not possible to support both EBCDIC and UTF-8 codes in  the  same
3880        version  of  the  library. Consequently, --enable-unicode and --enable-
3887          --enable-ebcdic-nl25
3889        as well as, or instead of, --enable-ebcdic. The EBCDIC character for CR
3891        0x25 is not chosen as LF is made to correspond to the Unicode NEL char-
3894        The options that select newline behaviour, such as --enable-newline-is-
3895        cr, and equivalent run-time options, refer to these character values in
3901        By default, on non-Windows systems, pcre2grep supports the use of call-
3904        This support can be disabled by adding  --disable-pcre2grep-callout  to
3914          --enable-pcre2grep-libz
3915          --enable-pcre2grep-libbz2
3917        to the configure command. These options naturally require that the rel-
3929        be processable is the notional buffer size. If a longer line is encoun-
3935          --with-pcre2grep-bufsize=51200
3936          --with-pcre2grep-max-bufsize=2097152
3939        values  by  using  --buffer-size  and  --max-buffer-size on the command
3947          --enable-pcre2test-libreadline
3948          --enable-pcre2test-libedit
3952        it reads it using the readline() function. This  provides  line-editing
3953        and  history  facilities.  Note that libreadline is GPL-licensed, so if
3958        Setting --enable-pcre2test-libreadline causes the -lreadline option  to
3960        sytem-installed readline library this is sufficient. However,  in  some
3972          LIBS="-ncurses"
3981          --enable-debug
3991          --enable-valgrind
4005          --enable-coverage
4017        When --enable-coverage is used,  the  following  addition  targets  are
4023        equivalent to running "make coverage-reset", "make  coverage-baseline",
4024        "make check", and then "make coverage-report".
4026          make coverage-reset
4030          make coverage-baseline
4034          make coverage-report
4038          make coverage-clean-report
4040        This  removes the generated coverage report without cleaning the cover-
4043          make coverage-clean-data
4048          make coverage-clean
4051        For more information about code coverage, see the gcov and  lcov  docu-
4060          --enable-fuzz-support
4062        At present this applies only to the 8-bit library. If set, it causes an
4063        extra  library  called  libpcre2-fuzzsupport.a  to  be  built,  but not
4064        installed. This contains a single function called  LLVMFuzzerTestOneIn-
4071        Setting  --enable-fuzz-support  also  causes  a binary called pcre2fuz-
4086          --disable-stack-for-recursion
4095        pcre2api(3), pcre2-config(3).
4101        University Computing Service
4108        Copyright (c) 1997-2018 University of Cambridge.
4109 ------------------------------------------------------------------------------
4117        PCRE2 - Perl-compatible regular expressions (revised API)
4132        PCRE2  provides  a feature called "callout", which is a means of tempo-
4143        ending delimiter is the same as the start, except for {, where the end-
4163          A(\d{2}|--)
4167          (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
4170        alternation bar. If the pattern contains a conditional group whose con-
4184        information  when  you are trying to optimize the performance of a par-
4194    Auto-possessification
4196        At compile time, PCRE2 "auto-possessifies" repeated items when it knows
4202          --->aaaa
4210        the   auto-possessify   feature  by  passing  PCRE2_NO_AUTO_POSSESS  to
4214          --->aaaa
4231        beginning of the subject, and pcre2_compile() remembers this. If a pat-
4232        tern  has more than one top-level branch, automatic anchoring occurs if
4237        It is also disabled if the pattern contains (*PRUNE) or  (*SKIP).  How-
4243          --->aa
4250        This shows that all match attempts start at the beginning of  the  sub-
4253        starting  the pattern with (*NO_DOTSTAR_ANCHOR). In this case, the out-
4256          --->aa
4266        This shows more match attempts, starting at the second subject  charac-
4287        You can disable these optimizations by passing the PCRE2_NO_START_OPTI-
4297        to both normal, DFA, and JIT matching. The first argument to the  call-
4323        version  1, and the callout_flags field for version 2. If you are writ-
4332        contains the number of the callout, in the range  0-255.  This  is  the
4339        callout_string  points  to the string that is contained within the com-
4347        delimiter as callout_string[-1] if you need it.
4371        The  capture_last  field  contains the number of the most recently cap-
4373        number  of  the  highest numbered captured substring so far. If no sub-
4379        The   contents  of  ovector[2]  to  ovector[<capture_top>*2-1]  can  be
4388        was passed to the matching function in the match data block  for  call-
4414        parenthesis, the length includes meta characters that follow the paren-
4417        the  length is one, unless a closing parenthesis is followed by a quan-
4426        are used by pcre2test to show the next item to be matched when display-
4430        zero-terminated name of the most recently passed (*MARK), (*PRUNE),  or
4452        starting position in the subject. Output from pcre2test does not  indi-
4456        The information in the callout_flags field is provided so that applica-
4460        because there is no backtracking in DFA matching, and there is no  sup-
4493        which they appear. Its first argument is a pointer to a callout enumer-
4495        passed  to  pcre2_callout_enumerate(). The data block contains the fol-
4513        non-zero minimum or a fixed maximum, the group is replicated inside the
4519        The callback function should normally return zero. If it returns a non-
4527        University Computing Service
4534        Copyright (c) 1997-2018 University of Cambridge.
4535 ------------------------------------------------------------------------------
4543        PCRE2 - Perl-compatible regular expressions (revised API)
4549        respect  to Perl versions 5.26, but as both Perl and PCRE2 are continu-
4555        2.  Like  Perl, PCRE2 allows repeat quantifiers on parenthesized asser-
4563        3.  Capturing  subpatterns that occur inside negative lookaround asser-
4569        \u, \U, and \N when followed by a character name. \N on its own, match-
4570        ing a non-newline character, and \N{U+dd..}, matching  a  Unicode  code
4572        letters are implemented by Perl's general string-handling and  are  not
4583        the need for the user to understand the internal representation of Uni-
4584        code characters, there is no need to implement the somewhat messy  con-
4591        does  not  have  variables).  Also, Perl does "double-quotish backslash
4592        interpolation" on any backslashes between \Q and \E which, its documen-
4620        effect  is  confined to that subpattern; it does not extend to the sur-
4641        13. PCRE2's handling of duplicate subpattern numbers and duplicate sub-
4648        distinguish  which  parentheses matched, because both names map to cap-
4659        such as [A-\d] or [a-[:digit:]]. It then treats the hyphens  as  liter-
4664        not  affected when case-independent matching is specified. For example,
4680        (b) From PCRE2 10.23, backreferences to groups of fixed length are sup-
4681        ported in lookbehinds, provided that there is no possibility of  refer-
4682        encing  a  non-unique  number or name. Perl does not support backrefer-
4686        $ meta-character matches only at the very end of the string.
4691        (e) If PCRE2_UNGREEDY is set, the greediness of the repetition  quanti-
4692        fiers is inverted, that is, by default they are not greedy, but if fol-
4704        (i)  The  callout  facility is PCRE2-specific. Perl supports codeblocks
4707        (j) The partial matching facility is PCRE2-specific.
4710        different way and is not Perl-compatible.
4717        /aa modifier restricts /i  case-insensitive  matching  to  pure  ascii,
4721        19. Perl has different limits than PCRE2. See the pcre2limit documenta-
4722        tion for details. Perl went with 5.10 from recursion to iteration keep-
4724        not  fall into any stack-overflow limit. PCRE2 made a similar change at
4725        release 10.30, and also has many build-time and  run-time  customizable
4732        University Computing Service
4739        Copyright (c) 1997-2018 University of Cambridge.
4740 ------------------------------------------------------------------------------
4748        PCRE2 - Perl-compatible regular expressions (revised API)
4750 PCRE2 JUST-IN-TIME COMPILER SUPPORT
4752        Just-in-time  compiling  is a heavyweight optimization that can greatly
4753        speed up pattern matching. However, it comes at the cost of extra  pro-
4755        the same pattern is going to be matched many times. This does not  nec-
4757        anchored, matching attempts may take place many times at various  posi-
4759        string is very long, it may still pay  to  use  JIT  even  for  one-off
4760        matches.  JIT  support  is  available  for all of the 8-bit, 16-bit and
4761        32-bit PCRE2 libraries.
4763        JIT support applies only to the  traditional  Perl-compatible  matching
4771        --enable-jit  (or  equivalent  CMake  option) must be set when PCRE2 is
4775          ARM 32-bit (v5, v7, and Thumb2)
4776          ARM 64-bit
4777          Intel x86 32-bit and 64-bit
4778          MIPS 32-bit and 64-bit
4779          Power PC 32-bit and 64-bit
4780          SPARC 32-bit
4782        If --enable-jit is set on an unsupported platform, compilation fails.
4784        A  program  can  tell if JIT support is available by calling pcre2_con-
4786        available,  and 0 otherwise. However, a simple program does not need to
4788        falls  back  to the interpretive code if JIT is not available. For pro-
4790        path" API that is JIT-specific.
4793 SIMPLE USE OF JIT
4799        second  is  zero  or  more of the following option bits: PCRE2_JIT_COM-
4810        the size of machine stack that it uses. The exact rules are  not  docu-
4815        PCRE2_JIT_COMPLETE  requests the JIT compiler to generate code for com-
4816        plete matches. If you want to run partial matches using the  PCRE2_PAR-
4821        pcre2_match()  is  called,  the appropriate code is run if it is avail-
4826        the option bits. For example, you can call it once with  PCRE2_JIT_COM-
4829        will ignore PCRE2_JIT_COMPLETE and just compile code for partial match-
4830        ing. If pcre2_jit_compile() is called with no option bits set, it imme-
4847        stack"  below,  even  if  you  do  not need to supply a non-default JIT
4849        be  obeyed.  If the match-time options are not right for JIT execution,
4852        If the JIT compiler finds an unsupported item, no JIT  data  is  gener-
4855        option.  A non-zero result means that JIT compilation was successful. A
4872        when  running in a UTF mode, and a callout immediately before an asser-
4881        that  the memory used for the JIT stack was insufficient. See "Control-
4901        The pcre2_jit_stack_create() function creates a JIT  stack.  Its  argu-
4907        function returns immediately, without doing anything. (For the  techni-
4919        The first argument is a pointer to a match context. When this is subse-
4921        JIT stack is used. If this argument is NULL, the function returns imme-
4941        is not obeyed when pcre2_match() is called with options that are incom-
4949        up  non-sequential matches in one thread is to use callouts: if a call-
4954        you assign or pass back NULL from  a  callback,  that  is  thread-safe,
4956        or pass back a non-NULL JIT stack, this must be a different  stack  for
4957        each thread so that the application is thread-safe.
4959        Strictly  speaking,  even more is allowed. You can assign the same non-
4968        up non-default JIT stacks might operate:
4976          Use a one-line callback function
4987        PCRE2 (and JIT) is a recursive, depth-first engine, so it needs a stack
4989        child nodes.  Allocating real machine stack on some platforms is diffi-
4997        address space instead of allocating memory. We can safely allocate mem-
5019        You can free compiled patterns, contexts, and stacks in any order, any-
5038        Especially on embedded sytems, it might be a good idea to release  mem-
5041        allocated  memory for any stack and another which allows releasing mem-
5055        The JIT executable allocator does not free all memory when it is possi-
5059        calling pcre2_jit_free_unused_memory(). Its argument is a general  con-
5060        text, for custom memory management, or NULL for standard memory manage-
5066        This is a single-threaded example that specifies a  JIT  stack  without
5111        number of other sanity checks are performed on the arguments. For exam-
5129        University Computing Service
5136        Copyright (c) 1997-2018 University of Cambridge.
5137 ------------------------------------------------------------------------------
5145        PCRE2 - Perl-compatible regular expressions (revised API)
5153        code units for the 8-bit and 16-bit libraries if PCRE2 is compiled with
5157        (when  building  the  16-bit  library,  3  is rounded up to 4). See the
5159        for  details.  In  these cases the limit is substantially larger.  How-
5160        ever, the speed of execution is slower.  In  the  32-bit  library,  the
5170        (that is ~(PCRE2_SIZE)0) is reserved as a special indicator  for  zero-
5190        (*THEN) verb is 255 code units for the 8-bit  library  and  65535  code
5191        units for the 16-bit and 32-bit libraries.
5194        number a 32-bit unsigned integer can hold.
5200        University Computing Service
5207        Copyright (c) 1997-2017 University of Cambridge.
5208 ------------------------------------------------------------------------------
5216        PCRE2 - Perl-compatible regular expressions (revised API)
5224        function,  and  provide a Perl-compatible matching operation. The just-
5225        in-time (JIT) optimization that is described in the pcre2jit documenta-
5229        it operates in a different way, and is not Perl-compatible. This alter-
5250        The set of strings that are matched by a regular expression can be rep-
5255        tree:  depth-first  and  breadth-first, and these correspond to the two
5261        In the terminology of Jeffrey Friedl's book "Mastering Regular  Expres-
5263        depth-first search of the pattern tree. That is, it  proceeds  along  a
5265        required. When there is a mismatch, the algorithm  tries  any  alterna-
5274        that  point the algorithm stops. Thus, if there is more than one possi-
5280        Because  it  ends  up  with a single path through the tree, it is rela-
5281        tively straightforward for this algorithm to keep  track  of  the  sub-
5288        This algorithm conducts a breadth-first search of  the  tree.  Starting
5306        this algorithm finds all of them, and in particular, it finds the long-
5308        an option to stop the algorithm after the first match (which is  neces-
5318        the fifth character of the subject. The algorithm  does  not  automati-
5321        PCRE2's "auto-possessification" optimization usually applies to charac-
5322        ter repeats at the end of a pattern (as well as internally). For  exam-
5327        either  use  an ungreedy repeat ("a\d+?") or set the PCRE2_NO_AUTO_POS-
5331        not  supported  by the alternative matching algorithm. They are as fol-
5336        may affect auto-possessification, as just described). During  matching,
5345        a  non-possessive quantifier. Similarly, if an atomic group is present,
5353        algorithm does not attempt to do this. This means that no captured sub-
5356        3. Because no substrings are captured, backreferences within  the  pat-
5359        4.  For  the same reason, conditional expressions that use a backrefer-
5373        these  modes,  because the alternative algorithm moves through the sub-
5384        Using  the alternative matching algorithm provides the following advan-
5387        1. All possible matches (at a single point in the subject) are automat-
5393        once, and never needs to backtrack (except for lookbehinds), it is pos-
5396        also  possible  to  do  multi-segment matching using the standard algo-
5397        rithm, by retaining partially matched substrings, it  is  more  compli-
5399        and discusses multi-segment matching.
5419        University Computing Service
5426        Copyright (c) 1997-2014 University of Cambridge.
5427 ------------------------------------------------------------------------------
5435        PCRE2 - Perl-compatible regular expressions
5441        the  entire pattern, PCRE2_ERROR_NOMATCH is returned. There are circum-
5454        reflecting the character that has been typed, for example. This immedi-
5467        If  you  want to use partial matching with just-in-time optimized code,
5473        PCRE2_JIT_COMPLETE should also be set if you are going to run  non-par-
5483        shorter strings. This optimization is also disabled for partial  match-
5490        the subject string is reached successfully, but  matching  cannot  con-
5491        tinue because more characters are needed. However, at least one charac-
5495        of  a matched string. The requirement for inspecting at least one char-
5502        the rest of the ovector are undefined. The appearance of \K in the pat-
5511        string  "abc12",  because  all these characters are needed for a subse-
5512        quent re-match with additional characters.
5520        match, the partial match is remembered, but matching continues as  nor-
5525        This  option  is "soft" because it prefers a complete match over a par-
5529        of the subject is treated as a non-alphanumeric.
5536        If this is matched against the subject string "abc123dog", both  alter-
5557        The  difference  between the two partial matching options can be illus-
5565        However,  if  PCRE2_PARTIAL_HARD is set, the result is PCRE2_ERROR_PAR-
5586        without backtracking, searching for  all  possible  matches  simultane-
5587        ously.  If the end of the subject is reached before the end of the pat-
5600        behaviour is different from  the  standard  functions  when  PCRE2_PAR-
5614        boundaries,  partial matching with PCRE2_PARTIAL_SOFT can give counter-
5649        matched substrings. The remaining four strings do not  match  the  com-
5657 MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()
5661        and  calling  the function again with the same compiled regular expres-
5663        same working space as before, because this is where details of the pre-
5672        The first call has "23ja" as the subject, and requests  partial  match-
5675        last  part  is  shown;  PCRE2 does not retain the previously partially-
5685        this may or may not be what you want.  The only way to allow for start-
5695 MULTI-SEGMENT MATCHING WITH pcre2_match()
5700        re-run,  starting from the point where the partial match occurred. Ear-
5719 ISSUES WITH MULTI-SEGMENT MATCHING
5721        Certain types of pattern may give problems with multi-segment matching,
5727        option, but in practice when doing multi-segment matching you should be
5730        2. If a pattern contains a lookbehind assertion, characters  that  pre-
5742        retained. In a non-UTF or a 32-bit situation, moving  back  is  just  a
5743        subtraction,  but in UTF-8 or UTF-16 you have to count characters while
5752        the string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maxi-
5784        been  found,  continuation to a new subject segment is no longer possi-
5809        matching  multi-segment  data.  The  example above then behaves differ-
5849        re-running  the  entire  match  can  also be used with the DFA matching
5859        University Computing Service
5866        Copyright (c) 1997-2014 University of Cambridge.
5867 ------------------------------------------------------------------------------
5875        PCRE2 - Perl-compatible regular expressions (revised API)
5880        by PCRE2 are described in detail below. There is a quick-reference syn-
5882        and semantics as closely as it can.  PCRE2 also supports some  alterna-
5897        different  algorithm  that is not Perl-compatible. Some of the features
5898        discussed below are not available when DFA matching is used. The advan-
5903 SPECIAL START-OF-PATTERN ITEMS
5906        set by special items at the start of a pattern. These are not Perl-com-
5908        writers  who are not able to change the program that processes the pat-
5915        In the 8-bit and 16-bit PCRE2 libraries, characters may be coded either
5916        as single code units, or as multiple UTF-8 or UTF-16 code units. UTF-32
5917        can be specified for the 32-bit library, in which  case  it  constrains
5928        restrict   them   to   non-UTF   data  for  security  reasons.  If  the
5936        causes  sequences such as \d and \w to use Unicode properties to deter-
5949        to whichever matching function is subsequently called to match the pat-
5953    Disabling auto-possessification
5961    Disabling start-up optimizations
5964        setting the PCRE2_NO_START_OPTIMIZE option. This disables several opti-
5971        as setting the PCRE2_NO_DOTSTAR_ANCHOR option. This disables  optimiza-
5972        tions that apply to patterns whose top-level branches all start with .*
5992        These  facilities  are  provided to catch runaway matches that are pro-
5993        voked by patterns with huge matching trees (a typical example is a pat-
6003        where d is any number of decimal digits. However, the value of the set-
6026        strings:  a  single  CR (carriage return) character, a single LF (line-
6027        feed) character, the two-character sequence CRLF, any of the three pre-
6032        It  is also possible to specify a newline convention by starting a pat-
6042        These override the default and the options given to the compiling func-
6052        The newline convention affects where the circumflex and  dollar  asser-
6053        tions are true. It also affects the interpretation of the dot metachar-
6067        starting  a  pattern  with (*BSR_ANYCRLF). For completeness, (*BSR_UNI-
6074        character  code instead of ASCII or Unicode (typically a mainframe sys-
6075        tem). In the sections below, character code values are  ASCII  or  Uni-
6098        There are two different sets of metacharacters: those that  are  recog-
6124          -      indicates character range
6142        always safe to precede a non-alphanumeric  with  backslash  to  specify
6143        that it stands for itself.  In particular, if you want to match a back-
6156        If you want to remove the special meaning from a  sequence  of  charac-
6157        ters,  you can do so by putting them between \Q and \E. This is differ-
6159        sequences  in PCRE2, whereas in Perl, $ and @ cause variable interpola-
6160        tion. Also, Perl does "double-quotish backslash interpolation"  on  any
6182    Non-printing characters
6184        A second use of backslash provides a way of encoding non-printing char-
6186        appearance  of non-printing characters in a pattern, but when a pattern
6192          \cx         "control-x", where x is any printable ASCII character
6218        32 or greater than 126, a compile-time error occurs.
6222        The \c escape is processed as specified for Perl in the perlebcdic doc-
6223        ument.  The  only characters that are allowed after \c are A-Z, a-z, or
6224        one of @, [, \, ], ^, _, or ?. Any other character provokes a  compile-
6226        letters (in either case) encode characters 1-26 (hex 01 to hex 1A);  [,
6227        \,  ],  ^,  and  _  encode characters 27-31 (hex 1B to hex 1F), and \c?
6239        FF),  but  in  the one Perl calls POSIX-BC its value is 95 (hex 5F). If
6240        certain other characters have POSIX-BC values, PCRE2 makes \c? generate
6256        a digit greater than zero. Instead, use \o{} or \x{} to specify numeri-
6257        cal character code points, and \g{} to specify backreferences. The fol-
6260        The handling of a backslash followed by a digit other than 0 is compli-
6263        Outside a character class, PCRE2 reads the digit and any following dig-
6267        backreference.  A description of how this works is given later, follow-
6271        Inside  a character class, PCRE2 handles \8 and \9 as the literal char-
6272        acters "8" and "9", and otherwise reads up to three octal  digits  fol-
6273        lowing the backslash, using them to generate a data character. Any sub-
6295        By  default, after \x that is not followed by {, from zero to two hexa-
6297        number of hexadecimal digits may appear between \x{ and }. If a charac-
6302        just described only when it is followed by two hexadecimal digits. Oth-
6305        by  four hexadecimal digits; otherwise it matches a literal "u" charac-
6309        two syntaxes for \x (or by \u in PCRE2_ALT_BSUX mode). There is no dif-
6318          8-bit non-UTF mode    no greater than 0xff
6319          16-bit non-UTF mode   no greater than 0xffff
6320          32-bit non-UTF mode   no greater than 0xffffffff
6324        (the  so-called  "surrogate"  code  points). The check for these can be
6327        UTF-8 and UTF-32 modes, because these values are not  representable  in
6328        UTF-16.
6358        For  compatibility with Oniguruma, the non-Perl syntax \g followed by a
6362        \g<...> (Oniguruma syntax) are not synonymous. The former is a backref-
6379          \W     any "non-word" character
6384        has a different meaning. See the section entitled "Non-printing charac-
6388        Each  pair of lower and upper case escape sequences partitions the com-
6398        locale. This list may vary if locale-specific matching is taking place.
6399        For example, in some locales the "non-breaking space" character  (\xA0)
6403        or digit.  By default, the definition of letters  and  digits  is  con-
6404        trolled by PCRE2's low-valued character tables, and may vary if locale-
6406        page).  For  example,  in  a French locale such as "fr_FR" in Unix-like
6413        be different for characters in the range 128-255  when  locale-specific
6415        meanings from before Unicode support was available,  mainly  for  effi-
6437          U+00A0     Non-break space
6444          U+2004     Three-per-em space
6445          U+2005     Four-per-em space
6446          U+2006     Six-per-em space
6451          U+202F     Narrow no-break space
6465        In 8-bit, non-UTF-8 mode, only the characters  with  code  points  less
6471        any Unicode newline sequence. In 8-bit non-UTF-8 mode \R is  equivalent
6477        below.  This particular group matches either the two-character sequence
6479        U+000A), VT (vertical tab, U+000B), FF (form feed,  U+000C),  CR  (car-
6481        atomic group, the two-character sequence is treated as  a  single  unit
6485        than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-
6491        PCRE2_BSR_ANYCRLF at compile time. (BSR is an  abbrevation  for  "back-
6493        the case, the other behaviour can be requested via  the  PCRE2_BSR_UNI-
6500        These override the default and the options given to the compiling func-
6501        tion.  Note that these special settings, which are not Perl-compatible,
6515        When  PCRE2  is  built  with Unicode support (the default), three addi-
6517        are  available.  In 8-bit non-UTF-8 mode, these sequences are of course
6519        they do work in this mode.  In 32-bit non-UTF mode, code points greater
6531        (described  in the next section).  Other Perl properties such as "InMu-
6545        Adlam, Ahom, Anatolian_Hieroglyphs, Arabic,  Armenian,  Avestan,  Bali-
6547        Braille, Buginese, Buhid, Canadian_Aboriginal, Carian,  Caucasian_Alba-
6553        Inscriptional_Pahlavi, Inscriptional_Parthian, Javanese,  Kaithi,  Kan-
6555        Latin, Lepcha, Limbu, Linear_A, Linear_B, Lisu, Lycian,  Lydian,  Maha-
6559        Nabataean, New_Tai_Lue, Newa, Nko, Nushu, Ogham, Ol_Chiki,  Old_Hungar-
6560        ian,  Old_Italic,  Old_North_Arabian, Old_Permic, Old_Persian, Old_Sog-
6563        Psalter_Pahlavi, Rejang, Runic, Samaritan,  Saurashtra,  Sharada,  Sha-
6566        Tai_Viet,  Takri,  Tamil,  Tangut, Telugu, Thaana, Thai, Tibetan, Tifi-
6569        Each character has exactly one Unicode general category property, spec-
6570        ified  by a two-letter abbreviation. For compatibility with Perl, nega-
6575        If only one letter is specified with \p or \P, it includes all the gen-
6602          Mn    Non-spacing mark
6643        No character that is in the Unicode table has the Cn (unassigned) prop-
6652        to do a multistage table lookup in order to find  a  character's  prop-
6667        properties  that had been used for emojis.  Instead it introduced vari-
6668        ous emoji-specific properties. PCRE2  uses  only  the  Extended  Picto-
6677        2.  Do not end between CR and LF; otherwise end after any control char-
6687        "zero-width  joiner"  character.  Characters  with  the "mark" property
6694        property.  Extend and ZWJ characters are allowed  between  the  charac-
6705        As  well as the standard Unicode properties described above, PCRE2 sup-
6708        non-standard, non-Perl properties internally  when  PCRE2_UCP  is  set.
6716        Xan  matches  characters that have either the L (letter) or the N (num-
6723        There is another non-standard property, Xuc, which matches any  charac-
6730        Note that the Xuc property does not match these sequences but the char-
6747        mode), though it again reports the matched string as "bar".  This  fea-
6751        does not interfere with the setting of captured substrings.  For  exam-
6762        be  greater  than the end of the match. Using \K in a lookbehind asser-
6763        tion at the start of a pattern can also lead to odd effects. For  exam-
6773    Simple assertions
6775        The final use of backslash is for certain simple assertions. An  asser-
6799        PCRE2 nor Perl has a separate "start of word" or "end of word"  metase-
6806        set.  Thus,  they are independent of multiline mode. These three asser-
6808        which  affect only the behaviour of the circumflex and dollar metachar-
6809        acters. However, if the startoffset argument of pcre2_match()  is  non-
6816        the  start point of the matching process, as specified by the startoff-
6818        startoffset  is  non-zero. By calling pcre2_match() multiple times with
6836        The circumflex and dollar  metacharacters  are  zero-width  assertions.
6837        That  is,  they test for a particular condition being true without con-
6839        are  concerned  with matching the starts and ends of lines. If the new-
6840        line convention is set so that only the two-character sequence CRLF  is
6846        point is at the start of the subject string. If the  startoffset  argu-
6847        ment  of  pcre2_match() is non-zero, or if PCRE2_NOTBOL is set, circum-
6856        if the pattern is constrained to match only at the start  of  the  sub-
6864        newline. Dollar need not be the last character of the pattern if a num-
6866        branch in which it appears. Dollar has no special meaning in a  charac-
6886        pcre2_match() is non-zero. The PCRE2_DOLLAR_ENDONLY option  is  ignored
6889        When  the  newline  convention (see "Newline conventions" below) recog-
6890        nizes the two-character sequence CRLF as a newline, this is  preferred,
6891        even  if  the  single  characters CR and LF are also recognized as new-
6905        Outside a character class, a dot in the pattern matches any one charac-
6906        ter  in  the subject string except (by default) a character that signi-
6910        that  character; when the two-character sequence CRLF is used, dot does
6912        matches  all characters (including isolated CRs and LFs). When any Uni-
6918        exception.   If  the two-character sequence CRLF is present in the sub-
6921        The handling of dot is entirely independent of the handling of  circum-
6931        the section entitled "Non-printing characters" above for details.  Perl
6939        unit,  whether or not a UTF mode is set. In the 8-bit library, one code
6940        unit is one byte; in the 16-bit library it is a  16-bit  unit;  in  the
6941        32-bit  library  it  is  a 32-bit unit. Unlike a dot, \C always matches
6942        line-ending characters. The feature is provided in  Perl  in  order  to
6943        match individual bytes in UTF-8 mode, but it is unclear how it can use-
6947        one  unit  with  \C  in UTF-8 or UTF-16 mode means that the rest of the
6949        results, because PCRE2 assumes that it is matching character by charac-
6959        below)  in UTF-8 or UTF-16 modes, because this would make it impossible
6962        these UTF modes.  The former gives a match-time error; the latter fails
6965        In  the  32-bit  library,  however,  \C  is  always supported (when not
6967        whether or not UTF-32 is specified.
6970        using it that avoids the problem of malformed UTF-8 or  UTF-16  charac-
6972        as in this pattern, which could be used with  a  UTF-8  string  (ignore
6975          (?| (?=[\x00-\x7f])(\C) |
6976              (?=[\x80-\x{7ff}])(\C)(\C) |
6977              (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
6978              (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
6981        parentheses numbers in each alternative (see "Duplicate Subpattern Num-
6983        UTF-8 character for values whose encoding uses 1, 2,  3,  or  4  bytes,
6991        closing square bracket. A closing square bracket on its own is not spe-
7010        class that starts with a circumflex is not an assertion; it still  con-
7016        letters in a class represent both their upper case and lower case  ver-
7022        special way  when  matching  character  classes,  whatever  line-ending
7037        sequences,  they  cause an error. The same is true for \N when not fol-
7040        The minus (hyphen) character can be used to specify a range of  charac-
7041        ters  in  a  character  class.  For  example,  [d-m] matches any letter
7046        example, [b-d-z] matches letters in the range b to d, a hyphen  charac-
7056        It is not possible to have the literal character "]" as the end charac-
7057        ter  of a range. A pattern such as [W-]46] is interpreted as a class of
7058        two characters ("W" and "-") followed by a literal string "46]", so  it
7059        would  match  "W46]"  or  "-46]". However, if the "]" is escaped with a
7060        backslash it is interpreted as the end of range, so [W-\]46] is  inter-
7065        Ranges normally include all code points between the start and end char-
7067        numerically, for example [\000-\037]. Ranges can include any characters
7068        that are valid for the current mode. In any  UTF  mode,  the  so-called
7071        PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES  option  disables this check). How-
7072        ever, ranges such as [\x{d7ff}-\x{e000}], which include the surrogates,
7076        points are both specified as literal letters in the same case. For com-
7078        letters are omitted. For example, [h-k] matches only  four  characters,
7081        [\x88-\x92] or [h-\x92], all code points are included.
7084        it matches the letters in either case. For example, [W-c] is equivalent
7085        to  [][\\^_`wxyzabc],  matched  caselessly,  and  in a non-UTF mode, if
7086        character tables for a French locale are in  use,  [\xc8-\xcb]  matches
7100        special compatibility feature - see the next  two  sections),  and  the
7101        terminating  closing  square  bracket.  However,  escaping  other  non-
7118          ascii    character codes 0 - 127
7132        CR (13), and space (32). If locale-specific matching is  taking  place,
7142        matches "1", "2", or any non-digit. PCRE2 (and Perl) also recognize the
7147        the POSIX character classes, although this may be different for charac-
7148        ters in the range 128-255 when locale-specific matching  is  happening.
7168                  when printed. In Unicode property terms, it matches all char-
7173                    U+2066 - U+2069  Various "isolate"s
7180        [:punct:] This matches all characters that have the Unicode P (punctua-
7201        that \b matches at the start and the end of a word (see "Simple  asser-
7202        tions"  above),  and in a Perl-style pattern the preceding or following
7204        assertions  that  are used above in order to give exactly the POSIX be-
7228        enclosed  between "(?"  and ")". These options are Perl-compatible, and
7229        are described in detail in the pcre2api documentation. The option  let-
7239        For example, (?im) sets caseless, multiline matching. It is also possi-
7241        hyphen, for example (?-im). The two "extended" options are not indepen-
7244        A  combined  setting  and  unsetting  such  as  (?im-sx),  which   sets
7248        the option is unset. An empty options setting "(?)" is  allowed.  Need-
7252        the above options to be unset. Thus, (?^) is equivalent  to  (?-imnsx).
7253        Letters  may  follow  the  circumflex  to  cause some options to be re-
7256        The PCRE2-specific options PCRE2_DUPNAMES  and  PCRE2_UNGREEDY  can  be
7257        changed  in  the  same  way as the Perl-compatible options by using the
7269        not  used).   By this means, options can be made to have different set-
7270        tings in different parts of the pattern. Any changes made in one alter-
7282        start of a non-capturing subpattern (see the next section), the  option
7290        Note:  There  are  other  PCRE2-specific options that can be set by the
7291        application when the compiling function is called. The pattern can con-
7327        the captured substrings are "red king", "red", and "king", and are num-
7333        by  a question mark and a colon, the subpattern does not do any captur-
7344        start of a non-capturing subpattern,  the  option  letters  may  appear
7361        starts  with (?| and is itself a non-capturing subpattern. For example,
7366        Because the two alternatives are inside a (?| group, both sets of  cap-
7370        not all, of one of a number of alternatives. Inside a (?| group, paren-
7373        subpattern  start after the highest number used in any branch. The fol-
7374        lowing example is taken from the Perl documentation. The numbers under-
7377          # before  ---------------branch-reset----------- after
7393        A relative reference such as (?-1) is no different: it is just a conve-
7396        If a condition test for a subpattern's having matched refers to a  non-
7397        unique  number, the test is true if any of the subpatterns of that num-
7406        Identifying  capturing  parentheses  by number is simple, but it can be
7407        very hard to keep track of the numbers in  complicated  patterns.  Fur-
7409        with this difficulty, PCRE2 supports the naming  of  capturing  subpat-
7417        must start with a non-digit. References to capturing  parentheses  from
7418        other parts of the pattern, such as backreferences, recursion, and con-
7422        exactly  as if the names were not present. In both PCRE2 and Perl, cap-
7425        for extracting the complete name-to-number  translation  table  from  a
7426        compiled  pattern, as well as convenience functions for extracting cap-
7442        number to be associated with more than one name. The example above pro-
7443        vokes a compile-time error. However, there is still  scope  for  confu-
7453        By default, a name must be unique within a pattern, except that  dupli-
7459        The duplicate name constraint can be disabled by setting the PCRE2_DUP-
7463        a weekday, either as a 3-letter abbreviation or as the full  name,  and
7478        problem is to use a "branch reset" subpattern, as described in the pre-
7481        If you make a backreference to a non-unique named subpattern from else-
7484        first one that is set is used for the reference. For example, this pat-
7490        If you make a subroutine call to a non-unique named subpattern, the one
7519        The  general repetition quantifier specifies a minimum and maximum num-
7540        the syntax of a quantifier, is taken as a literal character. For  exam-
7545        of which is represented by a two-byte sequence in a UTF-8 string. Simi-
7551        the previous item and the quantifier were not present. This may be use-
7557        For  convenience, the three most common quantifiers have single-charac-
7573        subpattern  does in fact match no characters, the loop is forcibly bro-
7594        and instead matches the minimum number of times possible, so  the  pat-
7621        (equivalent  to  Perl's /s) is set, thus allowing the dot to match new-
7628        In cases where it is known that the subject  string  contains  no  new-
7629        lines,  it  is worth setting PCRE2_DOTALL in order to obtain this opti-
7639        If  the subject is "xyz123abc123" the match point is the fourth charac-
7642        Another case where implicit anchoring is not applied is when the  lead-
7648        It matches "ab" in the subject "aab". The use of the backtracking  con-
7652        When a capturing subpattern is repeated, the value captured is the sub-
7659        the  corresponding captured values may have been set in previous itera-
7671        to be re-evaluated to see if a different number of repeats  allows  the
7687        to be re-evaluated in this way.
7695        This kind of parenthesis "locks up" the  part of the  pattern  it  con-
7704        Atomic grouping subpatterns are not capturing subpatterns. Simple cases
7706        must  swallow  everything  it can. So, while both \d+ and \d+? are pre-
7732        The  possessive  quantifier syntax is an extension to the Perl 5.8 syn-
7739        simple pattern constructs. For example, the sequence A+B is treated  as
7741        when B must follow.  This feature can be disabled by the PCRE2_NO_AUTO-
7751        matches an unlimited number of substrings that either consist  of  non-
7762        when  a single character is used. They remember the last single charac-
7769        sequences of non-digits cannot be broken, and failure happens quickly.
7775        0  (and possibly further digits) is a backreference to a capturing sub-
7781        there  are  not that many capturing left parentheses in the entire pat-
7783        to  the left of the reference for numbers less than 8. A "forward back-
7785        and  the  subpattern to the right has participated in an earlier itera-
7791        See the subsection entitled "Non-printing characters" above for further
7805        An  unsigned number specifies an absolute reference without the ambigu-
7810          (abc(def)ghi)\g{-1}
7812        The sequence \g{-1} is a reference to the most recently started captur-
7813        ing subpattern before \g, that is, is it equivalent to \2 in this exam-
7814        ple.  Similarly, \g{-2} would be equivalent to \1. The use of  relative
7823        A backreference matches whatever actually matched the capturing subpat-
7832        time of the backreference, the case of letters is relevant.  For  exam-
7856        subpattern  has not actually been used in a particular match, any back-
7862        the PCRE2_MATCH_UNSET_BACKREF option is set at compile time, a backref-
7865        Because there may be many capturing parentheses in a pattern, all  dig-
7866        its  following  a backslash are taken as part of a potential backrefer-
7877        matches.  However, such references can be useful inside  repeated  sub-
7882        matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
7886        the  backreference. This can be done using alternation, as in the exam-
7898        current matching point that does not consume any characters. The simple
7915        referenced in the usual way.  For example, a sequence such as (.)\g{-1}
7922        retained after a successful negative assertion. When an assertion  con-
7925        For  a  positive  assertion, internally captured substrings in the suc-
7926        cessful branch are retained, and matching continues with the next  pat-
7938        useful. However, an assertion that forms the  condition  for  a  condi-
7939        tional  subpattern may not be quantified. In practice, for other asser-
7948        tried with and without the assertion, the order depending on the greed-
7962        matches a word followed by a semicolon, but does not include the  semi-
7992        strings it matches must have a fixed length. However, if there are sev-
7993        eral  top-level  alternatives,  they  do  not all have to have the same
8009        is  not  permitted,  because  its single top-level branch can match two
8011        two top-level branches:
8016        of a lookbehind assertion to get round the fixed-length restriction.
8020        then try to match. If there are insufficient characters before the cur-
8023        In  UTF-8  and  UTF-16 modes, PCRE2 does not allow the \C escape (which
8026        the lookbehind. The \X and \R escapes, which can match  different  num-
8030        lookbehinds, as long as the subpattern matches a  fixed-length  string.
8046        assertions to specify efficient matching of fixed-length strings at the
8047        end of subject strings. Consider a simple pattern such as
8052        proceeds from left to right, PCRE2 will look for each "a" in  the  sub-
8067        quantifier; it can match only the entire string. The subsequent lookbe-
8082        three characters are not "999".  This pattern does not match "foo" pre-
8084        three of which are not "999". For example, it  doesn't  match  "123abc-
8108        It  is possible to cause the matching process to obey a subpattern con-
8110        on  the result of an assertion, or whether a specific capturing subpat-
8114          (?(condition)yes-pattern)
8115          (?(condition)yes-pattern|no-pattern)
8117        If  the  condition is satisfied, the yes-pattern is used; otherwise the
8118        no-pattern (if present) is used. An absent no-pattern is equivalent  to
8119        an  empty string (it always matches). If there are more than two alter-
8120        natives in the subpattern, a compile-time error occurs. Each of the two
8121        alternatives may itself contain nested subpatterns of any form, includ-
8129        There are five kinds of condition: references  to  subpatterns,  refer-
8130        ences  to  recursion,  two pseudo-conditions called DEFINE and VERSION,
8136        the condition is true if a capturing subpattern of that number has pre-
8139        numbers), the condition is true if any of them have matched. An  alter-
8142        most  recently opened parentheses can be referenced by (?(-1), the next
8143        most recent by (?(-2), and so on. Inside loops it can also  make  sense
8146        is not used; it provokes a compile-time error.)
8148        Consider  the  following  pattern, which contains non-significant white
8155        character is present, sets it as the first captured substring. The sec-
8160        yes-pattern  is  executed and a closing parenthesis is required. Other-
8161        wise, since no-pattern is not present, the subpattern matches  nothing.
8162        In  other  words,  this  pattern matches a sequence of non-parentheses,
8168          ...other stuff... ( \( )?    [^()]+    (?(-1) \) ) ...
8179        the letter R followed by digits are ambiguous (see the  following  sec-
8192        "Recursion"  in  this sense refers to any subroutine-like call from one
8193        part of the pattern to another, whether or not it  is  actually  recur-
8231        be only one alternative in the subpattern. It is always skipped if con-
8233        can be used to define subroutines that can  be  referenced  from  else-
8234        where. (The use of subroutines is described below.) For example, a pat-
8235        tern to match an IPv4 address such as "192.168.23.245" could be written
8238          (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
8243        an  IPv4  address  (a number less than 256). When matching takes place,
8246        to match the four dot-separated components of an IPv4 address,  insist-
8251        Programs  that link with a PCRE2 library can check the version by call-
8253        that  do  not have access to the underlying code cannot do this. A spe-
8269        assertion.  Consider  this  pattern,  again  containing non-significant
8272          (?(?=[^a-z]*[a-z])
8273          \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )
8276        optional  sequence of non-letters followed by a letter. In other words,
8280        strings  in  one  of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
8285        for both positive and negative assertions, because matching always con-
8286        tinues after the assertion, whether it succeeds or fails. (Compare non-
8306        at the start of the pattern, as described in the section entitled "New-
8310        when PCRE2_EXTENDED is set, and the default newline convention (a  sin-
8329        For some time, Perl has provided a facility that allows regular expres-
8341        Instead, it supports special syntax for recursion of  the  entire  pat-
8342        tern, and also for individual subpattern recursion. After its introduc-
8349        subpattern. (If not, it is a non-recursive subroutine  call,  which  is
8359        substrings which can either be a  sequence  of  non-parentheses,  or  a
8360        recursive  match  of the pattern itself (that is, a correctly parenthe-
8362        of a possessive quantifier to avoid backtracking into sequences of non-
8375        of (?1) in the pattern above you can write (?-2) to refer to the second
8380        Be aware however, that if duplicate subpattern numbers are in use, rel-
8384          (?|(a)|(b)) (c) (?-2)
8387        group (c) is number 2. When the reference  (?-2)  is  encountered,  the
8396        because  the  reference  is  not inside the parentheses that are refer-
8397        enced. They are always non-recursive subroutine calls, as described  in
8401        for this is (?&name); PCRE1's earlier syntax  (?P>name)  is  also  sup-
8409        The example pattern that we have been looking at contains nested unlim-
8411        strings of non-parentheses is important when applying  the  pattern  to
8423        callout function can be used (see below and the pcre2callout documenta-
8429        which is the last value taken on at the top level. If a capturing  sub-
8435        recursion.  Consider this pattern, which matches text in  angle  brack-
8437        brackets (that is, when recursing), whereas any characters are  permit-
8443        two different alternatives for the recursive and  non-recursive  cases.
8453        never re-entered, even if it contained untried alternatives  and  there
8458        treated as atomic. That is, they can be re-entered to try unused alter-
8473        match fails. If you want to match typical palindromic phrases, the pat-
8474        tern  has  to  ignore  all  non-word characters, which can be done like
8480        such  as "A man, a plan, a canal: Panama!". Note the use of the posses-
8481        sive quantifier *+ to avoid backtracking  into  sequences  of  non-word
8489        next section), it had no access to any values that were  captured  out-
8513          (...(relative)...)...(?-1)...
8534        Processing options such as case-independence are fixed when  a  subpat-
8538          (abc)(?i:(?-1))
8550        For  compatibility with Oniguruma, the non-Perl syntax \g followed by a
8553        possibly recursively. Here are two of the examples used above,  rewrit-
8562          (abc)(?i:\g<-1>)
8573        This makes it possible, amongst other things, to extract different sub-
8574        strings that match the same pair of parentheses when there is a repeti-
8577        PCRE2 provides a similar feature, but of course it  cannot  obey  arbi-
8582        passed, or if the callout entry point is set to NULL, callouts are dis-
8594        During matching, when PCRE2 reaches a callout point, the external func-
8601        time, and one side-effect is that sometimes callouts  are  skipped.  If
8617        They are all numbered 255. If there is a conditional group in the  pat-
8629        A delimited string may be used instead of a number as a  callout  argu-
8631        ending delimiter is the same as the start, except for {, where the end-
8653        PCRE2_ALT_VERBNAMES  option,  but the result is no longer Perl-compati-
8659        and  sequences such as \x{100} that define character code points. Char-
8665        names is skipped, and #-comments are recognized, exactly as in the rest
8669        The  maximum  length of a name is 255 in the 8-bit library and 65535 in
8670        the 16-bit and 32-bit libraries. If the name is empty, that is, if  the
8672        the colon were not there. Any number of these verbs may occur in a pat-
8676        them can be used only when the pattern is to be matched using the  tra-
8683        subpatterns called as subroutines (whether or not recursively) is docu-
8693        course, be processed. You can suppress the start-of-match optimizations
8694        by setting the PCRE2_NO_START_OPTIMIZE option when  calling  pcre2_com-
8711        then continues at the outer level. If (*ACCEPT) in triggered in a posi-
8715        If  (*ACCEPT)  is inside capturing parentheses, the data so far is cap-
8720        This matches "AB", "AAD", or "ACD"; when it matches "AB", "B"  is  cap-
8729        are  not  present  in PCRE2. The nearest equivalent is the callout fea-
8752        When a match succeeds, the name of the last-encountered (*MARK:NAME) on
8753        the matching path is passed back to the caller as described in the sec-
8754        tion entitled "Other information about the match" in the pcre2api docu-
8775        The (*MARK) name is tagged with "MK:" in this output, and in this exam-
8777        efficient  way of obtaining this information than putting each alterna-
8781        true,  the  name  is recorded and passed back if it is the last-encoun-
8803        The following verbs do nothing when they are encountered. Matching con-
8805        causing  a  backtrack  to the verb, a failure is forced. That is, back-
8809        group  has been matched, there is never any backtracking into it. Back-
8813        These  verbs  differ  in exactly what kind of failure occurs when back-
8815        when  the  verb is not in a subroutine or an assertion. Subsequent sec-
8821        matching failure that causes backtracking to reach it. Even if the pat-
8824        verb that is encountered, once it has been passed pcre2_match() is com-
8833        The  behaviour  of (*COMMIT:NAME) is not the same as (*MARK:NAME)(*COM-
8834        MIT). It is like (*MARK:NAME) in that the name is remembered for  pass-
8845        anchor,  unless PCRE2's start-of-match optimizations are turned off, as
8867        the subject if there is a later matching failure that causes backtrack-
8872        right, backtracking cannot cross (*PRUNE). In simple cases, the use  of
8873        (*PRUNE)  is just an alternative to an atomic group or possessive quan-
8887        character, but to the position in the subject where (*SKIP) was encoun-
8896        skips on to start the next attempt at "c". Note that a possessive quan-
8907        found,  the  "bumpalong" advance is to the subject position that corre-
8913        atomic groups or assertions, because they are never re-entered by back-
8931        backtracks, and this causes a new matching attempt to start at the sec-
8942        This verb causes a skip to the next innermost  alternative  when  back-
8945        that it can be used for a pattern-based if-then-else block:
8952        into COND1. If that succeeds and BAR fails, COND3 is tried.  If  subse-
8953        quently  BAZ fails, there are no more alternatives, so there is a back-
8979        failure in C, matching moves to (*FAIL), which causes the whole subpat-
9010        that is backtracked onto first acts. For example,  consider  this  pat-
9068        in a standalone positive assertion. In a  conditional  positive  asser-
9070        or (*PRUNE) causes the condition to be false. However, for both  stand-
9072        (*SKIP), or (*PRUNE) causes the assertion to be true, without consider-
9077        These  behaviours  occur whether or not the subpattern is called recur-
9081        match  to succeed without any further processing. Matching then contin-
9089        when triggered by being backtracked to in a subpattern called as a sub-
9108        University Computing Service
9115        Copyright (c) 1997-2018 University of Cambridge.
9116 ------------------------------------------------------------------------------
9124        PCRE2 - Perl-compatible regular expressions (revised API)
9128        Two  aspects  of performance are discussed below: memory usage and pro-
9136        code, so that most simple patterns do not use much memory  for  storing
9139        subpattern has a quantifier with a minimum greater than 1 and/or a lim-
9153        is  not  usually a problem. However, if the numbers are large, and par-
9155        an embarrassment. For example, the very simple pattern
9159        uses  over  50KiB  when compiled using the 8-bit library. When PCRE2 is
9161        limit on a compiled pattern is 65535 code units in the 8-bit and 16-bit
9162        libraries, and this is reached with the above pattern if the outer rep-
9168        of PCRE2's "subroutine" facility. Re-writing the above pattern as
9174        this kind of pattern is not always exactly equivalent, because any cap-
9177        process  patterns that PCRE2 cannot otherwise handle. The matching per-
9179        same.  (This applies from release 10.30 - things were different in ear-
9185        From release 10.30, the interpretive (non-JIT) version of pcre2_match()
9186        uses  very  little system stack at run time. In earlier releases recur-
9188        cause  problems, but this usage has been eliminated. Backtracking posi-
9194        used.  Rewriting patterns to be time-efficient, as described below, may
9203        has been re-factored to use heap memory  when  necessary  for  internal
9214        Certain items in regular expression patterns are processed  more  effi-
9216        [aeiou]  than  a  set  of   single-character   alternatives   such   as
9224        slow, because PCRE2 has to use a multi-stage table lookup  whenever  it
9234        pcre2_match(); the performance loss is less with a DFA  matching  func-
9237        When  a pattern begins with .* not in atomic parentheses, nor in paren-
9241        multiple top-level branches, they must all be anchorable. The optimiza-
9247        subject  string contains newlines, the pattern may match from the char-
9258        If you are using such a pattern with subject strings that do  not  con-
9261        explicit anchoring. That saves PCRE2 from having to scan along the sub-
9278        An optimization catches some of the more simple cases such as
9283        matching  procedure, PCRE2 checks that there is a "b" later in the sub-
9284        ject string, and if there is not, it fails the match immediately.  How-
9295        an  atomic group or a possessive quantifier. This can often reduce mem-
9306        matched character. For a long string, a lot of memory is required. Con-
9312        This runs much faster, because sequences of characters that do not con-
9313        tain "<" are "swallowed" in one item inside the parentheses, and a pos-
9315        non-"<"  characters.  This  version also uses a lot less memory because
9343        University Computing Service
9350        Copyright (c) 1997-2018 University of Cambridge.
9351 ------------------------------------------------------------------------------
9359        PCRE2 - Perl-compatible regular expressions (revised API)
9379        This  set of functions provides a POSIX-style API for the PCRE2 regular
9380        expression 8-bit library. See the pcre2api documentation for a descrip-
9381        tion  of PCRE2's native API, which contains much additional functional-
9382        ity. There are no POSIX-style wrappers for PCRE2's  16-bit  and  32-bit
9388        called libpcre2-posix.a, so can be accessed by adding -lpcre2-posix  to
9391        -lpcre2-8.
9402        PCRE2-specific features via the POSIX calling interface or to  add  BSD
9406        POSIX-like in style. The syntax and semantics of  the  regular  expres-
9408        various PCRE2 options, as described below. "POSIX-like in style"  means
9410        POSIX-compatible, and in multi-unit encoding  domains  it  is  probably
9416        two structure types, regex_t for  compiled  internal  forms,  and  reg-
9417        match_t  for  returning  captured substrings. It also defines some con-
9427        structure that is used as a base for storing information about the com-
9449        the defined POSIX behaviour for REG_NEWLINE  (see  the  following  sec-
9455        for compilation to the native function. This disables all meta  charac-
9464        for  matching, the nmatch and pmatch arguments are ignored, and no cap-
9496        all  data  strings used for matching it to be treated as UTF-8 strings.
9504        It  does not affect the way newlines are matched by the dot metacharac-
9507        The yield of regcomp() is zero on success, and non-zero otherwise.  The
9513        NOTE: If the yield of regcomp() is non-zero, you must  not  attempt  to
9520        This area is not simple, because POSIX and Perl take different views of
9534        This is the equivalent table for a POSIX-compatible pattern matcher:
9552        action. When using the POSIX API, passing REG_NEWLINE to  PCRE2's  reg-
9554        and REG_DOTALL passes PCRE2_DOTALL. There is no way to pass  PCRE2_DOL-
9567        The PCRE2_NOTBOL option is set when calling the underlying PCRE2 match-
9574        standard.  However, setting this option can give more POSIX-like behav-
9579        The PCRE2_NOTEOL option is set when calling the underlying PCRE2 match-
9598        intended  to  be  portable to other systems. Note that a non-zero rm_so
9620        Unused entries in the array have both structure members set to -1.
9629        The regerror() function maps a non-zero errorcode from either regcomp()
9633        the first errbuf_size - 1 characters of the error message are used. The
9641        Compiling a regular expression causes memory to be allocated and  asso-
9643        memory, after which preg may no longer be used as  a  compiled  expres-
9650        University Computing Service
9657        Copyright (c) 1997-2017 University of Cambridge.
9658 ------------------------------------------------------------------------------
9666        PCRE2 - Perl-compatible regular expressions (revised API)
9670        A  simple, complete demonstration program to get you started with using
9674        can save this listing to re-create the contents of pcre2demo.c.
9679        used. If matching succeeds, the program outputs the portion of the sub-
9680        ject  that  matched,  together  with  the contents of any captured sub-
9683        If the -g option is given on the command line, the program then goes on
9685        subject string. The logic is a little bit tricky because of the  possi-
9689        The code in pcre2demo.c is an 8-bit program that uses the  PCRE2  8-bit
9690        library.  It  handles  strings  and characters that are stored in 8-bit
9693        treated as UTF-8 strings, where characters  may  occupy  multiple  code
9697        for your operating system, you should be able to compile the demonstra-
9700          cc -o pcre2demo pcre2demo.c -lpcre2-8
9703        to the command line. For example, on a Unix-like system that has  PCRE2
9707          cc -o pcre2demo -I/usr/local/include pcre2demo.c \
9708             -L/usr/local/lib -lpcre2-8
9710        Once you have built the demonstration program, you can run simple tests
9714          ./pcre2demo -g 'cat|dog' 'the dog sat on the cat'
9718        expressions using all three PCRE2 libraries (8-bit, 16-bit, and 32-bit,
9719        though not all three need be installed). The pcre2demo program is  pro-
9720        vided as a relatively simple coding example.
9726          ld.so.1: pcre2demo: fatal: libpcre2-8.so.0: open failed: No such file
9729        This is caused by the way shared library support works  on  those  sys-
9732          -R/usr/local/lib
9740        University Computing Service
9747        Copyright (c) 1997-2016 University of Cambridge.
9748 ------------------------------------------------------------------------------
9754        PCRE2 - Perl-compatible regular expressions (revised API)
9756 SAVING AND RE-USING PRECOMPILED PCRE2 PATTERNS
9773        run. However, if you are using the just-in-time  optimization  feature,
9774        it is not possible to save and reload the JIT data, because it is posi-
9775        tion-dependent. The host on which the patterns  are  reloaded  must  be
9778        For  example, patterns compiled on a 32-bit system using PCRE2's 16-bit
9779        library cannot be reloaded on a 64-bit system, nor can they be reloaded
9780        using the 8-bit library.
9787        linked with a fixed version of PCRE2 must be prepared to recompile pat-
9796        arbitrary external sources.  There  is  only  some  simple  consistency
9797        checking, not complete validation of what is being re-loaded. Corrupted
9809        in the byte stream (its size is 1088 bytes). For more details of  char-
9810        acter  tables,  see the section on locale support in the pcre2api docu-
9816        the length of the vector. The third and fourth arguments point to vari-
9830        PCRE2_ERROR_BADMAGIC  means  either that a pattern's code has been cor-
9831        rupted, or that a slot in the vector does not point to a compiled  pat-
9855        between binary and non-binary data, be sure that the file is opened for
9860        freed in the usual way by calling pcre2_code_free(). When you have fin-
9861        ished with the byte stream, it too must be freed by calling pcre2_seri-
9866 RE-USING PRECOMPILED PATTERNS
9868        In  order  to  re-use  a  set of saved patterns you must first make the
9869        serialized byte stream available in main memory (for example, by  read-
9884        If this argument is NULL, malloc() and free() are used. After deserial-
9913        and a reference count is used to arrange for its memory to be automati-
9920        If  a pattern was processed by pcre2_jit_compile() before being serial-
9929        University Computing Service
9936        Copyright (c) 1997-2018 University of Cambridge.
9937 ------------------------------------------------------------------------------
9945        PCRE2 - Perl-compatible regular expressions (revised API)
9949        The  full syntax and semantics of the regular expressions that are sup-
9951        document contains a quick-reference summary of the syntax.
9956          \x         where x is non-alphanumeric is a literal x
9965          \cx        "control-x", where x is any ASCII printing character
9980        Note that \0dd is always an octal code. The treatment of backslash fol-
9981        lowed by a non-zero digit is complicated; for details see  the  section
9982        "Non-printing  characters"  in  the  pcre2pattern  documentation, where
9989        read, but if PCRE2_ALT_BSUX is set, \x must be followed by two hexadec-
9991        matches a literal "x".  Likewise, if \u (in ALT_BSUX mode) is not  fol-
10013          \W         a "non-word" character
10017        middle of a UTF-8 or UTF-16 character. The application can lock out the
10021        By default, \d, \s, and \w match only ASCII characters, even  in  UTF-8
10022        mode or in the 16-bit and 32-bit libraries. However, if locale-specific
10024        points in the range 128-255. If the PCRE2_UCP option is set, the behav-
10049          Mn         Non-spacing mark
10082          Xuc        Univerally-named character: one that can be
10086        Perl and POSIX space are now the same. Perl added VT to its space char-
10092        Adlam, Ahom, Anatolian_Hieroglyphs, Arabic,  Armenian,  Avestan,  Bali-
10094        Braille, Buginese, Buhid, Canadian_Aboriginal, Carian,  Caucasian_Alba-
10100        Inscriptional_Pahlavi, Inscriptional_Parthian, Javanese,  Kaithi,  Kan-
10102        Latin, Lepcha, Limbu, Linear_A, Linear_B, Lisu, Lycian,  Lydian,  Maha-
10106        Nabataean, New_Tai_Lue, Newa, Nko, Nushu, Ogham, Ol_Chiki,  Old_Hungar-
10107        ian,  Old_Italic,  Old_North_Arabian, Old_Permic, Old_Persian, Old_Sog-
10110        Psalter_Pahlavi, Rejang, Runic, Samaritan,  Saurashtra,  Sharada,  Sha-
10113        Tai_Viet,  Takri,  Tamil,  Tangut, Telugu, Thaana, Thai, Tibetan, Tifi-
10121          [x-y]       range (can be used for hex characters)
10127          ascii       0-127
10165 ANCHORS AND SIMPLE ASSERTIONS
10200          (?:...)         non-capturing group
10201          (?|...)         non-capturing group; reset group numbers for
10207          (?>...)         atomic, non-capturing group
10227          (?-...)         unset option(s)
10231        a mixture of setting and unsetting such as (?i-x) is allowed, but there
10233        for example (?^in). An option setting may appear at the start of a non-
10245          (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
10248          (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
10289        Each top-level branch of a look behind must be of a fixed length.
10298          \g-n            relative reference by number
10300          \g{-n}          relative reference by number
10313          (?-n)           call subpattern by relative number
10322          \g<-n>          call subpattern by relative number (PCRE2 extension)
10323          \g'-n'          call subpattern by relative number (PCRE2 extension)
10328          (?(condition)yes-pattern)
10329          (?(condition)yes-pattern|no-pattern)
10333          (?(-n)              relative reference condition
10361        The following act only when a subsequent match failure causes  a  back-
10363        what happens afterwards. Those that advance the start-of-match point do
10398        University Computing Service
10405        Copyright (c) 1997-2018 University of Cambridge.
10406 ------------------------------------------------------------------------------
10414        PCRE - Perl-compatible regular expressions (revised API)
10420        in  UTF-8, UTF-16, or UTF-32 format (depending on the code unit width).
10425        (*UTF). When either of these is the case, both the pattern and any sub-
10427        instead of strings of individual one-code-unit  characters.  There  are
10428        also  some  other  changes  to the way characters are handled, as docu-
10443        names  for  properties are supported. For example, \p{L} matches a let-
10445        Perl,  many properties may optionally be prefixed by "Is", for compati-
10458        allowed in non-UTF modes.
10468        multi-unit  characters  (see  the description of \C in the pcre2pattern
10472        pcre2_dfa_match() when in UTF-8 or UTF-16 mode, that is, when a charac-
10474        modes  provokes a match-time error. Also, the JIT optimization does not
10475        support \C in these modes. If JIT optimization is requested for a UTF-8
10476        or  UTF-16  pattern  that contains \C, it will not succeed, and so when
10483        set as in non-UTF mode, all  with  code  points  less  than  256.  This
10489        Alternatively, if you set the PCRE2_UCP option, the way that the  char-
10495        all low-valued characters, unless the PCRE2_UCP option is set.
10498        escapes (\h, \H, \v, and \V) do match all the appropriate Unicode char-
10502 CASE-EQUIVALENCE IN UTF MODES
10504        Case-insensitive matching in a UTF mode makes use of Unicode properties
10506        at most two case-equivalent values. For these, a direct table lookup is
10508        than two code points that are case-equivalent, and these are treated as
10521        UTF-16 and UTF-32 strings can indicate their endianness by special code
10522        knows  as  a  byte-order  mark (BOM). The PCRE2 functions do not handle
10526        case  of  pcre2_match()  and  pcre2_dfa_match()  calls  with a non-zero
10530        end  of  the subject. If there are no lookbehind assertions in the pat-
10534        the  starting offset. Note that the sequences \b and \B are one-charac-
10539        the surrogate area. The so-called "non-character" code points  are  not
10544        UTF-16,  where they are used in pairs to encode code points with values
10545        greater than 0xFFFF. The code points that are encoded by  UTF-16  pairs
10546        are  available  independently  in  the  UTF-8 and UTF-32 encodings. (In
10547        other words, the whole surrogate thing is  a  fudge  for  UTF-16  which
10548        unfortunately messes up UTF-8 and UTF-32.)
10551        and therefore want to skip these checks in  order  to  improve  perfor-
10553        scanned repeatedly.  If you set the PCRE2_NO_UTF_CHECK option  at  com-
10569        PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra option. However, this is pos-
10570        sible only in UTF-8 and UTF-32 modes, because these values are not rep-
10571        resentable in UTF-16.
10573    Errors in UTF-8 strings
10575        The following negative error codes are given for invalid UTF-8 strings:
10583        The string ends with a truncated UTF-8 character;  the  code  specifies
10584        how  many bytes are missing (1 to 5). Although RFC 3629 restricts UTF-8
10585        characters to be no longer than 4 bytes, the  encoding  scheme  (origi-
10607        A  4-byte character has a value greater than 0x10fff; these code points
10612        A 3-byte character has a value in the  range  0xd800  to  0xdfff;  this
10613        range  of code points are reserved by RFC 3629 for use with UTF-16, and
10614        so are excluded from UTF-8.
10622        A 2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it  codes
10624        For example, the two bytes 0xc0, 0xae give the value 0x2e,  whose  cor-
10630        binary value 0b10 (that is, the most significant bit is 1 and the  sec-
10631        ond  is  0). Such a byte can only validly occur as the second or subse-
10632        quent byte of a multi-byte character.
10637        can never occur in a valid UTF-8 string.
10639    Errors in UTF-16 strings
10641        The  following  negative  error  codes  are  given  for  invalid UTF-16
10649    Errors in UTF-32 strings
10651        The following  negative  error  codes  are  given  for  invalid  UTF-32
10661        University Computing Service
10668        Copyright (c) 1997-2018 University of Cambridge.
10669 ------------------------------------------------------------------------------