• Home
  • Raw
  • Download

Lines Matching full:is

7 PCRE2 is a new API for PCRE. This document contains a description of all its
257 units, respectively. However, there is just one header file, \fBpcre2.h\fP.
277 example, PCRE2_UCHAR16 is usually defined as `uint16_t'. The SPTR types are
278 constant pointers to the equivalent UCHAR types, that is, they are pointers to
285 PCRE2_CODE_UNIT_WIDTH is not defined by default. An application must define it
291 including \fBpcre2.h\fP, and then use the real function names. Any code that is
292 to be included in an environment where the value of PCRE2_CODE_UNIT_WIDTH is
293 unknown should also use the real function names. (Unfortunately, it is not
296 If PCRE2_CODE_UNIT_WIDTH is not defined before including \fBpcre2.h\fP, a
313 PCRE2 has its own native API, which is described in this document. There are
334 sample program that demonstrates the simplest way of using them is provided in
336 of this program is given in the
346 Just-in-time compiler support is an optional feature of PCRE2 that can be built
351 support is not available.
357 JIT matching is automatically used by \fBpcre2_match()\fP if it is available,
358 unless the PCRE2_NO_JIT option is set. There is also a direct interface for JIT
366 A second matching function, \fBpcre2_dfa_match()\fP, which is not
367 Perl-compatible, is also provided. This uses a different algorithm for the
372 and disadvantages is given in the
376 documentation. There is no JIT support for \fBpcre2_dfa_match()\fP.
407 blocks of various sorts. In all cases, if one of these functions is called with
415 several places. These values are always of type PCRE2_SIZE, which is an
417 value that can be stored in such a type (that is ~(PCRE2_SIZE)0) is reserved
419 Therefore, the longest string that can be handled is one less than this
435 Each of the first three conventions is used by at least one operating system as
436 its standard newline sequence. When PCRE2 is built, a default can be specified.
437 The default default is LF, which is the Unix standard. However, the newline
446 In the PCRE2 documentation the word "newline" is used to mean "the character or
449 metacharacters, the handling of #-comments in /x mode, and, when CRLF is a
451 non-anchored pattern. There is more detail about this in the
466 In a multithreaded application it is important to keep thread-specific data
468 itself is thread-safe: it contains no static or global variables. The API is
479 A pointer to the compiled form of a pattern is returned to the user when
480 \fBpcre2_compile()\fP is successful. The data in the compiled pattern is fixed,
481 and does not change when the pattern is matched. Therefore, it is thread-safe,
482 that is, the same compiled pattern can be used by more than one thread
485 just-in-time optimization feature is being used, it needs separate memory stack
508 If JIT is being used, but the JIT compilation is not being done immediately,
509 (perhaps waiting to see if the pattern is used often enough) similar logic is
520 functions are called. A context is nothing more than a collection of parameters
522 in a context is a convenient way of passing them to a PCRE2 function without
550 directly. A context is just a block of memory that holds the parameter values.
552 NULL when a context pointer is required.
554 There are three different types of context: a general context that is relevant
563 library. The context is named `general' rather than specifically `memory'
566 general context. A general context is created by:
580 Whenever code in PCRE2 calls these functions, the final argument is the value
583 \fImalloc()\fP and \fIfree()\fP are used. (This is not currently useful, as
585 The \fIprivate_malloc()\fP function is used (if supplied) to obtain memory for
590 used. When the time comes to free the block, this function is called.
611 A compile context is required if you want to change the default values of any
621 A compile context is also required if you are using custom memory management.
625 A compile context is created, copied, and freed by the following functions:
637 A compile context is created with default values for its parameters. These can
639 PCRE2_ERROR_BADDATA if invalid data is detected.
648 ending sequence. The value is used by the JIT compiler and by the two
658 argument is a general context. This function builds a set of character tables
666 This sets a maximum length, in code units, for the pattern string that is to be
667 compiled. If the pattern is longer, an error is generated. This facility is
669 limit their size. The default is the largest number that a PCRE2_SIZE variable
670 can hold, which is effectively unlimited.
683 When a pattern is compiled with the PCRE2_EXTENDED option, the value of this
685 comments starting with #. The value is saved with the compiled pattern for
694 This parameter ajusts the limit, set when PCRE2 is built (default 250), on the
703 There is at least one application that runs PCRE2 in threads with very limited
704 system stack, where running out of stack is to be avoided at all costs. The
705 parenthesis limit above cannot take account of how much stack is actually
706 available. For a finer control, you can supply a function that is called
712 nesting, and the second is user data that is set up by the last argument of
714 zero if all is well, or non-zero to force an error.
721 A match context is required if you want to change the default values of any
729 A match context is also required if you are using custom memory management.
733 A match context is created, copied, and freed by the following functions:
745 A match context is created with default values for its parameters. These can
747 PCRE2_ERROR_BADDATA if invalid data is detected.
768 advance in the subject string. The default value is PCRE2_UNSET. The
771 offset is not found. For example, if the pattern /abc/ is matched against
772 "123abc" with an offset limit less than 3, the result is PCRE2_ERROR_NO_MATCH.
774 \fBpcre2_match()\fP or \fBpcre2_dfa_match()\fP is greater than the offset
778 \fBpcre2_compile()\fP so that when JIT is in use, different code can be
779 compiled. If a match is started with a non-default match limit when
780 PCRE2_USE_OFFSET_LIMIT is not set, an error is generated.
784 start within the first line of the subject. If this is set with an offset
786 In other words, whichever limit comes first is used.
796 classic example is a pattern that uses nested unlimited repeats.
799 calls repeatedly (sometimes recursively). The limit set by \fImatch_limit\fP is
800 imposed on the number of times this function is called during a match, which
803 in the subject string. This limit is not relevant to \fBpcre2_dfa_match()\fP,
806 When \fBpcre2_match()\fP is called with a pattern that was successfully
807 processed by \fBpcre2_jit_compile()\fP, the way in which matching is executed
808 is entirely different. However, there is still the possibility of runaway
813 The default value for the limit can be set when PCRE2 is built; the default
814 default is 10 million, which handles all but the most extreme cases. If the
815 limit is exceeded, \fBpcre2_match()\fP returns PCRE2_ERROR_MATCHLIMIT. A value
821 where ddd is a decimal number. However, such a setting is ignored unless ddd is
823 limit is set, less than the default.
830 The \fIrecursion_limit\fP parameter is similar to \fImatch_limit\fP, but
831 instead of limiting the total number of times that \fBmatch()\fP is called, it
832 limits the depth of recursion. The recursion depth is a smaller number than the
834 This limit is of use only if it is set smaller than \fImatch_limit\fP.
838 stack, the amount of heap memory that can be used. This limit is not relevant,
839 and is ignored, when matching is done using JIT compiled code or by the
842 The default value for \fIrecursion_limit\fP can be set when PCRE2 is built; the
843 default default is the same value as the default for \fImatch_limit\fP. If the
844 limit is exceeded, \fBpcre2_match()\fP returns PCRE2_ERROR_RECURSIONLIMIT. A
850 where ddd is a decimal number. However, such a setting is ignored unless ddd is
852 limit is set, less than the default.
862 by \fBpcre2_match()\fP when PCRE2 is compiled to use the heap for remembering
864 stack. There is a discussion about PCRE2's stack usage in the
874 Using the heap for recursion is a non-standard way of building PCRE2, for use
879 same size. The blocks are retained by \fBpcre2_match()\fP until it is about to
897 The first argument for \fBpcre2_config()\fP specifies which information is
898 required. The second argument is a pointer to memory into which the information
899 is placed. If NULL is passed, the function returns the amount of memory that is
901 the value is in bytes; when requesting these values, \fIwhere\fP should point
903 length is given in code units, not counting the terminating zero.
905 When requesting information, the returned value from \fBpcre2_config()\fP is
907 the value in the first argument is not recognized. The following information is
912 The output is a uint32_t integer whose value indicates what character
916 default can be overridden when a pattern is compiled.
920 The output is a uint32_t integer that is set to one if support for just-in-time
921 compiling is available; otherwise it is set to zero.
925 The \fIwhere\fP argument should point to a buffer that is at least 48 code
927 \fBpcre2_config()\fP with \fBwhere\fP set to NULL.) The buffer is filled with a
928 string that contains the name of the architecture for which the JIT compiler is
930 is not available, PCRE2_ERROR_BADOPTION is returned, otherwise the number of
931 code units used is returned. This is the length of the string, plus one unit
936 The output is a uint32_t integer that contains the number of bytes used for
937 internal linkage in compiled regular expressions. When PCRE2 is configured, the
938 value can be set to 2, 3, or 4, with the default being 2. This is the value
939 that is returned by \fBpcre2_config()\fP. However, when the 16-bit library is
940 compiled, a value of 3 is rounded up to 4, and when the 32-bit library is
941 compiled, internal linkages always use 4 bytes, so the configured value is not
944 The default value of 2 for the 8-bit and 16-bit libraries is sufficient for all
951 The output is a uint32_t integer that gives the default limit for the number of
957 The output is a uint32_t integer whose value specifies the default character
958 sequence that is recognized as meaning "newline". The values are:
971 The output is a uint32_t integer that gives the maximum depth of nesting
972 of parentheses (of any kind) in a pattern. This limit is imposed to cap the
973 amount of system stack used when a pattern is compiled. It is specified when
974 PCRE2 is built; the default is 250. This limit does not take into account the
980 The output is a uint32_t integer that gives the default limit for the depth of
986 The output is a uint32_t integer that is set to one if internal recursion when
987 running \fBpcre2_match()\fP is implemented by recursive function calls that use
988 the system stack to remember their state. This is the usual way that PCRE2 is
989 compiled. The output is zero if PCRE2 was compiled to use blocks of data on the
994 The \fIwhere\fP argument should point to a buffer that is at least 24 code
997 without Unicode support, the buffer is filled with the text "Unicode not
998 supported". Otherwise, the Unicode version string (for example, "8.0.0") is
999 inserted. The number of code units used is returned. This is the length of the
1004 The output is a uint32_t integer that is set to one if Unicode support is
1005 available; otherwise it is set to zero. Unicode support implies UTF support.
1009 The \fIwhere\fP argument should point to a buffer that is at least 12 code
1011 \fBpcre2_config()\fP with \fBwhere\fP set to NULL.) The buffer is filled with
1012 the PCRE2 version string, zero-terminated. The number of code units used is
1013 returned. This is the length of the string plus one unit for the terminating
1032 The pattern is defined by a pointer to a string of code units and a length. If
1033 the pattern is zero-terminated, the length can be specified as
1037 If the compile context argument \fIccontext\fP is NULL, memory for the compiled
1038 pattern is obtained by calling \fBmalloc()\fP. Otherwise, it is obtained from
1040 free the memory by calling \fBpcre2_code_free()\fP when it is no longer needed.
1049 the JIT information cannot be copied (because it is position-dependent).
1055 NOTE: When one of the matching functions is called, pointers to the compiled
1089 If \fIerrorcode\fP or \fIerroroffset\fP is NULL, \fBpcre2_compile()\fP returns
1093 error has occurred. The values are not defined when compilation is successful
1104 UTF-8 or UTF-16 string, the offset is that of the first code unit of the
1108 cases, the offset passed back is the length of the pattern. Note that the
1109 offset is in code units, not characters, even in a UTF mode. It may sometimes
1120 PCRE2_ZERO_TERMINATED, /* the pattern is zero-terminated */
1131 If this bit is set, the pattern is forced to be "anchored", that is, it is
1132 constrained to match only at the first matching point in the string that is
1134 appropriate constructs in the pattern itself, which is the only way to do it in
1140 immediately follows an opening one is treated as a data character for the
1141 class. When PCRE2_ALLOW_EMPTY_CLASS is set, it terminates the class, which
1147 makes PCRE2's behaviour more like ECMAscript (aka JavaScript). When it is set:
1152 (2) \eu matches a lower case "u" character unless it is followed by four
1157 (3) \ex matches a lower case "x" character unless it is followed by two
1159 to match. By default, as in Perl, a hexadecimal number is always expected after
1165 In multiline mode (when PCRE2_MULTILINE is set), the circumflex metacharacter
1166 matches at the start of the subject (unless PCRE2_NOTBOL is set), and also
1175 (*MARK:NAME) is any sequence of characters that does not include a closing
1176 parenthesis. The name is not processed in any way, and it is not possible to
1178 option is set, normal backslash processing is applied to verb names and only an
1181 option is set, unescaped whitespace in verb names is skipped and #-comments are
1186 If this bit is set, \fBpcre2_compile()\fP automatically inserts callout items,
1196 If this bit is set, letters in the pattern match both upper and lower case
1197 letters in the subject. It is equivalent to Perl's /i option, and it can be
1202 If this bit is set, a dollar metacharacter in the pattern matches only at the
1205 newlines). The PCRE2_DOLLAR_ENDONLY option is ignored if PCRE2_MULTILINE is
1206 set. There is no equivalent to this option in Perl, and no way to set it within
1211 If this bit is set, a dot metacharacter in the pattern matches any character,
1214 not match when the current position in the subject is at a newline. This option
1221 If this bit is set, names used to identify capturing subpatterns need not be
1222 unique. This can be helpful for certain types of pattern when it is known that
1232 If this bit is set, most white space characters in the pattern are totally
1236 Ignorable white space is permitted between an item and a following quantifier
1242 this type of comment is a literal newline sequence in the pattern; escape
1243 sequences that happen to represent a newline do not count. PCRE2_EXTENDED is
1248 the compile context that is passed to \fBpcre2_compile()\fP or by a special
1254 in the \fBpcre2pattern\fP documentation. A default is defined when PCRE2 is
1259 If this option is set, an unanchored pattern is required to match before or at
1262 general limiting facility. If PCRE2_FIRSTLINE is set with an offset limit, a
1264 words, whichever limit comes first is used.
1268 If this option is set, a back reference to an unset subpattern group matches an
1270 A pattern such as (\e1)(a) succeeds when this option is set (assuming it can
1282 (except when PCRE2_DOLLAR_ENDONLY is set). Note, however, that unless
1283 PCRE2_DOTALL is set, the "any character" metacharacter (.) does not match at a
1284 newline. This behaviour (for ^, $, and dot) is the same as Perl.
1286 When PCRE2_MULTILINE it is set, the "start of line" and "end of line"
1298 This option locks out the use of \eC in the pattern that is being compiled.
1302 external sources. Note that there is also a build-time option that permanently
1317 UTF-32, depending on which library is in use. In particular, it prevents the
1325 If this option is set, it disables the use of numbered capturing parentheses in
1326 the pattern. Any opening parenthesis that is not followed by ? behaves as if it
1328 they acquire numbers in the usual way). There is no equivalent of this option
1329 in Perl. Note that, if this option is set, references to capturing groups (back
1335 If this option is set, it disables "auto-possessification", which is an
1340 search and run all the callouts, but it is mainly provided for testing
1345 If this option is set, it disables an optimization that is applied when .* is
1347 other branches also start with .* or with \eA or \eG or ^. The optimization is
1348 automatically disabled for .* if it is inside an atomic group or a capturing
1349 group that is the subject of a back reference, or if the pattern contains
1350 (*PRUNE) or (*SKIP). When the optimization is not disabled, such a pattern is
1351 automatically anchored if PCRE2_DOTALL is set for all the .* items and
1352 PCRE2_MULTILINE is not set for any ^ items. Otherwise, the fact that any match
1353 must start either at the start of the subject or following a newline is
1358 This is an option whose main effect is at matching time. It does not change
1363 order to speed up the process. For example, if it is known that an unanchored
1367 such as (*COMMIT) at the start of a pattern is not considered until after a
1370 skipped if the pattern is never actually used. The start-up optimizations are
1371 in effect a pre-scan of the subject that takes place before the pattern is run.
1375 result is "no match", the callouts do occur, and that items such as (*COMMIT)
1384 When this is compiled, PCRE2 records the fact that a match must start with the
1385 character "A". Suppose the subject string is "DEFABC". The start-up
1389 match is run with PCRE2_NO_START_OPTIMIZE set, the initial scan along the
1390 subject string does not happen. The first match attempt is run starting from
1392 the overall result is "no match". There are also other start-up optimizations.
1398 The minimum length for a match is one character. If the subject is "ABC", there
1401 the subject is now too short, and so the (*MARK) is never encountered. In this
1402 case, the optimization does not affect the overall match result, which is still
1403 "no match", but it does affect the auxiliary information that is returned.
1407 When PCRE2_UTF is set, the validity of the pattern as a UTF string is
1427 If an invalid UTF sequence is found, \fBpcre2_compile()\fP returns a negative
1430 If you know that your pattern is valid, and you want to skip this check for
1431 performance reasons, you can set the PCRE2_NO_UTF_CHECK option. When it is set,
1432 the effect of passing an invalid UTF string as a pattern is undefined. It may
1441 are recognized, but if PCRE2_UCP is set, Unicode properties are used instead to
1452 longer. The option is available only if PCRE2 has been compiled with Unicode
1458 greedy by default, but become greedy if followed by "?". It is not compatible
1464 \fBpcre2_set_offset_limit()\fP is going to be used to set a non-default offset
1465 limit in a match context for matches that use this pattern. An error is
1466 generated if an offset limit is set without this option. For more details, see
1479 single-code-unit strings. It is available when PCRE2 is built to include
1480 Unicode support (which is the default). If Unicode support is not available,
1533 compiler is available, further processes a compiled pattern into machine code
1541 JIT compilation is a heavyweight optimization. It can take some time for
1555 However, if PCRE2 is built with UTF support, all characters can be tested with
1560 The use of locales with Unicode is discouraged. If you are handling characters
1566 recognize only ASCII characters. However, when PCRE2 is built, it is possible
1573 support is expected to die away.
1589 The locale name "fr_FR" is used on Linux and other Unix-like systems; if you
1590 are using Windows, the name for the French locale is "french". It is the
1592 available for as long as it is needed.
1594 The pointer that is passed (via the compile context) to \fBpcre2_compile()\fP
1615 The first argument for \fBpcre2_pattern_info()\fP is a pointer to the compiled
1616 pattern. The second argument specifies which piece of information is required,
1617 and the third argument is a pointer to a variable to receive the data. If the
1618 third argument is NULL, the first argument is ignored, and the function returns
1619 the size in bytes of the variable that is required for the information
1620 requested. Otherwise, The yield of the function is zero for success, or one of
1626 PCRE2_ERROR_UNSET the requested field is not set
1628 The "magic number" is placed at the start of each compiled pattern as an simple
1629 check against passing an arbitrary memory pointer. Here is a typical call of
1636 PCRE2_INFO_SIZE, /* what is required */
1651 For example, if the pattern /(*UTF)abc/ is compiled with the PCRE2_EXTENDED
1652 option, the result for PCRE2_INFO_ALLOPTIONS is PCRE2_EXTENDED and PCRE2_UTF.
1657 A pattern compiled without PCRE2_ANCHORED is automatically anchored by PCRE2 if
1658 the first significant item in every top-level branch is one of the following:
1660 ^ unless PCRE2_MULTILINE is set
1665 When .* is the first significant item, anchoring is possible only when all the
1668 .* is not in an atomic group
1670 .* is not in a capturing group that is the subject
1672 PCRE2_DOTALL is in force for .*
1674 PCRE2_NO_DOTSTAR_ANCHOR is not set.
1676 For patterns that are auto-anchored, the PCRE2_ANCHORED bit is set in the
1685 given group, but in addition, the check that a capturing group is set in a
1686 conditional subpattern such as (?(3)a|b) is also a back reference. Zero is
1691 The output is a uint32_t whose value indicates what character sequences the \eR
1699 where (?| is not used, this is also the total number of capturing subpatterns.
1709 value 255 or above". If such a table was constructed, a pointer to it is
1710 returned. Otherwise NULL is returned. The third argument should point to an
1717 variable. If there is a fixed first value, for example, the letter "c" from a
1718 pattern such as (cat|cow|coyote), 1 is returned, and the character value can be
1719 retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed first value, but
1720 it is known that a match can occur only at the start of the subject or
1721 following a newline in the subject, 2 is returned. Otherwise, and for anchored
1722 patterns, 0 is returned.
1729 value is always less than 256. In the 16-bit library the value can be up to
1742 explicit match is either a literal CR or LF character, or \er or \en.
1746 Return 1 if the (?J) or (?-J) option setting is used in the pattern, otherwise
1758 Returns 1 if there is a rightmost literal code unit that must exist in any
1760 \fBuint32_t\fP variable. If there is no such value, 0 is returned. When 1 is
1762 PCRE2_INFO_LASTCODEUNIT. For anchored patterns, a last literal value is
1764 pattern /^a\ed+z\ed+/ the returned value is 1 (with "z" returned from
1765 PCRE2_INFO_LASTCODEUNIT), but for /^a\edz\ed/ the returned value is 0.
1771 third argument should point to an \fBuint32_t\fP variable. If there is no such
1772 value, 0 is returned.
1778 recursive subroutine calls it is not always possible to determine whether or
1785 (*LIMIT_MATCH=nnnn) at the start, the value is returned. The third argument
1793 integer. This information is useful when doing multi-segment matching using the
1796 lookbehind, though it does not actually inspect the previous character. This is
1797 to ensure that at least one character from the old segment is retained when a
1798 new segment is processed. Otherwise, if there are no lookbehinds in the
1803 If a minimum length for matching subject strings was computed, its value is
1804 returned. Otherwise the returned value is 0. The value is a number of
1806 The third argument should point to an \fBuint32_t\fP variable. The value is a
1808 of that length that do actually match, but every string that does match is at
1819 substrings by name. It is also possible to extract the data directly, by first
1822 you need to use the name-to-number map, which is described by these three
1830 PCRE2_INFO_NAMETABLE returns a pointer to the first entry of the table. This is
1836 the parenthesis number. The rest of the entry is the corresponding name, zero
1839 The names are in alphabetical order. If (?| is used to create multiple groups
1849 page, the groups may be given the same name, but there is only one entry in the
1853 if PCRE2_DUPNAMES is set. They appear in the table in the order in which they
1854 were found in the pattern. In the absence of (?| this is the order of
1855 increasing number; when (?| is used this is not necessarily the case because
1859 after compilation by the 8-bit library (assume PCRE2_EXTENDED is set, so white
1860 space - including newlines - is ignored):
1867 in the table is eight bytes long. The table is as follows, with non-printing
1876 name-to-number map, remember that the length of the entries is likely to be
1881 The output is a \fBuint32_t\fP with one of the following values:
1895 (*LIMIT_RECURSION=nnnn) at the start, the value is returned. The third
1904 pattern itself. The value that is used when \fBpcre2_compile()\fP is getting
1923 be done by calling \fBpcre2_callout_enumerate()\fP. The first argument is a
1925 the third is arbitrary user data. The callback function is called for every
1926 callout in the pattern in the order in which they appear. Its first argument is
1927 a pointer to a callout enumeration block, and its second argument is the
1939 It is possible to save compiled patterns on disc or elsewhere, and reload them
1963 Information about a successful or unsuccessful match is placed in a match
1964 data block, which is an opaque structure that is accessed by function calls. In
1967 captured. This is know as the \fIovector\fP.
1972 argument is the number of pairs of offsets in the \fIovector\fP. One pair of
1973 offsets is required to identify the string that matched the whole pattern, with
1976 substrings. A minimum of at least 1 pair is imposed by
1977 \fBpcre2_match_data_create()\fP, so it is always possible to return the overall
1980 The second argument of \fBpcre2_match_data_create()\fP is a pointer to a
1985 For \fBpcre2_match_data_create_from_pattern()\fP, the first argument is a
1986 pointer to a compiled pattern. The ovector is created to be exactly the right
1987 size to hold all the substrings a pattern might capture. The second argument is
1988 again a pointer to a general context, but in this case if NULL is passed, the
1989 memory is obtained using the same allocator that was used for the compiled
2006 When a call of \fBpcre2_match()\fP fails, valid data is available in the match
2007 block only when the error is PCRE2_ERROR_NOMATCH, PCRE2_ERROR_PARTIAL, or one
2008 of the error codes for an invalid UTF string. Exactly what is available depends
2009 on the error, and is detailed below.
2011 When one of the matching functions is called, pointers to the compiled pattern
2017 When a match data block itself is no longer needed, it should be freed by
2031 The function \fBpcre2_match()\fP is called to match a subject string against a
2032 compiled pattern, which is passed in the \fIcode\fP argument. You can call
2037 This function is the main matching facility of the library, and it operates in
2038 a Perl-like manner. For specialist use there is also an alternative matching
2039 function, which is described
2046 Here is an example of a simple call to \fBpcre2_match()\fP:
2058 If the subject string is zero-terminated, the length can be given as
2071 The subject string is passed to \fBpcre2_match()\fP as a pointer in
2074 That is, they are in bytes for the 8-bit library, 16-bit code units for the
2076 UTF processing is enabled.
2078 If \fIstartoffset\fP is greater than the length of the subject,
2079 \fBpcre2_match()\fP returns PCRE2_ERROR_BADOFFSET. When the starting offset is
2086 A non-zero starting offset is useful when searching for another match in the
2095 the current position in the subject is not a word boundary.) When applied to
2097 occurrence. If \fBpcre2_match()\fP is called again with just the remainder of
2098 the subject, namely "issipi", it does not match, because \eB is always false at
2099 the start of the subject, which is deemed to be a word boundary. However, if
2100 \fBpcre2_match()\fP is passed the entire string again, but with
2102 is able to look behind the starting point to discover that it is preceded by a
2105 Finding all the matches in a subject is tricky when the pattern can match an
2106 empty string. It is possible to emulate Perl's /g behaviour by first trying the
2109 and trying an ordinary match again. There is some code that demonstrates how to
2116 character is CR followed by LF, advance the starting offset by two characters
2119 If a non-zero starting offset is passed when the pattern is anchored, one
2120 attempt to match at the given offset is made. This can only succeed if the
2131 PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. Their action is
2134 Setting PCRE2_ANCHORED at match time is not supported by the just-in-time (JIT)
2135 compiler. If it is set, JIT matching is disabled and the normal interpretive
2136 code in \fBpcre2_match()\fP is run. Apart from PCRE2_NO_JIT (obviously), the
2149 This option specifies that first character of the subject string is not the
2157 This option specifies that the end of the subject string is not the end of a
2166 An empty string is not considered to be a valid match if this option is set. If
2173 string at the start of the subject. With PCRE2_NOTEMPTY set, this match is not
2179 This is like PCRE2_NOTEMPTY, except that it locks out an empty string match
2180 only at the first matching position, that is, at the start of the subject plus
2181 the starting offset. An empty string match later in the subject is permitted.
2182 If the pattern is anchored, such a match can occur only if the pattern contains
2188 \fBpcre2_jit_compile()\fP, JIT is automatically used when \fBpcre2_match()\fP
2194 When PCRE2_UTF is set at compile time, the validity of the subject as a UTF
2195 string is checked by default when \fBpcre2_match()\fP is subsequently called.
2196 If a non-zero starting offset is given, the check is applied only to that part
2197 of the subject that could be inspected during matching, and there is a check
2205 The check is carried out before any other processing takes place, and a
2206 negative error code is returned if the check fails. There are several UTF error
2228 If you know that your subject is valid, and you want to skip these checks for
2234 NOTE: When PCRE2_NO_UTF_CHECK is set, the effect of passing an invalid string
2235 as a subject, or an invalid value of \fIstartoffset\fP, is undefined. Your
2242 the end of the subject string is reached successfully, but there are not enough
2244 PCRE2_PARTIAL_SOFT (but not PCRE2_PARTIAL_HARD) is set, matching continues by
2245 testing any remaining alternatives. Only if no complete match can be found is
2247 PCRE2_PARTIAL_SOFT specifies that the caller is prepared to handle a partial
2250 If PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In this case, if
2251 a partial match is found, \fBpcre2_match()\fP immediately returns
2253 words, when PCRE2_PARTIAL_HARD is set, a partial match is considered to be more
2256 There is a more detailed discussion of partial and multi-segment matching, with
2268 When PCRE2 is built, a default newline convention is set; this is usually the
2287 starting position is advanced after a match failure for an unanchored pattern.
2289 When PCRE2_NEWLINE_CRLF, PCRE2_NEWLINE_ANYCRLF, or PCRE2_NEWLINE_ANY is set as
2291 when the current starting position is at a CRLF sequence, and the pattern
2292 contains no explicit matches for CR or LF characters, the match position is
2295 The above rule is a compromise that makes the most common cases work as
2296 expected. For example, if the pattern is .+A (and the PCRE2_DOTALL option is
2302 An explicit match for CR of LF is either a literal appearance of one of those
2307 Notwithstanding the above, anomalous effects may still occur when CRLF is a
2324 book, this is called "capturing" in what follows, and the phrase "capturing
2325 subpattern" or "capturing group" is used for a fragment of a pattern that picks
2344 called the \fBovector\fP, which contains the offsets of captured strings. It is
2354 Within the ovector, the first in each pair of values is set to the offset of
2355 the first code unit of a substring, and the second is set to the offset of the
2357 offsets, not character offsets. That is, they are byte offsets in the 8-bit
2362 of offsets (that is, \fIovector[0]\fP and \fIovector[1]\fP) are set. They
2370 the subject string that was matched by the entire pattern. The next pair is
2372 \fBpcre2_match()\fP is one more than the highest numbered pair that has been
2373 set. For example, if two substrings have been captured, the returned value is
2375 match is 1, indicating that just the first pair of offsets has been set.
2379 For example, if the pattern (?=ab\eK) is matched against "ab", the start and
2382 If a capturing subpattern group is matched repeatedly within a single match
2383 operation, it is the last portion of the subject that it matched that is
2386 If the ovector is too small to hold all the captured substring offsets, as much
2387 as possible is filled in, and the function returns a value of zero. If captured
2389 data block whose ovector is of minimum length (that is, one pair). However, if
2390 the pattern contains back references and the \fIovector\fP is not big enough to
2392 during matching. Thus it is usually advisable to set up a match data block
2395 It is possible for capturing subpattern number \fIn+1\fP to match some part of
2397 the string "abc" is matched against the pattern (a|(z))(bc) the return from the
2398 function is 4, and subpatterns 1 and 3 are matched, but 2 is not. When this
2403 expression are also set to PCRE2_UNSET. For example, if the string "abc" is
2405 The return from the function is 2, because the highest used capturing
2406 subpattern number is 1. The offsets for for the second and third capturing
2407 subpatterns (assuming the vector is large enough, of course) are set to
2411 pattern are never changed. That is, if a pattern contains \fIn\fP capturing
2427 As well as the offsets in the ovector, other information about a match is
2429 appropriate circumstances. If they are called at other times, the result is
2435 zero-terminated name, which is within the compiled pattern. Otherwise NULL is
2436 returned. The length of the (*MARK) name (excluding the terminating zero) is
2441 After a successful match, the (*MARK) name that is returned is the
2443 match" or a partial match, the last encountered (*MARK) name is returned. For
2448 When it matches "bc", the returned mark is A. The B mark is "seen" in the first
2449 branch of the group, but it is not on the matching path. On the other hand,
2450 when this pattern fails to match "bx", the returned mark is B.
2457 escape sequence. After a partial match, however, this value is always the same
2480 with them. The codes are given names in the header file. If UTF checking is in
2481 force and an invalid UTF subject string is detected, one of a number of
2482 UTF-specific negative error codes is returned. Details are given in the
2504 catch the case when it is passed a junk pointer. This is the error that is
2505 returned when the magic number is not present.
2509 This error is given when a pattern that was compiled by the 8-bit library is
2529 This error is never generated by \fBpcre2_match()\fP itself. It is provided for
2544 This error is returned when a pattern that was successfully studied using JIT
2546 correspond to any JIT compilation mode. When the JIT fast path function is
2555 This error is returned when a pattern that was successfully studied using JIT
2557 stack is not large enough. See the
2569 If a pattern contains back references, but the ovector is not big enough to
2572 extra memory is needed during matching. This error is given when memory cannot
2582 This error is returned when \fBpcre2_match()\fP detects a recursion loop within
2607 code unit buffer and its length, into which the text message is placed. Note
2608 that the message is returned in code units of the appropriate width for the
2609 library that is being used.
2611 The returned message is terminated with a trailing zero, and the function
2613 error number is unknown, the negative error code PCRE2_ERROR_BADDATA is
2614 returned. If the buffer is too small, the message is truncated (but still with
2615 a trailing zero), and the negative error code PCRE2_ERROR_NOMEMORY is returned.
2616 None of the messages are very long; a buffer size of 120 code units is ample.
2645 a binary zero is correctly extracted and has a further zero added on the end,
2646 but the result is not, of course, a C string.
2651 substring zero is available. An attempt to extract any other substring gives
2657 For example, if the pattern (?=ab\eK) is matched against "ab", the start and
2663 argument is a pointer to the match data block, the second is the group number,
2664 and the third is a pointer to a variable into which the length is placed. If
2676 This is updated to contain the actual number of code units used for the
2682 zero. When the substring is no longer needed, the memory should be freed by
2685 The return value from all these functions is zero for success, or a negative
2686 error code. If the pattern match failed, the match failure code is returned.
2687 If a substring number greater than zero is used after a partial match,
2688 PCRE2_ERROR_PARTIAL is returned. Other possible error codes are:
2697 There is no substring with that number in the pattern, that is, the number is
2703 pattern, is greater than the number of slots in the ovector, so the substring
2708 The substring did not participate in the match. For example, if the pattern is
2709 (abc)|(def) and the subject is "def", and the ovector contains at least two
2710 capturing slots, substring number 1 is unset.
2726 that is added to each of them. All this is done in a single block of memory
2727 that is obtained using the same memory allocation function that was used to get
2731 partial match, the error code PCRE2_ERROR_PARTIAL is returned.
2733 The address of the memory block is returned via \fIlistptr\fP, which is also
2734 the start of the list of string pointers. The end of the list is marked by a
2735 NULL pointer. The address of the list of lengths is returned via
2739 function is zero if all went well, or PCRE2_ERROR_NOMEMORY if the memory block
2740 could not be obtained. When the list is no longer needed, it should be freed by
2743 If this function encounters a substring that is unset, which can happen when
2776 the number of the subpattern called "xxx" is 2. If the name is known to be
2778 calling \fBpcre2_substring_number_from_name()\fP. The first argument is the
2779 compiled pattern, and the second is the name. The yield of the function is the
2780 subpattern number, PCRE2_ERROR_NOSUBSTRING if there is no subpattern of that
2781 name, or PCRE2_ERROR_NOUNIQUESUBSTRING if there is more than one subpattern of
2786 "bynumber" functions, the only difference being that the second argument is a
2787 name instead of a number. If PCRE2_DUPNAMES is set and there are duplicate
2789 first named string that is set.
2791 If there are no groups with the given name, PCRE2_ERROR_NOSUBSTRING is
2793 number of slots in the ovector, PCRE2_ERROR_UNAVAILABLE is returned. If there
2794 is at least one group with a slot in the ovector, but no group is found to be
2795 set, PCRE2_ERROR_UNSET is returned.
2827 \fIreplacement\fP string, whose length is supplied in \fBrlength\fP. This can
2835 data block is obtained and freed within this function, using memory management
2840 length, in code units, of the output buffer. If the function is successful, the
2841 value is updated to contain the length of the new string, excluding the
2842 trailing zero that is automatically added.
2844 If the function is not successful, the value set via \fIoutlengthptr\fP depends
2845 on the type of error. For syntax errors in the replacement string, the value is
2847 errors, the value is PCRE2_UNSET by default. This includes the case of the
2848 output buffer being too small, unless PCRE2_SUBSTITUTE_OVERFLOW_LENGTH is set
2849 (see below), in which case the value is the minimum length needed, including
2853 length is in code units, not bytes.
2855 In the replacement string, which is interpreted as a UTF string in UTF mode,
2856 and is checked for UTF validity unless the PCRE2_NO_UTF_CHECK option is set, a
2857 dollar character is an escape character that can specify the insertion of
2868 For example, if the pattern a(b)c is matched with "=abc=" and the replacement
2869 string "+$1$0$1+", the result is "=+babcb+=".
2882 replacing every matching substring. If this is not set, only the first matching
2883 substring is replaced. If any matched substring has zero length, after the
2885 position is performed. If this is not successful, the current position is
2886 advanced by one character except when CRLF is a valid newline sequence and the
2887 next two characters are CR, LF. In this case, the current position is advanced
2890 PCRE2_SUBSTITUTE_OVERFLOW_LENGTH changes what happens when the output buffer is
2891 too small. The default action is to return PCRE2_ERROR_NOMEMORY immediately. If
2892 this option is set, however, \fBpcre2_substitute()\fP continues to go through
2894 in order to compute the size of buffer that is needed. This value is passed
2898 Passing a buffer size of zero is a permitted way of finding out how much memory
2900 operation is carried out twice. Depending on the application, it may be more
2910 groups when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) to be treated as empty
2911 strings when inserted as described above. If this option is not set, an attempt
2916 replacement string. Without this option, only the dollar character is special,
2918 PCRE2_SUBSTITUTE_EXTENDED is set, two things change:
2920 Firstly, backslash in a replacement string is interpreted as an escape
2931 \eu and \el force the next character (if it is a letter) to upper or lower
2937 the result of processing "\eUaa\eLBB\eEcc\eE" is "AAbbcc"; the final \eE has no
2940 The second effect of setting PCRE2_SUBSTITUTE_EXTENDED is to add more
2941 flexibility to group substitution. The syntax is similar to that used by Bash:
2947 default value. If group <n> is set, its value is inserted; if not, <string> is
2949 expanded and inserted when group <n> is set or unset, respectively. The first
2950 form is just a convenient shorthand for
2969 were made. This may be zero if no matches were found, and is never greater than
2970 1 unless PCRE2_SUBSTITUTE_GLOBAL is set.
2972 In the event of an error, a negative error code is returned. Except for
2973 PCRE2_ERROR_NOMATCH (which is never returned), errors from \fBpcre2_match()\fP
2976 PCRE2_ERROR_NOSUBSTRING is returned for a non-existent substring insertion,
2977 unless PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set.
2979 PCRE2_ERROR_UNSET is returned for an unset substring insertion (including an
2980 unknown substring when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) when the simple
2981 (non-extended) syntax is used and PCRE2_SUBSTITUTE_UNSET_EMPTY is not set.
2983 PCRE2_ERROR_NOMEMORY is returned if the output buffer is not big enough. If the
2984 PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set, the size of buffer that is
2985 needed is returned via \fIoutlengthptr\fP. Note that this does not happen by
2988 PCRE2_ERROR_BADREPLACEMENT is used for miscellaneous syntax errors in the
2993 started, which can happen if \eK is used in an assertion).
3012 When a pattern is compiled with the PCRE2_DUPNAMES option, names for
3018 one of the named subpatterns participates. An example is shown in the
3026 the given name that is set. Only if none are set is PCRE2_ERROR_UNSET is
3032 argument is the compiled pattern, and the second is the name. If the third and
3040 PCRE2_ERROR_NOSUBSTRING is returned if there are no entries for the given name.
3042 The format of the name table is described
3060 callout facility, which is described in the
3066 What you have to do is to insert a callout right at the end of the pattern.
3067 When your callout function is called, extract and save the current matched
3085 The function \fBpcre2_dfa_match()\fP is called to match a subject string
3088 the normal algorithm, and is not compatible with Perl. Some of the features of
3099 is used in a different way, and this is described below. The other common
3101 description is not repeated here.
3104 vector should contain at least 20 elements. It is used for keeping track of
3105 multiple paths through the pattern tree. More workspace is needed for patterns
3108 Here is an example of a simple call to \fBpcre2_dfa_match()\fP:
3131 \fBpcre2_match()\fP, so their description is not repeated here.
3137 details are slightly different. When PCRE2_PARTIAL_HARD is set for
3139 subject is reached and there is still at least one matching possibility that
3141 already been found. When PCRE2_PARTIAL_SOFT is set, the return code
3142 PCRE2_ERROR_NOMATCH is converted into PCRE2_ERROR_PARTIAL if the end of the
3143 subject is reached, there have been no complete matches, but there is still at
3145 when the longest partial match was found is set as the first matching string in
3146 both cases. There is a more detailed discussion of partial and multi-segment
3157 works, this is necessarily the shortest possible match at the first possible
3162 When \fBpcre2_dfa_match()\fP returns a partial match, it is possible to call it
3164 match. The PCRE2_DFA_RESTART option requests this action; when it is set, the
3166 before because data about the match so far is left in them after a partial
3167 match. There is more discussion of this facility in the
3186 This is <something> <something else> <something further> no more
3194 On success, the yield of the function is a number greater than zero, which is
3209 The ovector is not big enough to include a slot for the given substring number.
3213 There is a slot in the ovector for this substring, but there were insufficient
3217 is, the longest matching string is first. If there were too many matches to fit
3218 into the ovector, the yield of the function is zero, and the vector is filled
3223 pattern "a\ed+" is compiled as if it were "a\ed++". For DFA matching, this
3224 means that only one possible match is found. If you really do want multiple
3243 This return is given if \fBpcre2_dfa_match()\fP encounters an item in the
3249 This return is given if \fBpcre2_dfa_match()\fP encounters a condition item
3255 This return is given if \fBpcre2_dfa_match()\fP runs out of space in the
3260 When a recursive subpattern is processed, the matching function calls itself
3262 error is given if the internal ovector is not large enough. This should be
3263 extremely rare, as a vector of size 1000 is used.
3267 When \fBpcre2_dfa_match()\fP is called with the \fBPCRE2_DFA_RESTART\fP option,
3270 fail, this error is given.