Lines Matching full:is
7 PCRE2 is a new API for PCRE, starting at release 10.0. This document contains a
270 backward compatibility. They should not be used in new code. The first is
271 replaced by \fBpcre2_set_depth_limit()\fP; the second is no longer needed and
301 patterns that can be processed by \fBpcre2_compile()\fP. This facility is
314 units, respectively. However, there is just one header file, \fBpcre2.h\fP.
334 example, PCRE2_UCHAR16 is usually defined as `uint16_t'. The SPTR types are
335 constant pointers to the equivalent UCHAR types, that is, they are pointers to
342 PCRE2_CODE_UNIT_WIDTH is not defined by default. An application must define it
348 including \fBpcre2.h\fP, and then use the real function names. Any code that is
349 to be included in an environment where the value of PCRE2_CODE_UNIT_WIDTH is
350 unknown should also use the real function names. (Unfortunately, it is not
353 If PCRE2_CODE_UNIT_WIDTH is not defined before including \fBpcre2.h\fP, a
370 PCRE2 has its own native API, which is described in this document. There are
391 sample program that demonstrates the simplest way of using them is provided in
393 of this program is given in the
409 Just-in-time (JIT) compiler support is an optional feature of PCRE2 that can be
414 support is not available.
420 JIT matching is automatically used by \fBpcre2_match()\fP if it is available,
421 unless the PCRE2_NO_JIT option is set. There is also a direct interface for JIT
429 A second matching function, \fBpcre2_dfa_match()\fP, which is not
430 Perl-compatible, is also provided. This uses a different algorithm for the
435 and disadvantages is given in the
439 documentation. There is no JIT support for \fBpcre2_dfa_match()\fP.
457 functions is called with a NULL argument, the function returns immediately
472 blocks of various sorts. In all cases, if one of these functions is called with
480 several places. These values are always of type PCRE2_SIZE, which is an
482 value that can be stored in such a type (that is ~(PCRE2_SIZE)0) is reserved
484 Therefore, the longest string that can be handled is one less than this
500 Each of the first three conventions is used by at least one operating system as
501 its standard newline sequence. When PCRE2 is built, a default can be specified.
502 If it is not, the default is set to LF, which is the Unix standard. However,
511 In the PCRE2 documentation the word "newline" is used to mean "the character or
514 metacharacters, the handling of #-comments in /x mode, and, when CRLF is a
516 non-anchored pattern. There is more detail about this in the
531 In a multithreaded application it is important to keep thread-specific data
533 itself is thread-safe: it contains no static or global variables. The API is
544 A pointer to the compiled form of a pattern is returned to the user when
545 \fBpcre2_compile()\fP is successful. The data in the compiled pattern is fixed,
546 and does not change when the pattern is matched. Therefore, it is thread-safe,
547 that is, the same compiled pattern can be used by more than one thread
550 just-in-time (JIT) optimization feature is being used, it needs separate memory
573 If JIT is being used, but the JIT compilation is not being done immediately,
574 (perhaps waiting to see if the pattern is used often enough) similar logic is
586 functions are called. A context is nothing more than a collection of parameters
588 in a context is a convenient way of passing them to a PCRE2 function without
616 directly. A context is just a block of memory that holds the parameter values.
618 NULL when a context pointer is required.
620 There are three different types of context: a general context that is relevant
629 library. The context is named `general' rather than specifically `memory'
632 general context. A general context is created by:
646 Whenever code in PCRE2 calls these functions, the final argument is the value
649 \fImalloc()\fP and \fIfree()\fP are used. (This is not currently useful, as
651 The \fIprivate_malloc()\fP function is used (if supplied) to obtain memory for
656 used. When the time comes to free the block, this function is called.
671 If this function is passed a NULL argument, it returns immediately without
679 A compile context is required if you want to provide an external function for
690 A compile context is also required if you are using custom memory management.
694 A compile context is created, copied, and freed by the following functions:
706 A compile context is created with default values for its parameters. These can
708 PCRE2_ERROR_BADDATA if invalid data is detected.
717 ending sequence. The value is used by the JIT compiler and by the two
727 argument is a general context. This function builds a set of character tables
752 This sets a maximum length, in code units, for any pattern string that is
753 compiled with this context. If the pattern is longer, an error is generated.
754 This facility is provided so that applications that accept patterns from
755 external sources can limit their size. The default is the largest number that a
756 PCRE2_SIZE variable can hold, which is effectively unlimited.
768 NUL character, that is a binary zero).
777 When a pattern is compiled with the PCRE2_EXTENDED or PCRE2_EXTENDED_MORE
779 comments starting with #. The value is saved with the compiled pattern for
788 This parameter ajusts the limit, set when PCRE2 is built (default 250), on the
798 There is at least one application that runs PCRE2 in threads with very limited
799 system stack, where running out of stack is to be avoided at all costs. The
800 parenthesis limit above cannot take account of how much stack is actually
802 that is called whenever \fBpcre2_compile()\fP starts to compile a parenthesized
807 nesting, and the second is user data that is set up by the last argument of
809 zero if all is well, or non-zero to force an error.
816 A match context is required if you want to:
828 A match context is created, copied, and freed by the following functions:
840 A match context is created with default values for its parameters. These can
842 PCRE2_ERROR_BADDATA if invalid data is detected.
863 advance in the subject string. The default value is PCRE2_UNSET. The
866 offset is not found. The \fBpcre2_substitute()\fP function makes no more
869 For example, if the pattern /abc/ is matched against "123abc" with an offset
870 limit less than 3, the result is PCRE2_ERROR_NO_MATCH. A match can never be
872 \fBpcre2_dfa_match()\fP, or \fBpcre2_substitute()\fP is greater than the offset
876 calling \fBpcre2_compile()\fP so that when JIT is in use, different code can be
877 compiled. If a match is started with a non-default match limit when
878 PCRE2_USE_OFFSET_LIMIT is not set, an error is generated.
883 newline that follows the start of matching in the subject. If this is set with
885 offset limit. In other words, whichever limit comes first is used.
902 documentation for more details). If the limit is reached, the negative error
903 code PCRE2_ERROR_HEAPLIMIT is returned. The default limit can be set when PCRE2
904 is built; if it is not, the default is set very large and is essentially
912 where ddd is a decimal number. However, such a setting is ignored unless ddd is
914 limit is set, less than the default.
918 there are (that is, the deeper the search tree), the more memory is needed.
919 Heap memory is used only if the initial vector is too small. If the heap limit
924 Similarly, for \fBpcre2_dfa_match()\fP, a vector on the system stack is used
926 this is not big enough is heap memory used. In this case, too, setting a value
937 trees. The classic example is a pattern that uses nested unlimited repeats.
939 There is an internal counter in \fBpcre2_match()\fP that is incremented each
945 though the counting is done in a different way.
947 When \fBpcre2_match()\fP is called with a pattern that was successfully
948 processed by \fBpcre2_jit_compile()\fP, the way in which matching is executed
949 is entirely different. However, there is still the possibility of runaway
954 The default value for the limit can be set when PCRE2 is built; the default
955 default is 10 million, which handles all but the most extreme cases. A value
961 where ddd is a decimal number. However, such a setting is ignored unless ddd is
963 \fBpcre2_dfa_match()\fP or, if no such limit is set, less than the default.
971 Each time a nested backtracking point is passed, a new memory "frame" is used
973 indirectly limits the amount of memory that is used in a match. However,
979 The depth limit is not relevant, and is ignored, when matching is done using
980 JIT compiled code. However, it is supported by \fBpcre2_dfa_match()\fP, which
983 limits, indirectly, the amount of system stack that is used. It was more useful
989 If the depth of internal recursive function calls is great enough, local
991 depth limit also indirectly limits the amount of heap memory that is used. A
993 using \fBpcre2_dfa_match()\fP, can use a great deal of memory. However, it is
997 The default value for the depth limit can be set when PCRE2 is built; if it is
998 not, the default is set to the same value as the default for the match limit.
999 If the limit is exceeded, \fBpcre2_match()\fP or \fBpcre2_dfa_match()\fP
1005 where ddd is a decimal number. However, such a setting is ignored unless ddd is
1007 \fBpcre2_dfa_match()\fP or, if no such limit is set, less than the default.
1022 The first argument for \fBpcre2_config()\fP specifies which information is
1023 required. The second argument is a pointer to memory into which the information
1024 is placed. If NULL is passed, the function returns the amount of memory that is
1026 the value is in bytes; when requesting these values, \fIwhere\fP should point
1028 length is given in code units, not counting the terminating zero.
1030 When requesting information, the returned value from \fBpcre2_config()\fP is
1032 the value in the first argument is not recognized. The following information is
1037 The output is a uint32_t integer whose value indicates what character
1041 default can be overridden when a pattern is compiled.
1045 The output is a uint32_t integer whose lower bits indicate which code unit
1051 The output is a uint32_t integer that gives the default limit for the depth of
1058 The output is a uint32_t integer that gives, in kibibytes, the default limit
1065 The output is a uint32_t integer that is set to one if support for just-in-time
1066 compiling is available; otherwise it is set to zero.
1070 The \fIwhere\fP argument should point to a buffer that is at least 48 code
1072 \fBpcre2_config()\fP with \fBwhere\fP set to NULL.) The buffer is filled with a
1073 string that contains the name of the architecture for which the JIT compiler is
1075 is not available, PCRE2_ERROR_BADOPTION is returned, otherwise the number of
1076 code units used is returned. This is the length of the string, plus one unit
1081 The output is a uint32_t integer that contains the number of bytes used for
1082 internal linkage in compiled regular expressions. When PCRE2 is configured, the
1083 value can be set to 2, 3, or 4, with the default being 2. This is the value
1084 that is returned by \fBpcre2_config()\fP. However, when the 16-bit library is
1085 compiled, a value of 3 is rounded up to 4, and when the 32-bit library is
1086 compiled, internal linkages always use 4 bytes, so the configured value is not
1089 The default value of 2 for the 8-bit and 16-bit libraries is sufficient for all
1096 The output is a uint32_t integer that gives the default match limit for
1102 The output is a uint32_t integer whose value specifies the default character
1103 sequence that is recognized as meaning "newline". The values are:
1117 The output is a uint32_t integer that is set to one if the use of \eC was
1118 permanently disabled when PCRE2 was built; otherwise it is set to zero.
1122 The output is a uint32_t integer that gives the maximum depth of nesting
1123 of parentheses (of any kind) in a pattern. This limit is imposed to cap the
1124 amount of system stack used when a pattern is compiled. It is specified when
1125 PCRE2 is built; the default is 250. This limit does not take into account the
1131 This parameter is obsolete and should not be used in new code. The output is a
1132 uint32_t integer that is always set to zero.
1136 The \fIwhere\fP argument should point to a buffer that is at least 24 code
1139 without Unicode support, the buffer is filled with the text "Unicode not
1140 supported". Otherwise, the Unicode version string (for example, "8.0.0") is
1141 inserted. The number of code units used is returned. This is the length of the
1146 The output is a uint32_t integer that is set to one if Unicode support is
1147 available; otherwise it is set to zero. Unicode support implies UTF support.
1151 The \fIwhere\fP argument should point to a buffer that is at least 24 code
1153 \fBpcre2_config()\fP with \fBwhere\fP set to NULL.) The buffer is filled with
1154 the PCRE2 version string, zero-terminated. The number of code units used is
1155 returned. This is the length of the string plus one unit for the terminating
1176 The pattern is defined by a pointer to a string of code units and a length (in
1177 code units). If the pattern is zero-terminated, the length can be specified as
1181 If the compile context argument \fIccontext\fP is NULL, memory for the compiled
1182 pattern is obtained by calling \fBmalloc()\fP. Otherwise, it is obtained from
1184 free the memory by calling \fBpcre2_code_free()\fP when it is no longer needed.
1185 If \fBpcre2_code_free()\fP is called with a NULL argument, it returns
1195 the JIT information cannot be copied (because it is position-dependent).
1197 passed to \fBpcre2_jit_compile()\fP if required. If \fBpcre2_code_copy()\fP is
1210 tables are used throughout, so this behaviour is appropriate. Nevertheless,
1214 the new tables. The memory for the new tables is automatically freed when
1215 \fBpcre2_code_free()\fP is called for the new copy of the compiled code. If
1216 \fBpcre2_code_copy_withy_tables()\fP is called with a NULL argument, it returns
1219 NOTE: When one of the matching functions is called, pointers to the compiled
1253 If \fIerrorcode\fP or \fIerroroffset\fP is NULL, \fBpcre2_compile()\fP returns
1257 error has occurred. The values are not defined when compilation is successful
1267 page. There is no separate documentation for the positive error codes, because
1278 The value returned in \fIerroroffset\fP is an indication of where in the
1279 pattern the error occurred. It is not necessarily the furthest point in the
1280 pattern that was read. For example, after the error "lookbehind assertion is
1282 assertion. For an invalid UTF-8 or UTF-16 string, the offset is that of the
1286 cases, the offset passed back is the length of the pattern. Note that the
1287 offset is in code units, not characters, even in a UTF mode. It may sometimes
1298 PCRE2_ZERO_TERMINATED, /* the pattern is zero-terminated */
1309 If this bit is set, the pattern is forced to be "anchored", that is, it is
1310 constrained to match only at the first matching point in the string that is
1312 appropriate constructs in the pattern itself, which is the only way to do it in
1318 immediately follows an opening one is treated as a data character for the
1319 class. When PCRE2_ALLOW_EMPTY_CLASS is set, it terminates the class, which
1325 makes PCRE2's behaviour more like ECMAscript (aka JavaScript). When it is set:
1330 (2) \eu matches a lower case "u" character unless it is followed by four
1335 (3) \ex matches a lower case "x" character unless it is followed by two
1337 to match. By default, as in Perl, a hexadecimal number is always expected after
1343 In multiline mode (when PCRE2_MULTILINE is set), the circumflex metacharacter
1344 matches at the start of the subject (unless PCRE2_NOTBOL is set), and also
1353 (*MARK:NAME) is any sequence of characters that does not include a closing
1354 parenthesis. The name is not processed in any way, and it is not possible to
1356 option is set, normal backslash processing is applied to verb names and only an
1359 or PCRE2_EXTENDED_MORE option is set with PCRE2_ALT_VERBNAMES, unescaped
1360 whitespace in verb names is skipped and #-comments are recognized, exactly as
1365 If this bit is set, \fBpcre2_compile()\fP automatically inserts callout items,
1376 If this bit is set, letters in the pattern match both upper and lower case
1377 letters in the subject. It is equivalent to Perl's /i option, and it can be
1378 changed within a pattern by a (?i) option setting. If PCRE2_UTF is set, Unicode
1381 characters with only one other case, a lookup table is used for speed. When
1382 PCRE2_UTF is not set, a lookup table is used for all code points less than 256,
1388 If this bit is set, a dollar metacharacter in the pattern matches only at the
1391 newlines). The PCRE2_DOLLAR_ENDONLY option is ignored if PCRE2_MULTILINE is
1392 set. There is no equivalent to this option in Perl, and no way to set it within
1397 If this bit is set, a dot metacharacter in the pattern matches any character,
1400 not match when the current position in the subject is at a newline. This option
1408 If this bit is set, names used to identify capturing subpatterns need not be
1409 unique. This can be helpful for certain types of pattern when it is known that
1419 If this bit is set, the end of any pattern match must be right at the end of
1432 achieved by appropriate constructs in the pattern itself, which is the only way
1436 to the first (that is, the longest) matched string. Other parallel matches,
1442 If this bit is set, most white space characters in the pattern are totally
1446 Ignorable white space is permitted between an item and a following quantifier
1448 PCRE2_EXTENDED is equivalent to Perl's /x option, and it can be changed within
1451 When PCRE2 is compiled without Unicode support, PCRE2_EXTENDED recognizes as
1453 flagged as white space in its low-character table. The table is normally
1463 When PCRE2 is compiled with Unicode support, in addition to these characters,
1467 separator). This set of characters is the same as recognized by Perl's /x
1474 complicated patterns. Note that the end of this type of comment is a literal
1479 the compile context that is passed to \fBpcre2_compile()\fP or by a special
1485 in the \fBpcre2pattern\fP documentation. A default is defined when PCRE2 is
1493 characters that are ignored outside a character class. PCRE2_EXTENDED_MORE is
1499 If this option is set, the start of an unanchored pattern match must be before
1501 though the matched text may continue over the newline. If \fIstartoffset\fP is
1502 non-zero, the limiting newline is not necessarily the first newline in the
1503 subject. For example, if the subject string is "abc\enxyz" (where \en
1505 PCRE2_FIRSTLINE if \fIstartoffset\fP is greater than 3. See also
1507 PCRE2_FIRSTLINE is set with an offset limit, a match must occur in the first
1509 first is used.
1513 If this option is set, all meta-characters in the pattern are disabled, and it
1515 expression engine is not the most efficient way of doing it. If you are doing a
1526 If this option is set, a backreference to an unset subpattern group matches an
1528 A pattern such as (\e1)(a) succeeds when this option is set (assuming it can
1540 (except when PCRE2_DOLLAR_ENDONLY is set). Note, however, that unless
1541 PCRE2_DOTALL is set, the "any character" metacharacter (.) does not match at a
1542 newline. This behaviour (for ^, $, and dot) is the same as Perl.
1544 When PCRE2_MULTILINE it is set, the "start of line" and "end of line"
1556 This option locks out the use of \eC in the pattern that is being compiled.
1560 external sources. Note that there is also a build-time option that permanently
1575 UTF-32, depending on which library is in use. In particular, it prevents the
1583 If this option is set, it disables the use of numbered capturing parentheses in
1584 the pattern. Any opening parenthesis that is not followed by ? behaves as if it
1586 they acquire numbers in the usual way). This is the same as Perl's /n option.
1587 Note that, when this option is set, references to capturing groups
1593 If this option is set, it disables "auto-possessification", which is an
1598 search and run all the callouts, but it is mainly provided for testing
1603 If this option is set, it disables an optimization that is applied when .* is
1605 other branches also start with .* or with \eA or \eG or ^. The optimization is
1606 automatically disabled for .* if it is inside an atomic group or a capturing
1607 group that is the subject of a backreference, or if the pattern contains
1608 (*PRUNE) or (*SKIP). When the optimization is not disabled, such a pattern is
1609 automatically anchored if PCRE2_DOTALL is set for all the .* items and
1610 PCRE2_MULTILINE is not set for any ^ items. Otherwise, the fact that any match
1611 must start either at the start of the subject or following a newline is
1616 This is an option whose main effect is at matching time. It does not change
1621 order to speed up the process. For example, if it is known that an unanchored
1625 such as (*COMMIT) at the start of a pattern is not considered until after a
1628 skipped if the pattern is never actually used. The start-up optimizations are
1629 in effect a pre-scan of the subject that takes place before the pattern is run.
1633 result is "no match", the callouts do occur, and that items such as (*COMMIT)
1642 When this is compiled, PCRE2 records the fact that a match must start with the
1643 character "A". Suppose the subject string is "DEFABC". The start-up
1647 match is run with PCRE2_NO_START_OPTIMIZE set, the initial scan along the
1648 subject string does not happen. The first match attempt is run starting from
1650 the overall result is "no match".
1657 The minimum length for a match is one character. If the subject is "ABC", there
1660 the subject is now too short, and so the (*MARK) is never encountered. In this
1661 case, the optimization does not affect the overall match result, which is still
1662 "no match", but it does affect the auxiliary information that is returned.
1666 When PCRE2_UTF is set, the validity of the pattern as a UTF string is
1685 document. If an invalid UTF sequence is found, \fBpcre2_compile()\fP returns a
1688 If you know that your pattern is a valid UTF string, and you want to skip this
1690 it is set, the effect of passing an invalid UTF string as a pattern is
1698 error that is given if an escape sequence for an invalid Unicode code point is
1707 However, this is possible only in UTF-8 and UTF-32 modes, because these values
1714 are recognized, but if PCRE2_UCP is set, Unicode properties are used instead to
1725 longer. The option is available only if PCRE2 has been compiled with Unicode
1726 support (which is the default).
1731 greedy by default, but become greedy if followed by "?". It is not compatible
1737 \fBpcre2_set_offset_limit()\fP is going to be used to set a non-default offset
1738 limit in a match context for matches that use this pattern. An error is
1739 generated if an offset limit is set without this option. For more details, see
1752 single-code-unit strings. It is available when PCRE2 is built to include
1753 Unicode support (which is the default). If Unicode support is not available,
1773 This option applies when compiling a pattern in UTF-8 or UTF-32 mode. It is
1779 in a UTF-8 or UTF-32 string that is being checked for validity by PCRE2.
1788 If the extra option PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES is set, surrogate code
1791 characters if the matching function is called with PCRE2_NO_UTF_CHECK set.
1795 This is a dangerous option. Use with care. By default, an unrecognized escape
1797 detected by \fBpcre2_compile()\fP. Perl is somewhat inconsistent in handling
1798 such items: for example, \ej is treated as a literal "j", and non-hexadecimal
1800 Perl's warning switch is enabled. However, a malformed octal number after \eo{
1803 If the PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL extra option is passed to
1805 treated as single-character escapes. For example, \ej is a literal "j" and
1806 \ex{2z} is treated as the literal string "x{2z}". Setting this option means
1807 that typos in patterns may go undetected and have unexpected results. This is a
1812 This option is provided for use by the \fB-x\fP option of \fBpcre2grep\fP. It
1813 causes the pattern only to match complete lines. This is achieved by
1815 pattern and ")$" at the end. Thus, when PCRE2_MULTILINE is set, the matched
1821 This option is provided for use by the \fB-w\fP option of \fBpcre2grep\fP. It
1823 and the end. This is achieved by automatically inserting the code for "\eb(?:"
1825 used with PCRE2_LITERAL. However, it is ignored if PCRE2_EXTRA_MATCH_LINE is
1853 compiler is available, further processes a compiled pattern into machine code
1861 JIT compilation is a heavyweight optimization. It can take some time for
1875 However, if PCRE2 is built with Unicode support, all characters can be tested
1877 pattern is compiled; this causes \ew and friends to use Unicode property
1880 The use of locales with Unicode is discouraged. If you are handling characters
1886 recognize only ASCII characters. However, when PCRE2 is built, it is possible
1893 support is expected to die away.
1909 The locale name "fr_FR" is used on Linux and other Unix-like systems; if you
1910 are using Windows, the name for the French locale is "french". It is the
1912 available for as long as it is needed.
1914 The pointer that is passed (via the compile context) to \fBpcre2_compile()\fP
1935 The first argument for \fBpcre2_pattern_info()\fP is a pointer to the compiled
1936 pattern. The second argument specifies which piece of information is required,
1937 and the third argument is a pointer to a variable to receive the data. If the
1938 third argument is NULL, the first argument is ignored, and the function returns
1939 the size in bytes of the variable that is required for the information
1940 requested. Otherwise, the yield of the function is zero for success, or one of
1946 PCRE2_ERROR_UNSET the requested field is not set
1948 The "magic number" is placed at the start of each compiled pattern as an simple
1949 check against passing an arbitrary memory pointer. Here is a typical call of
1956 PCRE2_INFO_SIZE, /* what is required */
1974 For example, if the pattern /(*UTF)abc/ is compiled with the PCRE2_EXTENDED
1975 option, the result for PCRE2_INFO_ALLOPTIONS is PCRE2_EXTENDED and PCRE2_UTF.
1980 A pattern compiled without PCRE2_ANCHORED is automatically anchored by PCRE2 if
1981 the first significant item in every top-level branch is one of the following:
1983 ^ unless PCRE2_MULTILINE is set
1988 When .* is the first significant item, anchoring is possible only when all the
1991 .* is not in an atomic group
1993 .* is not in a capturing group that is the subject
1995 PCRE2_DOTALL is in force for .*
1997 PCRE2_NO_DOTSTAR_ANCHOR is not set
1999 For patterns that are auto-anchored, the PCRE2_ANCHORED bit is set in the
2008 given group, but in addition, the check that a capturing group is set in a
2009 conditional subpattern such as (?(3)a|b) is also a backreference. Zero is
2014 The output is a uint32_t integer whose value indicates what character sequences
2022 where (?| is not used, this is also the total number of capturing subpatterns.
2028 (*LIMIT_DEPTH=nnnn) at the start, the value is returned. The third argument
2031 limit will only be used during matching if it is less than the limit set or
2041 value 255 or above". If such a table was constructed, a pointer to it is
2042 returned. Otherwise NULL is returned. The third argument should point to a
2049 variable. If there is a fixed first value, for example, the letter "c" from a
2050 pattern such as (cat|cow|coyote), 1 is returned, and the value can be retrieved
2051 using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed first value, but it is
2053 newline in the subject, 2 is returned. Otherwise, and for anchored patterns, 0
2061 value is always less than 256. In the 16-bit library the value can be up to
2068 backtracking positions when the pattern is processed by \fBpcre2_match()\fP
2082 explicit match is either a literal CR or LF character, or \er or \en or one of
2088 (*LIMIT_HEAP=nnnn) at the start, the value is returned. The third argument
2091 limit will only be used during matching if it is less than the limit set or
2096 Return 1 if the (?J) or (?-J) option setting is used in the pattern, otherwise
2108 Returns 1 if there is a rightmost literal code unit that must exist in any
2110 \fBuint32_t\fP variable. If there is no such value, 0 is returned. When 1 is
2112 PCRE2_INFO_LASTCODEUNIT. For anchored patterns, a last literal value is
2114 pattern /^a\ed+z\ed+/ the returned value is 1 (with "z" returned from
2115 PCRE2_INFO_LASTCODEUNIT), but for /^a\edz\ed/ the returned value is 0.
2128 recursive subroutine calls it is not always possible to determine whether or
2135 (*LIMIT_MATCH=nnnn) at the start, the value is returned. The third argument
2138 limit will only be used during matching if it is less than the limit set or
2145 integer. This information is useful when doing multi-segment matching using the
2148 lookbehind, though it does not actually inspect the previous character. This is
2149 to ensure that at least one character from the old segment is retained when a
2150 new segment is processed. Otherwise, if there are no lookbehinds in the
2156 If a minimum length for matching subject strings was computed, its value is
2157 returned. Otherwise the returned value is 0. The value is a number of
2159 The third argument should point to an \fBuint32_t\fP variable. The value is a
2161 of that length that do actually match, but every string that does match is at
2172 substrings by name. It is also possible to extract the data directly, by first
2175 you need to use the name-to-number map, which is described by these three
2183 PCRE2_INFO_NAMETABLE returns a pointer to the first entry of the table. This is
2189 the parenthesis number. The rest of the entry is the corresponding name, zero
2192 The names are in alphabetical order. If (?| is used to create multiple groups
2202 page, the groups may be given the same name, but there is only one entry in the
2206 if PCRE2_DUPNAMES is set. They appear in the table in the order in which they
2207 were found in the pattern. In the absence of (?| this is the order of
2208 increasing number; when (?| is used this is not necessarily the case because
2212 after compilation by the 8-bit library (assume PCRE2_EXTENDED is set, so white
2213 space - including newlines - is ignored):
2220 in the table is eight bytes long. The table is as follows, with non-printing
2229 name-to-number map, remember that the length of the entries is likely to be
2234 The output is one of the following \fBuint32_t\fP values:
2251 pattern itself. The value that is used when \fBpcre2_compile()\fP is getting
2270 be done by calling \fBpcre2_callout_enumerate()\fP. The first argument is a
2272 the third is arbitrary user data. The callback function is called for every
2273 callout in the pattern in the order in which they appear. Its first argument is
2274 a pointer to a callout enumeration block, and its second argument is the
2286 It is possible to save compiled patterns on disc or elsewhere, and reload them
2291 "serialized" form, which in the case of PCRE2 is really just a bytecode dump.
2315 Information about a successful or unsuccessful match is placed in a match
2316 data block, which is an opaque structure that is accessed by function calls. In
2319 captured. This is known as the \fIovector\fP.
2324 argument is the number of pairs of offsets in the \fIovector\fP. One pair of
2325 offsets is required to identify the string that matched the whole pattern, with
2328 captured substrings. A minimum of at least 1 pair is imposed by
2329 \fBpcre2_match_data_create()\fP, so it is always possible to return the overall
2332 The second argument of \fBpcre2_match_data_create()\fP is a pointer to a
2337 For \fBpcre2_match_data_create_from_pattern()\fP, the first argument is a
2338 pointer to a compiled pattern. The ovector is created to be exactly the right
2339 size to hold all the substrings a pattern might capture. The second argument is
2340 again a pointer to a general context, but in this case if NULL is passed, the
2341 memory is obtained using the same allocator that was used for the compiled
2358 When a call of \fBpcre2_match()\fP fails, valid data is available in the match
2359 block only when the error is PCRE2_ERROR_NOMATCH, PCRE2_ERROR_PARTIAL, or one
2360 of the error codes for an invalid UTF string. Exactly what is available depends
2361 on the error, and is detailed below.
2363 When one of the matching functions is called, pointers to the compiled pattern
2369 When a match data block itself is no longer needed, it should be freed by
2370 calling \fBpcre2_match_data_free()\fP. If this function is called with a NULL
2384 The function \fBpcre2_match()\fP is called to match a subject string against a
2385 compiled pattern, which is passed in the \fIcode\fP argument. You can call
2390 This function is the main matching facility of the library, and it operates in
2391 a Perl-like manner. For specialist use there is also an alternative matching
2392 function, which is described
2399 Here is an example of a simple call to \fBpcre2_match()\fP:
2411 If the subject string is zero-terminated, the length can be given as
2424 The subject string is passed to \fBpcre2_match()\fP as a pointer in
2427 That is, they are in bytes for the 8-bit library, 16-bit code units for the
2429 UTF processing is enabled.
2431 If \fIstartoffset\fP is greater than the length of the subject,
2432 \fBpcre2_match()\fP returns PCRE2_ERROR_BADOFFSET. When the starting offset is
2439 A non-zero starting offset is useful when searching for another match in the
2448 the current position in the subject is not a word boundary.) When applied to
2450 occurrence. If \fBpcre2_match()\fP is called again with just the remainder of
2451 the subject, namely "issipi", it does not match, because \eB is always false at
2452 the start of the subject, which is deemed to be a word boundary. However, if
2453 \fBpcre2_match()\fP is passed the entire string again, but with
2455 is able to look behind the starting point to discover that it is preceded by a
2458 Finding all the matches in a subject is tricky when the pattern can match an
2459 empty string. It is possible to emulate Perl's /g behaviour by first trying the
2462 and trying an ordinary match again. There is some code that demonstrates how to
2469 character is CR followed by LF, advance the starting offset by two characters
2472 If a non-zero starting offset is passed when the pattern is anchored, a single
2473 attempt to match at the given offset is made. This can only succeed if the
2487 Their action is described below.
2489 Setting PCRE2_ANCHORED or PCRE2_ENDANCHORED at match time is not supported by
2490 the just-in-time (JIT) compiler. If it is set, JIT matching is disabled and the
2491 interpretive code in \fBpcre2_match()\fP is run. Apart from PCRE2_NO_JIT
2504 If the PCRE2_ENDANCHORED option is set, any string that \fBpcre2_match()\fP
2510 This option specifies that first character of the subject string is not the
2518 This option specifies that the end of the subject string is not the end of a
2527 An empty string is not considered to be a valid match if this option is set. If
2534 string at the start of the subject. With PCRE2_NOTEMPTY set, this match is not
2540 This is like PCRE2_NOTEMPTY, except that it locks out an empty string match
2541 only at the first matching position, that is, at the start of the subject plus
2542 the starting offset. An empty string match later in the subject is permitted.
2543 If the pattern is anchored, such a match can occur only if the pattern contains
2549 \fBpcre2_jit_compile()\fP, JIT is automatically used when \fBpcre2_match()\fP
2555 When PCRE2_UTF is set at compile time, the validity of the subject as a UTF
2556 string is checked by default when \fBpcre2_match()\fP is subsequently called.
2557 If a non-zero starting offset is given, the check is applied only to that part
2558 of the subject that could be inspected during matching, and there is a check
2566 The check is carried out before any other processing takes place, and a
2567 negative error code is returned if the check fails. There are several UTF error
2589 If you know that your subject is valid, and you want to skip these checks for
2595 \fBWarning:\fP When PCRE2_NO_UTF_CHECK is set, the effect of passing an invalid
2596 string as a subject, or an invalid value of \fIstartoffset\fP, is undefined.
2603 the end of the subject string is reached successfully, but there are not enough
2605 PCRE2_PARTIAL_SOFT (but not PCRE2_PARTIAL_HARD) is set, matching continues by
2606 testing any remaining alternatives. Only if no complete match can be found is
2608 PCRE2_PARTIAL_SOFT specifies that the caller is prepared to handle a partial
2611 If PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In this case, if
2612 a partial match is found, \fBpcre2_match()\fP immediately returns
2614 words, when PCRE2_PARTIAL_HARD is set, a partial match is considered to be more
2617 There is a more detailed discussion of partial and multi-segment matching, with
2629 When PCRE2 is built, a default newline convention is set; this is usually the
2648 starting position is advanced after a match failure for an unanchored pattern.
2650 When PCRE2_NEWLINE_CRLF, PCRE2_NEWLINE_ANYCRLF, or PCRE2_NEWLINE_ANY is set as
2652 when the current starting position is at a CRLF sequence, and the pattern
2653 contains no explicit matches for CR or LF characters, the match position is
2656 The above rule is a compromise that makes the most common cases work as
2657 expected. For example, if the pattern is .+A (and the PCRE2_DOTALL option is
2663 An explicit match for CR of LF is either a literal appearance of one of those
2668 Notwithstanding the above, anomalous effects may still occur when CRLF is a
2685 book, this is called "capturing" in what follows, and the phrase "capturing
2686 subpattern" or "capturing group" is used for a fragment of a pattern that picks
2705 called the \fBovector\fP, which contains the offsets of captured strings. It is
2715 Within the ovector, the first in each pair of values is set to the offset of
2716 the first code unit of a substring, and the second is set to the offset of the
2718 offsets, not character offsets. That is, they are byte offsets in the 8-bit
2723 of offsets (that is, \fIovector[0]\fP and \fIovector[1]\fP) are set. They
2732 pair is used for the first captured substring, and so on. The value returned by
2733 \fBpcre2_match()\fP is one more than the highest numbered pair that has been
2734 set. For example, if two substrings have been captured, the returned value is
2736 match is 1, indicating that just the first pair of offsets has been set.
2740 For example, if the pattern (?=ab\eK) is matched against "ab", the start and
2743 If a capturing subpattern group is matched repeatedly within a single match
2744 operation, it is the last portion of the subject that it matched that is
2747 If the ovector is too small to hold all the captured substring offsets, as much
2748 as possible is filled in, and the function returns a value of zero. If captured
2750 data block whose ovector is of minimum length (that is, one pair).
2752 It is possible for capturing subpattern number \fIn+1\fP to match some part of
2754 the string "abc" is matched against the pattern (a|(z))(bc) the return from the
2755 function is 4, and subpatterns 1 and 3 are matched, but 2 is not. When this
2760 expression are also set to PCRE2_UNSET. For example, if the string "abc" is
2762 The return from the function is 2, because the highest used capturing
2763 subpattern number is 1. The offsets for for the second and third capturing
2764 subpatterns (assuming the vector is large enough, of course) are set to
2768 pattern are never changed. That is, if a pattern contains \fIn\fP capturing
2784 As well as the offsets in the ovector, other information about a match is
2786 appropriate circumstances. If they are called at other times, the result is
2793 zero-terminated name, which is within the compiled pattern. If no name is
2794 available, NULL is returned. The length of the name (excluding the terminating
2795 zero) is stored in the code unit that precedes the name. You should use this
2799 After a successful match, the name that is returned is the last (*MARK),
2802 if the matching path contains (*MARK:A)(*PRUNE), the name "A" is returned.
2803 After a "no match" or a partial match, the last encountered name is returned.
2808 When it matches "bc", the returned name is A. The B mark is "seen" in the first
2809 branch of the group, but it is not on the matching path. On the other hand,
2810 when this pattern fails to match "bx", the returned name is B.
2814 is removed from the pattern above, there is an initial check for the presence
2825 escape sequence. After a partial match, however, this value is always the same
2848 with them. The codes are given names in the header file. If UTF checking is in
2849 force and an invalid UTF subject string is detected, one of a number of
2850 UTF-specific negative error codes is returned. Details are given in the
2872 catch the case when it is passed a junk pointer. This is the error that is
2873 returned when the magic number is not present.
2877 This error is given when a compiled pattern is passed to a function in a
2879 the 8-bit library is passed to a 16-bit or 32-bit library function.
2898 This error is never generated by \fBpcre2_match()\fP itself. It is provided for
2921 This error is returned when a pattern that was successfully studied using JIT
2923 stack is not large enough. See the
2935 If a pattern contains many nested backtracking points, heap memory is used to
2936 remember them. This error is given when the memory allocation function (default
2937 or custom) fails. Note that a different error, PCRE2_ERROR_HEAPLIMIT, is given
2947 This error is returned when \fBpcre2_match()\fP detects a recursion loop within
2968 code unit buffer and its length in code units, into which the text message is
2969 placed. The message is returned in code units of the appropriate width for the
2970 library that is being used.
2972 The returned message is terminated with a trailing zero, and the function
2974 error number is unknown, the negative error code PCRE2_ERROR_BADDATA is
2975 returned. If the buffer is too small, the message is truncated (but still with
2976 a trailing zero), and the negative error code PCRE2_ERROR_NOMEMORY is returned.
2977 None of the messages are very long; a buffer size of 120 code units is ample.
3006 a binary zero is correctly extracted and has a further zero added on the end,
3007 but the result is not, of course, a C string.
3012 substring zero is available. An attempt to extract any other substring gives
3018 For example, if the pattern (?=ab\eK) is matched against "ab", the start and
3024 argument is a pointer to the match data block, the second is the group number,
3025 and the third is a pointer to a variable into which the length is placed. If
3037 This is updated to contain the actual number of code units used for the
3043 zero. When the substring is no longer needed, the memory should be freed by
3046 The return value from all these functions is zero for success, or a negative
3047 error code. If the pattern match failed, the match failure code is returned.
3048 If a substring number greater than zero is used after a partial match,
3049 PCRE2_ERROR_PARTIAL is returned. Other possible error codes are:
3058 There is no substring with that number in the pattern, that is, the number is
3064 pattern, is greater than the number of slots in the ovector, so the substring
3069 The substring did not participate in the match. For example, if the pattern is
3070 (abc)|(def) and the subject is "def", and the ovector contains at least two
3071 capturing slots, substring number 1 is unset.
3087 that is added to each of them. All this is done in a single block of memory
3088 that is obtained using the same memory allocation function that was used to get
3092 partial match, the error code PCRE2_ERROR_PARTIAL is returned.
3094 The address of the memory block is returned via \fIlistptr\fP, which is also
3095 the start of the list of string pointers. The end of the list is marked by a
3096 NULL pointer. The address of the list of lengths is returned via
3100 function is zero if all went well, or PCRE2_ERROR_NOMEMORY if the memory block
3101 could not be obtained. When the list is no longer needed, it should be freed by
3104 If this function encounters a substring that is unset, which can happen when
3137 the number of the subpattern called "xxx" is 2. If the name is known to be
3139 calling \fBpcre2_substring_number_from_name()\fP. The first argument is the
3140 compiled pattern, and the second is the name. The yield of the function is the
3141 subpattern number, PCRE2_ERROR_NOSUBSTRING if there is no subpattern of that
3142 name, or PCRE2_ERROR_NOUNIQUESUBSTRING if there is more than one subpattern of
3147 "bynumber" functions, the only difference being that the second argument is a
3148 name instead of a number. If PCRE2_DUPNAMES is set and there are duplicate
3150 first named string that is set.
3152 If there are no groups with the given name, PCRE2_ERROR_NOSUBSTRING is
3154 number of slots in the ovector, PCRE2_ERROR_UNAVAILABLE is returned. If there
3155 is at least one group with a slot in the ovector, but no group is found to be
3156 set, PCRE2_ERROR_UNSET is returned.
3188 \fIreplacement\fP string, whose length is supplied in \fBrlength\fP. This can
3199 data block is obtained and freed within this function, using memory management
3203 If an external \fImatch_data\fP block is provided, its contents afterwards
3209 length, in code units, of the output buffer. If the function is successful, the
3210 value is updated to contain the length of the new string, excluding the
3211 trailing zero that is automatically added.
3213 If the function is not successful, the value set via \fIoutlengthptr\fP depends
3214 on the type of error. For syntax errors in the replacement string, the value is
3216 errors, the value is PCRE2_UNSET by default. This includes the case of the
3217 output buffer being too small, unless PCRE2_SUBSTITUTE_OVERFLOW_LENGTH is set
3218 (see below), in which case the value is the minimum length needed, including
3222 length is in code units, not bytes.
3224 In the replacement string, which is interpreted as a UTF string in UTF mode,
3225 and is checked for UTF validity unless the PCRE2_NO_UTF_CHECK option is set, a
3226 dollar character is an escape character that can specify the insertion of
3237 For example, if the pattern a(b)c is matched with "=abc=" and the replacement
3238 string "+$1$0$1+", the result is "=+babcb+=".
3243 the name inserted is "A", but for (*MARK:A)(*PRUNE:B) the relevant name is "B".
3255 replacing every matching substring. If this option is not set, only the first
3256 matching substring is replaced. The search for matches takes place in the
3257 original subject string (that is, previous replacements do not affect it).
3258 Iteration is implemented by advancing the \fIstartoffset\fP value for each
3259 search, which is always passed the entire subject string. If an offset limit is
3260 set in the match context, searching stops when that limit is reached.
3264 limit. Here is a \fPpcre2test\fP example:
3271 length, an attempt to find a non-empty match at the same offset is performed.
3272 If this is not successful, the offset is advanced by one character except when
3273 CRLF is a valid newline sequence and the next two characters are CR, LF. In
3274 this case, the offset is advanced by two characters.
3276 PCRE2_SUBSTITUTE_OVERFLOW_LENGTH changes what happens when the output buffer is
3277 too small. The default action is to return PCRE2_ERROR_NOMEMORY immediately. If
3278 this option is set, however, \fBpcre2_substitute()\fP continues to go through
3280 in order to compute the size of buffer that is needed. This value is passed
3284 Passing a buffer size of zero is a permitted way of finding out how much memory
3286 operation is carried out twice. Depending on the application, it may be more
3296 groups when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) to be treated as empty
3297 strings when inserted as described above. If this option is not set, an attempt
3302 replacement string. Without this option, only the dollar character is special,
3304 PCRE2_SUBSTITUTE_EXTENDED is set, two things change:
3306 Firstly, backslash in a replacement string is interpreted as an escape
3317 \eu and \el force the next character (if it is a letter) to upper or lower
3323 the result of processing "\eUaa\eLBB\eEcc\eE" is "AAbbcc"; the final \eE has no
3326 The second effect of setting PCRE2_SUBSTITUTE_EXTENDED is to add more
3327 flexibility to group substitution. The syntax is similar to that used by Bash:
3333 default value. If group <n> is set, its value is inserted; if not, <string> is
3335 expanded and inserted when group <n> is set or unset, respectively. The first
3336 form is just a convenient shorthand for
3355 were made. This may be zero if no matches were found, and is never greater than
3356 1 unless PCRE2_SUBSTITUTE_GLOBAL is set.
3358 In the event of an error, a negative error code is returned. Except for
3359 PCRE2_ERROR_NOMATCH (which is never returned), errors from \fBpcre2_match()\fP
3362 PCRE2_ERROR_NOSUBSTRING is returned for a non-existent substring insertion,
3363 unless PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set.
3365 PCRE2_ERROR_UNSET is returned for an unset substring insertion (including an
3366 unknown substring when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) when the simple
3367 (non-extended) syntax is used and PCRE2_SUBSTITUTE_UNSET_EMPTY is not set.
3369 PCRE2_ERROR_NOMEMORY is returned if the output buffer is not big enough. If the
3370 PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set, the size of buffer that is
3371 needed is returned via \fIoutlengthptr\fP. Note that this does not happen by
3374 PCRE2_ERROR_BADREPLACEMENT is used for miscellaneous syntax errors in the
3380 subject, which can happen if \eK is used in an assertion).
3399 When a pattern is compiled with the PCRE2_DUPNAMES option, names for
3405 one of the named subpatterns participates. An example is shown in the
3413 the given name that is set. Only if none are set is PCRE2_ERROR_UNSET is
3419 argument is the compiled pattern, and the second is the name. If the third and
3427 PCRE2_ERROR_NOSUBSTRING is returned if there are no entries for the given name.
3429 The format of the name table is described
3447 callout facility, which is described in the
3453 What you have to do is to insert a callout right at the end of the pattern.
3454 When your callout function is called, extract and save the current matched
3472 The function \fBpcre2_dfa_match()\fP is called to match a subject string
3475 This has different characteristics to the normal algorithm, and is not
3487 is used in a different way, and this is described below. The other common
3489 description is not repeated here.
3492 vector should contain at least 20 elements. It is used for keeping track of
3493 multiple paths through the pattern tree. More workspace is needed for patterns
3496 Here is an example of a simple call to \fBpcre2_dfa_match()\fP:
3519 for \fBpcre2_match()\fP, so their description is not repeated here.
3525 details are slightly different. When PCRE2_PARTIAL_HARD is set for
3527 subject is reached and there is still at least one matching possibility that
3529 already been found. When PCRE2_PARTIAL_SOFT is set, the return code
3530 PCRE2_ERROR_NOMATCH is converted into PCRE2_ERROR_PARTIAL if the end of the
3531 subject is reached, there have been no complete matches, but there is still at
3533 when the longest partial match was found is set as the first matching string in
3534 both cases. There is a more detailed discussion of partial and multi-segment
3545 works, this is necessarily the shortest possible match at the first possible
3550 When \fBpcre2_dfa_match()\fP returns a partial match, it is possible to call it
3552 match. The PCRE2_DFA_RESTART option requests this action; when it is set, the
3554 before because data about the match so far is left in them after a partial
3555 match. There is more discussion of this facility in the
3574 This is <something> <something else> <something further> no more
3582 On success, the yield of the function is a number greater than zero, which is
3595 is, the longest matching string is first. If there were too many matches to fit
3596 into the ovector, the yield of the function is zero, and the vector is filled
3601 pattern "a\ed+" is compiled as if it were "a\ed++". For DFA matching, this
3602 means that only one possible match is found. If you really do want multiple
3621 This return is given if \fBpcre2_dfa_match()\fP encounters an item in the
3627 This return is given if \fBpcre2_dfa_match()\fP encounters a condition item
3633 This return is given if \fBpcre2_dfa_match()\fP runs out of space in the
3638 When a recursive subpattern is processed, the matching function calls itself
3640 error is given if the internal ovector is not large enough. This should be
3641 extremely rare, as a vector of size 1000 is used.
3645 When \fBpcre2_dfa_match()\fP is called with the \fBPCRE2_DFA_RESTART\fP option,
3648 fail, this error is given.