pcre2api.3 - OpenGrok cross reference for /external/pcre/dist2/doc/pcre2api.3

Lines Matching full:the
8 functions. See the
12 document for an overview of all the PCRE2 documentation.
258 This contains the function prototypes and other definitions for all three
260 systems the libraries are called \fBlibpcre2-8\fP, \fBlibpcre2-16\fP, and
261 \fBlibpcre2-32\fP, and they can also co-exist with the original PCRE libraries.
264 unsigned integers in code units of the appropriate width. Every PCRE2 function
276 The UCHAR types define unsigned code units of the appropriate widths. For
277 example, PCRE2_UCHAR16 is usually defined as `uint16_t'. The SPTR types are
278 constant pointers to the equivalent UCHAR types, that is, they are pointers to
282 are defined whose names are the generic forms such as \fBpcre2_compile()\fP and
283 PCRE2_SPTR. These macros use the value of the macro PCRE2_CODE_UNIT_WIDTH to
284 generate the appropriate width-specific function and macro names.
286 to be 8, 16, or 32 before including \fBpcre2.h\fP in order to make use of the
291 including \fBpcre2.h\fP, and then use the real function names. Any code that is
292 to be included in an environment where the value of PCRE2_CODE_UNIT_WIDTH is
293 unknown should also use the real function names. (Unfortunately, it is not
294 possible in C code to save and restore the value of a macro.)
305 In the function summaries above, and in the rest of this document and other
307 names, without the 8, 16, or 32 suffix.
314 also some wrapper functions for the 8-bit library that correspond to the
315 POSIX regular expression API, but they do not give access to all the
316 functionality. They are described in the
323 codes are defined in the header file \fBpcre2.h\fP, which contains definitions
324 of PCRE2_MAJOR and PCRE2_MINOR, the major and minor release numbers for the
334 sample program that demonstrates the simplest way of using them is provided in
335 the file called \fIpcre2demo.c\fP in the PCRE2 source distribution. A listing
336 of this program is given in the
340 documentation, and the
347 in appropriate hardware environments. It greatly speeds up the matching
353 More complicated programs might need to make use of the specialist functions
355 \fBpcre2_jit_stack_assign()\fP in order to control the JIT code's memory usage.
358 unless the PCRE2_NO_JIT option is set. There is also a direct interface for JIT
359 matching, which gives improved performance. The JIT-specific functions are
360 discussed in the
367 Perl-compatible, is also provided. This uses a different algorithm for the
368 matching. The alternative algorithm finds all possible matches (at a given
369 point in the subject), and scans the subject just once (unless there are
371 substrings. A description of the two matching algorithms and their advantages
372 and disadvantages is given in the
378 In addition to the main compiling and matching functions, there are convenience
393 provided, to free the memory used for extracted strings.
396 return a copy of the subject string with substitutions for parts that were
403 pattern (\fBpcre2_pattern_info()\fP) and about the configuration with which
416 unsigned integer type, currently always defined as \fIsize_t\fP. The largest
419 Therefore, the longest string that can be handled is one less than this
429 character, the two-character sequence CRLF, any of the three preceding, or any
430 Unicode newline sequence. The Unicode newline sequences are the three just
431 mentioned, plus the single characters VT (vertical tab, U+000B), FF (form feed,
435 Each of the first three conventions is used by at least one operating system as
437 The default default is LF, which is the Unix standard. However, the newline
439 or it can be specified by special text at the start of the pattern itself; this
440 overrides any other settings. See the
444 page for details of the special character sequences.
446 In the PCRE2 documentation the word "newline" is used to mean "the character or
447 pair of characters that indicate a line break". The choice of newline
448 convention affects the handling of the dot, circumflex, and dollar
449 metacharacters, the handling of #-comments in /x mode, and, when CRLF is a
450 recognized line ending sequence, the match position advancement for a
451 non-anchored pattern. There is more detail about this in the
458 The choice of newline convention does not affect the interpretation of
467 separate from data that can be shared between threads. The PCRE2 library code
468 itself is thread-safe: it contains no static or global variables. The API is
469 designed to be fairly simple for non-threaded applications while at the same
473 between the application and the PCRE2 libraries.
476 .SS "The compiled pattern"
479 A pointer to the compiled form of a pattern is returned to the user when
480 \fBpcre2_compile()\fP is successful. The data in the compiled pattern is fixed,
481 and does not change when the pattern is matched. Therefore, it is thread-safe,
482 that is, the same compiled pattern can be used by more than one thread
483 simultaneously. For example, an application can compile all its patterns at the
484 start, before forking off multiple threads that use them. However, if the
486 areas for each thread. See the
495 least until a pattern has been compiled. The logic can be something like this:
503   Release the lock
506 Of course, testing for compilation errors should also be included in the code.
508 If JIT is being used, but the JIT compilation is not being done immediately,
509 (perhaps waiting to see if the pattern is used often enough) similar logic is
510 required. JIT compilation updates a pointer within the compiled code block, so
511 a thread must gain unique write access to the pointer before calling
513 to obtain a private copy of the compiled code.
519 The next main section below introduces the idea of "contexts" in which PCRE2
521 that control the way PCRE2 operates. Grouping a number of parameters together
523 using lots of arguments. The parameters that are stored in contexts are in some
524 sense "advanced features" of the API. Many straightforward applications will
527 In a multithreaded application, if the parameters in a context are values that
528 are never changed, the same context can be used by all the threads. However, if
538 additional information such as the name of a (*MARK) setting. Each thread must
548 reasonable size, and at the same time to keep the API extensible, "uncommon"
550 directly. A context is just a block of memory that holds the parameter values.
551 Applications that do not need to adjust any of the context parameters can pass
558 .SS "The general context"
562 memory management functions that are called from several places in the PCRE2
563 library. The context is named `general' rather than specifically `memory'
580 Whenever code in PCRE2 calls these functions, the final argument is the value
581 of \fImemory_data\fP. Either of the first two arguments of the creation
582 function may be NULL, in which case the system memory management functions
586 storing the context, and all three values are saved as part of the context.
588 Whenever PCRE2 creates a data block of any kind, the block contains a pointer
589 to the \fIfree()\fP function that matches the \fImalloc()\fP function that was
590 used. When the time comes to free the block, this function is called.
608 .SS "The compile context"
611 A compile context is required if you want to change the default values of any
612 of the following compile-time parameters:
616   The newline character sequence
617   The compile time nested parentheses limit
618   The maximum length of the pattern string
622 If none of these apply, just pass NULL as the context argument of
625 A compile context is created, copied, and freed by the following functions:
638 be changed by calling the following functions, which return 0 on success, or
648 ending sequence. The value is used by the JIT compiler and by the two
657 The value must be the result of a call to \fIpcre2_maketables()\fP, whose only
659 in the current locale.
666 This sets a maximum length, in code units, for the pattern string that is to be
667 compiled. If the pattern is longer, an error is generated. This facility is
669 limit their size. The default is the largest number that a PCRE2_SIZE variable
678 newlines. The value must be one of PCRE2_NEWLINE_CR (carriage return only),
679 PCRE2_NEWLINE_LF (linefeed only), PCRE2_NEWLINE_CRLF (the two-character
680 sequence CR followed by LF), PCRE2_NEWLINE_ANYCRLF (any of the above), or
683 When a pattern is compiled with the PCRE2_EXTENDED option, the value of this
684 parameter affects the recognition of white space and the end of internal
685 comments starting with #. The value is saved with the compiled pattern for
686 subsequent use by the JIT compiler and by the two interpreted matching
694 This parameter ajusts the limit, set when PCRE2 is built (default 250), on the
704 system stack, where running out of stack is to be avoided at all costs. The
708 pattern. This function can check the actual stack size (or anything else that
711 The first argument to the callout function gives the current depth of
712 nesting, and the second is user data that is set up by the last argument of
713 \fBpcre2_set_compile_recursion_guard()\fP. The callout function should return
718 .SS "The match context"
721 A match context is required if you want to change the default values of any
722 of the following match-time parameters:
725   The offset limit for matching an unanchored pattern
726   The limit for calling \fBmatch()\fP (see below)
727   The limit for calling \fBmatch()\fP recursively
730 If none of these apply, just pass NULL as the context argument of
733 A match context is created, copied, and freed by the following functions:
746 be changed by calling the following functions, which return 0 on success, or
756 during a matching operation. Details are given in the
768 advance in the subject string. The default value is PCRE2_UNSET. The
770 PCRE2_ERROR_NOMATCH if a match with a starting point before or at the given
771 offset is not found. For example, if the pattern /abc/ is matched against
772 "123abc" with an offset limit less than 3, the result is PCRE2_ERROR_NO_MATCH.
773 A match can never be found if the \fIstartoffset\fP argument of
774 \fBpcre2_match()\fP or \fBpcre2_dfa_match()\fP is greater than the offset
783 subject strings. See also the PCRE2_FIRSTLINE option, which requires a match to
784 start within the first line of the subject. If this is set with an offset
785 limit, a match must occur in the first line and also within the offset limit.
795 which have a very large number of possibilities in their search trees. The
799 calls repeatedly (sometimes recursively). The limit set by \fImatch_limit\fP is
800 imposed on the number of times this function is called during a match, which
801 has the effect of limiting the amount of backtracking that can take place. For
802 patterns that are not anchored, the count restarts from zero for each position
803 in the subject string. This limit is not relevant to \fBpcre2_dfa_match()\fP,
807 processed by \fBpcre2_jit_compile()\fP, the way in which matching is executed
808 is entirely different. However, there is still the possibility of runaway
809 matching that goes on for a very long time, and so the \fImatch_limit\fP value
810 is also used in this case (but in a different way) to limit how long the
813 The default value for the limit can be set when PCRE2 is built; the default
814 default is 10 million, which handles all but the most extreme cases. If the
816 for the match limit may also be supplied by an item at the start of a pattern
817 of the form
822 less than the limit set by the caller of \fBpcre2_match()\fP or, if no such
823 limit is set, less than the default.
831 instead of limiting the total number of times that \fBmatch()\fP is called, it
832 limits the depth of recursion. The recursion depth is a smaller number than the
836 Limiting the recursion depth limits the amount of system stack that can be
837 used, or, when PCRE2 has been compiled to use memory on the heap instead of the
838 stack, the amount of heap memory that can be used. This limit is not relevant,
839 and is ignored, when matching is done using JIT compiled code or by the
842 The default value for \fIrecursion_limit\fP can be set when PCRE2 is built; the
843 default default is the same value as the default for \fImatch_limit\fP. If the
845 value for the recursion limit may also be supplied by an item at the start of a
846 pattern of the form
851 less than the limit set by the caller of \fBpcre2_match()\fP or, if no such
852 limit is set, less than the default.
862 by \fBpcre2_match()\fP when PCRE2 is compiled to use the heap for remembering
863 backtracking data, instead of recursive function calls that use the system
864 stack. There is a discussion about PCRE2's stack usage in the
868 documentation. See the
874 Using the heap for recursion is a non-standard way of building PCRE2, for use
875 in environments that have limited stacks. Because of the greater use of memory
877 to the general custom memory functions are provided so that special-purpose
878 external code can be used for this case, because the memory blocks are all the
879 same size. The blocks are retained by \fBpcre2_match()\fP until it is about to
880 exit so that they can be re-used when possible during the match. In the absence
881 of these functions, the normal custom memory management functions are used, if
882 supplied, otherwise the system functions.
891 discover which optional features have been compiled into the PCRE2 library. The
898 required. The second argument is a pointer to memory into which the information
899 is placed. If NULL is passed, the function returns the amount of memory that is
900 needed for the requested information. For calls that return numerical values,
902 to appropriately aligned memory. For calls that return strings, the required
903 length is given in code units, not counting the terminating zero.
905 When requesting information, the returned value from \fBpcre2_config()\fP is
906 non-negative on success, or the negative error code PCRE2_ERROR_BADOPTION if
907 the value in the first argument is not recognized. The following information is
913 sequences the \eR escape sequence matches by default. A value of
915 value of PCRE2_BSR_ANYCRLF means that \eR matches only CR, LF, or CRLF. The
926 units long. (The exact length required can be found by calling
927 \fBpcre2_config()\fP with \fBwhere\fP set to NULL.) The buffer is filled with a
928 string that contains the name of the architecture for which the JIT compiler is
930 is not available, PCRE2_ERROR_BADOPTION is returned, otherwise the number of
931 code units used is returned. This is the length of the string, plus one unit
932 for the terminating zero.
936 The output is a uint32_t integer that contains the number of bytes used for
937 internal linkage in compiled regular expressions. When PCRE2 is configured, the
938 value can be set to 2, 3, or 4, with the default being 2. This is the value
939 that is returned by \fBpcre2_config()\fP. However, when the 16-bit library is
940 compiled, a value of 3 is rounded up to 4, and when the 32-bit library is
941 compiled, internal linkages always use 4 bytes, so the configured value is not
944 The default value of 2 for the 8-bit and 16-bit libraries is sufficient for all
945 but the most massive patterns, since it allows the size of the compiled pattern
947 be compiled by those two libraries, but at the expense of slower matching.
951 The output is a uint32_t integer that gives the default limit for the number of
957 The output is a uint32_t integer whose value specifies the default character
958 sequence that is recognized as meaning "newline". The values are:
966 The default should normally correspond to the standard sequence for your
971 The output is a uint32_t integer that gives the maximum depth of nesting
972 of parentheses (of any kind) in a pattern. This limit is imposed to cap the
974 PCRE2 is built; the default is 250. This limit does not take into account the
975 stack that may already be used by the calling application. For finer control
980 The output is a uint32_t integer that gives the default limit for the depth of
981 recursion when calling the internal matching function in a \fBpcre2_match()\fP
988 the system stack to remember their state. This is the usual way that PCRE2 is
989 compiled. The output is zero if PCRE2 was compiled to use blocks of data on the
995 units long. (The exact length required can be found by calling
997 without Unicode support, the buffer is filled with the text "Unicode not
998 supported". Otherwise, the Unicode version string (for example, "8.0.0") is
999 inserted. The number of code units used is returned. This is the length of the
1000 string plus one unit for the terminating zero.
1010 units long. (The exact length required can be found by calling
1011 \fBpcre2_config()\fP with \fBwhere\fP set to NULL.) The buffer is filled with
1012 the PCRE2 version string, zero-terminated. The number of code units used is
1013 returned. This is the length of the string plus one unit for the terminating
1033 the pattern is zero-terminated, the length can be specified as
1034 PCRE2_ZERO_TERMINATED. The function returns a pointer to a block of memory that
1035 contains the compiled pattern and related data, or NULL if an error occurred.
1037 If the compile context argument \fIccontext\fP is NULL, memory for the compiled
1039 the same memory function that was used for the compile context. The caller must
1040 free the memory by calling \fBpcre2_code_free()\fP when it is no longer needed.
1042 The function \fBpcre2_code_copy()\fP makes a copy of the compiled code in new
1043 memory, using the same memory allocator as was used for the original. However,
1044 if the code has been processed by the JIT compiler (see
1051 passed to \fBpcre2_jit_compile()\fP if required. The \fBpcre2_code_copy()\fP
1055 NOTE: When one of the matching functions is called, pointers to the compiled
1056 pattern and the subject string are set in the match data block so that they can
1057 be referenced by the substring extraction functions. After running a match, you
1059 operations on the
1067 settings that affect the compilation. It should be zero if no options are
1068 required. The available options are described below. Some of them (in
1070 also be set and unset from within the pattern (see the detailed description in
1077 For those options that can be different in different parts of the pattern, the
1078 contents of the \fIoptions\fP argument specifies their settings at the start of
1079 compilation. The PCRE2_ANCHORED and PCRE2_NO_UTF_CHECK options can be set at
1082 Other, less frequently required compile-time parameters (for example, the
1090 NULL immediately. Otherwise, the variables to which these point are set to an
1091 error code and an offset (number of code units) within the pattern,
1093 error has occurred. The values are not defined when compilation is successful
1104 UTF-8 or UTF-16 string, the offset is that of the first code unit of the
1107 Some errors are not detected until the whole pattern has been scanned; in these
1108 cases, the offset passed back is the length of the pattern. Note that the
1110 point into the middle of a UTF-8 or UTF-16 character.
1119     "^A.*Z",                /* the pattern */
1120     PCRE2_ZERO_TERMINATED,  /* the pattern is zero-terminated */
1126 The following names for option bits are defined in the \fBpcre2.h\fP header
1131 If this bit is set, the pattern is forced to be "anchored", that is, it is
1132 constrained to match only at the first matching point in the string that is
1133 being searched (the "subject string"). This effect can also be achieved by
1134 appropriate constructs in the pattern itself, which is the only way to do it in
1140 immediately follows an opening one is treated as a data character for the
1141 class. When PCRE2_ALLOW_EMPTY_CLASS is set, it terminates the class, which
1153 hexadecimal digits, in which case the hexadecimal number defines the code point
1155 case the following character).
1158 hexadecimal digits, in which case the hexadecimal number defines the code point
1165 In multiline mode (when PCRE2_MULTILINE is set), the circumflex metacharacter
1166 matches at the start of the subject (unless PCRE2_NOTBOL is set), and also
1167 after any internal newline. However, it does not match after a newline at the
1168 end of the subject, for compatibility with Perl. If you want a multiline
1174 By default, for compatibility with Perl, the name in any verb sequence such as
1176 parenthesis. The name is not processed in any way, and it is not possible to
1177 include a closing parenthesis in the name. However, if the PCRE2_ALT_VERBNAMES
1179 unescaped closing parenthesis terminates the name. A closing parenthesis can be
1180 included in a name either as \e) or between \eQ and \eE. If the PCRE2_EXTENDED
1182 recognized, exactly as in the rest of the pattern.
1187 all with number 255, before each pattern item. For discussion of the callout
1188 facility, see the
1196 If this bit is set, letters in the pattern match both upper and lower case
1197 letters in the subject. It is equivalent to Perl's /i option, and it can be
1202 If this bit is set, a dollar metacharacter in the pattern matches only at the
1203 end of the subject string. Without this option, a dollar also matches
1204 immediately before a newline at the end of the string (but not before any other
1205 newlines). The PCRE2_DOLLAR_ENDONLY option is ignored if PCRE2_MULTILINE is
1211 If this bit is set, a dot metacharacter in the pattern matches any character,
1214 not match when the current position in the subject is at a newline. This option
1217 characters, independent of the setting of this option.
1223 only one instance of the named subpattern can ever be matched. There are more
1224 details of named subpatterns below; see also the
1232 If this bit is set, most white space characters in the pattern are totally
1240 character class and the next newline, inclusive, to be ignored, which makes it
1241 possible to include comments inside complicated patterns. Note that the end of
1242 this type of comment is a literal newline sequence in the pattern; escape
1249 sequence at the start of the pattern, as described in the section entitled
1254 in the \fBpcre2pattern\fP documentation. A default is defined when PCRE2 is
1260 the first newline in the subject string, though the matched text may continue
1261 over the newline. See also PCRE2_USE_OFFSET_LIMIT, which provides a more
1263 match must occur in the first line and also within the offset limit. In other
1269 empty string (by default this causes the current matching alternative to fail).
1271 find an "a" in the subject), whereas it fails by default, for Perl
1277 By default, for the purposes of matching "start of line" and "end of line",
1278 PCRE2 treats the subject string as consisting of a single line of characters,
1279 even if it actually contains newlines. The "start of line" metacharacter (^)
1280 matches only at the start of the string, and the "end of line" metacharacter
1281 ($) matches only at the end of the string, or before a terminating newline
1283 PCRE2_DOTALL is set, the "any character" metacharacter (.) does not match at a
1284 newline. This behaviour (for ^, $, and dot) is the same as Perl.
1286 When PCRE2_MULTILINE it is set, the "start of line" and "end of line"
1288 in the subject string, respectively, as well as at the very start and end. This
1290 (?m) option setting. Note that the "start of line" metacharacter does not match
1291 after a newline at the end of the subject, for compatibility with Perl.
1292 However, you can change this by setting the PCRE2_ALT_CIRCUMFLEX option. If
1298 This option locks out the use of \eC in the pattern that is being compiled.
1300 it may leave the current matching point in the middle of a multi-code-unit
1303 locks out the use of \eC.
1307 This option locks out the use of Unicode properties for handling \eB, \eb, \eD,
1308 \ed, \eS, \es, \eW, \ew, and some of the POSIX character classes, as described
1309 for the PCRE2_UCP option below. In particular, it prevents the creator of the
1310 pattern from enabling this facility by starting the pattern with (*UCP). This
1312 sources. The option combination PCRE_UCP and PCRE_NEVER_UCP causes an error.
1316 This option locks out interpretation of the pattern as UTF-8, UTF-16, or
1317 UTF-32, depending on which library is in use. In particular, it prevents the
1318 creator of the pattern from switching to UTF interpretation by starting the
1320 patterns from external sources. The combination of PCRE2_UTF and
1325 If this option is set, it disables the use of numbered capturing parentheses in
1328 they acquire numbers in the usual way). There is no equivalent of this option
1331 though the reference can be by name or by number.
1339 set this option if you want the matching functions to do a full unoptimized
1340 search and run all the callouts, but it is mainly provided for testing
1346 the first significant item in a top-level branch of a pattern, and all the
1347 other branches also start with .* or with \eA or \eG or ^. The optimization is
1349 group that is the subject of a back reference, or if the pattern contains
1350 (*PRUNE) or (*SKIP). When the optimization is not disabled, such a pattern is
1351 automatically anchored if PCRE2_DOTALL is set for all the .* items and
1352 PCRE2_MULTILINE is not set for any ^ items. Otherwise, the fact that any match
1353 must start either at the start of the subject or following a newline is
1359 what \fBpcre2_compile()\fP generates, but it does affect the output of the JIT
1362 There are a number of optimizations that may occur at the start of a match, in
1363 order to speed up the process. For example, if it is known that an unanchored
1364 match must start with a specific character, the matching code searches the
1366 actually running the main matching function. This means that a special item
1367 such as (*COMMIT) at the start of a pattern is not considered until after a
1368 suitable starting point for the match has been found. Also, when callouts or
1370 skipped if the pattern is never actually used. The start-up optimizations are
1371 in effect a pre-scan of the subject that takes place before the pattern is run.
1373 The PCRE2_NO_START_OPTIMIZE option disables the start-up optimizations,
1374 possibly causing performance to suffer, but ensuring that in cases where the
1375 result is "no match", the callouts do occur, and that items such as (*COMMIT)
1376 and (*MARK) are considered at every possible starting position in the subject
1379 Setting PCRE2_NO_START_OPTIMIZE may change the outcome of a matching operation.
1380 Consider the pattern
1384 When this is compiled, PCRE2 records the fact that a match must start with the
1385 character "A". Suppose the subject string is "DEFABC". The start-up
1386 optimization scans along the subject, finds "A" and runs the first match
1387 attempt from there. The (*COMMIT) item means that the pattern must match the
1388 current starting position, which in this case, it does. However, if the same
1389 match is run with PCRE2_NO_START_OPTIMIZE set, the initial scan along the
1390 subject string does not happen. The first match attempt is run starting from
1393 For example, a minimum length for the subject may be recorded. Consider the
1398 The minimum length for a match is one character. If the subject is "ABC", there
1400 string at the end of the subject does not take place, because PCRE2 knows that
1401 the subject is now too short, and so the (*MARK) is never encountered. In this
1402 case, the optimization does not affect the overall match result, which is still
1403 "no match", but it does affect the auxiliary information that is returned.
1407 When PCRE2_UTF is set, the validity of the pattern as a UTF string is
1408 automatically checked. There are discussions about the validity of
1422 in the
1431 performance reasons, you can set the PCRE2_NO_UTF_CHECK option. When it is set,
1435 checking of the subject string.
1439 This option changes the way PCRE2 processes \eB, \eb, \eD, \ed, \eS, \es, \eW,
1440 \ew, and some of the POSIX character classes. By default, only ASCII characters
1442 classify characters. More details are given in the section on
1447 in the
1451 page. If you set PCRE2_UCP, matching one of the items it affects takes much
1452 longer. The option is available only if PCRE2 has been compiled with Unicode
1457 This option inverts the "greediness" of the quantifiers so that they are not
1459 with Perl. It can also be set by a (?U) option setting within the pattern.
1467 the description of \fBpcre2_set_offset_limit()\fP in the
1472 that describes match contexts. See also the PCRE2_FIRSTLINE
1477 This option causes PCRE2 to regard both the pattern and the subject strings
1480 Unicode support (which is the default). If Unicode support is not available,
1482 the behaviour of PCRE2 are given in the
1493 (via \fIerrorcode\fP) if it finds an error in the pattern. There are also some
1494 negative error codes that are used for invalid UTF strings. These are the same
1496 in the
1500 page. The \fBpcre2_get_error_message()\fP function (see "Obtaining a textual
1532 These functions provide support for JIT compilation, which, if the just-in-time
1534 that executes much faster than the \fBpcre2_match()\fP interpretive matching
1535 function. Full details are given in the
1542 patterns to be analyzed, and for one-off matches and simple patterns the
1544 Most, but not all patterns can be optimized by the JIT compiler.
1556 \ep and \eP, or, alternatively, the PCRE2_UCP option can be set when a pattern
1558 instead of the built-in tables.
1562 use locales, but not try to mix the two.
1565 These are sufficient for many applications. Normally, the internal tables
1567 to cause the internal tables to be rebuilt in the default "C" locale of the
1570 The internal tables can be overridden by tables supplied by the application
1571 that calls PCRE2. These may be created in a different locale from the default.
1572 As more and more applications change to using Unicode, the need for this locale
1575 External tables are built by calling the \fBpcre2_maketables()\fP function, in
1576 the relevant locale. The result can be passed to \fBpcre2_compile()\fP as often
1578 \fBpcre2_set_character_tables()\fP to set the tables pointer therein. For
1579 example, to build and use tables that are appropriate for the French locale
1581 letters), the following code could be used:
1590 are using Windows, the name for the French locale is "french". It is the
1591 caller's responsibility to ensure that the memory containing the tables remains
1594 The pointer that is passed (via the compile context) to \fBpcre2_compile()\fP
1595 is saved with the compiled pattern, and the same tables are used by
1597 compilation, and matching all happen in the same locale, but different patterns
1610 compiled pattern. For information about callouts, see the
1615 The first argument for \fBpcre2_pattern_info()\fP is a pointer to the compiled
1616 pattern. The second argument specifies which piece of information is required,
1617 and the third argument is a pointer to a variable to receive the data. If the
1618 third argument is NULL, the first argument is ignored, and the function returns
1619 the size in bytes of the variable that is required for the information
1620 requested. Otherwise, The yield of the function is zero for success, or one of
1623   PCRE2_ERROR_NULL           the argument \fIcode\fP was NULL
1624   PCRE2_ERROR_BADMAGIC       the "magic number" was not found
1625   PCRE2_ERROR_BADOPTION      the value of \fIwhat\fP was invalid
1626   PCRE2_ERROR_UNSET          the requested field is not set
1628 The "magic number" is placed at the start of each compiled pattern as an simple
1630 \fBpcre2_pattern_info()\fP, to obtain the length of the compiled pattern:
1637     &length);         /* where to put the data */
1639 The possible values for the second argument are defined in \fBpcre2.h\fP, and
1645 Return a copy of the pattern's options. The third argument should point to a
1646 \fBuint32_t\fP variable. PCRE2_INFO_ARGOPTIONS returns exactly the options that
1649 (*UTF) at the start of the pattern itself.
1651 For example, if the pattern /(*UTF)abc/ is compiled with the PCRE2_EXTENDED
1652 option, the result for PCRE2_INFO_ALLOPTIONS is PCRE2_EXTENDED and PCRE2_UTF.
1653 Option settings such as (?i) that can change within a pattern do not affect the
1654 result of PCRE2_INFO_ALLOPTIONS, even if they appear right at the start of the
1658 the first significant item in every top-level branch is one of the following:
1665 When .* is the first significant item, anchoring is possible only when all the
1670   .* is not in a capturing group that is the subject
1673   Neither (*PRUNE) nor (*SKIP) appears in the pattern.
1676 For patterns that are auto-anchored, the PCRE2_ANCHORED bit is set in the
1681 Return the number of the highest back reference in the pattern. The third
1683 numbers as well as names, and these count towards the highest back reference.
1684 Back references such as \e4 or \eg{12} match the captured characters of the
1685 given group, but in addition, the check that a capturing group is set in a
1691 The output is a uint32_t whose value indicates what character sequences the \eR
1698 Return the highest capturing subpattern number in the pattern. In patterns
1699 where (?| is not used, this is also the total number of capturing subpatterns.
1704 In the absence of a single first code unit for a non-anchored pattern,
1706 values for the first code unit in any match. For example, a pattern that starts
1708 greater than 255 are supported, the flag bit for 255 means "any code unit of
1710 returned. Otherwise NULL is returned. The third argument should point to an
1715 Return information about the first code unit of any matched string, for a
1716 non-anchored pattern. The third argument should point to an \fBuint32_t\fP
1717 variable. If there is a fixed first value, for example, the letter "c" from a
1718 pattern such as (cat|cow|coyote), 1 is returned, and the character value can be
1720 it is known that a match can occur only at the start of the subject or
1721 following a newline in the subject, 2 is returned. Otherwise, and for anchored
1726 Return the value of the first code unit of any matched string in the situation
1727 where PCRE2_INFO_FIRSTCODETYPE returns 1; otherwise return 0. The third
1728 argument should point to an \fBuint32_t\fP variable. In the 8-bit library, the
1729 value is always less than 256. In the 16-bit library the value can be up to
1730 0xffff. In the 32-bit library in UTF-32 mode the value can be up to 0x10ffff,
1735 Return 1 if the pattern contains any instances of \eC, otherwise 0. The third
1740 Return 1 if the pattern contains any explicit matches for CR or LF characters,
1741 otherwise 0. The third argument should point to an \fBuint32_t\fP variable. An
1746 Return 1 if the (?J) or (?-J) option setting is used in the pattern, otherwise
1747 0. The third argument should point to an \fBuint32_t\fP variable. (?J) and
1748 (?-J) set and unset the local PCRE2_DUPNAMES option, respectively.
1752 If the compiled pattern was successfully processed by
1753 \fBpcre2_jit_compile()\fP, return the size of the JIT compiled code, otherwise
1754 return zero. The third argument should point to a \fBsize_t\fP variable.
1759 matched string, other than at its start. The third argument should  point to an
1761 returned, the code unit value itself can be retrieved using
1763 recorded only if it follows something of variable length. For example, for the
1764 pattern /^a\ed+z\ed+/ the returned value is 1 (with "z" returned from
1765 PCRE2_INFO_LASTCODEUNIT), but for /^a\edz\ed/ the returned value is 0.
1769 Return the value of the rightmost literal data unit that must exist in any
1770 matched string, other than at its start, if such a value has been recorded. The
1776 Return 1 if the pattern might match an empty string, otherwise 0. The third
1784 If the pattern set a match limit by including an item of the form
1785 (*LIMIT_MATCH=nnnn) at the start, the value is returned. The third argument
1786 should point to an unsigned 32-bit integer. If no such value has been set, the
1787 call to \fBpcre2_pattern_info()\fP returns the error PCRE2_ERROR_UNSET.
1791 Return the number of characters (not code units) in the longest lookbehind
1792 assertion in the pattern. The third argument should point to an unsigned 32-bit
1793 integer. This information is useful when doing multi-segment matching using the
1794 partial matching facilities. Note that the simple assertions \eb and \eB
1796 lookbehind, though it does not actually inspect the previous character. This is
1797 to ensure that at least one character from the old segment is retained when a
1798 new segment is processed. Otherwise, if there are no lookbehinds in the
1799 pattern, \eA might match incorrectly at the start of a new segment.
1804 returned. Otherwise the returned value is 0. The value is a number of
1805 characters, which in UTF mode may be different from the number of code units.
1806 The third argument should point to an \fBuint32_t\fP variable. The value is a
1807 lower bound to the length of any matching string. There may not be any strings
1815 PCRE2 supports the use of named as well as numbered capturing parentheses. The
1816 names are just an additional way of identifying the parentheses, which still
1819 substrings by name. It is also possible to extract the data directly, by first
1820 converting the name to a number in order to access the correct pointers in the
1821 output vector (described with \fBpcre2_match()\fP below). To do the conversion,
1822 you need to use the name-to-number map, which is described by these three
1826 the number of entries, and PCRE2_INFO_NAMEENTRYSIZE gives the size of each
1827 entry in code units; both of these return a \fBuint32_t\fP value. The entry
1828 size depends on the length of the longest name.
1830 PCRE2_INFO_NAMETABLE returns a pointer to the first entry of the table. This is
1831 a PCRE2_SPTR pointer to a block of code units. In the 8-bit library, the first
1832 two bytes of each entry are the number of the capturing parenthesis, most
1833 significant byte first. In the 16-bit library, the pointer points to 16-bit
1834 code units, the first of which contains the parenthesis number. In the 32-bit
1835 library, the pointer points to 32-bit code units, the first of which contains
1836 the parenthesis number. The rest of the entry is the corresponding name, zero
1840 with the same number, as described in the
1845 in the
1849 page, the groups may be given the same name, but there is only one entry in the
1850 table. Different names for groups of the same number are not permitted.
1853 if PCRE2_DUPNAMES is set. They appear in the table in the order in which they
1854 were found in the pattern. In the absence of (?| this is the order of
1855 increasing number; when (?| is used this is not necessarily the case because
1858 As a simple example of the name/number table, consider the following pattern
1859 after compilation by the 8-bit library (assume PCRE2_EXTENDED is set, so white
1866 There are four named subpatterns, so the table has four entries, and each entry
1867 in the table is eight bytes long. The table is as follows, with non-printing
1875 When writing code to extract data from named subpatterns using the
1876 name-to-number map, remember that the length of the entries is likely to be
1881 The output is a \fBuint32_t\fP with one of the following values:
1889 This specifies the default character sequence that will be recognized as
1894 If the pattern set a recursion limit by including an item of the form
1895 (*LIMIT_RECURSION=nnnn) at the start, the value is returned. The third
1897 set, the call to \fBpcre2_pattern_info()\fP returns the error PCRE2_ERROR_UNSET.
1901 Return the size of the compiled pattern in bytes (for all three libraries). The
1902 third argument should point to a \fBsize_t\fP variable. This value includes the
1903 size of the general data block that precedes the code units of the compiled
1904 pattern itself. The value that is used when \fBpcre2_compile()\fP is getting
1905 memory in which to place the compiled pattern may be slightly larger than the
1906 value returned by this option, because there are cases where the code that
1907 calculates the size has to over-estimate. Processing a pattern with the JIT
1908 compiler does not alter the value returned by this option.
1921 A script language that supports the use of string arguments in callouts might
1922 like to scan all the callouts in a pattern before running the match. This can
1923 be done by calling \fBpcre2_callout_enumerate()\fP. The first argument is a
1924 pointer to a compiled pattern, the second points to a callback function, and
1925 the third is arbitrary user data. The callback function is called for every
1926 callout in the pattern in the order in which they appear. Its first argument is
1927 a pointer to a callout enumeration block, and its second argument is the
1928 \fIuser_data\fP value that was passed to \fBpcre2_callout_enumerate()\fP. The
1929 contents of the callout enumeration block are described in the
1940 later, subject to a number of restrictions. The functions whose names begin
1950 .SH "THE MATCH DATA BLOCK"
1965 particular, the match data block contains a vector of offsets into the subject
1966 string that define the matched part of the subject and any substrings that were
1967 captured. This is know as the \fIovector\fP.
1971 the creation functions above. For \fBpcre2_match_data_create()\fP, the first
1972 argument is the number of pairs of offsets in the \fIovector\fP. One pair of
1973 offsets is required to identify the string that matched the whole pattern, with
1975 enough space to record the matched portion of the subject plus three captured
1977 \fBpcre2_match_data_create()\fP, so it is always possible to return the overall
1981 general context, which can specify custom memory management for obtaining the
1982 memory for the match data block. If you are not using custom memory management,
1985 For \fBpcre2_match_data_create_from_pattern()\fP, the first argument is a
1986 pointer to a compiled pattern. The ovector is created to be exactly the right
1987 size to hold all the substrings a pattern might capture. The second argument is
1988 again a pointer to a general context, but in this case if NULL is passed, the
1989 memory is obtained using the same allocator that was used for the compiled
1992 A match data block can be used many times, with the same or different compiled
1994 operation has finished, using functions that are described in the sections on
2006 When a call of \fBpcre2_match()\fP fails, valid data is available in the match
2007 block only when the error is PCRE2_ERROR_NOMATCH, PCRE2_ERROR_PARTIAL, or one
2008 of the error codes for an invalid UTF string. Exactly what is available depends
2009 on the error, and is detailed below.
2011 When one of the matching functions is called, pointers to the compiled pattern
2012 and the subject string are set in the match data block so that they can be
2013 referenced by the extraction functions. After running a match, you must not
2014 free a compiled pattern or a subject string until after all operations on the
2021 .SH "MATCHING A PATTERN: THE TRADITIONAL FUNCTION"
2032 compiled pattern, which is passed in the \fIcode\fP argument. You can call
2033 \fBpcre2_match()\fP with the same \fIcode\fP argument as many times as you
2034 like, in order to find multiple matches in the subject string or to match
2035 different subject strings with the same pattern.
2037 This function is the main matching facility of the library, and it operates in
2044 in the section about the \fBpcre2_dfa_match()\fP function.
2051     "some string",  /* the subject string */
2052     11,             /* the length of the subject string */
2053     0,              /* start at offset 0 in the subject */
2055     match_data,     /* the match data block */
2058 If the subject string is zero-terminated, the length can be given as
2060 matching parameters are to be changed. For details, see the section on
2068 .SS "The string to be matched by \fBpcre2_match()\fP"
2073 \fIstartoffset\fP. The length and offset are in code units, not characters.
2074 That is, they are in bytes for the 8-bit library, 16-bit code units for the
2075 16-bit library, and 32-bit code units for the 32-bit library, whether or not
2078 If \fIstartoffset\fP is greater than the length of the subject,
2079 \fBpcre2_match()\fP returns PCRE2_ERROR_BADOFFSET. When the starting offset is
2080 zero, the search for a match starts at the beginning of the subject, and this
2081 is by far the most common case. In UTF-8 or UTF-16 mode, the starting offset
2082 must point to the start of a character, or to the end of the subject (in UTF-32
2083 mode, one code unit equals one character, so all offsets are valid). Like the
2084 pattern string, the subject may contain binary zeroes.
2086 A non-zero starting offset is useful when searching for another match in the
2089 setting PCRE2_NOTBOL in the case of a pattern that begins with any kind of
2090 lookbehind. For example, consider the pattern
2094 which finds occurrences of "iss" in the middle of words. (\eB matches only if
2095 the current position in the subject is not a word boundary.) When applied to
2096 the string "Mississipi" the first call to \fBpcre2_match()\fP finds the first
2097 occurrence. If \fBpcre2_match()\fP is called again with just the remainder of
2099 the start of the subject, which is deemed to be a word boundary. However, if
2100 \fBpcre2_match()\fP is passed the entire string again, but with
2101 \fIstartoffset\fP set to 4, it finds the second occurrence of "iss" because it
2102 is able to look behind the starting point to discover that it is preceded by a
2105 Finding all the matches in a subject is tricky when the pattern can match an
2106 empty string. It is possible to emulate Perl's /g behaviour by first trying the
2107 match again at the same offset, with the PCRE2_NOTEMPTY_ATSTART and
2108 PCRE2_ANCHORED options, and then if that fails, advancing the starting offset
2110 do this in the
2114 sample program. In the most general case, you have to check to see if the
2115 newline convention recognizes CRLF as a newline, and if so, and the current
2116 character is CR followed by LF, advance the starting offset by two characters
2119 If a non-zero starting offset is passed when the pattern is anchored, one
2120 attempt to match at the given offset is made. This can only succeed if the
2121 pattern does not require the match to be at the start of the subject.
2128 The unused bits of the \fIoptions\fP argument for \fBpcre2_match()\fP must be
2129 zero. The only bits that may be set are PCRE2_ANCHORED, PCRE2_NOTBOL,
2134 Setting PCRE2_ANCHORED at match time is not supported by the just-in-time (JIT)
2135 compiler. If it is set, JIT matching is disabled and the normal interpretive
2136 code in \fBpcre2_match()\fP is run. Apart from PCRE2_NO_JIT (obviously), the
2141 The PCRE2_ANCHORED option limits \fBpcre2_match()\fP to matching at the first
2144 matching time. Note that setting the option at match time disables JIT
2149 This option specifies that first character of the subject string is not the
2150 beginning of a line, so the circumflex metacharacter should not match before
2152 circumflex never to match. This option affects only the behaviour of the
2157 This option specifies that the end of the subject string is not the end of a
2158 line, so the dollar metacharacter should not match it nor (except in multiline
2161 affects only the behaviour of the dollar metacharacter. It does not affect \eZ
2167 there are alternatives in the pattern, they are tried. If all the alternatives
2168 match the empty string, the entire match fails. For example, if the pattern
2173 string at the start of the subject. With PCRE2_NOTEMPTY set, this match is not
2174 valid, so \fBpcre2_match()\fP searches further into the string for occurrences
2180 only at the first matching position, that is, at the start of the subject plus
2181 the starting offset. An empty string match later in the subject is permitted.
2182 If the pattern is anchored, such a match can occur only if the pattern contains
2189 is called with options that JIT supports. Setting PCRE2_NO_JIT disables the use
2190 of JIT; it forces matching to be done by the interpreter.
2194 When PCRE2_UTF is set at compile time, the validity of the subject as a UTF
2196 If a non-zero starting offset is given, the check is applied only to that part
2197 of the subject that could be inspected during matching, and there is a check
2198 that the starting offset points to the first code unit of a character or to the
2199 end of the subject. If there are no lookbehind assertions in the pattern, the
2200 check starts at the starting offset. Otherwise, it starts at the length of the
2201 longest lookbehind before the starting offset, or at the start of the subject
2202 if there are not that many characters before the starting offset. Note that the
2206 negative error code is returned if the check fails. There are several UTF error
2207 codes for each code unit width, corresponding to different problems with the
2208 code unit sequence. There are discussions about the validity of
2222 in the
2229 performance reasons, you can set the PCRE2_NO_UTF_CHECK option when calling
2230 \fBpcre2_match()\fP. You might want to do this for the second and subsequent
2231 calls to \fBpcre2_match()\fP if you are making repeated calls to find all the
2234 NOTE: When PCRE2_NO_UTF_CHECK is set, the effect of passing an invalid string
2241 These options turn on the partial matching feature. A partial match occurs if
2242 the end of the subject string is reached successfully, but there are not enough
2243 subject characters to complete the match. If this happens when
2247 PCRE2_PARTIAL_SOFT specifies that the caller is prepared to handle a partial
2257 examples, in the
2268 When PCRE2 is built, a default newline convention is set; this is usually the
2269 standard convention for the operating system. The default can be overridden in
2276 pattern string with, for example, (*CRLF), as described in the
2281 in the
2285 page. During matching, the newline choice affects the behaviour of the dot,
2286 circumflex, and dollar metacharacters. It may also alter the way the match
2291 when the current starting position is at a CRLF sequence, and the pattern
2292 contains no explicit matches for CR or LF characters, the match position is
2293 advanced by two characters instead of one, in other words, to after the CRLF.
2295 The above rule is a compromise that makes the most common cases work as
2296 expected. For example, if the pattern is .+A (and the PCRE2_DOTALL option is
2297 not set), it does not match the string "\er\enA" because, after failing at the
2298 start, it skips both the CR and the LF before retrying. However, the pattern
2300 reference, and so advances only by one character after the first failure.
2303 characters in the pattern, or one of the \er or \en escape sequences. Implicit
2305 LF in the characters that it matches.
2307 Notwithstanding the above, anomalous effects may still occur when CRLF is a
2308 valid newline sequence and explicit \er or \en escapes appear in the pattern.
2321 In general, a pattern matches a certain portion of the subject, and in
2322 addition, further substrings from the subject may be picked out by
2323 parenthesized parts of the pattern. Following the usage in Jeffrey Friedl's
2324 book, this is called "capturing" in what follows, and the phrase "capturing
2327 that do not cause substrings to be captured. The \fBpcre2_pattern_info()\fP
2343 Alternatively, you can make direct use of the vector of PCRE2_SIZE values,
2344 called the \fBovector\fP, which contains the offsets of captured strings. It is
2345 part of the
2350 The function \fBpcre2_get_ovector_pointer()\fP returns the address of the
2351 ovector, and \fBpcre2_get_ovector_count()\fP returns the number of pairs of
2354 Within the ovector, the first in each pair of values is set to the offset of
2355 the first code unit of a substring, and the second is set to the offset of the
2356 first code unit after the end of a substring. These values are always code unit
2357 offsets, not character offsets. That is, they are byte offsets in the 8-bit
2358 library, 16-bit offsets in the 16-bit library, and 32-bit offsets in the 32-bit
2361 After a partial match (error return PCRE2_ERROR_PARTIAL), only the first pair
2363 identify the part of the subject that was partially matched. See the
2369 After a successful match, the first pair of offsets identifies the portion of
2370 the subject string that was matched by the entire pattern. The next pair is
2371 used for the first capturing subpattern, and so on. The value returned by
2372 \fBpcre2_match()\fP is one more than the highest numbered pair that has been
2373 set. For example, if two substrings have been captured, the returned value is
2374 3. If there are no capturing subpatterns, the return value from a successful
2375 match is 1, indicating that just the first pair of offsets has been set.
2377 If a pattern uses the \eK escape sequence within a positive assertion, the
2378 reported start of a successful match can be greater than the end of the match.
2379 For example, if the pattern (?=ab\eK) is matched against "ab", the start and
2380 end offset values for the match are 2 and 0.
2383 operation, it is the last portion of the subject that it matched that is
2386 If the ovector is too small to hold all the captured substring offsets, as much
2387 as possible is filled in, and the function returns a value of zero. If captured
2390 the pattern contains back references and the \fIovector\fP is not big enough to
2391 remember the related substrings, PCRE2 has to get additional memory for use
2397 the string "abc" is matched against the pattern (a|(z))(bc) the return from the
2399 happens, both values in the offset pairs corresponding to unused subpatterns
2402 Offset values that correspond to unused subpatterns at the end of the
2403 expression are also set to PCRE2_UNSET. For example, if the string "abc" is
2404 matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not matched.
2405 The return from the function is 2, because the highest used capturing
2406 subpattern number is 1. The offsets for for the second and third capturing
2407 subpatterns (assuming the vector is large enough, of course) are set to
2410 Elements in the ovector that do not correspond to capturing parentheses in the
2413 \fBpcre2_match()\fP. The other elements retain whatever values they previously
2427 As well as the offsets in the ovector, other information about a match is
2428 retained in the match data block and can be retrieved by the above functions in
2429 appropriate circumstances. If they are called at other times, the result is
2434 \fBpcre2_get_mark()\fP can be called. It returns a pointer to the
2435 zero-terminated name, which is within the compiled pattern. Otherwise NULL is
2436 returned. The length of the (*MARK) name (excluding the terminating zero) is
2437 stored in the code unit that preceeds the name. You should use this instead of
2438 relying on the terminating zero if the (*MARK) name might contain a binary
2441 After a successful match, the (*MARK) name that is returned is the
2442 last one encountered on the matching path through the pattern. After a "no
2443 match" or a partial match, the last encountered (*MARK) name is returned. For
2448 When it matches "bc", the returned mark is A. The B mark is "seen" in the first
2449 branch of the group, but it is not on the matching path. On the other hand,
2450 when this pattern fails to match "bx", the returned mark is B.
2452 After a successful match, a partial match, or one of the invalid UTF errors
2454 called. After a successful or partial match it returns the code unit offset of
2455 the character at which the match started. For a non-partial match, this can be
2456 different to the value of \fIovector[0]\fP if the pattern contains the \eK
2457 escape sequence. After a partial match, however, this value is always the same
2458 as \fIovector[0]\fP because \eK does not affect the result of a partial match.
2461 the code unit offset of the invalid UTF character. Details are given in the
2473 converted to a text string by calling the \fBpcre2_get_error_message()\fP
2480 with them. The codes are given names in the header file. If UTF checking is in
2482 UTF-specific negative error codes is returned. Details are given in the
2486 page. The following are the other errors that may be returned by
2491 The subject string did not match the pattern.
2495 The subject string did not match, but it did match partially. See the
2503 PCRE2 stores a 4-byte "magic number" at the start of the compiled code, to
2504 catch the case when it is passed a junk pointer. This is the error that is
2505 returned when the magic number is not present.
2509 This error is given when a pattern that was compiled by the 8-bit library is
2514 The value of \fIstartoffset\fP was greater than the length of the subject.
2518 An unrecognized bit was set in the \fIoptions\fP argument.
2523 to be valid (the PCRE2_NO_UTF_CHECK option was not set), but the value of
2524 \fIstartoffset\fP did not point to the beginning of a UTF character or the end
2525 of the subject.
2531 \fBpcre2_callout_enumerate()\fP to return a distinctive error code. See the
2540 in PCRE2 or by overwriting of the compiled pattern.
2545 is being matched, but the matching mode (partial or complete match) does not
2546 correspond to any JIT compilation mode. When the JIT fast path function is
2547 used, this error may be also given for invalid options. See the
2556 is being matched, but the memory available for the just-in-time processing
2557 stack is not large enough. See the
2569 If a pattern contains back references, but the ovector is not big enough to
2570 remember the referenced substrings, PCRE2 gets a block of memory at the start
2577 Either the \fIcode\fP, \fIsubject\fP, or \fImatch_data\fP argument was passed
2583 the pattern. Specifically, it means that either the whole pattern or a
2584 subpattern has been called recursively for the second time at the same position
2585 in the subject string. Some simple patterns that might do this are detected and
2605 auxiliary) can be obtained by calling \fBpcre2_get_error_message()\fP. The code
2606 is passed as the first argument, with the remaining two arguments specifying a
2607 code unit buffer and its length, into which the text message is placed. Note
2608 that the message is returned in code units of the appropriate width for the
2611 The returned message is terminated with a trailing zero, and the function
2612 returns the number of code units used, excluding the trailing zero. If the
2613 error number is unknown, the negative error code PCRE2_ERROR_BADDATA is
2614 returned. If the buffer is too small, the message is truncated (but still with
2615 a trailing zero), and the negative error code PCRE2_ERROR_NOMEMORY is returned.
2616 None of the messages are very long; a buffer size of 120 code units is ample.
2638 Captured substrings can be accessed directly by using the ovector as described
2645 a binary zero is correctly extracted and has a further zero added on the end,
2646 but the result is not, of course, a C string.
2648 The functions in this section identify substrings by number. The number zero
2649 refers to the entire matched substring, with higher numbers referring to
2652 the error PCRE2_ERROR_PARTIAL. The next section describes similar functions for
2655 If a pattern uses the \eK escape sequence within a positive assertion, the
2656 reported start of a successful match can be greater than the end of the match.
2657 For example, if the pattern (?=ab\eK) is matched against "ab", the start and
2658 end offset values for the match are 2 and 0. In this situation, calling these
2661 You can find the length in code units of a captured substring without
2662 extracting it by calling \fBpcre2_substring_length_bynumber()\fP. The first
2663 argument is a pointer to the match data block, the second is the group number,
2664 and the third is a pointer to a variable into which the length is placed. If
2665 you just want to know whether or not the substring has been captured, you can
2666 pass the third argument as NULL.
2670 into new memory, obtained using the same memory allocation function that was
2671 used for the match data block. The first two arguments of these functions are a
2672 pointer to the match data block and a capturing group number.
2676 This is updated to contain the actual number of code units used for the
2677 extracted substring, excluding the terminating zero.
2679 For \fBpcre2_substring_get_bynumber()\fP the third and fourth arguments point
2680 to variables that are updated with a pointer to the new memory and the number
2681 of code units that comprise the substring, again excluding the terminating
2682 zero. When the substring is no longer needed, the memory should be freed by
2686 error code. If the pattern match failed, the match failure code is returned.
2692 The buffer was too small for \fBpcre2_substring_copy_bynumber()\fP, or the
2697 There is no substring with that number in the pattern, that is, the number is
2698 greater than the number of capturing parentheses.
2702 The substring number, though not greater than the number of captures in the
2703 pattern, is greater than the number of slots in the ovector, so the substring
2708 The substring did not participate in the match. For example, if the pattern is
2709 (abc)|(def) and the subject is "def", and the ovector contains at least two
2727 that is obtained using the same memory allocation function that was used to get
2731 partial match, the error code PCRE2_ERROR_PARTIAL is returned.
2733 The address of the memory block is returned via \fIlistptr\fP, which is also
2734 the start of the list of string pointers. The end of the list is marked by a
2735 NULL pointer. The address of the list of lengths is returned via
2737 therefore need the lengths, you may supply NULL as the \fBlengthsptr\fP
2738 argument to disable the creation of a list of lengths. The yield of the
2739 function is zero if all went well, or PCRE2_ERROR_NOMEMORY if the memory block
2740 could not be obtained. When the list is no longer needed, it should be freed by
2744 capturing subpattern number \fIn+1\fP matches some part of the subject, but
2746 can be distinguished from a genuine zero-length substring by inspecting the
2747 appropriate offset in the ovector, which contain PCRE2_UNSET for unset
2776 the number of the subpattern called "xxx" is 2. If the name is known to be
2777 unique (PCRE2_DUPNAMES was not set), you can find the number from the name by
2778 calling \fBpcre2_substring_number_from_name()\fP. The first argument is the
2779 compiled pattern, and the second is the name. The yield of the function is the
2782 that name. Given the number, you can extract the substring directly, or use one
2783 of the functions described above.
2785 For convenience, there are also "byname" functions that correspond to the
2786 "bynumber" functions, the only difference being that the second argument is a
2788 names, these functions scan all the groups with the given name, and return the
2791 If there are no groups with the given name, PCRE2_ERROR_NOSUBSTRING is
2792 returned. If all groups with the name have numbers that are greater than the
2793 number of slots in the ovector, PCRE2_ERROR_UNAVAILABLE is returned. If there
2794 is at least one group with a slot in the ovector, but no group is found to be
2797 \fBWarning:\fP If the pattern uses the (?| feature to set up multiple
2798 subpatterns with the same number, as described in the
2803 in the
2807 page, you cannot use names to distinguish the different subpatterns, because
2808 names are not included in the compiled code. The matching process uses only
2809 numbers. For this reason, the use of different names for subpatterns of the
2825 This function calls \fBpcre2_match()\fP and then makes a copy of the subject
2826 string in \fIoutputbuffer\fP, replacing the part that was matched with the
2829 which a \eK item in a lookahead in the pattern causes the match to end before
2832 The first seven arguments of \fBpcre2_substitute()\fP are the same as for
2833 \fBpcre2_match()\fP, except that the partial matching options are not
2836 functions from the match context, if provided, or else those that were used to
2837 allocate memory for the compiled code.
2839 The \fIoutlengthptr\fP argument must point to a variable that contains the
2840 length, in code units, of the output buffer. If the function is successful, the
2841 value is updated to contain the length of the new string, excluding the
2844 If the function is not successful, the value set via \fIoutlengthptr\fP depends
2845 on the type of error. For syntax errors in the replacement string, the value is
2846 the offset in the replacement string where the error was detected. For other
2847 errors, the value is PCRE2_UNSET by default. This includes the case of the
2849 (see below), in which case the value is the minimum length needed, including
2850 space for the trailing zero. Note that in order to compute the required length,
2851 \fBpcre2_substitute()\fP has to simulate all the matching and copying, instead
2852 of giving an error return as soon as the buffer overflows. Note also that the
2855 In the replacement string, which is interpreted as a UTF string in UTF mode,
2856 and is checked for UTF validity unless the PCRE2_NO_UTF_CHECK option is set, a
2857 dollar character is an escape character that can specify the insertion of
2858 characters from capturing groups or (*MARK) items in the pattern. The following
2862   $<n> or ${<n>}      insert the contents of group <n>
2863   $*MARK or ${*MARK}  insert the name of the last (*MARK) encountered
2866 required only if the following character would be interpreted as part of the
2867 number or name. The number may be zero to include the entire matched string.
2868 For example, if the pattern a(b)c is matched with "=abc=" and the replacement
2869 string "+$1$0$1+", the result is "=+babcb+=".
2878 As well as the usual options for \fBpcre2_match()\fP, a number of additional
2879 options can be set in the \fIoptions\fP argument.
2881 PCRE2_SUBSTITUTE_GLOBAL causes the function to iterate over the subject string,
2882 replacing every matching substring. If this is not set, only the first matching
2883 substring is replaced. If any matched substring has zero length, after the
2884 substitution has happened, an attempt to find a non-empty match at the same
2885 position is performed. If this is not successful, the current position is
2886 advanced by one character except when CRLF is a valid newline sequence and the
2887 next two characters are CR, LF. In this case, the current position is advanced
2890 PCRE2_SUBSTITUTE_OVERFLOW_LENGTH changes what happens when the output buffer is
2891 too small. The default action is to return PCRE2_ERROR_NOMEMORY immediately. If
2894 in order to compute the size of buffer that is needed. This value is passed
2895 back via the \fIoutlengthptr\fP variable, with the result of the function still
2899 is needed for given substitution. However, this does mean that the entire
2900 operation is carried out twice. Depending on the application, it may be more
2901 efficient to allocate a large buffer and free the excess afterwards, instead of
2905 not appear in the pattern to be treated as unset groups. This option should be
2907 longer causes the PCRE2_ERROR_NOSUBSTRING error.
2912 to insert an unset group causes the PCRE2_ERROR_UNSET error. This option does
2913 not influence the extended substitution syntax described below.
2915 PCRE2_SUBSTITUTE_EXTENDED causes extra processing to be applied to the
2916 replacement string. Without this option, only the dollar character is special,
2917 and only the group insertion forms listed above are valid. When
2921 character. The usual forms such as \en or \ex{ddd} can be used to specify
2926 There are also four escape sequences for forcing the case of inserted letters.
2928 and force lower case. The escape sequences change the current state: \eU and
2930 terminating a \eQ quoted sequence) reverts to no case forcing. The sequences
2931 \eu and \el force the next character (if it is a letter) to upper or lower
2932 case, respectively, and then the state automatically reverts to no case
2937 the result of processing "\eUaa\eLBB\eEcc\eE" is "AAbbcc"; the final \eE has no
2941 flexibility to group substitution. The syntax is similar to that used by Bash:
2946 As before, <n> may be a group number or a name. The first form specifies a
2948 expanded and the result inserted. The second form specifies strings that are
2949 expanded and inserted when group <n> is set or unset, respectively. The first
2954 Backslash can be used to escape colons and closing curly brackets in the
2955 replacement strings. A change of the case forcing state within a replacement
2966 groups in the extended syntax forms to be treated as unset.
2968 If successful, \fBpcre2_substitute()\fP returns the number of replacements that
2972 In the event of an error, a negative error code is returned. Except for
2980 unknown substring when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) when the simple
2983 PCRE2_ERROR_NOMEMORY is returned if the output buffer is not big enough. If the
2984 PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set, the size of buffer that is
2988 PCRE2_ERROR_BADREPLACEMENT is used for miscellaneous syntax errors in the
2992 substitution), and PCRE2_BADSUBPATTERN (the pattern match ended before it
2995 As for all PCRE2 errors, a text message that describes the error can be
2996 obtained by calling the \fBpcre2_get_error_message()\fP function (see
3012 When a pattern is compiled with the PCRE2_DUPNAMES option, names for
3014 for subpatterns with the same number, created by using the (?| feature. Indeed,
3015 if such subpatterns are named, they are required to use the same names.
3018 one of the named subpatterns participates. An example is shown in the
3025 \fBpcre2_substring_get_byname()\fP return the first substring corresponding to
3027 returned. The \fBpcre2_substring_number_from_name()\fP function returns the
3031 you must use the \fBpcre2_substring_nametable_scan()\fP function. The first
3032 argument is the compiled pattern, and the second is the name. If the third and
3033 fourth arguments are NULL, the function returns a group number for a unique
3036 When the third and fourth arguments are not NULL, they must be pointers to
3037 variables that are updated by the function. After it has run, they point to the
3038 first and last entries in the name-to-number table for the given name, and the
3039 function returns the length of each entry in code units. In both cases,
3040 PCRE2_ERROR_NOSUBSTRING is returned if there are no entries for the given name.
3042 The format of the name table is described
3047 in the section entitled \fIInformation about a pattern\fP. Given all the
3048 relevant entries for the name, you can extract each of their numbers, and hence
3056 when it finds the first match at a given point in the subject. If you want to
3057 find all possible matches, or the longest possible match at a given position,
3058 consider using the alternative matching function (see below) instead. If you
3059 cannot use the alternative function, you can kludge it up by making use of the
3060 callout facility, which is described in the
3066 What you have to do is to insert a callout right at the end of the pattern.
3067 When your callout function is called, extract and save the current matched
3074 .SH "MATCHING A PATTERN: THE ALTERNATIVE FUNCTION"
3086 against a compiled pattern, using a matching algorithm that scans the subject
3088 the normal algorithm, and is not compatible with Perl. Some of the features of
3090 of matching can be useful. For a discussion of the two matching algorithms, and
3091 a list of features that \fBpcre2_dfa_match()\fP does not support, see the
3097 The arguments for the \fBpcre2_dfa_match()\fP function are the same as for
3098 \fBpcre2_match()\fP, plus two extras. The ovector within the match data block
3099 is used in a different way, and this is described below. The other common
3100 arguments are used in the same way as for \fBpcre2_match()\fP, so their
3103 The two additional arguments provide workspace for the function. The workspace
3105 multiple paths through the pattern tree. More workspace is needed for patterns
3114     "some string",  /* the subject string */
3115     11,             /* the length of the subject string */
3116     0,              /* start at offset 0 in the subject */
3118     match_data,     /* the match data block */
3126 The unused bits of the \fIoptions\fP argument for \fBpcre2_dfa_match()\fP must
3127 be zero. The only bits that may be set are PCRE2_ANCHORED, PCRE2_NOTBOL,
3130 PCRE2_DFA_RESTART. All but the last four of these are exactly the same as for
3136 These have the same general effect as they do for \fBpcre2_match()\fP, but the
3138 \fBpcre2_dfa_match()\fP, it returns PCRE2_ERROR_PARTIAL if the end of the
3141 already been found. When PCRE2_PARTIAL_SOFT is set, the return code
3142 PCRE2_ERROR_NOMATCH is converted into PCRE2_ERROR_PARTIAL if the end of the
3144 least one matching possibility. The portion of the string that was inspected
3145 when the longest partial match was found is set as the first matching string in
3147 matching, with examples, in the
3155 Setting the PCRE2_DFA_SHORTEST option causes the matching algorithm to stop as
3156 soon as it has found one match. Because of the way the alternative algorithm
3157 works, this is necessarily the shortest possible match at the first possible
3158 matching point in the subject string.
3163 again, with additional subject characters, and have it continue with the same
3164 match. The PCRE2_DFA_RESTART option requests this action; when it is set, the
3165 \fIworkspace\fP and \fIwscount\fP options must reference the same vector as
3166 before because data about the match so far is left in them after a partial
3167 match. There is more discussion of this facility in the
3178 substring in the subject. Note, however, that all the matches from one run of
3179 the function start at the same point in the subject. The shorter matches are
3180 all initial substrings of the longer matches. For example, if the pattern
3184 is matched against the string
3194 On success, the yield of the function is a number greater than zero, which is
3195 the number of matched substrings. The offsets of the substrings are returned in
3196 the ovector, and can be extracted by number in the same way as for
3197 \fBpcre2_match()\fP, but the numbers bear no relation to any capturing groups
3198 that may exist in the pattern, because DFA matching does not support group
3201 Calls to the convenience functions that extract substrings by name
3202 return the error PCRE2_ERROR_DFA_UFUNC (unsupported function) if used after a
3203 DFA match. The convenience functions that extract substrings by number never
3204 return PCRE2_ERROR_NOSUBSTRING, and the meanings of some other errors are
3209 The ovector is not big enough to include a slot for the given substring number.
3213 There is a slot in the ovector for this substring, but there were insufficient
3216 The matched strings are stored in the ovector in reverse order of length; that
3217 is, the longest matching string is first. If there were too many matches to fit
3218 into the ovector, the yield of the function is zero, and the vector is filled
3219 with the longest matches.
3222 repeats at the end of a pattern (as well as internally). For example, the
3233 Many of the errors are the same as for \fBpcre2_match()\fP, as described
3238 There are in addition the following errors that are specific to
3243 This return is given if \fBpcre2_dfa_match()\fP encounters an item in the
3244 pattern that it does not support, for instance, the use of \eC in a UTF mode or
3250 that uses a back reference for the condition, or a test for recursion in a
3255 This return is given if \fBpcre2_dfa_match()\fP runs out of space in the
3260 When a recursive subpattern is processed, the matching function calls itself
3261 recursively, using private memory for the ovector and \fIworkspace\fP. This
3262 error is given if the internal ovector is not large enough. This should be
3267 When \fBpcre2_dfa_match()\fP is called with the \fBPCRE2_DFA_RESTART\fP option,
3268 some plausibility checks are made on the contents of the workspace, which
3269 should contain data about the previous partial match. If any of these checks