1<html> 2<head> 3<title>pcreapi specification</title> 4</head> 5<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB"> 6<h1>pcreapi man page</h1> 7<p> 8Return to the <a href="index.html">PCRE index page</a>. 9</p> 10<p> 11This page is part of the PCRE HTML documentation. It was generated automatically 12from the original man page. If there is any nonsense in it, please consult the 13man page, in case the conversion went wrong. 14<br> 15<ul> 16<li><a name="TOC1" href="#SEC1">PCRE NATIVE API</a> 17<li><a name="TOC2" href="#SEC2">PCRE API OVERVIEW</a> 18<li><a name="TOC3" href="#SEC3">NEWLINES</a> 19<li><a name="TOC4" href="#SEC4">MULTITHREADING</a> 20<li><a name="TOC5" href="#SEC5">SAVING PRECOMPILED PATTERNS FOR LATER USE</a> 21<li><a name="TOC6" href="#SEC6">CHECKING BUILD-TIME OPTIONS</a> 22<li><a name="TOC7" href="#SEC7">COMPILING A PATTERN</a> 23<li><a name="TOC8" href="#SEC8">COMPILATION ERROR CODES</a> 24<li><a name="TOC9" href="#SEC9">STUDYING A PATTERN</a> 25<li><a name="TOC10" href="#SEC10">LOCALE SUPPORT</a> 26<li><a name="TOC11" href="#SEC11">INFORMATION ABOUT A PATTERN</a> 27<li><a name="TOC12" href="#SEC12">OBSOLETE INFO FUNCTION</a> 28<li><a name="TOC13" href="#SEC13">REFERENCE COUNTS</a> 29<li><a name="TOC14" href="#SEC14">MATCHING A PATTERN: THE TRADITIONAL FUNCTION</a> 30<li><a name="TOC15" href="#SEC15">EXTRACTING CAPTURED SUBSTRINGS BY NUMBER</a> 31<li><a name="TOC16" href="#SEC16">EXTRACTING CAPTURED SUBSTRINGS BY NAME</a> 32<li><a name="TOC17" href="#SEC17">DUPLICATE SUBPATTERN NAMES</a> 33<li><a name="TOC18" href="#SEC18">FINDING ALL POSSIBLE MATCHES</a> 34<li><a name="TOC19" href="#SEC19">MATCHING A PATTERN: THE ALTERNATIVE FUNCTION</a> 35<li><a name="TOC20" href="#SEC20">SEE ALSO</a> 36<li><a name="TOC21" href="#SEC21">AUTHOR</a> 37<li><a name="TOC22" href="#SEC22">REVISION</a> 38</ul> 39<br><a name="SEC1" href="#TOC1">PCRE NATIVE API</a><br> 40<P> 41<b>#include <pcre.h></b> 42</P> 43<P> 44<b>pcre *pcre_compile(const char *<i>pattern</i>, int <i>options</i>,</b> 45<b>const char **<i>errptr</i>, int *<i>erroffset</i>,</b> 46<b>const unsigned char *<i>tableptr</i>);</b> 47</P> 48<P> 49<b>pcre *pcre_compile2(const char *<i>pattern</i>, int <i>options</i>,</b> 50<b>int *<i>errorcodeptr</i>,</b> 51<b>const char **<i>errptr</i>, int *<i>erroffset</i>,</b> 52<b>const unsigned char *<i>tableptr</i>);</b> 53</P> 54<P> 55<b>pcre_extra *pcre_study(const pcre *<i>code</i>, int <i>options</i>,</b> 56<b>const char **<i>errptr</i>);</b> 57</P> 58<P> 59<b>int pcre_exec(const pcre *<i>code</i>, const pcre_extra *<i>extra</i>,</b> 60<b>const char *<i>subject</i>, int <i>length</i>, int <i>startoffset</i>,</b> 61<b>int <i>options</i>, int *<i>ovector</i>, int <i>ovecsize</i>);</b> 62</P> 63<P> 64<b>int pcre_dfa_exec(const pcre *<i>code</i>, const pcre_extra *<i>extra</i>,</b> 65<b>const char *<i>subject</i>, int <i>length</i>, int <i>startoffset</i>,</b> 66<b>int <i>options</i>, int *<i>ovector</i>, int <i>ovecsize</i>,</b> 67<b>int *<i>workspace</i>, int <i>wscount</i>);</b> 68</P> 69<P> 70<b>int pcre_copy_named_substring(const pcre *<i>code</i>,</b> 71<b>const char *<i>subject</i>, int *<i>ovector</i>,</b> 72<b>int <i>stringcount</i>, const char *<i>stringname</i>,</b> 73<b>char *<i>buffer</i>, int <i>buffersize</i>);</b> 74</P> 75<P> 76<b>int pcre_copy_substring(const char *<i>subject</i>, int *<i>ovector</i>,</b> 77<b>int <i>stringcount</i>, int <i>stringnumber</i>, char *<i>buffer</i>,</b> 78<b>int <i>buffersize</i>);</b> 79</P> 80<P> 81<b>int pcre_get_named_substring(const pcre *<i>code</i>,</b> 82<b>const char *<i>subject</i>, int *<i>ovector</i>,</b> 83<b>int <i>stringcount</i>, const char *<i>stringname</i>,</b> 84<b>const char **<i>stringptr</i>);</b> 85</P> 86<P> 87<b>int pcre_get_stringnumber(const pcre *<i>code</i>,</b> 88<b>const char *<i>name</i>);</b> 89</P> 90<P> 91<b>int pcre_get_stringtable_entries(const pcre *<i>code</i>,</b> 92<b>const char *<i>name</i>, char **<i>first</i>, char **<i>last</i>);</b> 93</P> 94<P> 95<b>int pcre_get_substring(const char *<i>subject</i>, int *<i>ovector</i>,</b> 96<b>int <i>stringcount</i>, int <i>stringnumber</i>,</b> 97<b>const char **<i>stringptr</i>);</b> 98</P> 99<P> 100<b>int pcre_get_substring_list(const char *<i>subject</i>,</b> 101<b>int *<i>ovector</i>, int <i>stringcount</i>, const char ***<i>listptr</i>);</b> 102</P> 103<P> 104<b>void pcre_free_substring(const char *<i>stringptr</i>);</b> 105</P> 106<P> 107<b>void pcre_free_substring_list(const char **<i>stringptr</i>);</b> 108</P> 109<P> 110<b>const unsigned char *pcre_maketables(void);</b> 111</P> 112<P> 113<b>int pcre_fullinfo(const pcre *<i>code</i>, const pcre_extra *<i>extra</i>,</b> 114<b>int <i>what</i>, void *<i>where</i>);</b> 115</P> 116<P> 117<b>int pcre_info(const pcre *<i>code</i>, int *<i>optptr</i>, int</b> 118<b>*<i>firstcharptr</i>);</b> 119</P> 120<P> 121<b>int pcre_refcount(pcre *<i>code</i>, int <i>adjust</i>);</b> 122</P> 123<P> 124<b>int pcre_config(int <i>what</i>, void *<i>where</i>);</b> 125</P> 126<P> 127<b>char *pcre_version(void);</b> 128</P> 129<P> 130<b>void *(*pcre_malloc)(size_t);</b> 131</P> 132<P> 133<b>void (*pcre_free)(void *);</b> 134</P> 135<P> 136<b>void *(*pcre_stack_malloc)(size_t);</b> 137</P> 138<P> 139<b>void (*pcre_stack_free)(void *);</b> 140</P> 141<P> 142<b>int (*pcre_callout)(pcre_callout_block *);</b> 143</P> 144<br><a name="SEC2" href="#TOC1">PCRE API OVERVIEW</a><br> 145<P> 146PCRE has its own native API, which is described in this document. There are 147also some wrapper functions that correspond to the POSIX regular expression 148API. These are described in the 149<a href="pcreposix.html"><b>pcreposix</b></a> 150documentation. Both of these APIs define a set of C function calls. A C++ 151wrapper is distributed with PCRE. It is documented in the 152<a href="pcrecpp.html"><b>pcrecpp</b></a> 153page. 154</P> 155<P> 156The native API C function prototypes are defined in the header file 157<b>pcre.h</b>, and on Unix systems the library itself is called <b>libpcre</b>. 158It can normally be accessed by adding <b>-lpcre</b> to the command for linking 159an application that uses PCRE. The header file defines the macros PCRE_MAJOR 160and PCRE_MINOR to contain the major and minor release numbers for the library. 161Applications can use these to include support for different releases of PCRE. 162</P> 163<P> 164In a Windows environment, if you want to statically link an application program 165against a non-dll <b>pcre.a</b> file, you must define PCRE_STATIC before 166including <b>pcre.h</b> or <b>pcrecpp.h</b>, because otherwise the 167<b>pcre_malloc()</b> and <b>pcre_free()</b> exported functions will be declared 168<b>__declspec(dllimport)</b>, with unwanted results. 169</P> 170<P> 171The functions <b>pcre_compile()</b>, <b>pcre_compile2()</b>, <b>pcre_study()</b>, 172and <b>pcre_exec()</b> are used for compiling and matching regular expressions 173in a Perl-compatible manner. A sample program that demonstrates the simplest 174way of using them is provided in the file called <i>pcredemo.c</i> in the PCRE 175source distribution. A listing of this program is given in the 176<a href="pcredemo.html"><b>pcredemo</b></a> 177documentation, and the 178<a href="pcresample.html"><b>pcresample</b></a> 179documentation describes how to compile and run it. 180</P> 181<P> 182A second matching function, <b>pcre_dfa_exec()</b>, which is not 183Perl-compatible, is also provided. This uses a different algorithm for the 184matching. The alternative algorithm finds all possible matches (at a given 185point in the subject), and scans the subject just once (unless there are 186lookbehind assertions). However, this algorithm does not return captured 187substrings. A description of the two matching algorithms and their advantages 188and disadvantages is given in the 189<a href="pcrematching.html"><b>pcrematching</b></a> 190documentation. 191</P> 192<P> 193In addition to the main compiling and matching functions, there are convenience 194functions for extracting captured substrings from a subject string that is 195matched by <b>pcre_exec()</b>. They are: 196<pre> 197 <b>pcre_copy_substring()</b> 198 <b>pcre_copy_named_substring()</b> 199 <b>pcre_get_substring()</b> 200 <b>pcre_get_named_substring()</b> 201 <b>pcre_get_substring_list()</b> 202 <b>pcre_get_stringnumber()</b> 203 <b>pcre_get_stringtable_entries()</b> 204</pre> 205<b>pcre_free_substring()</b> and <b>pcre_free_substring_list()</b> are also 206provided, to free the memory used for extracted strings. 207</P> 208<P> 209The function <b>pcre_maketables()</b> is used to build a set of character tables 210in the current locale for passing to <b>pcre_compile()</b>, <b>pcre_exec()</b>, 211or <b>pcre_dfa_exec()</b>. This is an optional facility that is provided for 212specialist use. Most commonly, no special tables are passed, in which case 213internal tables that are generated when PCRE is built are used. 214</P> 215<P> 216The function <b>pcre_fullinfo()</b> is used to find out information about a 217compiled pattern; <b>pcre_info()</b> is an obsolete version that returns only 218some of the available information, but is retained for backwards compatibility. 219The function <b>pcre_version()</b> returns a pointer to a string containing the 220version of PCRE and its date of release. 221</P> 222<P> 223The function <b>pcre_refcount()</b> maintains a reference count in a data block 224containing a compiled pattern. This is provided for the benefit of 225object-oriented applications. 226</P> 227<P> 228The global variables <b>pcre_malloc</b> and <b>pcre_free</b> initially contain 229the entry points of the standard <b>malloc()</b> and <b>free()</b> functions, 230respectively. PCRE calls the memory management functions via these variables, 231so a calling program can replace them if it wishes to intercept the calls. This 232should be done before calling any PCRE functions. 233</P> 234<P> 235The global variables <b>pcre_stack_malloc</b> and <b>pcre_stack_free</b> are also 236indirections to memory management functions. These special functions are used 237only when PCRE is compiled to use the heap for remembering data, instead of 238recursive function calls, when running the <b>pcre_exec()</b> function. See the 239<a href="pcrebuild.html"><b>pcrebuild</b></a> 240documentation for details of how to do this. It is a non-standard way of 241building PCRE, for use in environments that have limited stacks. Because of the 242greater use of memory management, it runs more slowly. Separate functions are 243provided so that special-purpose external code can be used for this case. When 244used, these functions are always called in a stack-like manner (last obtained, 245first freed), and always for memory blocks of the same size. There is a 246discussion about PCRE's stack usage in the 247<a href="pcrestack.html"><b>pcrestack</b></a> 248documentation. 249</P> 250<P> 251The global variable <b>pcre_callout</b> initially contains NULL. It can be set 252by the caller to a "callout" function, which PCRE will then call at specified 253points during a matching operation. Details are given in the 254<a href="pcrecallout.html"><b>pcrecallout</b></a> 255documentation. 256<a name="newlines"></a></P> 257<br><a name="SEC3" href="#TOC1">NEWLINES</a><br> 258<P> 259PCRE supports five different conventions for indicating line breaks in 260strings: a single CR (carriage return) character, a single LF (linefeed) 261character, the two-character sequence CRLF, any of the three preceding, or any 262Unicode newline sequence. The Unicode newline sequences are the three just 263mentioned, plus the single characters VT (vertical tab, U+000B), FF (formfeed, 264U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS 265(paragraph separator, U+2029). 266</P> 267<P> 268Each of the first three conventions is used by at least one operating system as 269its standard newline sequence. When PCRE is built, a default can be specified. 270The default default is LF, which is the Unix standard. When PCRE is run, the 271default can be overridden, either when a pattern is compiled, or when it is 272matched. 273</P> 274<P> 275At compile time, the newline convention can be specified by the <i>options</i> 276argument of <b>pcre_compile()</b>, or it can be specified by special text at the 277start of the pattern itself; this overrides any other settings. See the 278<a href="pcrepattern.html"><b>pcrepattern</b></a> 279page for details of the special character sequences. 280</P> 281<P> 282In the PCRE documentation the word "newline" is used to mean "the character or 283pair of characters that indicate a line break". The choice of newline 284convention affects the handling of the dot, circumflex, and dollar 285metacharacters, the handling of #-comments in /x mode, and, when CRLF is a 286recognized line ending sequence, the match position advancement for a 287non-anchored pattern. There is more detail about this in the 288<a href="#execoptions">section on <b>pcre_exec()</b> options</a> 289below. 290</P> 291<P> 292The choice of newline convention does not affect the interpretation of 293the \n or \r escape sequences, nor does it affect what \R matches, which is 294controlled in a similar way, but by separate options. 295</P> 296<br><a name="SEC4" href="#TOC1">MULTITHREADING</a><br> 297<P> 298The PCRE functions can be used in multi-threading applications, with the 299proviso that the memory management functions pointed to by <b>pcre_malloc</b>, 300<b>pcre_free</b>, <b>pcre_stack_malloc</b>, and <b>pcre_stack_free</b>, and the 301callout function pointed to by <b>pcre_callout</b>, are shared by all threads. 302</P> 303<P> 304The compiled form of a regular expression is not altered during matching, so 305the same compiled pattern can safely be used by several threads at once. 306</P> 307<br><a name="SEC5" href="#TOC1">SAVING PRECOMPILED PATTERNS FOR LATER USE</a><br> 308<P> 309The compiled form of a regular expression can be saved and re-used at a later 310time, possibly by a different program, and even on a host other than the one on 311which it was compiled. Details are given in the 312<a href="pcreprecompile.html"><b>pcreprecompile</b></a> 313documentation. However, compiling a regular expression with one version of PCRE 314for use with a different version is not guaranteed to work and may cause 315crashes. 316</P> 317<br><a name="SEC6" href="#TOC1">CHECKING BUILD-TIME OPTIONS</a><br> 318<P> 319<b>int pcre_config(int <i>what</i>, void *<i>where</i>);</b> 320</P> 321<P> 322The function <b>pcre_config()</b> makes it possible for a PCRE client to 323discover which optional features have been compiled into the PCRE library. The 324<a href="pcrebuild.html"><b>pcrebuild</b></a> 325documentation has more details about these optional features. 326</P> 327<P> 328The first argument for <b>pcre_config()</b> is an integer, specifying which 329information is required; the second argument is a pointer to a variable into 330which the information is placed. The following information is available: 331<pre> 332 PCRE_CONFIG_UTF8 333</pre> 334The output is an integer that is set to one if UTF-8 support is available; 335otherwise it is set to zero. 336<pre> 337 PCRE_CONFIG_UNICODE_PROPERTIES 338</pre> 339The output is an integer that is set to one if support for Unicode character 340properties is available; otherwise it is set to zero. 341<pre> 342 PCRE_CONFIG_NEWLINE 343</pre> 344The output is an integer whose value specifies the default character sequence 345that is recognized as meaning "newline". The four values that are supported 346are: 10 for LF, 13 for CR, 3338 for CRLF, -2 for ANYCRLF, and -1 for ANY. 347Though they are derived from ASCII, the same values are returned in EBCDIC 348environments. The default should normally correspond to the standard sequence 349for your operating system. 350<pre> 351 PCRE_CONFIG_BSR 352</pre> 353The output is an integer whose value indicates what character sequences the \R 354escape sequence matches by default. A value of 0 means that \R matches any 355Unicode line ending sequence; a value of 1 means that \R matches only CR, LF, 356or CRLF. The default can be overridden when a pattern is compiled or matched. 357<pre> 358 PCRE_CONFIG_LINK_SIZE 359</pre> 360The output is an integer that contains the number of bytes used for internal 361linkage in compiled regular expressions. The value is 2, 3, or 4. Larger values 362allow larger regular expressions to be compiled, at the expense of slower 363matching. The default value of 2 is sufficient for all but the most massive 364patterns, since it allows the compiled pattern to be up to 64K in size. 365<pre> 366 PCRE_CONFIG_POSIX_MALLOC_THRESHOLD 367</pre> 368The output is an integer that contains the threshold above which the POSIX 369interface uses <b>malloc()</b> for output vectors. Further details are given in 370the 371<a href="pcreposix.html"><b>pcreposix</b></a> 372documentation. 373<pre> 374 PCRE_CONFIG_MATCH_LIMIT 375</pre> 376The output is a long integer that gives the default limit for the number of 377internal matching function calls in a <b>pcre_exec()</b> execution. Further 378details are given with <b>pcre_exec()</b> below. 379<pre> 380 PCRE_CONFIG_MATCH_LIMIT_RECURSION 381</pre> 382The output is a long integer that gives the default limit for the depth of 383recursion when calling the internal matching function in a <b>pcre_exec()</b> 384execution. Further details are given with <b>pcre_exec()</b> below. 385<pre> 386 PCRE_CONFIG_STACKRECURSE 387</pre> 388The output is an integer that is set to one if internal recursion when running 389<b>pcre_exec()</b> is implemented by recursive function calls that use the stack 390to remember their state. This is the usual way that PCRE is compiled. The 391output is zero if PCRE was compiled to use blocks of data on the heap instead 392of recursive function calls. In this case, <b>pcre_stack_malloc</b> and 393<b>pcre_stack_free</b> are called to manage memory blocks on the heap, thus 394avoiding the use of the stack. 395</P> 396<br><a name="SEC7" href="#TOC1">COMPILING A PATTERN</a><br> 397<P> 398<b>pcre *pcre_compile(const char *<i>pattern</i>, int <i>options</i>,</b> 399<b>const char **<i>errptr</i>, int *<i>erroffset</i>,</b> 400<b>const unsigned char *<i>tableptr</i>);</b> 401<b>pcre *pcre_compile2(const char *<i>pattern</i>, int <i>options</i>,</b> 402<b>int *<i>errorcodeptr</i>,</b> 403<b>const char **<i>errptr</i>, int *<i>erroffset</i>,</b> 404<b>const unsigned char *<i>tableptr</i>);</b> 405</P> 406<P> 407Either of the functions <b>pcre_compile()</b> or <b>pcre_compile2()</b> can be 408called to compile a pattern into an internal form. The only difference between 409the two interfaces is that <b>pcre_compile2()</b> has an additional argument, 410<i>errorcodeptr</i>, via which a numerical error code can be returned. To avoid 411too much repetition, we refer just to <b>pcre_compile()</b> below, but the 412information applies equally to <b>pcre_compile2()</b>. 413</P> 414<P> 415The pattern is a C string terminated by a binary zero, and is passed in the 416<i>pattern</i> argument. A pointer to a single block of memory that is obtained 417via <b>pcre_malloc</b> is returned. This contains the compiled code and related 418data. The <b>pcre</b> type is defined for the returned block; this is a typedef 419for a structure whose contents are not externally defined. It is up to the 420caller to free the memory (via <b>pcre_free</b>) when it is no longer required. 421</P> 422<P> 423Although the compiled code of a PCRE regex is relocatable, that is, it does not 424depend on memory location, the complete <b>pcre</b> data block is not 425fully relocatable, because it may contain a copy of the <i>tableptr</i> 426argument, which is an address (see below). 427</P> 428<P> 429The <i>options</i> argument contains various bit settings that affect the 430compilation. It should be zero if no options are required. The available 431options are described below. Some of them (in particular, those that are 432compatible with Perl, but some others as well) can also be set and unset from 433within the pattern (see the detailed description in the 434<a href="pcrepattern.html"><b>pcrepattern</b></a> 435documentation). For those options that can be different in different parts of 436the pattern, the contents of the <i>options</i> argument specifies their 437settings at the start of compilation and execution. The PCRE_ANCHORED, 438PCRE_BSR_<i>xxx</i>, PCRE_NEWLINE_<i>xxx</i>, PCRE_NO_UTF8_CHECK, and 439PCRE_NO_START_OPT options can be set at the time of matching as well as at 440compile time. 441</P> 442<P> 443If <i>errptr</i> is NULL, <b>pcre_compile()</b> returns NULL immediately. 444Otherwise, if compilation of a pattern fails, <b>pcre_compile()</b> returns 445NULL, and sets the variable pointed to by <i>errptr</i> to point to a textual 446error message. This is a static string that is part of the library. You must 447not try to free it. The offset from the start of the pattern to the byte that 448was being processed when the error was discovered is placed in the variable 449pointed to by <i>erroffset</i>, which must not be NULL. If it is, an immediate 450error is given. Some errors are not detected until checks are carried out when 451the whole pattern has been scanned; in this case the offset is set to the end 452of the pattern. 453</P> 454<P> 455Note that the offset is in bytes, not characters, even in UTF-8 mode. It may 456point into the middle of a UTF-8 character (for example, when 457PCRE_ERROR_BADUTF8 is returned for an invalid UTF-8 string). 458</P> 459<P> 460If <b>pcre_compile2()</b> is used instead of <b>pcre_compile()</b>, and the 461<i>errorcodeptr</i> argument is not NULL, a non-zero error code number is 462returned via this argument in the event of an error. This is in addition to the 463textual error message. Error codes and messages are listed below. 464</P> 465<P> 466If the final argument, <i>tableptr</i>, is NULL, PCRE uses a default set of 467character tables that are built when PCRE is compiled, using the default C 468locale. Otherwise, <i>tableptr</i> must be an address that is the result of a 469call to <b>pcre_maketables()</b>. This value is stored with the compiled 470pattern, and used again by <b>pcre_exec()</b>, unless another table pointer is 471passed to it. For more discussion, see the section on locale support below. 472</P> 473<P> 474This code fragment shows a typical straightforward call to <b>pcre_compile()</b>: 475<pre> 476 pcre *re; 477 const char *error; 478 int erroffset; 479 re = pcre_compile( 480 "^A.*Z", /* the pattern */ 481 0, /* default options */ 482 &error, /* for error message */ 483 &erroffset, /* for error offset */ 484 NULL); /* use default character tables */ 485</pre> 486The following names for option bits are defined in the <b>pcre.h</b> header 487file: 488<pre> 489 PCRE_ANCHORED 490</pre> 491If this bit is set, the pattern is forced to be "anchored", that is, it is 492constrained to match only at the first matching point in the string that is 493being searched (the "subject string"). This effect can also be achieved by 494appropriate constructs in the pattern itself, which is the only way to do it in 495Perl. 496<pre> 497 PCRE_AUTO_CALLOUT 498</pre> 499If this bit is set, <b>pcre_compile()</b> automatically inserts callout items, 500all with number 255, before each pattern item. For discussion of the callout 501facility, see the 502<a href="pcrecallout.html"><b>pcrecallout</b></a> 503documentation. 504<pre> 505 PCRE_BSR_ANYCRLF 506 PCRE_BSR_UNICODE 507</pre> 508These options (which are mutually exclusive) control what the \R escape 509sequence matches. The choice is either to match only CR, LF, or CRLF, or to 510match any Unicode newline sequence. The default is specified when PCRE is 511built. It can be overridden from within the pattern, or by setting an option 512when a compiled pattern is matched. 513<pre> 514 PCRE_CASELESS 515</pre> 516If this bit is set, letters in the pattern match both upper and lower case 517letters. It is equivalent to Perl's /i option, and it can be changed within a 518pattern by a (?i) option setting. In UTF-8 mode, PCRE always understands the 519concept of case for characters whose values are less than 128, so caseless 520matching is always possible. For characters with higher values, the concept of 521case is supported if PCRE is compiled with Unicode property support, but not 522otherwise. If you want to use caseless matching for characters 128 and above, 523you must ensure that PCRE is compiled with Unicode property support as well as 524with UTF-8 support. 525<pre> 526 PCRE_DOLLAR_ENDONLY 527</pre> 528If this bit is set, a dollar metacharacter in the pattern matches only at the 529end of the subject string. Without this option, a dollar also matches 530immediately before a newline at the end of the string (but not before any other 531newlines). The PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE is set. 532There is no equivalent to this option in Perl, and no way to set it within a 533pattern. 534<pre> 535 PCRE_DOTALL 536</pre> 537If this bit is set, a dot metacharacter in the pattern matches a character of 538any value, including one that indicates a newline. However, it only ever 539matches one character, even if newlines are coded as CRLF. Without this option, 540a dot does not match when the current position is at a newline. This option is 541equivalent to Perl's /s option, and it can be changed within a pattern by a 542(?s) option setting. A negative class such as [^a] always matches newline 543characters, independent of the setting of this option. 544<pre> 545 PCRE_DUPNAMES 546</pre> 547If this bit is set, names used to identify capturing subpatterns need not be 548unique. This can be helpful for certain types of pattern when it is known that 549only one instance of the named subpattern can ever be matched. There are more 550details of named subpatterns below; see also the 551<a href="pcrepattern.html"><b>pcrepattern</b></a> 552documentation. 553<pre> 554 PCRE_EXTENDED 555</pre> 556If this bit is set, whitespace data characters in the pattern are totally 557ignored except when escaped or inside a character class. Whitespace does not 558include the VT character (code 11). In addition, characters between an 559unescaped # outside a character class and the next newline, inclusive, are also 560ignored. This is equivalent to Perl's /x option, and it can be changed within a 561pattern by a (?x) option setting. 562</P> 563<P> 564Which characters are interpreted as newlines 565is controlled by the options passed to <b>pcre_compile()</b> or by a special 566sequence at the start of the pattern, as described in the section entitled 567<a href="pcrepattern.html#newlines">"Newline conventions"</a> 568in the <b>pcrepattern</b> documentation. Note that the end of this type of 569comment is a literal newline sequence in the pattern; escape sequences that 570happen to represent a newline do not count. 571</P> 572<P> 573This option makes it possible to include comments inside complicated patterns. 574Note, however, that this applies only to data characters. Whitespace characters 575may never appear within special character sequences in a pattern, for example 576within the sequence (?( that introduces a conditional subpattern. 577<pre> 578 PCRE_EXTRA 579</pre> 580This option was invented in order to turn on additional functionality of PCRE 581that is incompatible with Perl, but it is currently of very little use. When 582set, any backslash in a pattern that is followed by a letter that has no 583special meaning causes an error, thus reserving these combinations for future 584expansion. By default, as in Perl, a backslash followed by a letter with no 585special meaning is treated as a literal. (Perl can, however, be persuaded to 586give an error for this, by running it with the -w option.) There are at present 587no other features controlled by this option. It can also be set by a (?X) 588option setting within a pattern. 589<pre> 590 PCRE_FIRSTLINE 591</pre> 592If this option is set, an unanchored pattern is required to match before or at 593the first newline in the subject string, though the matched text may continue 594over the newline. 595<pre> 596 PCRE_JAVASCRIPT_COMPAT 597</pre> 598If this option is set, PCRE's behaviour is changed in some ways so that it is 599compatible with JavaScript rather than Perl. The changes are as follows: 600</P> 601<P> 602(1) A lone closing square bracket in a pattern causes a compile-time error, 603because this is illegal in JavaScript (by default it is treated as a data 604character). Thus, the pattern AB]CD becomes illegal when this option is set. 605</P> 606<P> 607(2) At run time, a back reference to an unset subpattern group matches an empty 608string (by default this causes the current matching alternative to fail). A 609pattern such as (\1)(a) succeeds when this option is set (assuming it can find 610an "a" in the subject), whereas it fails by default, for Perl compatibility. 611<pre> 612 PCRE_MULTILINE 613</pre> 614By default, PCRE treats the subject string as consisting of a single line of 615characters (even if it actually contains newlines). The "start of line" 616metacharacter (^) matches only at the start of the string, while the "end of 617line" metacharacter ($) matches only at the end of the string, or before a 618terminating newline (unless PCRE_DOLLAR_ENDONLY is set). This is the same as 619Perl. 620</P> 621<P> 622When PCRE_MULTILINE it is set, the "start of line" and "end of line" constructs 623match immediately following or immediately before internal newlines in the 624subject string, respectively, as well as at the very start and end. This is 625equivalent to Perl's /m option, and it can be changed within a pattern by a 626(?m) option setting. If there are no newlines in a subject string, or no 627occurrences of ^ or $ in a pattern, setting PCRE_MULTILINE has no effect. 628<pre> 629 PCRE_NEWLINE_CR 630 PCRE_NEWLINE_LF 631 PCRE_NEWLINE_CRLF 632 PCRE_NEWLINE_ANYCRLF 633 PCRE_NEWLINE_ANY 634</pre> 635These options override the default newline definition that was chosen when PCRE 636was built. Setting the first or the second specifies that a newline is 637indicated by a single character (CR or LF, respectively). Setting 638PCRE_NEWLINE_CRLF specifies that a newline is indicated by the two-character 639CRLF sequence. Setting PCRE_NEWLINE_ANYCRLF specifies that any of the three 640preceding sequences should be recognized. Setting PCRE_NEWLINE_ANY specifies 641that any Unicode newline sequence should be recognized. The Unicode newline 642sequences are the three just mentioned, plus the single characters VT (vertical 643tab, U+000B), FF (formfeed, U+000C), NEL (next line, U+0085), LS (line 644separator, U+2028), and PS (paragraph separator, U+2029). The last two are 645recognized only in UTF-8 mode. 646</P> 647<P> 648The newline setting in the options word uses three bits that are treated 649as a number, giving eight possibilities. Currently only six are used (default 650plus the five values above). This means that if you set more than one newline 651option, the combination may or may not be sensible. For example, 652PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to PCRE_NEWLINE_CRLF, but 653other combinations may yield unused numbers and cause an error. 654</P> 655<P> 656The only time that a line break in a pattern is specially recognized when 657compiling is when PCRE_EXTENDED is set. CR and LF are whitespace characters, 658and so are ignored in this mode. Also, an unescaped # outside a character class 659indicates a comment that lasts until after the next line break sequence. In 660other circumstances, line break sequences in patterns are treated as literal 661data. 662</P> 663<P> 664The newline option that is set at compile time becomes the default that is used 665for <b>pcre_exec()</b> and <b>pcre_dfa_exec()</b>, but it can be overridden. 666<pre> 667 PCRE_NO_AUTO_CAPTURE 668</pre> 669If this option is set, it disables the use of numbered capturing parentheses in 670the pattern. Any opening parenthesis that is not followed by ? behaves as if it 671were followed by ?: but named parentheses can still be used for capturing (and 672they acquire numbers in the usual way). There is no equivalent of this option 673in Perl. 674<pre> 675 NO_START_OPTIMIZE 676</pre> 677This is an option that acts at matching time; that is, it is really an option 678for <b>pcre_exec()</b> or <b>pcre_dfa_exec()</b>. If it is set at compile time, 679it is remembered with the compiled pattern and assumed at matching time. For 680details see the discussion of PCRE_NO_START_OPTIMIZE 681<a href="#execoptions">below.</a> 682<pre> 683 PCRE_UCP 684</pre> 685This option changes the way PCRE processes \B, \b, \D, \d, \S, \s, \W, 686\w, and some of the POSIX character classes. By default, only ASCII characters 687are recognized, but if PCRE_UCP is set, Unicode properties are used instead to 688classify characters. More details are given in the section on 689<a href="pcre.html#genericchartypes">generic character types</a> 690in the 691<a href="pcrepattern.html"><b>pcrepattern</b></a> 692page. If you set PCRE_UCP, matching one of the items it affects takes much 693longer. The option is available only if PCRE has been compiled with Unicode 694property support. 695<pre> 696 PCRE_UNGREEDY 697</pre> 698This option inverts the "greediness" of the quantifiers so that they are not 699greedy by default, but become greedy if followed by "?". It is not compatible 700with Perl. It can also be set by a (?U) option setting within the pattern. 701<pre> 702 PCRE_UTF8 703</pre> 704This option causes PCRE to regard both the pattern and the subject as strings 705of UTF-8 characters instead of single-byte character strings. However, it is 706available only when PCRE is built to include UTF-8 support. If not, the use 707of this option provokes an error. Details of how this option changes the 708behaviour of PCRE are given in the 709<a href="pcre.html#utf8support">section on UTF-8 support</a> 710in the main 711<a href="pcre.html"><b>pcre</b></a> 712page. 713<pre> 714 PCRE_NO_UTF8_CHECK 715</pre> 716When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is 717automatically checked. There is a discussion about the 718<a href="pcre.html#utf8strings">validity of UTF-8 strings</a> 719in the main 720<a href="pcre.html"><b>pcre</b></a> 721page. If an invalid UTF-8 sequence of bytes is found, <b>pcre_compile()</b> 722returns an error. If you already know that your pattern is valid, and you want 723to skip this check for performance reasons, you can set the PCRE_NO_UTF8_CHECK 724option. When it is set, the effect of passing an invalid UTF-8 string as a 725pattern is undefined. It may cause your program to crash. Note that this option 726can also be passed to <b>pcre_exec()</b> and <b>pcre_dfa_exec()</b>, to suppress 727the UTF-8 validity checking of subject strings. 728</P> 729<br><a name="SEC8" href="#TOC1">COMPILATION ERROR CODES</a><br> 730<P> 731The following table lists the error codes than may be returned by 732<b>pcre_compile2()</b>, along with the error messages that may be returned by 733both compiling functions. As PCRE has developed, some error codes have fallen 734out of use. To avoid confusion, they have not been re-used. 735<pre> 736 0 no error 737 1 \ at end of pattern 738 2 \c at end of pattern 739 3 unrecognized character follows \ 740 4 numbers out of order in {} quantifier 741 5 number too big in {} quantifier 742 6 missing terminating ] for character class 743 7 invalid escape sequence in character class 744 8 range out of order in character class 745 9 nothing to repeat 746 10 [this code is not in use] 747 11 internal error: unexpected repeat 748 12 unrecognized character after (? or (?- 749 13 POSIX named classes are supported only within a class 750 14 missing ) 751 15 reference to non-existent subpattern 752 16 erroffset passed as NULL 753 17 unknown option bit(s) set 754 18 missing ) after comment 755 19 [this code is not in use] 756 20 regular expression is too large 757 21 failed to get memory 758 22 unmatched parentheses 759 23 internal error: code overflow 760 24 unrecognized character after (?< 761 25 lookbehind assertion is not fixed length 762 26 malformed number or name after (?( 763 27 conditional group contains more than two branches 764 28 assertion expected after (?( 765 29 (?R or (?[+-]digits must be followed by ) 766 30 unknown POSIX class name 767 31 POSIX collating elements are not supported 768 32 this version of PCRE is not compiled with PCRE_UTF8 support 769 33 [this code is not in use] 770 34 character value in \x{...} sequence is too large 771 35 invalid condition (?(0) 772 36 \C not allowed in lookbehind assertion 773 37 PCRE does not support \L, \l, \N, \U, or \u 774 38 number after (?C is > 255 775 39 closing ) for (?C expected 776 40 recursive call could loop indefinitely 777 41 unrecognized character after (?P 778 42 syntax error in subpattern name (missing terminator) 779 43 two named subpatterns have the same name 780 44 invalid UTF-8 string 781 45 support for \P, \p, and \X has not been compiled 782 46 malformed \P or \p sequence 783 47 unknown property name after \P or \p 784 48 subpattern name is too long (maximum 32 characters) 785 49 too many named subpatterns (maximum 10000) 786 50 [this code is not in use] 787 51 octal value is greater than \377 (not in UTF-8 mode) 788 52 internal error: overran compiling workspace 789 53 internal error: previously-checked referenced subpattern 790 not found 791 54 DEFINE group contains more than one branch 792 55 repeating a DEFINE group is not allowed 793 56 inconsistent NEWLINE options 794 57 \g is not followed by a braced, angle-bracketed, or quoted 795 name/number or by a plain number 796 58 a numbered reference must not be zero 797 59 an argument is not allowed for (*ACCEPT), (*FAIL), or (*COMMIT) 798 60 (*VERB) not recognized 799 61 number is too big 800 62 subpattern name expected 801 63 digit expected after (?+ 802 64 ] is an invalid data character in JavaScript compatibility mode 803 65 different names for subpatterns of the same number are 804 not allowed 805 66 (*MARK) must have an argument 806 67 this version of PCRE is not compiled with PCRE_UCP support 807</pre> 808The numbers 32 and 10000 in errors 48 and 49 are defaults; different values may 809be used if the limits were changed when PCRE was built. 810</P> 811<br><a name="SEC9" href="#TOC1">STUDYING A PATTERN</a><br> 812<P> 813<b>pcre_extra *pcre_study(const pcre *<i>code</i>, int <i>options</i></b> 814<b>const char **<i>errptr</i>);</b> 815</P> 816<P> 817If a compiled pattern is going to be used several times, it is worth spending 818more time analyzing it in order to speed up the time taken for matching. The 819function <b>pcre_study()</b> takes a pointer to a compiled pattern as its first 820argument. If studying the pattern produces additional information that will 821help speed up matching, <b>pcre_study()</b> returns a pointer to a 822<b>pcre_extra</b> block, in which the <i>study_data</i> field points to the 823results of the study. 824</P> 825<P> 826The returned value from <b>pcre_study()</b> can be passed directly to 827<b>pcre_exec()</b> or <b>pcre_dfa_exec()</b>. However, a <b>pcre_extra</b> block 828also contains other fields that can be set by the caller before the block is 829passed; these are described 830<a href="#extradata">below</a> 831in the section on matching a pattern. 832</P> 833<P> 834If studying the pattern does not produce any useful information, 835<b>pcre_study()</b> returns NULL. In that circumstance, if the calling program 836wants to pass any of the other fields to <b>pcre_exec()</b> or 837<b>pcre_dfa_exec()</b>, it must set up its own <b>pcre_extra</b> block. 838</P> 839<P> 840The second argument of <b>pcre_study()</b> contains option bits. At present, no 841options are defined, and this argument should always be zero. 842</P> 843<P> 844The third argument for <b>pcre_study()</b> is a pointer for an error message. If 845studying succeeds (even if no data is returned), the variable it points to is 846set to NULL. Otherwise it is set to point to a textual error message. This is a 847static string that is part of the library. You must not try to free it. You 848should test the error pointer for NULL after calling <b>pcre_study()</b>, to be 849sure that it has run successfully. 850</P> 851<P> 852This is a typical call to <b>pcre_study</b>(): 853<pre> 854 pcre_extra *pe; 855 pe = pcre_study( 856 re, /* result of pcre_compile() */ 857 0, /* no options exist */ 858 &error); /* set to NULL or points to a message */ 859</pre> 860Studying a pattern does two things: first, a lower bound for the length of 861subject string that is needed to match the pattern is computed. This does not 862mean that there are any strings of that length that match, but it does 863guarantee that no shorter strings match. The value is used by 864<b>pcre_exec()</b> and <b>pcre_dfa_exec()</b> to avoid wasting time by trying to 865match strings that are shorter than the lower bound. You can find out the value 866in a calling program via the <b>pcre_fullinfo()</b> function. 867</P> 868<P> 869Studying a pattern is also useful for non-anchored patterns that do not have a 870single fixed starting character. A bitmap of possible starting bytes is 871created. This speeds up finding a position in the subject at which to start 872matching. 873</P> 874<P> 875The two optimizations just described can be disabled by setting the 876PCRE_NO_START_OPTIMIZE option when calling <b>pcre_exec()</b> or 877<b>pcre_dfa_exec()</b>. You might want to do this if your pattern contains 878callouts or (*MARK), and you want to make use of these facilities in cases 879where matching fails. See the discussion of PCRE_NO_START_OPTIMIZE 880<a href="#execoptions">below.</a> 881<a name="localesupport"></a></P> 882<br><a name="SEC10" href="#TOC1">LOCALE SUPPORT</a><br> 883<P> 884PCRE handles caseless matching, and determines whether characters are letters, 885digits, or whatever, by reference to a set of tables, indexed by character 886value. When running in UTF-8 mode, this applies only to characters with codes 887less than 128. By default, higher-valued codes never match escapes such as \w 888or \d, but they can be tested with \p if PCRE is built with Unicode character 889property support. Alternatively, the PCRE_UCP option can be set at compile 890time; this causes \w and friends to use Unicode property support instead of 891built-in tables. The use of locales with Unicode is discouraged. If you are 892handling characters with codes greater than 128, you should either use UTF-8 893and Unicode, or use locales, but not try to mix the two. 894</P> 895<P> 896PCRE contains an internal set of tables that are used when the final argument 897of <b>pcre_compile()</b> is NULL. These are sufficient for many applications. 898Normally, the internal tables recognize only ASCII characters. However, when 899PCRE is built, it is possible to cause the internal tables to be rebuilt in the 900default "C" locale of the local system, which may cause them to be different. 901</P> 902<P> 903The internal tables can always be overridden by tables supplied by the 904application that calls PCRE. These may be created in a different locale from 905the default. As more and more applications change to using Unicode, the need 906for this locale support is expected to die away. 907</P> 908<P> 909External tables are built by calling the <b>pcre_maketables()</b> function, 910which has no arguments, in the relevant locale. The result can then be passed 911to <b>pcre_compile()</b> or <b>pcre_exec()</b> as often as necessary. For 912example, to build and use tables that are appropriate for the French locale 913(where accented characters with values greater than 128 are treated as letters), 914the following code could be used: 915<pre> 916 setlocale(LC_CTYPE, "fr_FR"); 917 tables = pcre_maketables(); 918 re = pcre_compile(..., tables); 919</pre> 920The locale name "fr_FR" is used on Linux and other Unix-like systems; if you 921are using Windows, the name for the French locale is "french". 922</P> 923<P> 924When <b>pcre_maketables()</b> runs, the tables are built in memory that is 925obtained via <b>pcre_malloc</b>. It is the caller's responsibility to ensure 926that the memory containing the tables remains available for as long as it is 927needed. 928</P> 929<P> 930The pointer that is passed to <b>pcre_compile()</b> is saved with the compiled 931pattern, and the same tables are used via this pointer by <b>pcre_study()</b> 932and normally also by <b>pcre_exec()</b>. Thus, by default, for any single 933pattern, compilation, studying and matching all happen in the same locale, but 934different patterns can be compiled in different locales. 935</P> 936<P> 937It is possible to pass a table pointer or NULL (indicating the use of the 938internal tables) to <b>pcre_exec()</b>. Although not intended for this purpose, 939this facility could be used to match a pattern in a different locale from the 940one in which it was compiled. Passing table pointers at run time is discussed 941below in the section on matching a pattern. 942</P> 943<br><a name="SEC11" href="#TOC1">INFORMATION ABOUT A PATTERN</a><br> 944<P> 945<b>int pcre_fullinfo(const pcre *<i>code</i>, const pcre_extra *<i>extra</i>,</b> 946<b>int <i>what</i>, void *<i>where</i>);</b> 947</P> 948<P> 949The <b>pcre_fullinfo()</b> function returns information about a compiled 950pattern. It replaces the obsolete <b>pcre_info()</b> function, which is 951nevertheless retained for backwards compability (and is documented below). 952</P> 953<P> 954The first argument for <b>pcre_fullinfo()</b> is a pointer to the compiled 955pattern. The second argument is the result of <b>pcre_study()</b>, or NULL if 956the pattern was not studied. The third argument specifies which piece of 957information is required, and the fourth argument is a pointer to a variable 958to receive the data. The yield of the function is zero for success, or one of 959the following negative numbers: 960<pre> 961 PCRE_ERROR_NULL the argument <i>code</i> was NULL 962 the argument <i>where</i> was NULL 963 PCRE_ERROR_BADMAGIC the "magic number" was not found 964 PCRE_ERROR_BADOPTION the value of <i>what</i> was invalid 965</pre> 966The "magic number" is placed at the start of each compiled pattern as an simple 967check against passing an arbitrary memory pointer. Here is a typical call of 968<b>pcre_fullinfo()</b>, to obtain the length of the compiled pattern: 969<pre> 970 int rc; 971 size_t length; 972 rc = pcre_fullinfo( 973 re, /* result of pcre_compile() */ 974 pe, /* result of pcre_study(), or NULL */ 975 PCRE_INFO_SIZE, /* what is required */ 976 &length); /* where to put the data */ 977</pre> 978The possible values for the third argument are defined in <b>pcre.h</b>, and are 979as follows: 980<pre> 981 PCRE_INFO_BACKREFMAX 982</pre> 983Return the number of the highest back reference in the pattern. The fourth 984argument should point to an <b>int</b> variable. Zero is returned if there are 985no back references. 986<pre> 987 PCRE_INFO_CAPTURECOUNT 988</pre> 989Return the number of capturing subpatterns in the pattern. The fourth argument 990should point to an <b>int</b> variable. 991<pre> 992 PCRE_INFO_DEFAULT_TABLES 993</pre> 994Return a pointer to the internal default character tables within PCRE. The 995fourth argument should point to an <b>unsigned char *</b> variable. This 996information call is provided for internal use by the <b>pcre_study()</b> 997function. External callers can cause PCRE to use its internal tables by passing 998a NULL table pointer. 999<pre> 1000 PCRE_INFO_FIRSTBYTE 1001</pre> 1002Return information about the first byte of any matched string, for a 1003non-anchored pattern. The fourth argument should point to an <b>int</b> 1004variable. (This option used to be called PCRE_INFO_FIRSTCHAR; the old name is 1005still recognized for backwards compatibility.) 1006</P> 1007<P> 1008If there is a fixed first byte, for example, from a pattern such as 1009(cat|cow|coyote), its value is returned. Otherwise, if either 1010<br> 1011<br> 1012(a) the pattern was compiled with the PCRE_MULTILINE option, and every branch 1013starts with "^", or 1014<br> 1015<br> 1016(b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not set 1017(if it were set, the pattern would be anchored), 1018<br> 1019<br> 1020-1 is returned, indicating that the pattern matches only at the start of a 1021subject string or after any newline within the string. Otherwise -2 is 1022returned. For anchored patterns, -2 is returned. 1023<pre> 1024 PCRE_INFO_FIRSTTABLE 1025</pre> 1026If the pattern was studied, and this resulted in the construction of a 256-bit 1027table indicating a fixed set of bytes for the first byte in any matching 1028string, a pointer to the table is returned. Otherwise NULL is returned. The 1029fourth argument should point to an <b>unsigned char *</b> variable. 1030<pre> 1031 PCRE_INFO_HASCRORLF 1032</pre> 1033Return 1 if the pattern contains any explicit matches for CR or LF characters, 1034otherwise 0. The fourth argument should point to an <b>int</b> variable. An 1035explicit match is either a literal CR or LF character, or \r or \n. 1036<pre> 1037 PCRE_INFO_JCHANGED 1038</pre> 1039Return 1 if the (?J) or (?-J) option setting is used in the pattern, otherwise 10400. The fourth argument should point to an <b>int</b> variable. (?J) and 1041(?-J) set and unset the local PCRE_DUPNAMES option, respectively. 1042<pre> 1043 PCRE_INFO_LASTLITERAL 1044</pre> 1045Return the value of the rightmost literal byte that must exist in any matched 1046string, other than at its start, if such a byte has been recorded. The fourth 1047argument should point to an <b>int</b> variable. If there is no such byte, -1 is 1048returned. For anchored patterns, a last literal byte is recorded only if it 1049follows something of variable length. For example, for the pattern 1050/^a\d+z\d+/ the returned value is "z", but for /^a\dz\d/ the returned value 1051is -1. 1052<pre> 1053 PCRE_INFO_MINLENGTH 1054</pre> 1055If the pattern was studied and a minimum length for matching subject strings 1056was computed, its value is returned. Otherwise the returned value is -1. The 1057value is a number of characters, not bytes (this may be relevant in UTF-8 1058mode). The fourth argument should point to an <b>int</b> variable. A 1059non-negative value is a lower bound to the length of any matching string. There 1060may not be any strings of that length that do actually match, but every string 1061that does match is at least that long. 1062<pre> 1063 PCRE_INFO_NAMECOUNT 1064 PCRE_INFO_NAMEENTRYSIZE 1065 PCRE_INFO_NAMETABLE 1066</pre> 1067PCRE supports the use of named as well as numbered capturing parentheses. The 1068names are just an additional way of identifying the parentheses, which still 1069acquire numbers. Several convenience functions such as 1070<b>pcre_get_named_substring()</b> are provided for extracting captured 1071substrings by name. It is also possible to extract the data directly, by first 1072converting the name to a number in order to access the correct pointers in the 1073output vector (described with <b>pcre_exec()</b> below). To do the conversion, 1074you need to use the name-to-number map, which is described by these three 1075values. 1076</P> 1077<P> 1078The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT gives 1079the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size of each 1080entry; both of these return an <b>int</b> value. The entry size depends on the 1081length of the longest name. PCRE_INFO_NAMETABLE returns a pointer to the first 1082entry of the table (a pointer to <b>char</b>). The first two bytes of each entry 1083are the number of the capturing parenthesis, most significant byte first. The 1084rest of the entry is the corresponding name, zero terminated. 1085</P> 1086<P> 1087The names are in alphabetical order. Duplicate names may appear if (?| is used 1088to create multiple groups with the same number, as described in the 1089<a href="pcrepattern.html#dupsubpatternnumber">section on duplicate subpattern numbers</a> 1090in the 1091<a href="pcrepattern.html"><b>pcrepattern</b></a> 1092page. Duplicate names for subpatterns with different numbers are permitted only 1093if PCRE_DUPNAMES is set. In all cases of duplicate names, they appear in the 1094table in the order in which they were found in the pattern. In the absence of 1095(?| this is the order of increasing number; when (?| is used this is not 1096necessarily the case because later subpatterns may have lower numbers. 1097</P> 1098<P> 1099As a simple example of the name/number table, consider the following pattern 1100(assume PCRE_EXTENDED is set, so white space - including newlines - is 1101ignored): 1102<pre> 1103 (?<date> (?<year>(\d\d)?\d\d) - (?<month>\d\d) - (?<day>\d\d) ) 1104</pre> 1105There are four named subpatterns, so the table has four entries, and each entry 1106in the table is eight bytes long. The table is as follows, with non-printing 1107bytes shows in hexadecimal, and undefined bytes shown as ??: 1108<pre> 1109 00 01 d a t e 00 ?? 1110 00 05 d a y 00 ?? ?? 1111 00 04 m o n t h 00 1112 00 02 y e a r 00 ?? 1113</pre> 1114When writing code to extract data from named subpatterns using the 1115name-to-number map, remember that the length of the entries is likely to be 1116different for each compiled pattern. 1117<pre> 1118 PCRE_INFO_OKPARTIAL 1119</pre> 1120Return 1 if the pattern can be used for partial matching with 1121<b>pcre_exec()</b>, otherwise 0. The fourth argument should point to an 1122<b>int</b> variable. From release 8.00, this always returns 1, because the 1123restrictions that previously applied to partial matching have been lifted. The 1124<a href="pcrepartial.html"><b>pcrepartial</b></a> 1125documentation gives details of partial matching. 1126<pre> 1127 PCRE_INFO_OPTIONS 1128</pre> 1129Return a copy of the options with which the pattern was compiled. The fourth 1130argument should point to an <b>unsigned long int</b> variable. These option bits 1131are those specified in the call to <b>pcre_compile()</b>, modified by any 1132top-level option settings at the start of the pattern itself. In other words, 1133they are the options that will be in force when matching starts. For example, 1134if the pattern /(?im)abc(?-i)d/ is compiled with the PCRE_EXTENDED option, the 1135result is PCRE_CASELESS, PCRE_MULTILINE, and PCRE_EXTENDED. 1136</P> 1137<P> 1138A pattern is automatically anchored by PCRE if all of its top-level 1139alternatives begin with one of the following: 1140<pre> 1141 ^ unless PCRE_MULTILINE is set 1142 \A always 1143 \G always 1144 .* if PCRE_DOTALL is set and there are no back references to the subpattern in which .* appears 1145</pre> 1146For such patterns, the PCRE_ANCHORED bit is set in the options returned by 1147<b>pcre_fullinfo()</b>. 1148<pre> 1149 PCRE_INFO_SIZE 1150</pre> 1151Return the size of the compiled pattern, that is, the value that was passed as 1152the argument to <b>pcre_malloc()</b> when PCRE was getting memory in which to 1153place the compiled data. The fourth argument should point to a <b>size_t</b> 1154variable. 1155<pre> 1156 PCRE_INFO_STUDYSIZE 1157</pre> 1158Return the size of the data block pointed to by the <i>study_data</i> field in 1159a <b>pcre_extra</b> block. That is, it is the value that was passed to 1160<b>pcre_malloc()</b> when PCRE was getting memory into which to place the data 1161created by <b>pcre_study()</b>. If <b>pcre_extra</b> is NULL, or there is no 1162study data, zero is returned. The fourth argument should point to a 1163<b>size_t</b> variable. 1164</P> 1165<br><a name="SEC12" href="#TOC1">OBSOLETE INFO FUNCTION</a><br> 1166<P> 1167<b>int pcre_info(const pcre *<i>code</i>, int *<i>optptr</i>, int</b> 1168<b>*<i>firstcharptr</i>);</b> 1169</P> 1170<P> 1171The <b>pcre_info()</b> function is now obsolete because its interface is too 1172restrictive to return all the available data about a compiled pattern. New 1173programs should use <b>pcre_fullinfo()</b> instead. The yield of 1174<b>pcre_info()</b> is the number of capturing subpatterns, or one of the 1175following negative numbers: 1176<pre> 1177 PCRE_ERROR_NULL the argument <i>code</i> was NULL 1178 PCRE_ERROR_BADMAGIC the "magic number" was not found 1179</pre> 1180If the <i>optptr</i> argument is not NULL, a copy of the options with which the 1181pattern was compiled is placed in the integer it points to (see 1182PCRE_INFO_OPTIONS above). 1183</P> 1184<P> 1185If the pattern is not anchored and the <i>firstcharptr</i> argument is not NULL, 1186it is used to pass back information about the first character of any matched 1187string (see PCRE_INFO_FIRSTBYTE above). 1188</P> 1189<br><a name="SEC13" href="#TOC1">REFERENCE COUNTS</a><br> 1190<P> 1191<b>int pcre_refcount(pcre *<i>code</i>, int <i>adjust</i>);</b> 1192</P> 1193<P> 1194The <b>pcre_refcount()</b> function is used to maintain a reference count in the 1195data block that contains a compiled pattern. It is provided for the benefit of 1196applications that operate in an object-oriented manner, where different parts 1197of the application may be using the same compiled pattern, but you want to free 1198the block when they are all done. 1199</P> 1200<P> 1201When a pattern is compiled, the reference count field is initialized to zero. 1202It is changed only by calling this function, whose action is to add the 1203<i>adjust</i> value (which may be positive or negative) to it. The yield of the 1204function is the new value. However, the value of the count is constrained to 1205lie between 0 and 65535, inclusive. If the new value is outside these limits, 1206it is forced to the appropriate limit value. 1207</P> 1208<P> 1209Except when it is zero, the reference count is not correctly preserved if a 1210pattern is compiled on one host and then transferred to a host whose byte-order 1211is different. (This seems a highly unlikely scenario.) 1212</P> 1213<br><a name="SEC14" href="#TOC1">MATCHING A PATTERN: THE TRADITIONAL FUNCTION</a><br> 1214<P> 1215<b>int pcre_exec(const pcre *<i>code</i>, const pcre_extra *<i>extra</i>,</b> 1216<b>const char *<i>subject</i>, int <i>length</i>, int <i>startoffset</i>,</b> 1217<b>int <i>options</i>, int *<i>ovector</i>, int <i>ovecsize</i>);</b> 1218</P> 1219<P> 1220The function <b>pcre_exec()</b> is called to match a subject string against a 1221compiled pattern, which is passed in the <i>code</i> argument. If the 1222pattern was studied, the result of the study should be passed in the 1223<i>extra</i> argument. This function is the main matching facility of the 1224library, and it operates in a Perl-like manner. For specialist use there is 1225also an alternative matching function, which is described 1226<a href="#dfamatch">below</a> 1227in the section about the <b>pcre_dfa_exec()</b> function. 1228</P> 1229<P> 1230In most applications, the pattern will have been compiled (and optionally 1231studied) in the same process that calls <b>pcre_exec()</b>. However, it is 1232possible to save compiled patterns and study data, and then use them later 1233in different processes, possibly even on different hosts. For a discussion 1234about this, see the 1235<a href="pcreprecompile.html"><b>pcreprecompile</b></a> 1236documentation. 1237</P> 1238<P> 1239Here is an example of a simple call to <b>pcre_exec()</b>: 1240<pre> 1241 int rc; 1242 int ovector[30]; 1243 rc = pcre_exec( 1244 re, /* result of pcre_compile() */ 1245 NULL, /* we didn't study the pattern */ 1246 "some string", /* the subject string */ 1247 11, /* the length of the subject string */ 1248 0, /* start at offset 0 in the subject */ 1249 0, /* default options */ 1250 ovector, /* vector of integers for substring information */ 1251 30); /* number of elements (NOT size in bytes) */ 1252<a name="extradata"></a></PRE> 1253</P> 1254<br><b> 1255Extra data for <b>pcre_exec()</b> 1256</b><br> 1257<P> 1258If the <i>extra</i> argument is not NULL, it must point to a <b>pcre_extra</b> 1259data block. The <b>pcre_study()</b> function returns such a block (when it 1260doesn't return NULL), but you can also create one for yourself, and pass 1261additional information in it. The <b>pcre_extra</b> block contains the following 1262fields (not necessarily in this order): 1263<pre> 1264 unsigned long int <i>flags</i>; 1265 void *<i>study_data</i>; 1266 unsigned long int <i>match_limit</i>; 1267 unsigned long int <i>match_limit_recursion</i>; 1268 void *<i>callout_data</i>; 1269 const unsigned char *<i>tables</i>; 1270 unsigned char **<i>mark</i>; 1271</pre> 1272The <i>flags</i> field is a bitmap that specifies which of the other fields 1273are set. The flag bits are: 1274<pre> 1275 PCRE_EXTRA_STUDY_DATA 1276 PCRE_EXTRA_MATCH_LIMIT 1277 PCRE_EXTRA_MATCH_LIMIT_RECURSION 1278 PCRE_EXTRA_CALLOUT_DATA 1279 PCRE_EXTRA_TABLES 1280 PCRE_EXTRA_MARK 1281</pre> 1282Other flag bits should be set to zero. The <i>study_data</i> field is set in the 1283<b>pcre_extra</b> block that is returned by <b>pcre_study()</b>, together with 1284the appropriate flag bit. You should not set this yourself, but you may add to 1285the block by setting the other fields and their corresponding flag bits. 1286</P> 1287<P> 1288The <i>match_limit</i> field provides a means of preventing PCRE from using up a 1289vast amount of resources when running patterns that are not going to match, 1290but which have a very large number of possibilities in their search trees. The 1291classic example is a pattern that uses nested unlimited repeats. 1292</P> 1293<P> 1294Internally, PCRE uses a function called <b>match()</b> which it calls repeatedly 1295(sometimes recursively). The limit set by <i>match_limit</i> is imposed on the 1296number of times this function is called during a match, which has the effect of 1297limiting the amount of backtracking that can take place. For patterns that are 1298not anchored, the count restarts from zero for each position in the subject 1299string. 1300</P> 1301<P> 1302The default value for the limit can be set when PCRE is built; the default 1303default is 10 million, which handles all but the most extreme cases. You can 1304override the default by suppling <b>pcre_exec()</b> with a <b>pcre_extra</b> 1305block in which <i>match_limit</i> is set, and PCRE_EXTRA_MATCH_LIMIT is set in 1306the <i>flags</i> field. If the limit is exceeded, <b>pcre_exec()</b> returns 1307PCRE_ERROR_MATCHLIMIT. 1308</P> 1309<P> 1310The <i>match_limit_recursion</i> field is similar to <i>match_limit</i>, but 1311instead of limiting the total number of times that <b>match()</b> is called, it 1312limits the depth of recursion. The recursion depth is a smaller number than the 1313total number of calls, because not all calls to <b>match()</b> are recursive. 1314This limit is of use only if it is set smaller than <i>match_limit</i>. 1315</P> 1316<P> 1317Limiting the recursion depth limits the amount of stack that can be used, or, 1318when PCRE has been compiled to use memory on the heap instead of the stack, the 1319amount of heap memory that can be used. 1320</P> 1321<P> 1322The default value for <i>match_limit_recursion</i> can be set when PCRE is 1323built; the default default is the same value as the default for 1324<i>match_limit</i>. You can override the default by suppling <b>pcre_exec()</b> 1325with a <b>pcre_extra</b> block in which <i>match_limit_recursion</i> is set, and 1326PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in the <i>flags</i> field. If the limit 1327is exceeded, <b>pcre_exec()</b> returns PCRE_ERROR_RECURSIONLIMIT. 1328</P> 1329<P> 1330The <i>callout_data</i> field is used in conjunction with the "callout" feature, 1331and is described in the 1332<a href="pcrecallout.html"><b>pcrecallout</b></a> 1333documentation. 1334</P> 1335<P> 1336The <i>tables</i> field is used to pass a character tables pointer to 1337<b>pcre_exec()</b>; this overrides the value that is stored with the compiled 1338pattern. A non-NULL value is stored with the compiled pattern only if custom 1339tables were supplied to <b>pcre_compile()</b> via its <i>tableptr</i> argument. 1340If NULL is passed to <b>pcre_exec()</b> using this mechanism, it forces PCRE's 1341internal tables to be used. This facility is helpful when re-using patterns 1342that have been saved after compiling with an external set of tables, because 1343the external tables might be at a different address when <b>pcre_exec()</b> is 1344called. See the 1345<a href="pcreprecompile.html"><b>pcreprecompile</b></a> 1346documentation for a discussion of saving compiled patterns for later use. 1347</P> 1348<P> 1349If PCRE_EXTRA_MARK is set in the <i>flags</i> field, the <i>mark</i> field must 1350be set to point to a <b>char *</b> variable. If the pattern contains any 1351backtracking control verbs such as (*MARK:NAME), and the execution ends up with 1352a name to pass back, a pointer to the name string (zero terminated) is placed 1353in the variable pointed to by the <i>mark</i> field. The names are within the 1354compiled pattern; if you wish to retain such a name you must copy it before 1355freeing the memory of a compiled pattern. If there is no name to pass back, the 1356variable pointed to by the <i>mark</i> field set to NULL. For details of the 1357backtracking control verbs, see the section entitled 1358<a href="pcrepattern#backtrackcontrol">"Backtracking control"</a> 1359in the 1360<a href="pcrepattern.html"><b>pcrepattern</b></a> 1361documentation. 1362<a name="execoptions"></a></P> 1363<br><b> 1364Option bits for <b>pcre_exec()</b> 1365</b><br> 1366<P> 1367The unused bits of the <i>options</i> argument for <b>pcre_exec()</b> must be 1368zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_<i>xxx</i>, 1369PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART, 1370PCRE_NO_START_OPTIMIZE, PCRE_NO_UTF8_CHECK, PCRE_PARTIAL_SOFT, and 1371PCRE_PARTIAL_HARD. 1372<pre> 1373 PCRE_ANCHORED 1374</pre> 1375The PCRE_ANCHORED option limits <b>pcre_exec()</b> to matching at the first 1376matching position. If a pattern was compiled with PCRE_ANCHORED, or turned out 1377to be anchored by virtue of its contents, it cannot be made unachored at 1378matching time. 1379<pre> 1380 PCRE_BSR_ANYCRLF 1381 PCRE_BSR_UNICODE 1382</pre> 1383These options (which are mutually exclusive) control what the \R escape 1384sequence matches. The choice is either to match only CR, LF, or CRLF, or to 1385match any Unicode newline sequence. These options override the choice that was 1386made or defaulted when the pattern was compiled. 1387<pre> 1388 PCRE_NEWLINE_CR 1389 PCRE_NEWLINE_LF 1390 PCRE_NEWLINE_CRLF 1391 PCRE_NEWLINE_ANYCRLF 1392 PCRE_NEWLINE_ANY 1393</pre> 1394These options override the newline definition that was chosen or defaulted when 1395the pattern was compiled. For details, see the description of 1396<b>pcre_compile()</b> above. During matching, the newline choice affects the 1397behaviour of the dot, circumflex, and dollar metacharacters. It may also alter 1398the way the match position is advanced after a match failure for an unanchored 1399pattern. 1400</P> 1401<P> 1402When PCRE_NEWLINE_CRLF, PCRE_NEWLINE_ANYCRLF, or PCRE_NEWLINE_ANY is set, and a 1403match attempt for an unanchored pattern fails when the current position is at a 1404CRLF sequence, and the pattern contains no explicit matches for CR or LF 1405characters, the match position is advanced by two characters instead of one, in 1406other words, to after the CRLF. 1407</P> 1408<P> 1409The above rule is a compromise that makes the most common cases work as 1410expected. For example, if the pattern is .+A (and the PCRE_DOTALL option is not 1411set), it does not match the string "\r\nA" because, after failing at the 1412start, it skips both the CR and the LF before retrying. However, the pattern 1413[\r\n]A does match that string, because it contains an explicit CR or LF 1414reference, and so advances only by one character after the first failure. 1415</P> 1416<P> 1417An explicit match for CR of LF is either a literal appearance of one of those 1418characters, or one of the \r or \n escape sequences. Implicit matches such as 1419[^X] do not count, nor does \s (which includes CR and LF in the characters 1420that it matches). 1421</P> 1422<P> 1423Notwithstanding the above, anomalous effects may still occur when CRLF is a 1424valid newline sequence and explicit \r or \n escapes appear in the pattern. 1425<pre> 1426 PCRE_NOTBOL 1427</pre> 1428This option specifies that first character of the subject string is not the 1429beginning of a line, so the circumflex metacharacter should not match before 1430it. Setting this without PCRE_MULTILINE (at compile time) causes circumflex 1431never to match. This option affects only the behaviour of the circumflex 1432metacharacter. It does not affect \A. 1433<pre> 1434 PCRE_NOTEOL 1435</pre> 1436This option specifies that the end of the subject string is not the end of a 1437line, so the dollar metacharacter should not match it nor (except in multiline 1438mode) a newline immediately before it. Setting this without PCRE_MULTILINE (at 1439compile time) causes dollar never to match. This option affects only the 1440behaviour of the dollar metacharacter. It does not affect \Z or \z. 1441<pre> 1442 PCRE_NOTEMPTY 1443</pre> 1444An empty string is not considered to be a valid match if this option is set. If 1445there are alternatives in the pattern, they are tried. If all the alternatives 1446match the empty string, the entire match fails. For example, if the pattern 1447<pre> 1448 a?b? 1449</pre> 1450is applied to a string not beginning with "a" or "b", it matches an empty 1451string at the start of the subject. With PCRE_NOTEMPTY set, this match is not 1452valid, so PCRE searches further into the string for occurrences of "a" or "b". 1453<pre> 1454 PCRE_NOTEMPTY_ATSTART 1455</pre> 1456This is like PCRE_NOTEMPTY, except that an empty string match that is not at 1457the start of the subject is permitted. If the pattern is anchored, such a match 1458can occur only if the pattern contains \K. 1459</P> 1460<P> 1461Perl has no direct equivalent of PCRE_NOTEMPTY or PCRE_NOTEMPTY_ATSTART, but it 1462does make a special case of a pattern match of the empty string within its 1463<b>split()</b> function, and when using the /g modifier. It is possible to 1464emulate Perl's behaviour after matching a null string by first trying the match 1465again at the same offset with PCRE_NOTEMPTY_ATSTART and PCRE_ANCHORED, and then 1466if that fails, by advancing the starting offset (see below) and trying an 1467ordinary match again. There is some code that demonstrates how to do this in 1468the 1469<a href="pcredemo.html"><b>pcredemo</b></a> 1470sample program. In the most general case, you have to check to see if the 1471newline convention recognizes CRLF as a newline, and if so, and the current 1472character is CR followed by LF, advance the starting offset by two characters 1473instead of one. 1474<pre> 1475 PCRE_NO_START_OPTIMIZE 1476</pre> 1477There are a number of optimizations that <b>pcre_exec()</b> uses at the start of 1478a match, in order to speed up the process. For example, if it is known that an 1479unanchored match must start with a specific character, it searches the subject 1480for that character, and fails immediately if it cannot find it, without 1481actually running the main matching function. This means that a special item 1482such as (*COMMIT) at the start of a pattern is not considered until after a 1483suitable starting point for the match has been found. When callouts or (*MARK) 1484items are in use, these "start-up" optimizations can cause them to be skipped 1485if the pattern is never actually used. The start-up optimizations are in effect 1486a pre-scan of the subject that takes place before the pattern is run. 1487</P> 1488<P> 1489The PCRE_NO_START_OPTIMIZE option disables the start-up optimizations, possibly 1490causing performance to suffer, but ensuring that in cases where the result is 1491"no match", the callouts do occur, and that items such as (*COMMIT) and (*MARK) 1492are considered at every possible starting position in the subject string. If 1493PCRE_NO_START_OPTIMIZE is set at compile time, it cannot be unset at matching 1494time. 1495</P> 1496<P> 1497Setting PCRE_NO_START_OPTIMIZE can change the outcome of a matching operation. 1498Consider the pattern 1499<pre> 1500 (*COMMIT)ABC 1501</pre> 1502When this is compiled, PCRE records the fact that a match must start with the 1503character "A". Suppose the subject string is "DEFABC". The start-up 1504optimization scans along the subject, finds "A" and runs the first match 1505attempt from there. The (*COMMIT) item means that the pattern must match the 1506current starting position, which in this case, it does. However, if the same 1507match is run with PCRE_NO_START_OPTIMIZE set, the initial scan along the 1508subject string does not happen. The first match attempt is run starting from 1509"D" and when this fails, (*COMMIT) prevents any further matches being tried, so 1510the overall result is "no match". If the pattern is studied, more start-up 1511optimizations may be used. For example, a minimum length for the subject may be 1512recorded. Consider the pattern 1513<pre> 1514 (*MARK:A)(X|Y) 1515</pre> 1516The minimum length for a match is one character. If the subject is "ABC", there 1517will be attempts to match "ABC", "BC", "C", and then finally an empty string. 1518If the pattern is studied, the final attempt does not take place, because PCRE 1519knows that the subject is too short, and so the (*MARK) is never encountered. 1520In this case, studying the pattern does not affect the overall match result, 1521which is still "no match", but it does affect the auxiliary information that is 1522returned. 1523<pre> 1524 PCRE_NO_UTF8_CHECK 1525</pre> 1526When PCRE_UTF8 is set at compile time, the validity of the subject as a UTF-8 1527string is automatically checked when <b>pcre_exec()</b> is subsequently called. 1528The value of <i>startoffset</i> is also checked to ensure that it points to the 1529start of a UTF-8 character. There is a discussion about the validity of UTF-8 1530strings in the 1531<a href="pcre.html#utf8strings">section on UTF-8 support</a> 1532in the main 1533<a href="pcre.html"><b>pcre</b></a> 1534page. If an invalid UTF-8 sequence of bytes is found, <b>pcre_exec()</b> returns 1535the error PCRE_ERROR_BADUTF8 or, if PCRE_PARTIAL_HARD is set and the problem is 1536a truncated UTF-8 character at the end of the subject, PCRE_ERROR_SHORTUTF8. If 1537<i>startoffset</i> contains a value that does not point to the start of a UTF-8 1538character (or to the end of the subject), PCRE_ERROR_BADUTF8_OFFSET is 1539returned. 1540</P> 1541<P> 1542If you already know that your subject is valid, and you want to skip these 1543checks for performance reasons, you can set the PCRE_NO_UTF8_CHECK option when 1544calling <b>pcre_exec()</b>. You might want to do this for the second and 1545subsequent calls to <b>pcre_exec()</b> if you are making repeated calls to find 1546all the matches in a single subject string. However, you should be sure that 1547the value of <i>startoffset</i> points to the start of a UTF-8 character (or the 1548end of the subject). When PCRE_NO_UTF8_CHECK is set, the effect of passing an 1549invalid UTF-8 string as a subject or an invalid value of <i>startoffset</i> is 1550undefined. Your program may crash. 1551<pre> 1552 PCRE_PARTIAL_HARD 1553 PCRE_PARTIAL_SOFT 1554</pre> 1555These options turn on the partial matching feature. For backwards 1556compatibility, PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A partial match 1557occurs if the end of the subject string is reached successfully, but there are 1558not enough subject characters to complete the match. If this happens when 1559PCRE_PARTIAL_SOFT (but not PCRE_PARTIAL_HARD) is set, matching continues by 1560testing any remaining alternatives. Only if no complete match can be found is 1561PCRE_ERROR_PARTIAL returned instead of PCRE_ERROR_NOMATCH. In other words, 1562PCRE_PARTIAL_SOFT says that the caller is prepared to handle a partial match, 1563but only if no complete match can be found. 1564</P> 1565<P> 1566If PCRE_PARTIAL_HARD is set, it overrides PCRE_PARTIAL_SOFT. In this case, if a 1567partial match is found, <b>pcre_exec()</b> immediately returns 1568PCRE_ERROR_PARTIAL, without considering any other alternatives. In other words, 1569when PCRE_PARTIAL_HARD is set, a partial match is considered to be more 1570important that an alternative complete match. 1571</P> 1572<P> 1573In both cases, the portion of the string that was inspected when the partial 1574match was found is set as the first matching string. There is a more detailed 1575discussion of partial and multi-segment matching, with examples, in the 1576<a href="pcrepartial.html"><b>pcrepartial</b></a> 1577documentation. 1578</P> 1579<br><b> 1580The string to be matched by <b>pcre_exec()</b> 1581</b><br> 1582<P> 1583The subject string is passed to <b>pcre_exec()</b> as a pointer in 1584<i>subject</i>, a length (in bytes) in <i>length</i>, and a starting byte offset 1585in <i>startoffset</i>. If this is negative or greater than the length of the 1586subject, <b>pcre_exec()</b> returns PCRE_ERROR_BADOFFSET. When the starting 1587offset is zero, the search for a match starts at the beginning of the subject, 1588and this is by far the most common case. In UTF-8 mode, the byte offset must 1589point to the start of a UTF-8 character (or the end of the subject). Unlike the 1590pattern string, the subject may contain binary zero bytes. 1591</P> 1592<P> 1593A non-zero starting offset is useful when searching for another match in the 1594same subject by calling <b>pcre_exec()</b> again after a previous success. 1595Setting <i>startoffset</i> differs from just passing over a shortened string and 1596setting PCRE_NOTBOL in the case of a pattern that begins with any kind of 1597lookbehind. For example, consider the pattern 1598<pre> 1599 \Biss\B 1600</pre> 1601which finds occurrences of "iss" in the middle of words. (\B matches only if 1602the current position in the subject is not a word boundary.) When applied to 1603the string "Mississipi" the first call to <b>pcre_exec()</b> finds the first 1604occurrence. If <b>pcre_exec()</b> is called again with just the remainder of the 1605subject, namely "issipi", it does not match, because \B is always false at the 1606start of the subject, which is deemed to be a word boundary. However, if 1607<b>pcre_exec()</b> is passed the entire string again, but with <i>startoffset</i> 1608set to 4, it finds the second occurrence of "iss" because it is able to look 1609behind the starting point to discover that it is preceded by a letter. 1610</P> 1611<P> 1612Finding all the matches in a subject is tricky when the pattern can match an 1613empty string. It is possible to emulate Perl's /g behaviour by first trying the 1614match again at the same offset, with the PCRE_NOTEMPTY_ATSTART and 1615PCRE_ANCHORED options, and then if that fails, advancing the starting offset 1616and trying an ordinary match again. There is some code that demonstrates how to 1617do this in the 1618<a href="pcredemo.html"><b>pcredemo</b></a> 1619sample program. In the most general case, you have to check to see if the 1620newline convention recognizes CRLF as a newline, and if so, and the current 1621character is CR followed by LF, advance the starting offset by two characters 1622instead of one. 1623</P> 1624<P> 1625If a non-zero starting offset is passed when the pattern is anchored, one 1626attempt to match at the given offset is made. This can only succeed if the 1627pattern does not require the match to be at the start of the subject. 1628</P> 1629<br><b> 1630How <b>pcre_exec()</b> returns captured substrings 1631</b><br> 1632<P> 1633In general, a pattern matches a certain portion of the subject, and in 1634addition, further substrings from the subject may be picked out by parts of the 1635pattern. Following the usage in Jeffrey Friedl's book, this is called 1636"capturing" in what follows, and the phrase "capturing subpattern" is used for 1637a fragment of a pattern that picks out a substring. PCRE supports several other 1638kinds of parenthesized subpattern that do not cause substrings to be captured. 1639</P> 1640<P> 1641Captured substrings are returned to the caller via a vector of integers whose 1642address is passed in <i>ovector</i>. The number of elements in the vector is 1643passed in <i>ovecsize</i>, which must be a non-negative number. <b>Note</b>: this 1644argument is NOT the size of <i>ovector</i> in bytes. 1645</P> 1646<P> 1647The first two-thirds of the vector is used to pass back captured substrings, 1648each substring using a pair of integers. The remaining third of the vector is 1649used as workspace by <b>pcre_exec()</b> while matching capturing subpatterns, 1650and is not available for passing back information. The number passed in 1651<i>ovecsize</i> should always be a multiple of three. If it is not, it is 1652rounded down. 1653</P> 1654<P> 1655When a match is successful, information about captured substrings is returned 1656in pairs of integers, starting at the beginning of <i>ovector</i>, and 1657continuing up to two-thirds of its length at the most. The first element of 1658each pair is set to the byte offset of the first character in a substring, and 1659the second is set to the byte offset of the first character after the end of a 1660substring. <b>Note</b>: these values are always byte offsets, even in UTF-8 1661mode. They are not character counts. 1662</P> 1663<P> 1664The first pair of integers, <i>ovector[0]</i> and <i>ovector[1]</i>, identify the 1665portion of the subject string matched by the entire pattern. The next pair is 1666used for the first capturing subpattern, and so on. The value returned by 1667<b>pcre_exec()</b> is one more than the highest numbered pair that has been set. 1668For example, if two substrings have been captured, the returned value is 3. If 1669there are no capturing subpatterns, the return value from a successful match is 16701, indicating that just the first pair of offsets has been set. 1671</P> 1672<P> 1673If a capturing subpattern is matched repeatedly, it is the last portion of the 1674string that it matched that is returned. 1675</P> 1676<P> 1677If the vector is too small to hold all the captured substring offsets, it is 1678used as far as possible (up to two-thirds of its length), and the function 1679returns a value of zero. If the substring offsets are not of interest, 1680<b>pcre_exec()</b> may be called with <i>ovector</i> passed as NULL and 1681<i>ovecsize</i> as zero. However, if the pattern contains back references and 1682the <i>ovector</i> is not big enough to remember the related substrings, PCRE 1683has to get additional memory for use during matching. Thus it is usually 1684advisable to supply an <i>ovector</i>. 1685</P> 1686<P> 1687The <b>pcre_fullinfo()</b> function can be used to find out how many capturing 1688subpatterns there are in a compiled pattern. The smallest size for 1689<i>ovector</i> that will allow for <i>n</i> captured substrings, in addition to 1690the offsets of the substring matched by the whole pattern, is (<i>n</i>+1)*3. 1691</P> 1692<P> 1693It is possible for capturing subpattern number <i>n+1</i> to match some part of 1694the subject when subpattern <i>n</i> has not been used at all. For example, if 1695the string "abc" is matched against the pattern (a|(z))(bc) the return from the 1696function is 4, and subpatterns 1 and 3 are matched, but 2 is not. When this 1697happens, both values in the offset pairs corresponding to unused subpatterns 1698are set to -1. 1699</P> 1700<P> 1701Offset values that correspond to unused subpatterns at the end of the 1702expression are also set to -1. For example, if the string "abc" is matched 1703against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not matched. The 1704return from the function is 2, because the highest used capturing subpattern 1705number is 1, and the offsets for for the second and third capturing subpatterns 1706(assuming the vector is large enough, of course) are set to -1. 1707</P> 1708<P> 1709<b>Note</b>: Elements of <i>ovector</i> that do not correspond to capturing 1710parentheses in the pattern are never changed. That is, if a pattern contains 1711<i>n</i> capturing parentheses, no more than <i>ovector[0]</i> to 1712<i>ovector[2n+1]</i> are set by <b>pcre_exec()</b>. The other elements retain 1713whatever values they previously had. 1714</P> 1715<P> 1716Some convenience functions are provided for extracting the captured substrings 1717as separate strings. These are described below. 1718<a name="errorlist"></a></P> 1719<br><b> 1720Error return values from <b>pcre_exec()</b> 1721</b><br> 1722<P> 1723If <b>pcre_exec()</b> fails, it returns a negative number. The following are 1724defined in the header file: 1725<pre> 1726 PCRE_ERROR_NOMATCH (-1) 1727</pre> 1728The subject string did not match the pattern. 1729<pre> 1730 PCRE_ERROR_NULL (-2) 1731</pre> 1732Either <i>code</i> or <i>subject</i> was passed as NULL, or <i>ovector</i> was 1733NULL and <i>ovecsize</i> was not zero. 1734<pre> 1735 PCRE_ERROR_BADOPTION (-3) 1736</pre> 1737An unrecognized bit was set in the <i>options</i> argument. 1738<pre> 1739 PCRE_ERROR_BADMAGIC (-4) 1740</pre> 1741PCRE stores a 4-byte "magic number" at the start of the compiled code, to catch 1742the case when it is passed a junk pointer and to detect when a pattern that was 1743compiled in an environment of one endianness is run in an environment with the 1744other endianness. This is the error that PCRE gives when the magic number is 1745not present. 1746<pre> 1747 PCRE_ERROR_UNKNOWN_OPCODE (-5) 1748</pre> 1749While running the pattern match, an unknown item was encountered in the 1750compiled pattern. This error could be caused by a bug in PCRE or by overwriting 1751of the compiled pattern. 1752<pre> 1753 PCRE_ERROR_NOMEMORY (-6) 1754</pre> 1755If a pattern contains back references, but the <i>ovector</i> that is passed to 1756<b>pcre_exec()</b> is not big enough to remember the referenced substrings, PCRE 1757gets a block of memory at the start of matching to use for this purpose. If the 1758call via <b>pcre_malloc()</b> fails, this error is given. The memory is 1759automatically freed at the end of matching. 1760</P> 1761<P> 1762This error is also given if <b>pcre_stack_malloc()</b> fails in 1763<b>pcre_exec()</b>. This can happen only when PCRE has been compiled with 1764<b>--disable-stack-for-recursion</b>. 1765<pre> 1766 PCRE_ERROR_NOSUBSTRING (-7) 1767</pre> 1768This error is used by the <b>pcre_copy_substring()</b>, 1769<b>pcre_get_substring()</b>, and <b>pcre_get_substring_list()</b> functions (see 1770below). It is never returned by <b>pcre_exec()</b>. 1771<pre> 1772 PCRE_ERROR_MATCHLIMIT (-8) 1773</pre> 1774The backtracking limit, as specified by the <i>match_limit</i> field in a 1775<b>pcre_extra</b> structure (or defaulted) was reached. See the description 1776above. 1777<pre> 1778 PCRE_ERROR_CALLOUT (-9) 1779</pre> 1780This error is never generated by <b>pcre_exec()</b> itself. It is provided for 1781use by callout functions that want to yield a distinctive error code. See the 1782<a href="pcrecallout.html"><b>pcrecallout</b></a> 1783documentation for details. 1784<pre> 1785 PCRE_ERROR_BADUTF8 (-10) 1786</pre> 1787A string that contains an invalid UTF-8 byte sequence was passed as a subject. 1788However, if PCRE_PARTIAL_HARD is set and the problem is a truncated UTF-8 1789character at the end of the subject, PCRE_ERROR_SHORTUTF8 is used instead. 1790<pre> 1791 PCRE_ERROR_BADUTF8_OFFSET (-11) 1792</pre> 1793The UTF-8 byte sequence that was passed as a subject was valid, but the value 1794of <i>startoffset</i> did not point to the beginning of a UTF-8 character or the 1795end of the subject. 1796<pre> 1797 PCRE_ERROR_PARTIAL (-12) 1798</pre> 1799The subject string did not match, but it did match partially. See the 1800<a href="pcrepartial.html"><b>pcrepartial</b></a> 1801documentation for details of partial matching. 1802<pre> 1803 PCRE_ERROR_BADPARTIAL (-13) 1804</pre> 1805This code is no longer in use. It was formerly returned when the PCRE_PARTIAL 1806option was used with a compiled pattern containing items that were not 1807supported for partial matching. From release 8.00 onwards, there are no 1808restrictions on partial matching. 1809<pre> 1810 PCRE_ERROR_INTERNAL (-14) 1811</pre> 1812An unexpected internal error has occurred. This error could be caused by a bug 1813in PCRE or by overwriting of the compiled pattern. 1814<pre> 1815 PCRE_ERROR_BADCOUNT (-15) 1816</pre> 1817This error is given if the value of the <i>ovecsize</i> argument is negative. 1818<pre> 1819 PCRE_ERROR_RECURSIONLIMIT (-21) 1820</pre> 1821The internal recursion limit, as specified by the <i>match_limit_recursion</i> 1822field in a <b>pcre_extra</b> structure (or defaulted) was reached. See the 1823description above. 1824<pre> 1825 PCRE_ERROR_BADNEWLINE (-23) 1826</pre> 1827An invalid combination of PCRE_NEWLINE_<i>xxx</i> options was given. 1828<pre> 1829 PCRE_ERROR_BADOFFSET (-24) 1830</pre> 1831The value of <i>startoffset</i> was negative or greater than the length of the 1832subject, that is, the value in <i>length</i>. 1833<pre> 1834 PCRE_ERROR_SHORTUTF8 (-25) 1835</pre> 1836The subject string ended with an incomplete (truncated) UTF-8 character, and 1837the PCRE_PARTIAL_HARD option was set. Without this option, PCRE_ERROR_BADUTF8 1838is returned in this situation. 1839</P> 1840<P> 1841Error numbers -16 to -20 and -22 are not used by <b>pcre_exec()</b>. 1842</P> 1843<br><a name="SEC15" href="#TOC1">EXTRACTING CAPTURED SUBSTRINGS BY NUMBER</a><br> 1844<P> 1845<b>int pcre_copy_substring(const char *<i>subject</i>, int *<i>ovector</i>,</b> 1846<b>int <i>stringcount</i>, int <i>stringnumber</i>, char *<i>buffer</i>,</b> 1847<b>int <i>buffersize</i>);</b> 1848</P> 1849<P> 1850<b>int pcre_get_substring(const char *<i>subject</i>, int *<i>ovector</i>,</b> 1851<b>int <i>stringcount</i>, int <i>stringnumber</i>,</b> 1852<b>const char **<i>stringptr</i>);</b> 1853</P> 1854<P> 1855<b>int pcre_get_substring_list(const char *<i>subject</i>,</b> 1856<b>int *<i>ovector</i>, int <i>stringcount</i>, const char ***<i>listptr</i>);</b> 1857</P> 1858<P> 1859Captured substrings can be accessed directly by using the offsets returned by 1860<b>pcre_exec()</b> in <i>ovector</i>. For convenience, the functions 1861<b>pcre_copy_substring()</b>, <b>pcre_get_substring()</b>, and 1862<b>pcre_get_substring_list()</b> are provided for extracting captured substrings 1863as new, separate, zero-terminated strings. These functions identify substrings 1864by number. The next section describes functions for extracting named 1865substrings. 1866</P> 1867<P> 1868A substring that contains a binary zero is correctly extracted and has a 1869further zero added on the end, but the result is not, of course, a C string. 1870However, you can process such a string by referring to the length that is 1871returned by <b>pcre_copy_substring()</b> and <b>pcre_get_substring()</b>. 1872Unfortunately, the interface to <b>pcre_get_substring_list()</b> is not adequate 1873for handling strings containing binary zeros, because the end of the final 1874string is not independently indicated. 1875</P> 1876<P> 1877The first three arguments are the same for all three of these functions: 1878<i>subject</i> is the subject string that has just been successfully matched, 1879<i>ovector</i> is a pointer to the vector of integer offsets that was passed to 1880<b>pcre_exec()</b>, and <i>stringcount</i> is the number of substrings that were 1881captured by the match, including the substring that matched the entire regular 1882expression. This is the value returned by <b>pcre_exec()</b> if it is greater 1883than zero. If <b>pcre_exec()</b> returned zero, indicating that it ran out of 1884space in <i>ovector</i>, the value passed as <i>stringcount</i> should be the 1885number of elements in the vector divided by three. 1886</P> 1887<P> 1888The functions <b>pcre_copy_substring()</b> and <b>pcre_get_substring()</b> 1889extract a single substring, whose number is given as <i>stringnumber</i>. A 1890value of zero extracts the substring that matched the entire pattern, whereas 1891higher values extract the captured substrings. For <b>pcre_copy_substring()</b>, 1892the string is placed in <i>buffer</i>, whose length is given by 1893<i>buffersize</i>, while for <b>pcre_get_substring()</b> a new block of memory is 1894obtained via <b>pcre_malloc</b>, and its address is returned via 1895<i>stringptr</i>. The yield of the function is the length of the string, not 1896including the terminating zero, or one of these error codes: 1897<pre> 1898 PCRE_ERROR_NOMEMORY (-6) 1899</pre> 1900The buffer was too small for <b>pcre_copy_substring()</b>, or the attempt to get 1901memory failed for <b>pcre_get_substring()</b>. 1902<pre> 1903 PCRE_ERROR_NOSUBSTRING (-7) 1904</pre> 1905There is no substring whose number is <i>stringnumber</i>. 1906</P> 1907<P> 1908The <b>pcre_get_substring_list()</b> function extracts all available substrings 1909and builds a list of pointers to them. All this is done in a single block of 1910memory that is obtained via <b>pcre_malloc</b>. The address of the memory block 1911is returned via <i>listptr</i>, which is also the start of the list of string 1912pointers. The end of the list is marked by a NULL pointer. The yield of the 1913function is zero if all went well, or the error code 1914<pre> 1915 PCRE_ERROR_NOMEMORY (-6) 1916</pre> 1917if the attempt to get the memory block failed. 1918</P> 1919<P> 1920When any of these functions encounter a substring that is unset, which can 1921happen when capturing subpattern number <i>n+1</i> matches some part of the 1922subject, but subpattern <i>n</i> has not been used at all, they return an empty 1923string. This can be distinguished from a genuine zero-length substring by 1924inspecting the appropriate offset in <i>ovector</i>, which is negative for unset 1925substrings. 1926</P> 1927<P> 1928The two convenience functions <b>pcre_free_substring()</b> and 1929<b>pcre_free_substring_list()</b> can be used to free the memory returned by 1930a previous call of <b>pcre_get_substring()</b> or 1931<b>pcre_get_substring_list()</b>, respectively. They do nothing more than call 1932the function pointed to by <b>pcre_free</b>, which of course could be called 1933directly from a C program. However, PCRE is used in some situations where it is 1934linked via a special interface to another programming language that cannot use 1935<b>pcre_free</b> directly; it is for these cases that the functions are 1936provided. 1937</P> 1938<br><a name="SEC16" href="#TOC1">EXTRACTING CAPTURED SUBSTRINGS BY NAME</a><br> 1939<P> 1940<b>int pcre_get_stringnumber(const pcre *<i>code</i>,</b> 1941<b>const char *<i>name</i>);</b> 1942</P> 1943<P> 1944<b>int pcre_copy_named_substring(const pcre *<i>code</i>,</b> 1945<b>const char *<i>subject</i>, int *<i>ovector</i>,</b> 1946<b>int <i>stringcount</i>, const char *<i>stringname</i>,</b> 1947<b>char *<i>buffer</i>, int <i>buffersize</i>);</b> 1948</P> 1949<P> 1950<b>int pcre_get_named_substring(const pcre *<i>code</i>,</b> 1951<b>const char *<i>subject</i>, int *<i>ovector</i>,</b> 1952<b>int <i>stringcount</i>, const char *<i>stringname</i>,</b> 1953<b>const char **<i>stringptr</i>);</b> 1954</P> 1955<P> 1956To extract a substring by name, you first have to find associated number. 1957For example, for this pattern 1958<pre> 1959 (a+)b(?<xxx>\d+)... 1960</pre> 1961the number of the subpattern called "xxx" is 2. If the name is known to be 1962unique (PCRE_DUPNAMES was not set), you can find the number from the name by 1963calling <b>pcre_get_stringnumber()</b>. The first argument is the compiled 1964pattern, and the second is the name. The yield of the function is the 1965subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if there is no subpattern of 1966that name. 1967</P> 1968<P> 1969Given the number, you can extract the substring directly, or use one of the 1970functions described in the previous section. For convenience, there are also 1971two functions that do the whole job. 1972</P> 1973<P> 1974Most of the arguments of <b>pcre_copy_named_substring()</b> and 1975<b>pcre_get_named_substring()</b> are the same as those for the similarly named 1976functions that extract by number. As these are described in the previous 1977section, they are not re-described here. There are just two differences: 1978</P> 1979<P> 1980First, instead of a substring number, a substring name is given. Second, there 1981is an extra argument, given at the start, which is a pointer to the compiled 1982pattern. This is needed in order to gain access to the name-to-number 1983translation table. 1984</P> 1985<P> 1986These functions call <b>pcre_get_stringnumber()</b>, and if it succeeds, they 1987then call <b>pcre_copy_substring()</b> or <b>pcre_get_substring()</b>, as 1988appropriate. <b>NOTE:</b> If PCRE_DUPNAMES is set and there are duplicate names, 1989the behaviour may not be what you want (see the next section). 1990</P> 1991<P> 1992<b>Warning:</b> If the pattern uses the (?| feature to set up multiple 1993subpatterns with the same number, as described in the 1994<a href="pcrepattern.html#dupsubpatternnumber">section on duplicate subpattern numbers</a> 1995in the 1996<a href="pcrepattern.html"><b>pcrepattern</b></a> 1997page, you cannot use names to distinguish the different subpatterns, because 1998names are not included in the compiled code. The matching process uses only 1999numbers. For this reason, the use of different names for subpatterns of the 2000same number causes an error at compile time. 2001</P> 2002<br><a name="SEC17" href="#TOC1">DUPLICATE SUBPATTERN NAMES</a><br> 2003<P> 2004<b>int pcre_get_stringtable_entries(const pcre *<i>code</i>,</b> 2005<b>const char *<i>name</i>, char **<i>first</i>, char **<i>last</i>);</b> 2006</P> 2007<P> 2008When a pattern is compiled with the PCRE_DUPNAMES option, names for subpatterns 2009are not required to be unique. (Duplicate names are always allowed for 2010subpatterns with the same number, created by using the (?| feature. Indeed, if 2011such subpatterns are named, they are required to use the same names.) 2012</P> 2013<P> 2014Normally, patterns with duplicate names are such that in any one match, only 2015one of the named subpatterns participates. An example is shown in the 2016<a href="pcrepattern.html"><b>pcrepattern</b></a> 2017documentation. 2018</P> 2019<P> 2020When duplicates are present, <b>pcre_copy_named_substring()</b> and 2021<b>pcre_get_named_substring()</b> return the first substring corresponding to 2022the given name that is set. If none are set, PCRE_ERROR_NOSUBSTRING (-7) is 2023returned; no data is returned. The <b>pcre_get_stringnumber()</b> function 2024returns one of the numbers that are associated with the name, but it is not 2025defined which it is. 2026</P> 2027<P> 2028If you want to get full details of all captured substrings for a given name, 2029you must use the <b>pcre_get_stringtable_entries()</b> function. The first 2030argument is the compiled pattern, and the second is the name. The third and 2031fourth are pointers to variables which are updated by the function. After it 2032has run, they point to the first and last entries in the name-to-number table 2033for the given name. The function itself returns the length of each entry, or 2034PCRE_ERROR_NOSUBSTRING (-7) if there are none. The format of the table is 2035described above in the section entitled <i>Information about a pattern</i>. 2036Given all the relevant entries for the name, you can extract each of their 2037numbers, and hence the captured data, if any. 2038</P> 2039<br><a name="SEC18" href="#TOC1">FINDING ALL POSSIBLE MATCHES</a><br> 2040<P> 2041The traditional matching function uses a similar algorithm to Perl, which stops 2042when it finds the first match, starting at a given point in the subject. If you 2043want to find all possible matches, or the longest possible match, consider 2044using the alternative matching function (see below) instead. If you cannot use 2045the alternative function, but still need to find all possible matches, you 2046can kludge it up by making use of the callout facility, which is described in 2047the 2048<a href="pcrecallout.html"><b>pcrecallout</b></a> 2049documentation. 2050</P> 2051<P> 2052What you have to do is to insert a callout right at the end of the pattern. 2053When your callout function is called, extract and save the current matched 2054substring. Then return 1, which forces <b>pcre_exec()</b> to backtrack and try 2055other alternatives. Ultimately, when it runs out of matches, <b>pcre_exec()</b> 2056will yield PCRE_ERROR_NOMATCH. 2057<a name="dfamatch"></a></P> 2058<br><a name="SEC19" href="#TOC1">MATCHING A PATTERN: THE ALTERNATIVE FUNCTION</a><br> 2059<P> 2060<b>int pcre_dfa_exec(const pcre *<i>code</i>, const pcre_extra *<i>extra</i>,</b> 2061<b>const char *<i>subject</i>, int <i>length</i>, int <i>startoffset</i>,</b> 2062<b>int <i>options</i>, int *<i>ovector</i>, int <i>ovecsize</i>,</b> 2063<b>int *<i>workspace</i>, int <i>wscount</i>);</b> 2064</P> 2065<P> 2066The function <b>pcre_dfa_exec()</b> is called to match a subject string against 2067a compiled pattern, using a matching algorithm that scans the subject string 2068just once, and does not backtrack. This has different characteristics to the 2069normal algorithm, and is not compatible with Perl. Some of the features of PCRE 2070patterns are not supported. Nevertheless, there are times when this kind of 2071matching can be useful. For a discussion of the two matching algorithms, and a 2072list of features that <b>pcre_dfa_exec()</b> does not support, see the 2073<a href="pcrematching.html"><b>pcrematching</b></a> 2074documentation. 2075</P> 2076<P> 2077The arguments for the <b>pcre_dfa_exec()</b> function are the same as for 2078<b>pcre_exec()</b>, plus two extras. The <i>ovector</i> argument is used in a 2079different way, and this is described below. The other common arguments are used 2080in the same way as for <b>pcre_exec()</b>, so their description is not repeated 2081here. 2082</P> 2083<P> 2084The two additional arguments provide workspace for the function. The workspace 2085vector should contain at least 20 elements. It is used for keeping track of 2086multiple paths through the pattern tree. More workspace will be needed for 2087patterns and subjects where there are a lot of potential matches. 2088</P> 2089<P> 2090Here is an example of a simple call to <b>pcre_dfa_exec()</b>: 2091<pre> 2092 int rc; 2093 int ovector[10]; 2094 int wspace[20]; 2095 rc = pcre_dfa_exec( 2096 re, /* result of pcre_compile() */ 2097 NULL, /* we didn't study the pattern */ 2098 "some string", /* the subject string */ 2099 11, /* the length of the subject string */ 2100 0, /* start at offset 0 in the subject */ 2101 0, /* default options */ 2102 ovector, /* vector of integers for substring information */ 2103 10, /* number of elements (NOT size in bytes) */ 2104 wspace, /* working space vector */ 2105 20); /* number of elements (NOT size in bytes) */ 2106</PRE> 2107</P> 2108<br><b> 2109Option bits for <b>pcre_dfa_exec()</b> 2110</b><br> 2111<P> 2112The unused bits of the <i>options</i> argument for <b>pcre_dfa_exec()</b> must be 2113zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_<i>xxx</i>, 2114PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART, 2115PCRE_NO_UTF8_CHECK, PCRE_BSR_ANYCRLF, PCRE_BSR_UNICODE, PCRE_NO_START_OPTIMIZE, 2116PCRE_PARTIAL_HARD, PCRE_PARTIAL_SOFT, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. 2117All but the last four of these are exactly the same as for <b>pcre_exec()</b>, 2118so their description is not repeated here. 2119<pre> 2120 PCRE_PARTIAL_HARD 2121 PCRE_PARTIAL_SOFT 2122</pre> 2123These have the same general effect as they do for <b>pcre_exec()</b>, but the 2124details are slightly different. When PCRE_PARTIAL_HARD is set for 2125<b>pcre_dfa_exec()</b>, it returns PCRE_ERROR_PARTIAL if the end of the subject 2126is reached and there is still at least one matching possibility that requires 2127additional characters. This happens even if some complete matches have also 2128been found. When PCRE_PARTIAL_SOFT is set, the return code PCRE_ERROR_NOMATCH 2129is converted into PCRE_ERROR_PARTIAL if the end of the subject is reached, 2130there have been no complete matches, but there is still at least one matching 2131possibility. The portion of the string that was inspected when the longest 2132partial match was found is set as the first matching string in both cases. 2133There is a more detailed discussion of partial and multi-segment matching, with 2134examples, in the 2135<a href="pcrepartial.html"><b>pcrepartial</b></a> 2136documentation. 2137<pre> 2138 PCRE_DFA_SHORTEST 2139</pre> 2140Setting the PCRE_DFA_SHORTEST option causes the matching algorithm to stop as 2141soon as it has found one match. Because of the way the alternative algorithm 2142works, this is necessarily the shortest possible match at the first possible 2143matching point in the subject string. 2144<pre> 2145 PCRE_DFA_RESTART 2146</pre> 2147When <b>pcre_dfa_exec()</b> returns a partial match, it is possible to call it 2148again, with additional subject characters, and have it continue with the same 2149match. The PCRE_DFA_RESTART option requests this action; when it is set, the 2150<i>workspace</i> and <i>wscount</i> options must reference the same vector as 2151before because data about the match so far is left in them after a partial 2152match. There is more discussion of this facility in the 2153<a href="pcrepartial.html"><b>pcrepartial</b></a> 2154documentation. 2155</P> 2156<br><b> 2157Successful returns from <b>pcre_dfa_exec()</b> 2158</b><br> 2159<P> 2160When <b>pcre_dfa_exec()</b> succeeds, it may have matched more than one 2161substring in the subject. Note, however, that all the matches from one run of 2162the function start at the same point in the subject. The shorter matches are 2163all initial substrings of the longer matches. For example, if the pattern 2164<pre> 2165 <.*> 2166</pre> 2167is matched against the string 2168<pre> 2169 This is <something> <something else> <something further> no more 2170</pre> 2171the three matched strings are 2172<pre> 2173 <something> 2174 <something> <something else> 2175 <something> <something else> <something further> 2176</pre> 2177On success, the yield of the function is a number greater than zero, which is 2178the number of matched substrings. The substrings themselves are returned in 2179<i>ovector</i>. Each string uses two elements; the first is the offset to the 2180start, and the second is the offset to the end. In fact, all the strings have 2181the same start offset. (Space could have been saved by giving this only once, 2182but it was decided to retain some compatibility with the way <b>pcre_exec()</b> 2183returns data, even though the meaning of the strings is different.) 2184</P> 2185<P> 2186The strings are returned in reverse order of length; that is, the longest 2187matching string is given first. If there were too many matches to fit into 2188<i>ovector</i>, the yield of the function is zero, and the vector is filled with 2189the longest matches. 2190</P> 2191<br><b> 2192Error returns from <b>pcre_dfa_exec()</b> 2193</b><br> 2194<P> 2195The <b>pcre_dfa_exec()</b> function returns a negative number when it fails. 2196Many of the errors are the same as for <b>pcre_exec()</b>, and these are 2197described 2198<a href="#errorlist">above.</a> 2199There are in addition the following errors that are specific to 2200<b>pcre_dfa_exec()</b>: 2201<pre> 2202 PCRE_ERROR_DFA_UITEM (-16) 2203</pre> 2204This return is given if <b>pcre_dfa_exec()</b> encounters an item in the pattern 2205that it does not support, for instance, the use of \C or a back reference. 2206<pre> 2207 PCRE_ERROR_DFA_UCOND (-17) 2208</pre> 2209This return is given if <b>pcre_dfa_exec()</b> encounters a condition item that 2210uses a back reference for the condition, or a test for recursion in a specific 2211group. These are not supported. 2212<pre> 2213 PCRE_ERROR_DFA_UMLIMIT (-18) 2214</pre> 2215This return is given if <b>pcre_dfa_exec()</b> is called with an <i>extra</i> 2216block that contains a setting of the <i>match_limit</i> field. This is not 2217supported (it is meaningless). 2218<pre> 2219 PCRE_ERROR_DFA_WSSIZE (-19) 2220</pre> 2221This return is given if <b>pcre_dfa_exec()</b> runs out of space in the 2222<i>workspace</i> vector. 2223<pre> 2224 PCRE_ERROR_DFA_RECURSE (-20) 2225</pre> 2226When a recursive subpattern is processed, the matching function calls itself 2227recursively, using private vectors for <i>ovector</i> and <i>workspace</i>. This 2228error is given if the output vector is not large enough. This should be 2229extremely rare, as a vector of size 1000 is used. 2230</P> 2231<br><a name="SEC20" href="#TOC1">SEE ALSO</a><br> 2232<P> 2233<b>pcrebuild</b>(3), <b>pcrecallout</b>(3), <b>pcrecpp(3)</b>(3), 2234<b>pcrematching</b>(3), <b>pcrepartial</b>(3), <b>pcreposix</b>(3), 2235<b>pcreprecompile</b>(3), <b>pcresample</b>(3), <b>pcrestack</b>(3). 2236</P> 2237<br><a name="SEC21" href="#TOC1">AUTHOR</a><br> 2238<P> 2239Philip Hazel 2240<br> 2241University Computing Service 2242<br> 2243Cambridge CB2 3QH, England. 2244<br> 2245</P> 2246<br><a name="SEC22" href="#TOC1">REVISION</a><br> 2247<P> 2248Last updated: 21 November 2010 2249<br> 2250Copyright © 1997-2010 University of Cambridge. 2251<br> 2252<p> 2253Return to the <a href="index.html">PCRE index page</a>. 2254</p> 2255