1<html> 2<head> 3<title>pcre2api specification</title> 4</head> 5<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB"> 6<h1>pcre2api man page</h1> 7<p> 8Return to the <a href="index.html">PCRE2 index page</a>. 9</p> 10<p> 11This page is part of the PCRE2 HTML documentation. It was generated 12automatically from the original man page. If there is any nonsense in it, 13please consult the man page, in case the conversion went wrong. 14<br> 15<ul> 16<li><a name="TOC1" href="#SEC1">PCRE2 NATIVE API BASIC FUNCTIONS</a> 17<li><a name="TOC2" href="#SEC2">PCRE2 NATIVE API AUXILIARY MATCH FUNCTIONS</a> 18<li><a name="TOC3" href="#SEC3">PCRE2 NATIVE API GENERAL CONTEXT FUNCTIONS</a> 19<li><a name="TOC4" href="#SEC4">PCRE2 NATIVE API COMPILE CONTEXT FUNCTIONS</a> 20<li><a name="TOC5" href="#SEC5">PCRE2 NATIVE API MATCH CONTEXT FUNCTIONS</a> 21<li><a name="TOC6" href="#SEC6">PCRE2 NATIVE API STRING EXTRACTION FUNCTIONS</a> 22<li><a name="TOC7" href="#SEC7">PCRE2 NATIVE API STRING SUBSTITUTION FUNCTION</a> 23<li><a name="TOC8" href="#SEC8">PCRE2 NATIVE API JIT FUNCTIONS</a> 24<li><a name="TOC9" href="#SEC9">PCRE2 NATIVE API SERIALIZATION FUNCTIONS</a> 25<li><a name="TOC10" href="#SEC10">PCRE2 NATIVE API AUXILIARY FUNCTIONS</a> 26<li><a name="TOC11" href="#SEC11">PCRE2 8-BIT, 16-BIT, AND 32-BIT LIBRARIES</a> 27<li><a name="TOC12" href="#SEC12">PCRE2 API OVERVIEW</a> 28<li><a name="TOC13" href="#SEC13">STRING LENGTHS AND OFFSETS</a> 29<li><a name="TOC14" href="#SEC14">NEWLINES</a> 30<li><a name="TOC15" href="#SEC15">MULTITHREADING</a> 31<li><a name="TOC16" href="#SEC16">PCRE2 CONTEXTS</a> 32<li><a name="TOC17" href="#SEC17">CHECKING BUILD-TIME OPTIONS</a> 33<li><a name="TOC18" href="#SEC18">COMPILING A PATTERN</a> 34<li><a name="TOC19" href="#SEC19">COMPILATION ERROR CODES</a> 35<li><a name="TOC20" href="#SEC20">JUST-IN-TIME (JIT) COMPILATION</a> 36<li><a name="TOC21" href="#SEC21">LOCALE SUPPORT</a> 37<li><a name="TOC22" href="#SEC22">INFORMATION ABOUT A COMPILED PATTERN</a> 38<li><a name="TOC23" href="#SEC23">INFORMATION ABOUT A PATTERN'S CALLOUTS</a> 39<li><a name="TOC24" href="#SEC24">SERIALIZATION AND PRECOMPILING</a> 40<li><a name="TOC25" href="#SEC25">THE MATCH DATA BLOCK</a> 41<li><a name="TOC26" href="#SEC26">MATCHING A PATTERN: THE TRADITIONAL FUNCTION</a> 42<li><a name="TOC27" href="#SEC27">NEWLINE HANDLING WHEN MATCHING</a> 43<li><a name="TOC28" href="#SEC28">HOW PCRE2_MATCH() RETURNS A STRING AND CAPTURED SUBSTRINGS</a> 44<li><a name="TOC29" href="#SEC29">OTHER INFORMATION ABOUT A MATCH</a> 45<li><a name="TOC30" href="#SEC30">ERROR RETURNS FROM <b>pcre2_match()</b></a> 46<li><a name="TOC31" href="#SEC31">OBTAINING A TEXTUAL ERROR MESSAGE</a> 47<li><a name="TOC32" href="#SEC32">EXTRACTING CAPTURED SUBSTRINGS BY NUMBER</a> 48<li><a name="TOC33" href="#SEC33">EXTRACTING A LIST OF ALL CAPTURED SUBSTRINGS</a> 49<li><a name="TOC34" href="#SEC34">EXTRACTING CAPTURED SUBSTRINGS BY NAME</a> 50<li><a name="TOC35" href="#SEC35">CREATING A NEW STRING WITH SUBSTITUTIONS</a> 51<li><a name="TOC36" href="#SEC36">DUPLICATE SUBPATTERN NAMES</a> 52<li><a name="TOC37" href="#SEC37">FINDING ALL POSSIBLE MATCHES AT ONE POSITION</a> 53<li><a name="TOC38" href="#SEC38">MATCHING A PATTERN: THE ALTERNATIVE FUNCTION</a> 54<li><a name="TOC39" href="#SEC39">SEE ALSO</a> 55<li><a name="TOC40" href="#SEC40">AUTHOR</a> 56<li><a name="TOC41" href="#SEC41">REVISION</a> 57</ul> 58<P> 59<b>#include <pcre2.h></b> 60<br> 61<br> 62PCRE2 is a new API for PCRE. This document contains a description of all its 63functions. See the 64<a href="pcre2.html"><b>pcre2</b></a> 65document for an overview of all the PCRE2 documentation. 66</P> 67<br><a name="SEC1" href="#TOC1">PCRE2 NATIVE API BASIC FUNCTIONS</a><br> 68<P> 69<b>pcre2_code *pcre2_compile(PCRE2_SPTR <i>pattern</i>, PCRE2_SIZE <i>length</i>,</b> 70<b> uint32_t <i>options</i>, int *<i>errorcode</i>, PCRE2_SIZE *<i>erroroffset,</i></b> 71<b> pcre2_compile_context *<i>ccontext</i>);</b> 72<br> 73<br> 74<b>void pcre2_code_free(pcre2_code *<i>code</i>);</b> 75<br> 76<br> 77<b>pcre2_match_data *pcre2_match_data_create(uint32_t <i>ovecsize</i>,</b> 78<b> pcre2_general_context *<i>gcontext</i>);</b> 79<br> 80<br> 81<b>pcre2_match_data *pcre2_match_data_create_from_pattern(</b> 82<b> const pcre2_code *<i>code</i>, pcre2_general_context *<i>gcontext</i>);</b> 83<br> 84<br> 85<b>int pcre2_match(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b> 86<b> PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b> 87<b> uint32_t <i>options</i>, pcre2_match_data *<i>match_data</i>,</b> 88<b> pcre2_match_context *<i>mcontext</i>);</b> 89<br> 90<br> 91<b>int pcre2_dfa_match(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b> 92<b> PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b> 93<b> uint32_t <i>options</i>, pcre2_match_data *<i>match_data</i>,</b> 94<b> pcre2_match_context *<i>mcontext</i>,</b> 95<b> int *<i>workspace</i>, PCRE2_SIZE <i>wscount</i>);</b> 96<br> 97<br> 98<b>void pcre2_match_data_free(pcre2_match_data *<i>match_data</i>);</b> 99</P> 100<br><a name="SEC2" href="#TOC1">PCRE2 NATIVE API AUXILIARY MATCH FUNCTIONS</a><br> 101<P> 102<b>PCRE2_SPTR pcre2_get_mark(pcre2_match_data *<i>match_data</i>);</b> 103<br> 104<br> 105<b>uint32_t pcre2_get_ovector_count(pcre2_match_data *<i>match_data</i>);</b> 106<br> 107<br> 108<b>PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *<i>match_data</i>);</b> 109<br> 110<br> 111<b>PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *<i>match_data</i>);</b> 112</P> 113<br><a name="SEC3" href="#TOC1">PCRE2 NATIVE API GENERAL CONTEXT FUNCTIONS</a><br> 114<P> 115<b>pcre2_general_context *pcre2_general_context_create(</b> 116<b> void *(*<i>private_malloc</i>)(PCRE2_SIZE, void *),</b> 117<b> void (*<i>private_free</i>)(void *, void *), void *<i>memory_data</i>);</b> 118<br> 119<br> 120<b>pcre2_general_context *pcre2_general_context_copy(</b> 121<b> pcre2_general_context *<i>gcontext</i>);</b> 122<br> 123<br> 124<b>void pcre2_general_context_free(pcre2_general_context *<i>gcontext</i>);</b> 125</P> 126<br><a name="SEC4" href="#TOC1">PCRE2 NATIVE API COMPILE CONTEXT FUNCTIONS</a><br> 127<P> 128<b>pcre2_compile_context *pcre2_compile_context_create(</b> 129<b> pcre2_general_context *<i>gcontext</i>);</b> 130<br> 131<br> 132<b>pcre2_compile_context *pcre2_compile_context_copy(</b> 133<b> pcre2_compile_context *<i>ccontext</i>);</b> 134<br> 135<br> 136<b>void pcre2_compile_context_free(pcre2_compile_context *<i>ccontext</i>);</b> 137<br> 138<br> 139<b>int pcre2_set_bsr(pcre2_compile_context *<i>ccontext</i>,</b> 140<b> uint32_t <i>value</i>);</b> 141<br> 142<br> 143<b>int pcre2_set_character_tables(pcre2_compile_context *<i>ccontext</i>,</b> 144<b> const unsigned char *<i>tables</i>);</b> 145<br> 146<br> 147<b>int pcre2_set_max_pattern_length(pcre2_compile_context *<i>ccontext</i>,</b> 148<b> PCRE2_SIZE <i>value</i>);</b> 149<br> 150<br> 151<b>int pcre2_set_newline(pcre2_compile_context *<i>ccontext</i>,</b> 152<b> uint32_t <i>value</i>);</b> 153<br> 154<br> 155<b>int pcre2_set_parens_nest_limit(pcre2_compile_context *<i>ccontext</i>,</b> 156<b> uint32_t <i>value</i>);</b> 157<br> 158<br> 159<b>int pcre2_set_compile_recursion_guard(pcre2_compile_context *<i>ccontext</i>,</b> 160<b> int (*<i>guard_function</i>)(uint32_t, void *), void *<i>user_data</i>);</b> 161</P> 162<br><a name="SEC5" href="#TOC1">PCRE2 NATIVE API MATCH CONTEXT FUNCTIONS</a><br> 163<P> 164<b>pcre2_match_context *pcre2_match_context_create(</b> 165<b> pcre2_general_context *<i>gcontext</i>);</b> 166<br> 167<br> 168<b>pcre2_match_context *pcre2_match_context_copy(</b> 169<b> pcre2_match_context *<i>mcontext</i>);</b> 170<br> 171<br> 172<b>void pcre2_match_context_free(pcre2_match_context *<i>mcontext</i>);</b> 173<br> 174<br> 175<b>int pcre2_set_callout(pcre2_match_context *<i>mcontext</i>,</b> 176<b> int (*<i>callout_function</i>)(pcre2_callout_block *, void *),</b> 177<b> void *<i>callout_data</i>);</b> 178<br> 179<br> 180<b>int pcre2_set_match_limit(pcre2_match_context *<i>mcontext</i>,</b> 181<b> uint32_t <i>value</i>);</b> 182<br> 183<br> 184<b>int pcre2_set_offset_limit(pcre2_match_context *<i>mcontext</i>,</b> 185<b> PCRE2_SIZE <i>value</i>);</b> 186<br> 187<br> 188<b>int pcre2_set_recursion_limit(pcre2_match_context *<i>mcontext</i>,</b> 189<b> uint32_t <i>value</i>);</b> 190<br> 191<br> 192<b>int pcre2_set_recursion_memory_management(</b> 193<b> pcre2_match_context *<i>mcontext</i>,</b> 194<b> void *(*<i>private_malloc</i>)(PCRE2_SIZE, void *),</b> 195<b> void (*<i>private_free</i>)(void *, void *), void *<i>memory_data</i>);</b> 196</P> 197<br><a name="SEC6" href="#TOC1">PCRE2 NATIVE API STRING EXTRACTION FUNCTIONS</a><br> 198<P> 199<b>int pcre2_substring_copy_byname(pcre2_match_data *<i>match_data</i>,</b> 200<b> PCRE2_SPTR <i>name</i>, PCRE2_UCHAR *<i>buffer</i>, PCRE2_SIZE *<i>bufflen</i>);</b> 201<br> 202<br> 203<b>int pcre2_substring_copy_bynumber(pcre2_match_data *<i>match_data</i>,</b> 204<b> uint32_t <i>number</i>, PCRE2_UCHAR *<i>buffer</i>,</b> 205<b> PCRE2_SIZE *<i>bufflen</i>);</b> 206<br> 207<br> 208<b>void pcre2_substring_free(PCRE2_UCHAR *<i>buffer</i>);</b> 209<br> 210<br> 211<b>int pcre2_substring_get_byname(pcre2_match_data *<i>match_data</i>,</b> 212<b> PCRE2_SPTR <i>name</i>, PCRE2_UCHAR **<i>bufferptr</i>, PCRE2_SIZE *<i>bufflen</i>);</b> 213<br> 214<br> 215<b>int pcre2_substring_get_bynumber(pcre2_match_data *<i>match_data</i>,</b> 216<b> uint32_t <i>number</i>, PCRE2_UCHAR **<i>bufferptr</i>,</b> 217<b> PCRE2_SIZE *<i>bufflen</i>);</b> 218<br> 219<br> 220<b>int pcre2_substring_length_byname(pcre2_match_data *<i>match_data</i>,</b> 221<b> PCRE2_SPTR <i>name</i>, PCRE2_SIZE *<i>length</i>);</b> 222<br> 223<br> 224<b>int pcre2_substring_length_bynumber(pcre2_match_data *<i>match_data</i>,</b> 225<b> uint32_t <i>number</i>, PCRE2_SIZE *<i>length</i>);</b> 226<br> 227<br> 228<b>int pcre2_substring_nametable_scan(const pcre2_code *<i>code</i>,</b> 229<b> PCRE2_SPTR <i>name</i>, PCRE2_SPTR *<i>first</i>, PCRE2_SPTR *<i>last</i>);</b> 230<br> 231<br> 232<b>int pcre2_substring_number_from_name(const pcre2_code *<i>code</i>,</b> 233<b> PCRE2_SPTR <i>name</i>);</b> 234<br> 235<br> 236<b>void pcre2_substring_list_free(PCRE2_SPTR *<i>list</i>);</b> 237<br> 238<br> 239<b>int pcre2_substring_list_get(pcre2_match_data *<i>match_data</i>,</b> 240<b>" PCRE2_UCHAR ***<i>listptr</i>, PCRE2_SIZE **<i>lengthsptr</i>);</b> 241</P> 242<br><a name="SEC7" href="#TOC1">PCRE2 NATIVE API STRING SUBSTITUTION FUNCTION</a><br> 243<P> 244<b>int pcre2_substitute(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b> 245<b> PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b> 246<b> uint32_t <i>options</i>, pcre2_match_data *<i>match_data</i>,</b> 247<b> pcre2_match_context *<i>mcontext</i>, PCRE2_SPTR \fIreplacementzfP,</b> 248<b> PCRE2_SIZE <i>rlength</i>, PCRE2_UCHAR *<i>outputbuffer</i>,</b> 249<b> PCRE2_SIZE *<i>outlengthptr</i>);</b> 250</P> 251<br><a name="SEC8" href="#TOC1">PCRE2 NATIVE API JIT FUNCTIONS</a><br> 252<P> 253<b>int pcre2_jit_compile(pcre2_code *<i>code</i>, uint32_t <i>options</i>);</b> 254<br> 255<br> 256<b>int pcre2_jit_match(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b> 257<b> PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b> 258<b> uint32_t <i>options</i>, pcre2_match_data *<i>match_data</i>,</b> 259<b> pcre2_match_context *<i>mcontext</i>);</b> 260<br> 261<br> 262<b>void pcre2_jit_free_unused_memory(pcre2_general_context *<i>gcontext</i>);</b> 263<br> 264<br> 265<b>pcre2_jit_stack *pcre2_jit_stack_create(PCRE2_SIZE <i>startsize</i>,</b> 266<b> PCRE2_SIZE <i>maxsize</i>, pcre2_general_context *<i>gcontext</i>);</b> 267<br> 268<br> 269<b>void pcre2_jit_stack_assign(pcre2_match_context *<i>mcontext</i>,</b> 270<b> pcre2_jit_callback <i>callback_function</i>, void *<i>callback_data</i>);</b> 271<br> 272<br> 273<b>void pcre2_jit_stack_free(pcre2_jit_stack *<i>jit_stack</i>);</b> 274</P> 275<br><a name="SEC9" href="#TOC1">PCRE2 NATIVE API SERIALIZATION FUNCTIONS</a><br> 276<P> 277<b>int32_t pcre2_serialize_decode(pcre2_code **<i>codes</i>,</b> 278<b> int32_t <i>number_of_codes</i>, const uint8_t *<i>bytes</i>,</b> 279<b> pcre2_general_context *<i>gcontext</i>);</b> 280<br> 281<br> 282<b>int32_t pcre2_serialize_encode(const pcre2_code **<i>codes</i>,</b> 283<b> int32_t <i>number_of_codes</i>, uint8_t **<i>serialized_bytes</i>,</b> 284<b> PCRE2_SIZE *<i>serialized_size</i>, pcre2_general_context *<i>gcontext</i>);</b> 285<br> 286<br> 287<b>void pcre2_serialize_free(uint8_t *<i>bytes</i>);</b> 288<br> 289<br> 290<b>int32_t pcre2_serialize_get_number_of_codes(const uint8_t *<i>bytes</i>);</b> 291</P> 292<br><a name="SEC10" href="#TOC1">PCRE2 NATIVE API AUXILIARY FUNCTIONS</a><br> 293<P> 294<b>pcre2_code *pcre2_code_copy(const pcre2_code *<i>code</i>);</b> 295<br> 296<br> 297<b>int pcre2_get_error_message(int <i>errorcode</i>, PCRE2_UCHAR *<i>buffer</i>,</b> 298<b> PCRE2_SIZE <i>bufflen</i>);</b> 299<br> 300<br> 301<b>const unsigned char *pcre2_maketables(pcre2_general_context *<i>gcontext</i>);</b> 302<br> 303<br> 304<b>int pcre2_pattern_info(const pcre2 *<i>code</i>, uint32_t <i>what</i>, void *<i>where</i>);</b> 305<br> 306<br> 307<b>int pcre2_callout_enumerate(const pcre2_code *<i>code</i>,</b> 308<b> int (*<i>callback</i>)(pcre2_callout_enumerate_block *, void *),</b> 309<b> void *<i>user_data</i>);</b> 310<br> 311<br> 312<b>int pcre2_config(uint32_t <i>what</i>, void *<i>where</i>);</b> 313</P> 314<br><a name="SEC11" href="#TOC1">PCRE2 8-BIT, 16-BIT, AND 32-BIT LIBRARIES</a><br> 315<P> 316There are three PCRE2 libraries, supporting 8-bit, 16-bit, and 32-bit code 317units, respectively. However, there is just one header file, <b>pcre2.h</b>. 318This contains the function prototypes and other definitions for all three 319libraries. One, two, or all three can be installed simultaneously. On Unix-like 320systems the libraries are called <b>libpcre2-8</b>, <b>libpcre2-16</b>, and 321<b>libpcre2-32</b>, and they can also co-exist with the original PCRE libraries. 322</P> 323<P> 324Character strings are passed to and from a PCRE2 library as a sequence of 325unsigned integers in code units of the appropriate width. Every PCRE2 function 326comes in three different forms, one for each library, for example: 327<pre> 328 <b>pcre2_compile_8()</b> 329 <b>pcre2_compile_16()</b> 330 <b>pcre2_compile_32()</b> 331</pre> 332There are also three different sets of data types: 333<pre> 334 <b>PCRE2_UCHAR8, PCRE2_UCHAR16, PCRE2_UCHAR32</b> 335 <b>PCRE2_SPTR8, PCRE2_SPTR16, PCRE2_SPTR32</b> 336</pre> 337The UCHAR types define unsigned code units of the appropriate widths. For 338example, PCRE2_UCHAR16 is usually defined as `uint16_t'. The SPTR types are 339constant pointers to the equivalent UCHAR types, that is, they are pointers to 340vectors of unsigned code units. 341</P> 342<P> 343Many applications use only one code unit width. For their convenience, macros 344are defined whose names are the generic forms such as <b>pcre2_compile()</b> and 345PCRE2_SPTR. These macros use the value of the macro PCRE2_CODE_UNIT_WIDTH to 346generate the appropriate width-specific function and macro names. 347PCRE2_CODE_UNIT_WIDTH is not defined by default. An application must define it 348to be 8, 16, or 32 before including <b>pcre2.h</b> in order to make use of the 349generic names. 350</P> 351<P> 352Applications that use more than one code unit width can be linked with more 353than one PCRE2 library, but must define PCRE2_CODE_UNIT_WIDTH to be 0 before 354including <b>pcre2.h</b>, and then use the real function names. Any code that is 355to be included in an environment where the value of PCRE2_CODE_UNIT_WIDTH is 356unknown should also use the real function names. (Unfortunately, it is not 357possible in C code to save and restore the value of a macro.) 358</P> 359<P> 360If PCRE2_CODE_UNIT_WIDTH is not defined before including <b>pcre2.h</b>, a 361compiler error occurs. 362</P> 363<P> 364When using multiple libraries in an application, you must take care when 365processing any particular pattern to use only functions from a single library. 366For example, if you want to run a match using a pattern that was compiled with 367<b>pcre2_compile_16()</b>, you must do so with <b>pcre2_match_16()</b>, not 368<b>pcre2_match_8()</b>. 369</P> 370<P> 371In the function summaries above, and in the rest of this document and other 372PCRE2 documents, functions and data types are described using their generic 373names, without the 8, 16, or 32 suffix. 374</P> 375<br><a name="SEC12" href="#TOC1">PCRE2 API OVERVIEW</a><br> 376<P> 377PCRE2 has its own native API, which is described in this document. There are 378also some wrapper functions for the 8-bit library that correspond to the 379POSIX regular expression API, but they do not give access to all the 380functionality. They are described in the 381<a href="pcre2posix.html"><b>pcre2posix</b></a> 382documentation. Both these APIs define a set of C function calls. 383</P> 384<P> 385The native API C data types, function prototypes, option values, and error 386codes are defined in the header file <b>pcre2.h</b>, which contains definitions 387of PCRE2_MAJOR and PCRE2_MINOR, the major and minor release numbers for the 388library. Applications can use these to include support for different releases 389of PCRE2. 390</P> 391<P> 392In a Windows environment, if you want to statically link an application program 393against a non-dll PCRE2 library, you must define PCRE2_STATIC before including 394<b>pcre2.h</b>. 395</P> 396<P> 397The functions <b>pcre2_compile()</b>, and <b>pcre2_match()</b> are used for 398compiling and matching regular expressions in a Perl-compatible manner. A 399sample program that demonstrates the simplest way of using them is provided in 400the file called <i>pcre2demo.c</i> in the PCRE2 source distribution. A listing 401of this program is given in the 402<a href="pcre2demo.html"><b>pcre2demo</b></a> 403documentation, and the 404<a href="pcre2sample.html"><b>pcre2sample</b></a> 405documentation describes how to compile and run it. 406</P> 407<P> 408Just-in-time compiler support is an optional feature of PCRE2 that can be built 409in appropriate hardware environments. It greatly speeds up the matching 410performance of many patterns. Programs can request that it be used if 411available, by calling <b>pcre2_jit_compile()</b> after a pattern has been 412successfully compiled by <b>pcre2_compile()</b>. This does nothing if JIT 413support is not available. 414</P> 415<P> 416More complicated programs might need to make use of the specialist functions 417<b>pcre2_jit_stack_create()</b>, <b>pcre2_jit_stack_free()</b>, and 418<b>pcre2_jit_stack_assign()</b> in order to control the JIT code's memory usage. 419</P> 420<P> 421JIT matching is automatically used by <b>pcre2_match()</b> if it is available, 422unless the PCRE2_NO_JIT option is set. There is also a direct interface for JIT 423matching, which gives improved performance. The JIT-specific functions are 424discussed in the 425<a href="pcre2jit.html"><b>pcre2jit</b></a> 426documentation. 427</P> 428<P> 429A second matching function, <b>pcre2_dfa_match()</b>, which is not 430Perl-compatible, is also provided. This uses a different algorithm for the 431matching. The alternative algorithm finds all possible matches (at a given 432point in the subject), and scans the subject just once (unless there are 433lookbehind assertions). However, this algorithm does not return captured 434substrings. A description of the two matching algorithms and their advantages 435and disadvantages is given in the 436<a href="pcre2matching.html"><b>pcre2matching</b></a> 437documentation. There is no JIT support for <b>pcre2_dfa_match()</b>. 438</P> 439<P> 440In addition to the main compiling and matching functions, there are convenience 441functions for extracting captured substrings from a subject string that has 442been matched by <b>pcre2_match()</b>. They are: 443<pre> 444 <b>pcre2_substring_copy_byname()</b> 445 <b>pcre2_substring_copy_bynumber()</b> 446 <b>pcre2_substring_get_byname()</b> 447 <b>pcre2_substring_get_bynumber()</b> 448 <b>pcre2_substring_list_get()</b> 449 <b>pcre2_substring_length_byname()</b> 450 <b>pcre2_substring_length_bynumber()</b> 451 <b>pcre2_substring_nametable_scan()</b> 452 <b>pcre2_substring_number_from_name()</b> 453</pre> 454<b>pcre2_substring_free()</b> and <b>pcre2_substring_list_free()</b> are also 455provided, to free the memory used for extracted strings. 456</P> 457<P> 458The function <b>pcre2_substitute()</b> can be called to match a pattern and 459return a copy of the subject string with substitutions for parts that were 460matched. 461</P> 462<P> 463Functions whose names begin with <b>pcre2_serialize_</b> are used for saving 464compiled patterns on disc or elsewhere, and reloading them later. 465</P> 466<P> 467Finally, there are functions for finding out information about a compiled 468pattern (<b>pcre2_pattern_info()</b>) and about the configuration with which 469PCRE2 was built (<b>pcre2_config()</b>). 470</P> 471<P> 472Functions with names ending with <b>_free()</b> are used for freeing memory 473blocks of various sorts. In all cases, if one of these functions is called with 474a NULL argument, it does nothing. 475</P> 476<br><a name="SEC13" href="#TOC1">STRING LENGTHS AND OFFSETS</a><br> 477<P> 478The PCRE2 API uses string lengths and offsets into strings of code units in 479several places. These values are always of type PCRE2_SIZE, which is an 480unsigned integer type, currently always defined as <i>size_t</i>. The largest 481value that can be stored in such a type (that is ~(PCRE2_SIZE)0) is reserved 482as a special indicator for zero-terminated strings and unset offsets. 483Therefore, the longest string that can be handled is one less than this 484maximum. 485<a name="newlines"></a></P> 486<br><a name="SEC14" href="#TOC1">NEWLINES</a><br> 487<P> 488PCRE2 supports five different conventions for indicating line breaks in 489strings: a single CR (carriage return) character, a single LF (linefeed) 490character, the two-character sequence CRLF, any of the three preceding, or any 491Unicode newline sequence. The Unicode newline sequences are the three just 492mentioned, plus the single characters VT (vertical tab, U+000B), FF (form feed, 493U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS 494(paragraph separator, U+2029). 495</P> 496<P> 497Each of the first three conventions is used by at least one operating system as 498its standard newline sequence. When PCRE2 is built, a default can be specified. 499The default default is LF, which is the Unix standard. However, the newline 500convention can be changed by an application when calling <b>pcre2_compile()</b>, 501or it can be specified by special text at the start of the pattern itself; this 502overrides any other settings. See the 503<a href="pcre2pattern.html"><b>pcre2pattern</b></a> 504page for details of the special character sequences. 505</P> 506<P> 507In the PCRE2 documentation the word "newline" is used to mean "the character or 508pair of characters that indicate a line break". The choice of newline 509convention affects the handling of the dot, circumflex, and dollar 510metacharacters, the handling of #-comments in /x mode, and, when CRLF is a 511recognized line ending sequence, the match position advancement for a 512non-anchored pattern. There is more detail about this in the 513<a href="#matchoptions">section on <b>pcre2_match()</b> options</a> 514below. 515</P> 516<P> 517The choice of newline convention does not affect the interpretation of 518the \n or \r escape sequences, nor does it affect what \R matches; this has 519its own separate convention. 520</P> 521<br><a name="SEC15" href="#TOC1">MULTITHREADING</a><br> 522<P> 523In a multithreaded application it is important to keep thread-specific data 524separate from data that can be shared between threads. The PCRE2 library code 525itself is thread-safe: it contains no static or global variables. The API is 526designed to be fairly simple for non-threaded applications while at the same 527time ensuring that multithreaded applications can use it. 528</P> 529<P> 530There are several different blocks of data that are used to pass information 531between the application and the PCRE2 libraries. 532</P> 533<br><b> 534The compiled pattern 535</b><br> 536<P> 537A pointer to the compiled form of a pattern is returned to the user when 538<b>pcre2_compile()</b> is successful. The data in the compiled pattern is fixed, 539and does not change when the pattern is matched. Therefore, it is thread-safe, 540that is, the same compiled pattern can be used by more than one thread 541simultaneously. For example, an application can compile all its patterns at the 542start, before forking off multiple threads that use them. However, if the 543just-in-time optimization feature is being used, it needs separate memory stack 544areas for each thread. See the 545<a href="pcre2jit.html"><b>pcre2jit</b></a> 546documentation for more details. 547</P> 548<P> 549In a more complicated situation, where patterns are compiled only when they are 550first needed, but are still shared between threads, pointers to compiled 551patterns must be protected from simultaneous writing by multiple threads, at 552least until a pattern has been compiled. The logic can be something like this: 553<pre> 554 Get a read-only (shared) lock (mutex) for pointer 555 if (pointer == NULL) 556 { 557 Get a write (unique) lock for pointer 558 pointer = pcre2_compile(... 559 } 560 Release the lock 561 Use pointer in pcre2_match() 562</pre> 563Of course, testing for compilation errors should also be included in the code. 564</P> 565<P> 566If JIT is being used, but the JIT compilation is not being done immediately, 567(perhaps waiting to see if the pattern is used often enough) similar logic is 568required. JIT compilation updates a pointer within the compiled code block, so 569a thread must gain unique write access to the pointer before calling 570<b>pcre2_jit_compile()</b>. Alternatively, <b>pcre2_code_copy()</b> can be used 571to obtain a private copy of the compiled code. 572</P> 573<br><b> 574Context blocks 575</b><br> 576<P> 577The next main section below introduces the idea of "contexts" in which PCRE2 578functions are called. A context is nothing more than a collection of parameters 579that control the way PCRE2 operates. Grouping a number of parameters together 580in a context is a convenient way of passing them to a PCRE2 function without 581using lots of arguments. The parameters that are stored in contexts are in some 582sense "advanced features" of the API. Many straightforward applications will 583not need to use contexts. 584</P> 585<P> 586In a multithreaded application, if the parameters in a context are values that 587are never changed, the same context can be used by all the threads. However, if 588any thread needs to change any value in a context, it must make its own 589thread-specific copy. 590</P> 591<br><b> 592Match blocks 593</b><br> 594<P> 595The matching functions need a block of memory for working space and for storing 596the results of a match. This includes details of what was matched, as well as 597additional information such as the name of a (*MARK) setting. Each thread must 598provide its own copy of this memory. 599</P> 600<br><a name="SEC16" href="#TOC1">PCRE2 CONTEXTS</a><br> 601<P> 602Some PCRE2 functions have a lot of parameters, many of which are used only by 603specialist applications, for example, those that use custom memory management 604or non-standard character tables. To keep function argument lists at a 605reasonable size, and at the same time to keep the API extensible, "uncommon" 606parameters are passed to certain functions in a <b>context</b> instead of 607directly. A context is just a block of memory that holds the parameter values. 608Applications that do not need to adjust any of the context parameters can pass 609NULL when a context pointer is required. 610</P> 611<P> 612There are three different types of context: a general context that is relevant 613for several PCRE2 operations, a compile-time context, and a match-time context. 614</P> 615<br><b> 616The general context 617</b><br> 618<P> 619At present, this context just contains pointers to (and data for) external 620memory management functions that are called from several places in the PCRE2 621library. The context is named `general' rather than specifically `memory' 622because in future other fields may be added. If you do not want to supply your 623own custom memory management functions, you do not need to bother with a 624general context. A general context is created by: 625<b>pcre2_general_context *pcre2_general_context_create(</b> 626<b> void *(*<i>private_malloc</i>)(PCRE2_SIZE, void *),</b> 627<b> void (*<i>private_free</i>)(void *, void *), void *<i>memory_data</i>);</b> 628<br> 629<br> 630The two function pointers specify custom memory management functions, whose 631prototypes are: 632<pre> 633 <b>void *private_malloc(PCRE2_SIZE, void *);</b> 634 <b>void private_free(void *, void *);</b> 635</pre> 636Whenever code in PCRE2 calls these functions, the final argument is the value 637of <i>memory_data</i>. Either of the first two arguments of the creation 638function may be NULL, in which case the system memory management functions 639<i>malloc()</i> and <i>free()</i> are used. (This is not currently useful, as 640there are no other fields in a general context, but in future there might be.) 641The <i>private_malloc()</i> function is used (if supplied) to obtain memory for 642storing the context, and all three values are saved as part of the context. 643</P> 644<P> 645Whenever PCRE2 creates a data block of any kind, the block contains a pointer 646to the <i>free()</i> function that matches the <i>malloc()</i> function that was 647used. When the time comes to free the block, this function is called. 648</P> 649<P> 650A general context can be copied by calling: 651<b>pcre2_general_context *pcre2_general_context_copy(</b> 652<b> pcre2_general_context *<i>gcontext</i>);</b> 653<br> 654<br> 655The memory used for a general context should be freed by calling: 656<b>void pcre2_general_context_free(pcre2_general_context *<i>gcontext</i>);</b> 657<a name="compilecontext"></a></P> 658<br><b> 659The compile context 660</b><br> 661<P> 662A compile context is required if you want to change the default values of any 663of the following compile-time parameters: 664<pre> 665 What \R matches (Unicode newlines or CR, LF, CRLF only) 666 PCRE2's character tables 667 The newline character sequence 668 The compile time nested parentheses limit 669 The maximum length of the pattern string 670 An external function for stack checking 671</pre> 672A compile context is also required if you are using custom memory management. 673If none of these apply, just pass NULL as the context argument of 674<i>pcre2_compile()</i>. 675</P> 676<P> 677A compile context is created, copied, and freed by the following functions: 678<b>pcre2_compile_context *pcre2_compile_context_create(</b> 679<b> pcre2_general_context *<i>gcontext</i>);</b> 680<br> 681<br> 682<b>pcre2_compile_context *pcre2_compile_context_copy(</b> 683<b> pcre2_compile_context *<i>ccontext</i>);</b> 684<br> 685<br> 686<b>void pcre2_compile_context_free(pcre2_compile_context *<i>ccontext</i>);</b> 687<br> 688<br> 689A compile context is created with default values for its parameters. These can 690be changed by calling the following functions, which return 0 on success, or 691PCRE2_ERROR_BADDATA if invalid data is detected. 692<b>int pcre2_set_bsr(pcre2_compile_context *<i>ccontext</i>,</b> 693<b> uint32_t <i>value</i>);</b> 694<br> 695<br> 696The value must be PCRE2_BSR_ANYCRLF, to specify that \R matches only CR, LF, 697or CRLF, or PCRE2_BSR_UNICODE, to specify that \R matches any Unicode line 698ending sequence. The value is used by the JIT compiler and by the two 699interpreted matching functions, <i>pcre2_match()</i> and 700<i>pcre2_dfa_match()</i>. 701<b>int pcre2_set_character_tables(pcre2_compile_context *<i>ccontext</i>,</b> 702<b> const unsigned char *<i>tables</i>);</b> 703<br> 704<br> 705The value must be the result of a call to <i>pcre2_maketables()</i>, whose only 706argument is a general context. This function builds a set of character tables 707in the current locale. 708<b>int pcre2_set_max_pattern_length(pcre2_compile_context *<i>ccontext</i>,</b> 709<b> PCRE2_SIZE <i>value</i>);</b> 710<br> 711<br> 712This sets a maximum length, in code units, for the pattern string that is to be 713compiled. If the pattern is longer, an error is generated. This facility is 714provided so that applications that accept patterns from external sources can 715limit their size. The default is the largest number that a PCRE2_SIZE variable 716can hold, which is effectively unlimited. 717<b>int pcre2_set_newline(pcre2_compile_context *<i>ccontext</i>,</b> 718<b> uint32_t <i>value</i>);</b> 719<br> 720<br> 721This specifies which characters or character sequences are to be recognized as 722newlines. The value must be one of PCRE2_NEWLINE_CR (carriage return only), 723PCRE2_NEWLINE_LF (linefeed only), PCRE2_NEWLINE_CRLF (the two-character 724sequence CR followed by LF), PCRE2_NEWLINE_ANYCRLF (any of the above), or 725PCRE2_NEWLINE_ANY (any Unicode newline sequence). 726</P> 727<P> 728When a pattern is compiled with the PCRE2_EXTENDED option, the value of this 729parameter affects the recognition of white space and the end of internal 730comments starting with #. The value is saved with the compiled pattern for 731subsequent use by the JIT compiler and by the two interpreted matching 732functions, <i>pcre2_match()</i> and <i>pcre2_dfa_match()</i>. 733<b>int pcre2_set_parens_nest_limit(pcre2_compile_context *<i>ccontext</i>,</b> 734<b> uint32_t <i>value</i>);</b> 735<br> 736<br> 737This parameter ajusts the limit, set when PCRE2 is built (default 250), on the 738depth of parenthesis nesting in a pattern. This limit stops rogue patterns 739using up too much system stack when being compiled. 740<b>int pcre2_set_compile_recursion_guard(pcre2_compile_context *<i>ccontext</i>,</b> 741<b> int (*<i>guard_function</i>)(uint32_t, void *), void *<i>user_data</i>);</b> 742<br> 743<br> 744There is at least one application that runs PCRE2 in threads with very limited 745system stack, where running out of stack is to be avoided at all costs. The 746parenthesis limit above cannot take account of how much stack is actually 747available. For a finer control, you can supply a function that is called 748whenever <b>pcre2_compile()</b> starts to compile a parenthesized part of a 749pattern. This function can check the actual stack size (or anything else that 750it wants to, of course). 751</P> 752<P> 753The first argument to the callout function gives the current depth of 754nesting, and the second is user data that is set up by the last argument of 755<b>pcre2_set_compile_recursion_guard()</b>. The callout function should return 756zero if all is well, or non-zero to force an error. 757<a name="matchcontext"></a></P> 758<br><b> 759The match context 760</b><br> 761<P> 762A match context is required if you want to change the default values of any 763of the following match-time parameters: 764<pre> 765 A callout function 766 The offset limit for matching an unanchored pattern 767 The limit for calling <b>match()</b> (see below) 768 The limit for calling <b>match()</b> recursively 769</pre> 770A match context is also required if you are using custom memory management. 771If none of these apply, just pass NULL as the context argument of 772<b>pcre2_match()</b>, <b>pcre2_dfa_match()</b>, or <b>pcre2_jit_match()</b>. 773</P> 774<P> 775A match context is created, copied, and freed by the following functions: 776<b>pcre2_match_context *pcre2_match_context_create(</b> 777<b> pcre2_general_context *<i>gcontext</i>);</b> 778<br> 779<br> 780<b>pcre2_match_context *pcre2_match_context_copy(</b> 781<b> pcre2_match_context *<i>mcontext</i>);</b> 782<br> 783<br> 784<b>void pcre2_match_context_free(pcre2_match_context *<i>mcontext</i>);</b> 785<br> 786<br> 787A match context is created with default values for its parameters. These can 788be changed by calling the following functions, which return 0 on success, or 789PCRE2_ERROR_BADDATA if invalid data is detected. 790<b>int pcre2_set_callout(pcre2_match_context *<i>mcontext</i>,</b> 791<b> int (*<i>callout_function</i>)(pcre2_callout_block *, void *),</b> 792<b> void *<i>callout_data</i>);</b> 793<br> 794<br> 795This sets up a "callout" function, which PCRE2 will call at specified points 796during a matching operation. Details are given in the 797<a href="pcre2callout.html"><b>pcre2callout</b></a> 798documentation. 799<b>int pcre2_set_offset_limit(pcre2_match_context *<i>mcontext</i>,</b> 800<b> PCRE2_SIZE <i>value</i>);</b> 801<br> 802<br> 803The <i>offset_limit</i> parameter limits how far an unanchored search can 804advance in the subject string. The default value is PCRE2_UNSET. The 805<b>pcre2_match()</b> and <b>pcre2_dfa_match()</b> functions return 806PCRE2_ERROR_NOMATCH if a match with a starting point before or at the given 807offset is not found. For example, if the pattern /abc/ is matched against 808"123abc" with an offset limit less than 3, the result is PCRE2_ERROR_NO_MATCH. 809A match can never be found if the <i>startoffset</i> argument of 810<b>pcre2_match()</b> or <b>pcre2_dfa_match()</b> is greater than the offset 811limit. 812</P> 813<P> 814When using this facility, you must set PCRE2_USE_OFFSET_LIMIT when calling 815<b>pcre2_compile()</b> so that when JIT is in use, different code can be 816compiled. If a match is started with a non-default match limit when 817PCRE2_USE_OFFSET_LIMIT is not set, an error is generated. 818</P> 819<P> 820The offset limit facility can be used to track progress when searching large 821subject strings. See also the PCRE2_FIRSTLINE option, which requires a match to 822start within the first line of the subject. If this is set with an offset 823limit, a match must occur in the first line and also within the offset limit. 824In other words, whichever limit comes first is used. 825<b>int pcre2_set_match_limit(pcre2_match_context *<i>mcontext</i>,</b> 826<b> uint32_t <i>value</i>);</b> 827<br> 828<br> 829The <i>match_limit</i> parameter provides a means of preventing PCRE2 from using 830up too many resources when processing patterns that are not going to match, but 831which have a very large number of possibilities in their search trees. The 832classic example is a pattern that uses nested unlimited repeats. 833</P> 834<P> 835Internally, <b>pcre2_match()</b> uses a function called <b>match()</b>, which it 836calls repeatedly (sometimes recursively). The limit set by <i>match_limit</i> is 837imposed on the number of times this function is called during a match, which 838has the effect of limiting the amount of backtracking that can take place. For 839patterns that are not anchored, the count restarts from zero for each position 840in the subject string. This limit is not relevant to <b>pcre2_dfa_match()</b>, 841which ignores it. 842</P> 843<P> 844When <b>pcre2_match()</b> is called with a pattern that was successfully 845processed by <b>pcre2_jit_compile()</b>, the way in which matching is executed 846is entirely different. However, there is still the possibility of runaway 847matching that goes on for a very long time, and so the <i>match_limit</i> value 848is also used in this case (but in a different way) to limit how long the 849matching can continue. 850</P> 851<P> 852The default value for the limit can be set when PCRE2 is built; the default 853default is 10 million, which handles all but the most extreme cases. If the 854limit is exceeded, <b>pcre2_match()</b> returns PCRE2_ERROR_MATCHLIMIT. A value 855for the match limit may also be supplied by an item at the start of a pattern 856of the form 857<pre> 858 (*LIMIT_MATCH=ddd) 859</pre> 860where ddd is a decimal number. However, such a setting is ignored unless ddd is 861less than the limit set by the caller of <b>pcre2_match()</b> or, if no such 862limit is set, less than the default. 863<b>int pcre2_set_recursion_limit(pcre2_match_context *<i>mcontext</i>,</b> 864<b> uint32_t <i>value</i>);</b> 865<br> 866<br> 867The <i>recursion_limit</i> parameter is similar to <i>match_limit</i>, but 868instead of limiting the total number of times that <b>match()</b> is called, it 869limits the depth of recursion. The recursion depth is a smaller number than the 870total number of calls, because not all calls to <b>match()</b> are recursive. 871This limit is of use only if it is set smaller than <i>match_limit</i>. 872</P> 873<P> 874Limiting the recursion depth limits the amount of system stack that can be 875used, or, when PCRE2 has been compiled to use memory on the heap instead of the 876stack, the amount of heap memory that can be used. This limit is not relevant, 877and is ignored, when matching is done using JIT compiled code or by the 878<b>pcre2_dfa_match()</b> function. 879</P> 880<P> 881The default value for <i>recursion_limit</i> can be set when PCRE2 is built; the 882default default is the same value as the default for <i>match_limit</i>. If the 883limit is exceeded, <b>pcre2_match()</b> returns PCRE2_ERROR_RECURSIONLIMIT. A 884value for the recursion limit may also be supplied by an item at the start of a 885pattern of the form 886<pre> 887 (*LIMIT_RECURSION=ddd) 888</pre> 889where ddd is a decimal number. However, such a setting is ignored unless ddd is 890less than the limit set by the caller of <b>pcre2_match()</b> or, if no such 891limit is set, less than the default. 892<b>int pcre2_set_recursion_memory_management(</b> 893<b> pcre2_match_context *<i>mcontext</i>,</b> 894<b> void *(*<i>private_malloc</i>)(PCRE2_SIZE, void *),</b> 895<b> void (*<i>private_free</i>)(void *, void *), void *<i>memory_data</i>);</b> 896<br> 897<br> 898This function sets up two additional custom memory management functions for use 899by <b>pcre2_match()</b> when PCRE2 is compiled to use the heap for remembering 900backtracking data, instead of recursive function calls that use the system 901stack. There is a discussion about PCRE2's stack usage in the 902<a href="pcre2stack.html"><b>pcre2stack</b></a> 903documentation. See the 904<a href="pcre2build.html"><b>pcre2build</b></a> 905documentation for details of how to build PCRE2. 906</P> 907<P> 908Using the heap for recursion is a non-standard way of building PCRE2, for use 909in environments that have limited stacks. Because of the greater use of memory 910management, <b>pcre2_match()</b> runs more slowly. Functions that are different 911to the general custom memory functions are provided so that special-purpose 912external code can be used for this case, because the memory blocks are all the 913same size. The blocks are retained by <b>pcre2_match()</b> until it is about to 914exit so that they can be re-used when possible during the match. In the absence 915of these functions, the normal custom memory management functions are used, if 916supplied, otherwise the system functions. 917</P> 918<br><a name="SEC17" href="#TOC1">CHECKING BUILD-TIME OPTIONS</a><br> 919<P> 920<b>int pcre2_config(uint32_t <i>what</i>, void *<i>where</i>);</b> 921</P> 922<P> 923The function <b>pcre2_config()</b> makes it possible for a PCRE2 client to 924discover which optional features have been compiled into the PCRE2 library. The 925<a href="pcre2build.html"><b>pcre2build</b></a> 926documentation has more details about these optional features. 927</P> 928<P> 929The first argument for <b>pcre2_config()</b> specifies which information is 930required. The second argument is a pointer to memory into which the information 931is placed. If NULL is passed, the function returns the amount of memory that is 932needed for the requested information. For calls that return numerical values, 933the value is in bytes; when requesting these values, <i>where</i> should point 934to appropriately aligned memory. For calls that return strings, the required 935length is given in code units, not counting the terminating zero. 936</P> 937<P> 938When requesting information, the returned value from <b>pcre2_config()</b> is 939non-negative on success, or the negative error code PCRE2_ERROR_BADOPTION if 940the value in the first argument is not recognized. The following information is 941available: 942<pre> 943 PCRE2_CONFIG_BSR 944</pre> 945The output is a uint32_t integer whose value indicates what character 946sequences the \R escape sequence matches by default. A value of 947PCRE2_BSR_UNICODE means that \R matches any Unicode line ending sequence; a 948value of PCRE2_BSR_ANYCRLF means that \R matches only CR, LF, or CRLF. The 949default can be overridden when a pattern is compiled. 950<pre> 951 PCRE2_CONFIG_JIT 952</pre> 953The output is a uint32_t integer that is set to one if support for just-in-time 954compiling is available; otherwise it is set to zero. 955<pre> 956 PCRE2_CONFIG_JITTARGET 957</pre> 958The <i>where</i> argument should point to a buffer that is at least 48 code 959units long. (The exact length required can be found by calling 960<b>pcre2_config()</b> with <b>where</b> set to NULL.) The buffer is filled with a 961string that contains the name of the architecture for which the JIT compiler is 962configured, for example "x86 32bit (little endian + unaligned)". If JIT support 963is not available, PCRE2_ERROR_BADOPTION is returned, otherwise the number of 964code units used is returned. This is the length of the string, plus one unit 965for the terminating zero. 966<pre> 967 PCRE2_CONFIG_LINKSIZE 968</pre> 969The output is a uint32_t integer that contains the number of bytes used for 970internal linkage in compiled regular expressions. When PCRE2 is configured, the 971value can be set to 2, 3, or 4, with the default being 2. This is the value 972that is returned by <b>pcre2_config()</b>. However, when the 16-bit library is 973compiled, a value of 3 is rounded up to 4, and when the 32-bit library is 974compiled, internal linkages always use 4 bytes, so the configured value is not 975relevant. 976</P> 977<P> 978The default value of 2 for the 8-bit and 16-bit libraries is sufficient for all 979but the most massive patterns, since it allows the size of the compiled pattern 980to be up to 64K code units. Larger values allow larger regular expressions to 981be compiled by those two libraries, but at the expense of slower matching. 982<pre> 983 PCRE2_CONFIG_MATCHLIMIT 984</pre> 985The output is a uint32_t integer that gives the default limit for the number of 986internal matching function calls in a <b>pcre2_match()</b> execution. Further 987details are given with <b>pcre2_match()</b> below. 988<pre> 989 PCRE2_CONFIG_NEWLINE 990</pre> 991The output is a uint32_t integer whose value specifies the default character 992sequence that is recognized as meaning "newline". The values are: 993<pre> 994 PCRE2_NEWLINE_CR Carriage return (CR) 995 PCRE2_NEWLINE_LF Linefeed (LF) 996 PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF) 997 PCRE2_NEWLINE_ANY Any Unicode line ending 998 PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF 999</pre> 1000The default should normally correspond to the standard sequence for your 1001operating system. 1002<pre> 1003 PCRE2_CONFIG_PARENSLIMIT 1004</pre> 1005The output is a uint32_t integer that gives the maximum depth of nesting 1006of parentheses (of any kind) in a pattern. This limit is imposed to cap the 1007amount of system stack used when a pattern is compiled. It is specified when 1008PCRE2 is built; the default is 250. This limit does not take into account the 1009stack that may already be used by the calling application. For finer control 1010over compilation stack usage, see <b>pcre2_set_compile_recursion_guard()</b>. 1011<pre> 1012 PCRE2_CONFIG_RECURSIONLIMIT 1013</pre> 1014The output is a uint32_t integer that gives the default limit for the depth of 1015recursion when calling the internal matching function in a <b>pcre2_match()</b> 1016execution. Further details are given with <b>pcre2_match()</b> below. 1017<pre> 1018 PCRE2_CONFIG_STACKRECURSE 1019</pre> 1020The output is a uint32_t integer that is set to one if internal recursion when 1021running <b>pcre2_match()</b> is implemented by recursive function calls that use 1022the system stack to remember their state. This is the usual way that PCRE2 is 1023compiled. The output is zero if PCRE2 was compiled to use blocks of data on the 1024heap instead of recursive function calls. 1025<pre> 1026 PCRE2_CONFIG_UNICODE_VERSION 1027</pre> 1028The <i>where</i> argument should point to a buffer that is at least 24 code 1029units long. (The exact length required can be found by calling 1030<b>pcre2_config()</b> with <b>where</b> set to NULL.) If PCRE2 has been compiled 1031without Unicode support, the buffer is filled with the text "Unicode not 1032supported". Otherwise, the Unicode version string (for example, "8.0.0") is 1033inserted. The number of code units used is returned. This is the length of the 1034string plus one unit for the terminating zero. 1035<pre> 1036 PCRE2_CONFIG_UNICODE 1037</pre> 1038The output is a uint32_t integer that is set to one if Unicode support is 1039available; otherwise it is set to zero. Unicode support implies UTF support. 1040<pre> 1041 PCRE2_CONFIG_VERSION 1042</pre> 1043The <i>where</i> argument should point to a buffer that is at least 12 code 1044units long. (The exact length required can be found by calling 1045<b>pcre2_config()</b> with <b>where</b> set to NULL.) The buffer is filled with 1046the PCRE2 version string, zero-terminated. The number of code units used is 1047returned. This is the length of the string plus one unit for the terminating 1048zero. 1049<a name="compiling"></a></P> 1050<br><a name="SEC18" href="#TOC1">COMPILING A PATTERN</a><br> 1051<P> 1052<b>pcre2_code *pcre2_compile(PCRE2_SPTR <i>pattern</i>, PCRE2_SIZE <i>length</i>,</b> 1053<b> uint32_t <i>options</i>, int *<i>errorcode</i>, PCRE2_SIZE *<i>erroroffset,</i></b> 1054<b> pcre2_compile_context *<i>ccontext</i>);</b> 1055<br> 1056<br> 1057<b>void pcre2_code_free(pcre2_code *<i>code</i>);</b> 1058<br> 1059<br> 1060<b>pcre2_code *pcre2_code_copy(const pcre2_code *<i>code</i>);</b> 1061</P> 1062<P> 1063The <b>pcre2_compile()</b> function compiles a pattern into an internal form. 1064The pattern is defined by a pointer to a string of code units and a length. If 1065the pattern is zero-terminated, the length can be specified as 1066PCRE2_ZERO_TERMINATED. The function returns a pointer to a block of memory that 1067contains the compiled pattern and related data, or NULL if an error occurred. 1068</P> 1069<P> 1070If the compile context argument <i>ccontext</i> is NULL, memory for the compiled 1071pattern is obtained by calling <b>malloc()</b>. Otherwise, it is obtained from 1072the same memory function that was used for the compile context. The caller must 1073free the memory by calling <b>pcre2_code_free()</b> when it is no longer needed. 1074</P> 1075<P> 1076The function <b>pcre2_code_copy()</b> makes a copy of the compiled code in new 1077memory, using the same memory allocator as was used for the original. However, 1078if the code has been processed by the JIT compiler (see 1079<a href="#jitcompiling">below),</a> 1080the JIT information cannot be copied (because it is position-dependent). 1081The new copy can initially be used only for non-JIT matching, though it can be 1082passed to <b>pcre2_jit_compile()</b> if required. The <b>pcre2_code_copy()</b> 1083function provides a way for individual threads in a multithreaded application 1084to acquire a private copy of shared compiled code. 1085</P> 1086<P> 1087NOTE: When one of the matching functions is called, pointers to the compiled 1088pattern and the subject string are set in the match data block so that they can 1089be referenced by the substring extraction functions. After running a match, you 1090must not free a compiled pattern (or a subject string) until after all 1091operations on the 1092<a href="#matchdatablock">match data block</a> 1093have taken place. 1094</P> 1095<P> 1096The <i>options</i> argument for <b>pcre2_compile()</b> contains various bit 1097settings that affect the compilation. It should be zero if no options are 1098required. The available options are described below. Some of them (in 1099particular, those that are compatible with Perl, but some others as well) can 1100also be set and unset from within the pattern (see the detailed description in 1101the 1102<a href="pcre2pattern.html"><b>pcre2pattern</b></a> 1103documentation). 1104</P> 1105<P> 1106For those options that can be different in different parts of the pattern, the 1107contents of the <i>options</i> argument specifies their settings at the start of 1108compilation. The PCRE2_ANCHORED and PCRE2_NO_UTF_CHECK options can be set at 1109the time of matching as well as at compile time. 1110</P> 1111<P> 1112Other, less frequently required compile-time parameters (for example, the 1113newline setting) can be provided in a compile context (as described 1114<a href="#compilecontext">above).</a> 1115</P> 1116<P> 1117If <i>errorcode</i> or <i>erroroffset</i> is NULL, <b>pcre2_compile()</b> returns 1118NULL immediately. Otherwise, the variables to which these point are set to an 1119error code and an offset (number of code units) within the pattern, 1120respectively, when <b>pcre2_compile()</b> returns NULL because a compilation 1121error has occurred. The values are not defined when compilation is successful 1122and <b>pcre2_compile()</b> returns a non-NULL value. 1123</P> 1124<P> 1125The <b>pcre2_get_error_message()</b> function (see "Obtaining a textual error 1126message" 1127<a href="#geterrormessage">below)</a> 1128provides a textual message for each error code. Compilation errors have 1129positive error codes; UTF formatting error codes are negative. For an invalid 1130UTF-8 or UTF-16 string, the offset is that of the first code unit of the 1131failing character. 1132</P> 1133<P> 1134Some errors are not detected until the whole pattern has been scanned; in these 1135cases, the offset passed back is the length of the pattern. Note that the 1136offset is in code units, not characters, even in a UTF mode. It may sometimes 1137point into the middle of a UTF-8 or UTF-16 character. 1138</P> 1139<P> 1140This code fragment shows a typical straightforward call to 1141<b>pcre2_compile()</b>: 1142<pre> 1143 pcre2_code *re; 1144 PCRE2_SIZE erroffset; 1145 int errorcode; 1146 re = pcre2_compile( 1147 "^A.*Z", /* the pattern */ 1148 PCRE2_ZERO_TERMINATED, /* the pattern is zero-terminated */ 1149 0, /* default options */ 1150 &errorcode, /* for error code */ 1151 &erroffset, /* for error offset */ 1152 NULL); /* no compile context */ 1153</pre> 1154The following names for option bits are defined in the <b>pcre2.h</b> header 1155file: 1156<pre> 1157 PCRE2_ANCHORED 1158</pre> 1159If this bit is set, the pattern is forced to be "anchored", that is, it is 1160constrained to match only at the first matching point in the string that is 1161being searched (the "subject string"). This effect can also be achieved by 1162appropriate constructs in the pattern itself, which is the only way to do it in 1163Perl. 1164<pre> 1165 PCRE2_ALLOW_EMPTY_CLASS 1166</pre> 1167By default, for compatibility with Perl, a closing square bracket that 1168immediately follows an opening one is treated as a data character for the 1169class. When PCRE2_ALLOW_EMPTY_CLASS is set, it terminates the class, which 1170therefore contains no characters and so can never match. 1171<pre> 1172 PCRE2_ALT_BSUX 1173</pre> 1174This option request alternative handling of three escape sequences, which 1175makes PCRE2's behaviour more like ECMAscript (aka JavaScript). When it is set: 1176</P> 1177<P> 1178(1) \U matches an upper case "U" character; by default \U causes a compile 1179time error (Perl uses \U to upper case subsequent characters). 1180</P> 1181<P> 1182(2) \u matches a lower case "u" character unless it is followed by four 1183hexadecimal digits, in which case the hexadecimal number defines the code point 1184to match. By default, \u causes a compile time error (Perl uses it to upper 1185case the following character). 1186</P> 1187<P> 1188(3) \x matches a lower case "x" character unless it is followed by two 1189hexadecimal digits, in which case the hexadecimal number defines the code point 1190to match. By default, as in Perl, a hexadecimal number is always expected after 1191\x, but it may have zero, one, or two digits (so, for example, \xz matches a 1192binary zero character followed by z). 1193<pre> 1194 PCRE2_ALT_CIRCUMFLEX 1195</pre> 1196In multiline mode (when PCRE2_MULTILINE is set), the circumflex metacharacter 1197matches at the start of the subject (unless PCRE2_NOTBOL is set), and also 1198after any internal newline. However, it does not match after a newline at the 1199end of the subject, for compatibility with Perl. If you want a multiline 1200circumflex also to match after a terminating newline, you must set 1201PCRE2_ALT_CIRCUMFLEX. 1202<pre> 1203 PCRE2_ALT_VERBNAMES 1204</pre> 1205By default, for compatibility with Perl, the name in any verb sequence such as 1206(*MARK:NAME) is any sequence of characters that does not include a closing 1207parenthesis. The name is not processed in any way, and it is not possible to 1208include a closing parenthesis in the name. However, if the PCRE2_ALT_VERBNAMES 1209option is set, normal backslash processing is applied to verb names and only an 1210unescaped closing parenthesis terminates the name. A closing parenthesis can be 1211included in a name either as \) or between \Q and \E. If the PCRE2_EXTENDED 1212option is set, unescaped whitespace in verb names is skipped and #-comments are 1213recognized, exactly as in the rest of the pattern. 1214<pre> 1215 PCRE2_AUTO_CALLOUT 1216</pre> 1217If this bit is set, <b>pcre2_compile()</b> automatically inserts callout items, 1218all with number 255, before each pattern item. For discussion of the callout 1219facility, see the 1220<a href="pcre2callout.html"><b>pcre2callout</b></a> 1221documentation. 1222<pre> 1223 PCRE2_CASELESS 1224</pre> 1225If this bit is set, letters in the pattern match both upper and lower case 1226letters in the subject. It is equivalent to Perl's /i option, and it can be 1227changed within a pattern by a (?i) option setting. 1228<pre> 1229 PCRE2_DOLLAR_ENDONLY 1230</pre> 1231If this bit is set, a dollar metacharacter in the pattern matches only at the 1232end of the subject string. Without this option, a dollar also matches 1233immediately before a newline at the end of the string (but not before any other 1234newlines). The PCRE2_DOLLAR_ENDONLY option is ignored if PCRE2_MULTILINE is 1235set. There is no equivalent to this option in Perl, and no way to set it within 1236a pattern. 1237<pre> 1238 PCRE2_DOTALL 1239</pre> 1240If this bit is set, a dot metacharacter in the pattern matches any character, 1241including one that indicates a newline. However, it only ever matches one 1242character, even if newlines are coded as CRLF. Without this option, a dot does 1243not match when the current position in the subject is at a newline. This option 1244is equivalent to Perl's /s option, and it can be changed within a pattern by a 1245(?s) option setting. A negative class such as [^a] always matches newline 1246characters, independent of the setting of this option. 1247<pre> 1248 PCRE2_DUPNAMES 1249</pre> 1250If this bit is set, names used to identify capturing subpatterns need not be 1251unique. This can be helpful for certain types of pattern when it is known that 1252only one instance of the named subpattern can ever be matched. There are more 1253details of named subpatterns below; see also the 1254<a href="pcre2pattern.html"><b>pcre2pattern</b></a> 1255documentation. 1256<pre> 1257 PCRE2_EXTENDED 1258</pre> 1259If this bit is set, most white space characters in the pattern are totally 1260ignored except when escaped or inside a character class. However, white space 1261is not allowed within sequences such as (?> that introduce various 1262parenthesized subpatterns, nor within numerical quantifiers such as {1,3}. 1263Ignorable white space is permitted between an item and a following quantifier 1264and between a quantifier and a following + that indicates possessiveness. 1265</P> 1266<P> 1267PCRE2_EXTENDED also causes characters between an unescaped # outside a 1268character class and the next newline, inclusive, to be ignored, which makes it 1269possible to include comments inside complicated patterns. Note that the end of 1270this type of comment is a literal newline sequence in the pattern; escape 1271sequences that happen to represent a newline do not count. PCRE2_EXTENDED is 1272equivalent to Perl's /x option, and it can be changed within a pattern by a 1273(?x) option setting. 1274</P> 1275<P> 1276Which characters are interpreted as newlines can be specified by a setting in 1277the compile context that is passed to <b>pcre2_compile()</b> or by a special 1278sequence at the start of the pattern, as described in the section entitled 1279<a href="pcre2pattern.html#newlines">"Newline conventions"</a> 1280in the <b>pcre2pattern</b> documentation. A default is defined when PCRE2 is 1281built. 1282<pre> 1283 PCRE2_FIRSTLINE 1284</pre> 1285If this option is set, an unanchored pattern is required to match before or at 1286the first newline in the subject string, though the matched text may continue 1287over the newline. See also PCRE2_USE_OFFSET_LIMIT, which provides a more 1288general limiting facility. If PCRE2_FIRSTLINE is set with an offset limit, a 1289match must occur in the first line and also within the offset limit. In other 1290words, whichever limit comes first is used. 1291<pre> 1292 PCRE2_MATCH_UNSET_BACKREF 1293</pre> 1294If this option is set, a back reference to an unset subpattern group matches an 1295empty string (by default this causes the current matching alternative to fail). 1296A pattern such as (\1)(a) succeeds when this option is set (assuming it can 1297find an "a" in the subject), whereas it fails by default, for Perl 1298compatibility. Setting this option makes PCRE2 behave more like ECMAscript (aka 1299JavaScript). 1300<pre> 1301 PCRE2_MULTILINE 1302</pre> 1303By default, for the purposes of matching "start of line" and "end of line", 1304PCRE2 treats the subject string as consisting of a single line of characters, 1305even if it actually contains newlines. The "start of line" metacharacter (^) 1306matches only at the start of the string, and the "end of line" metacharacter 1307($) matches only at the end of the string, or before a terminating newline 1308(except when PCRE2_DOLLAR_ENDONLY is set). Note, however, that unless 1309PCRE2_DOTALL is set, the "any character" metacharacter (.) does not match at a 1310newline. This behaviour (for ^, $, and dot) is the same as Perl. 1311</P> 1312<P> 1313When PCRE2_MULTILINE it is set, the "start of line" and "end of line" 1314constructs match immediately following or immediately before internal newlines 1315in the subject string, respectively, as well as at the very start and end. This 1316is equivalent to Perl's /m option, and it can be changed within a pattern by a 1317(?m) option setting. Note that the "start of line" metacharacter does not match 1318after a newline at the end of the subject, for compatibility with Perl. 1319However, you can change this by setting the PCRE2_ALT_CIRCUMFLEX option. If 1320there are no newlines in a subject string, or no occurrences of ^ or $ in a 1321pattern, setting PCRE2_MULTILINE has no effect. 1322<pre> 1323 PCRE2_NEVER_BACKSLASH_C 1324</pre> 1325This option locks out the use of \C in the pattern that is being compiled. 1326This escape can cause unpredictable behaviour in UTF-8 or UTF-16 modes, because 1327it may leave the current matching point in the middle of a multi-code-unit 1328character. This option may be useful in applications that process patterns from 1329external sources. Note that there is also a build-time option that permanently 1330locks out the use of \C. 1331<pre> 1332 PCRE2_NEVER_UCP 1333</pre> 1334This option locks out the use of Unicode properties for handling \B, \b, \D, 1335\d, \S, \s, \W, \w, and some of the POSIX character classes, as described 1336for the PCRE2_UCP option below. In particular, it prevents the creator of the 1337pattern from enabling this facility by starting the pattern with (*UCP). This 1338option may be useful in applications that process patterns from external 1339sources. The option combination PCRE_UCP and PCRE_NEVER_UCP causes an error. 1340<pre> 1341 PCRE2_NEVER_UTF 1342</pre> 1343This option locks out interpretation of the pattern as UTF-8, UTF-16, or 1344UTF-32, depending on which library is in use. In particular, it prevents the 1345creator of the pattern from switching to UTF interpretation by starting the 1346pattern with (*UTF). This option may be useful in applications that process 1347patterns from external sources. The combination of PCRE2_UTF and 1348PCRE2_NEVER_UTF causes an error. 1349<pre> 1350 PCRE2_NO_AUTO_CAPTURE 1351</pre> 1352If this option is set, it disables the use of numbered capturing parentheses in 1353the pattern. Any opening parenthesis that is not followed by ? behaves as if it 1354were followed by ?: but named parentheses can still be used for capturing (and 1355they acquire numbers in the usual way). There is no equivalent of this option 1356in Perl. Note that, if this option is set, references to capturing groups (back 1357references or recursion/subroutine calls) may only refer to named groups, 1358though the reference can be by name or by number. 1359<pre> 1360 PCRE2_NO_AUTO_POSSESS 1361</pre> 1362If this option is set, it disables "auto-possessification", which is an 1363optimization that, for example, turns a+b into a++b in order to avoid 1364backtracks into a+ that can never be successful. However, if callouts are in 1365use, auto-possessification means that some callouts are never taken. You can 1366set this option if you want the matching functions to do a full unoptimized 1367search and run all the callouts, but it is mainly provided for testing 1368purposes. 1369<pre> 1370 PCRE2_NO_DOTSTAR_ANCHOR 1371</pre> 1372If this option is set, it disables an optimization that is applied when .* is 1373the first significant item in a top-level branch of a pattern, and all the 1374other branches also start with .* or with \A or \G or ^. The optimization is 1375automatically disabled for .* if it is inside an atomic group or a capturing 1376group that is the subject of a back reference, or if the pattern contains 1377(*PRUNE) or (*SKIP). When the optimization is not disabled, such a pattern is 1378automatically anchored if PCRE2_DOTALL is set for all the .* items and 1379PCRE2_MULTILINE is not set for any ^ items. Otherwise, the fact that any match 1380must start either at the start of the subject or following a newline is 1381remembered. Like other optimizations, this can cause callouts to be skipped. 1382<pre> 1383 PCRE2_NO_START_OPTIMIZE 1384</pre> 1385This is an option whose main effect is at matching time. It does not change 1386what <b>pcre2_compile()</b> generates, but it does affect the output of the JIT 1387compiler. 1388</P> 1389<P> 1390There are a number of optimizations that may occur at the start of a match, in 1391order to speed up the process. For example, if it is known that an unanchored 1392match must start with a specific character, the matching code searches the 1393subject for that character, and fails immediately if it cannot find it, without 1394actually running the main matching function. This means that a special item 1395such as (*COMMIT) at the start of a pattern is not considered until after a 1396suitable starting point for the match has been found. Also, when callouts or 1397(*MARK) items are in use, these "start-up" optimizations can cause them to be 1398skipped if the pattern is never actually used. The start-up optimizations are 1399in effect a pre-scan of the subject that takes place before the pattern is run. 1400</P> 1401<P> 1402The PCRE2_NO_START_OPTIMIZE option disables the start-up optimizations, 1403possibly causing performance to suffer, but ensuring that in cases where the 1404result is "no match", the callouts do occur, and that items such as (*COMMIT) 1405and (*MARK) are considered at every possible starting position in the subject 1406string. 1407</P> 1408<P> 1409Setting PCRE2_NO_START_OPTIMIZE may change the outcome of a matching operation. 1410Consider the pattern 1411<pre> 1412 (*COMMIT)ABC 1413</pre> 1414When this is compiled, PCRE2 records the fact that a match must start with the 1415character "A". Suppose the subject string is "DEFABC". The start-up 1416optimization scans along the subject, finds "A" and runs the first match 1417attempt from there. The (*COMMIT) item means that the pattern must match the 1418current starting position, which in this case, it does. However, if the same 1419match is run with PCRE2_NO_START_OPTIMIZE set, the initial scan along the 1420subject string does not happen. The first match attempt is run starting from 1421"D" and when this fails, (*COMMIT) prevents any further matches being tried, so 1422the overall result is "no match". There are also other start-up optimizations. 1423For example, a minimum length for the subject may be recorded. Consider the 1424pattern 1425<pre> 1426 (*MARK:A)(X|Y) 1427</pre> 1428The minimum length for a match is one character. If the subject is "ABC", there 1429will be attempts to match "ABC", "BC", and "C". An attempt to match an empty 1430string at the end of the subject does not take place, because PCRE2 knows that 1431the subject is now too short, and so the (*MARK) is never encountered. In this 1432case, the optimization does not affect the overall match result, which is still 1433"no match", but it does affect the auxiliary information that is returned. 1434<pre> 1435 PCRE2_NO_UTF_CHECK 1436</pre> 1437When PCRE2_UTF is set, the validity of the pattern as a UTF string is 1438automatically checked. There are discussions about the validity of 1439<a href="pcre2unicode.html#utf8strings">UTF-8 strings,</a> 1440<a href="pcre2unicode.html#utf16strings">UTF-16 strings,</a> 1441and 1442<a href="pcre2unicode.html#utf32strings">UTF-32 strings</a> 1443in the 1444<a href="pcre2unicode.html"><b>pcre2unicode</b></a> 1445document. 1446If an invalid UTF sequence is found, <b>pcre2_compile()</b> returns a negative 1447error code. 1448</P> 1449<P> 1450If you know that your pattern is valid, and you want to skip this check for 1451performance reasons, you can set the PCRE2_NO_UTF_CHECK option. When it is set, 1452the effect of passing an invalid UTF string as a pattern is undefined. It may 1453cause your program to crash or loop. Note that this option can also be passed 1454to <b>pcre2_match()</b> and <b>pcre_dfa_match()</b>, to suppress validity 1455checking of the subject string. 1456<pre> 1457 PCRE2_UCP 1458</pre> 1459This option changes the way PCRE2 processes \B, \b, \D, \d, \S, \s, \W, 1460\w, and some of the POSIX character classes. By default, only ASCII characters 1461are recognized, but if PCRE2_UCP is set, Unicode properties are used instead to 1462classify characters. More details are given in the section on 1463<a href="pcre2pattern.html#genericchartypes">generic character types</a> 1464in the 1465<a href="pcre2pattern.html"><b>pcre2pattern</b></a> 1466page. If you set PCRE2_UCP, matching one of the items it affects takes much 1467longer. The option is available only if PCRE2 has been compiled with Unicode 1468support. 1469<pre> 1470 PCRE2_UNGREEDY 1471</pre> 1472This option inverts the "greediness" of the quantifiers so that they are not 1473greedy by default, but become greedy if followed by "?". It is not compatible 1474with Perl. It can also be set by a (?U) option setting within the pattern. 1475<pre> 1476 PCRE2_USE_OFFSET_LIMIT 1477</pre> 1478This option must be set for <b>pcre2_compile()</b> if 1479<b>pcre2_set_offset_limit()</b> is going to be used to set a non-default offset 1480limit in a match context for matches that use this pattern. An error is 1481generated if an offset limit is set without this option. For more details, see 1482the description of <b>pcre2_set_offset_limit()</b> in the 1483<a href="#matchcontext">section</a> 1484that describes match contexts. See also the PCRE2_FIRSTLINE 1485option above. 1486<pre> 1487 PCRE2_UTF 1488</pre> 1489This option causes PCRE2 to regard both the pattern and the subject strings 1490that are subsequently processed as strings of UTF characters instead of 1491single-code-unit strings. It is available when PCRE2 is built to include 1492Unicode support (which is the default). If Unicode support is not available, 1493the use of this option provokes an error. Details of how this option changes 1494the behaviour of PCRE2 are given in the 1495<a href="pcre2unicode.html"><b>pcre2unicode</b></a> 1496page. 1497</P> 1498<br><a name="SEC19" href="#TOC1">COMPILATION ERROR CODES</a><br> 1499<P> 1500There are over 80 positive error codes that <b>pcre2_compile()</b> may return 1501(via <i>errorcode</i>) if it finds an error in the pattern. There are also some 1502negative error codes that are used for invalid UTF strings. These are the same 1503as given by <b>pcre2_match()</b> and <b>pcre2_dfa_match()</b>, and are described 1504in the 1505<a href="pcre2unicode.html"><b>pcre2unicode</b></a> 1506page. The <b>pcre2_get_error_message()</b> function (see "Obtaining a textual 1507error message" 1508<a href="#geterrormessage">below)</a> 1509can be called to obtain a textual error message from any error code. 1510<a name="jitcompiling"></a></P> 1511<br><a name="SEC20" href="#TOC1">JUST-IN-TIME (JIT) COMPILATION</a><br> 1512<P> 1513<b>int pcre2_jit_compile(pcre2_code *<i>code</i>, uint32_t <i>options</i>);</b> 1514<br> 1515<br> 1516<b>int pcre2_jit_match(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b> 1517<b> PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b> 1518<b> uint32_t <i>options</i>, pcre2_match_data *<i>match_data</i>,</b> 1519<b> pcre2_match_context *<i>mcontext</i>);</b> 1520<br> 1521<br> 1522<b>void pcre2_jit_free_unused_memory(pcre2_general_context *<i>gcontext</i>);</b> 1523<br> 1524<br> 1525<b>pcre2_jit_stack *pcre2_jit_stack_create(PCRE2_SIZE <i>startsize</i>,</b> 1526<b> PCRE2_SIZE <i>maxsize</i>, pcre2_general_context *<i>gcontext</i>);</b> 1527<br> 1528<br> 1529<b>void pcre2_jit_stack_assign(pcre2_match_context *<i>mcontext</i>,</b> 1530<b> pcre2_jit_callback <i>callback_function</i>, void *<i>callback_data</i>);</b> 1531<br> 1532<br> 1533<b>void pcre2_jit_stack_free(pcre2_jit_stack *<i>jit_stack</i>);</b> 1534</P> 1535<P> 1536These functions provide support for JIT compilation, which, if the just-in-time 1537compiler is available, further processes a compiled pattern into machine code 1538that executes much faster than the <b>pcre2_match()</b> interpretive matching 1539function. Full details are given in the 1540<a href="pcre2jit.html"><b>pcre2jit</b></a> 1541documentation. 1542</P> 1543<P> 1544JIT compilation is a heavyweight optimization. It can take some time for 1545patterns to be analyzed, and for one-off matches and simple patterns the 1546benefit of faster execution might be offset by a much slower compilation time. 1547Most, but not all patterns can be optimized by the JIT compiler. 1548<a name="localesupport"></a></P> 1549<br><a name="SEC21" href="#TOC1">LOCALE SUPPORT</a><br> 1550<P> 1551PCRE2 handles caseless matching, and determines whether characters are letters, 1552digits, or whatever, by reference to a set of tables, indexed by character code 1553point. This applies only to characters whose code points are less than 256. By 1554default, higher-valued code points never match escapes such as \w or \d. 1555However, if PCRE2 is built with UTF support, all characters can be tested with 1556\p and \P, or, alternatively, the PCRE2_UCP option can be set when a pattern 1557is compiled; this causes \w and friends to use Unicode property support 1558instead of the built-in tables. 1559</P> 1560<P> 1561The use of locales with Unicode is discouraged. If you are handling characters 1562with code points greater than 128, you should either use Unicode support, or 1563use locales, but not try to mix the two. 1564</P> 1565<P> 1566PCRE2 contains an internal set of character tables that are used by default. 1567These are sufficient for many applications. Normally, the internal tables 1568recognize only ASCII characters. However, when PCRE2 is built, it is possible 1569to cause the internal tables to be rebuilt in the default "C" locale of the 1570local system, which may cause them to be different. 1571</P> 1572<P> 1573The internal tables can be overridden by tables supplied by the application 1574that calls PCRE2. These may be created in a different locale from the default. 1575As more and more applications change to using Unicode, the need for this locale 1576support is expected to die away. 1577</P> 1578<P> 1579External tables are built by calling the <b>pcre2_maketables()</b> function, in 1580the relevant locale. The result can be passed to <b>pcre2_compile()</b> as often 1581as necessary, by creating a compile context and calling 1582<b>pcre2_set_character_tables()</b> to set the tables pointer therein. For 1583example, to build and use tables that are appropriate for the French locale 1584(where accented characters with values greater than 128 are treated as 1585letters), the following code could be used: 1586<pre> 1587 setlocale(LC_CTYPE, "fr_FR"); 1588 tables = pcre2_maketables(NULL); 1589 ccontext = pcre2_compile_context_create(NULL); 1590 pcre2_set_character_tables(ccontext, tables); 1591 re = pcre2_compile(..., ccontext); 1592</pre> 1593The locale name "fr_FR" is used on Linux and other Unix-like systems; if you 1594are using Windows, the name for the French locale is "french". It is the 1595caller's responsibility to ensure that the memory containing the tables remains 1596available for as long as it is needed. 1597</P> 1598<P> 1599The pointer that is passed (via the compile context) to <b>pcre2_compile()</b> 1600is saved with the compiled pattern, and the same tables are used by 1601<b>pcre2_match()</b> and <b>pcre_dfa_match()</b>. Thus, for any single pattern, 1602compilation, and matching all happen in the same locale, but different patterns 1603can be processed in different locales. 1604<a name="infoaboutpattern"></a></P> 1605<br><a name="SEC22" href="#TOC1">INFORMATION ABOUT A COMPILED PATTERN</a><br> 1606<P> 1607<b>int pcre2_pattern_info(const pcre2 *<i>code</i>, uint32_t <i>what</i>, void *<i>where</i>);</b> 1608</P> 1609<P> 1610The <b>pcre2_pattern_info()</b> function returns general information about a 1611compiled pattern. For information about callouts, see the 1612<a href="pcre2pattern.html#infoaboutcallouts">next section.</a> 1613The first argument for <b>pcre2_pattern_info()</b> is a pointer to the compiled 1614pattern. The second argument specifies which piece of information is required, 1615and the third argument is a pointer to a variable to receive the data. If the 1616third argument is NULL, the first argument is ignored, and the function returns 1617the size in bytes of the variable that is required for the information 1618requested. Otherwise, The yield of the function is zero for success, or one of 1619the following negative numbers: 1620<pre> 1621 PCRE2_ERROR_NULL the argument <i>code</i> was NULL 1622 PCRE2_ERROR_BADMAGIC the "magic number" was not found 1623 PCRE2_ERROR_BADOPTION the value of <i>what</i> was invalid 1624 PCRE2_ERROR_UNSET the requested field is not set 1625</pre> 1626The "magic number" is placed at the start of each compiled pattern as an simple 1627check against passing an arbitrary memory pointer. Here is a typical call of 1628<b>pcre2_pattern_info()</b>, to obtain the length of the compiled pattern: 1629<pre> 1630 int rc; 1631 size_t length; 1632 rc = pcre2_pattern_info( 1633 re, /* result of pcre2_compile() */ 1634 PCRE2_INFO_SIZE, /* what is required */ 1635 &length); /* where to put the data */ 1636</pre> 1637The possible values for the second argument are defined in <b>pcre2.h</b>, and 1638are as follows: 1639<pre> 1640 PCRE2_INFO_ALLOPTIONS 1641 PCRE2_INFO_ARGOPTIONS 1642</pre> 1643Return a copy of the pattern's options. The third argument should point to a 1644<b>uint32_t</b> variable. PCRE2_INFO_ARGOPTIONS returns exactly the options that 1645were passed to <b>pcre2_compile()</b>, whereas PCRE2_INFO_ALLOPTIONS returns 1646the compile options as modified by any top-level (*XXX) option settings such as 1647(*UTF) at the start of the pattern itself. 1648</P> 1649<P> 1650For example, if the pattern /(*UTF)abc/ is compiled with the PCRE2_EXTENDED 1651option, the result for PCRE2_INFO_ALLOPTIONS is PCRE2_EXTENDED and PCRE2_UTF. 1652Option settings such as (?i) that can change within a pattern do not affect the 1653result of PCRE2_INFO_ALLOPTIONS, even if they appear right at the start of the 1654pattern. (This was different in some earlier releases.) 1655</P> 1656<P> 1657A pattern compiled without PCRE2_ANCHORED is automatically anchored by PCRE2 if 1658the first significant item in every top-level branch is one of the following: 1659<pre> 1660 ^ unless PCRE2_MULTILINE is set 1661 \A always 1662 \G always 1663 .* sometimes - see below 1664</pre> 1665When .* is the first significant item, anchoring is possible only when all the 1666following are true: 1667<pre> 1668 .* is not in an atomic group 1669 .* is not in a capturing group that is the subject of a back reference 1670 PCRE2_DOTALL is in force for .* 1671 Neither (*PRUNE) nor (*SKIP) appears in the pattern. 1672 PCRE2_NO_DOTSTAR_ANCHOR is not set. 1673</pre> 1674For patterns that are auto-anchored, the PCRE2_ANCHORED bit is set in the 1675options returned for PCRE2_INFO_ALLOPTIONS. 1676<pre> 1677 PCRE2_INFO_BACKREFMAX 1678</pre> 1679Return the number of the highest back reference in the pattern. The third 1680argument should point to an <b>uint32_t</b> variable. Named subpatterns acquire 1681numbers as well as names, and these count towards the highest back reference. 1682Back references such as \4 or \g{12} match the captured characters of the 1683given group, but in addition, the check that a capturing group is set in a 1684conditional subpattern such as (?(3)a|b) is also a back reference. Zero is 1685returned if there are no back references. 1686<pre> 1687 PCRE2_INFO_BSR 1688</pre> 1689The output is a uint32_t whose value indicates what character sequences the \R 1690escape sequence matches. A value of PCRE2_BSR_UNICODE means that \R matches 1691any Unicode line ending sequence; a value of PCRE2_BSR_ANYCRLF means that \R 1692matches only CR, LF, or CRLF. 1693<pre> 1694 PCRE2_INFO_CAPTURECOUNT 1695</pre> 1696Return the highest capturing subpattern number in the pattern. In patterns 1697where (?| is not used, this is also the total number of capturing subpatterns. 1698The third argument should point to an <b>uint32_t</b> variable. 1699<pre> 1700 PCRE2_INFO_FIRSTBITMAP 1701</pre> 1702In the absence of a single first code unit for a non-anchored pattern, 1703<b>pcre2_compile()</b> may construct a 256-bit table that defines a fixed set of 1704values for the first code unit in any match. For example, a pattern that starts 1705with [abc] results in a table with three bits set. When code unit values 1706greater than 255 are supported, the flag bit for 255 means "any code unit of 1707value 255 or above". If such a table was constructed, a pointer to it is 1708returned. Otherwise NULL is returned. The third argument should point to an 1709<b>const uint8_t *</b> variable. 1710<pre> 1711 PCRE2_INFO_FIRSTCODETYPE 1712</pre> 1713Return information about the first code unit of any matched string, for a 1714non-anchored pattern. The third argument should point to an <b>uint32_t</b> 1715variable. If there is a fixed first value, for example, the letter "c" from a 1716pattern such as (cat|cow|coyote), 1 is returned, and the character value can be 1717retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed first value, but 1718it is known that a match can occur only at the start of the subject or 1719following a newline in the subject, 2 is returned. Otherwise, and for anchored 1720patterns, 0 is returned. 1721<pre> 1722 PCRE2_INFO_FIRSTCODEUNIT 1723</pre> 1724Return the value of the first code unit of any matched string in the situation 1725where PCRE2_INFO_FIRSTCODETYPE returns 1; otherwise return 0. The third 1726argument should point to an <b>uint32_t</b> variable. In the 8-bit library, the 1727value is always less than 256. In the 16-bit library the value can be up to 17280xffff. In the 32-bit library in UTF-32 mode the value can be up to 0x10ffff, 1729and up to 0xffffffff when not using UTF-32 mode. 1730<pre> 1731 PCRE2_INFO_HASBACKSLASHC 1732</pre> 1733Return 1 if the pattern contains any instances of \C, otherwise 0. The third 1734argument should point to an <b>uint32_t</b> variable. 1735<pre> 1736 PCRE2_INFO_HASCRORLF 1737</pre> 1738Return 1 if the pattern contains any explicit matches for CR or LF characters, 1739otherwise 0. The third argument should point to an <b>uint32_t</b> variable. An 1740explicit match is either a literal CR or LF character, or \r or \n. 1741<pre> 1742 PCRE2_INFO_JCHANGED 1743</pre> 1744Return 1 if the (?J) or (?-J) option setting is used in the pattern, otherwise 17450. The third argument should point to an <b>uint32_t</b> variable. (?J) and 1746(?-J) set and unset the local PCRE2_DUPNAMES option, respectively. 1747<pre> 1748 PCRE2_INFO_JITSIZE 1749</pre> 1750If the compiled pattern was successfully processed by 1751<b>pcre2_jit_compile()</b>, return the size of the JIT compiled code, otherwise 1752return zero. The third argument should point to a <b>size_t</b> variable. 1753<pre> 1754 PCRE2_INFO_LASTCODETYPE 1755</pre> 1756Returns 1 if there is a rightmost literal code unit that must exist in any 1757matched string, other than at its start. The third argument should point to an 1758<b>uint32_t</b> variable. If there is no such value, 0 is returned. When 1 is 1759returned, the code unit value itself can be retrieved using 1760PCRE2_INFO_LASTCODEUNIT. For anchored patterns, a last literal value is 1761recorded only if it follows something of variable length. For example, for the 1762pattern /^a\d+z\d+/ the returned value is 1 (with "z" returned from 1763PCRE2_INFO_LASTCODEUNIT), but for /^a\dz\d/ the returned value is 0. 1764<pre> 1765 PCRE2_INFO_LASTCODEUNIT 1766</pre> 1767Return the value of the rightmost literal data unit that must exist in any 1768matched string, other than at its start, if such a value has been recorded. The 1769third argument should point to an <b>uint32_t</b> variable. If there is no such 1770value, 0 is returned. 1771<pre> 1772 PCRE2_INFO_MATCHEMPTY 1773</pre> 1774Return 1 if the pattern might match an empty string, otherwise 0. The third 1775argument should point to an <b>uint32_t</b> variable. When a pattern contains 1776recursive subroutine calls it is not always possible to determine whether or 1777not it can match an empty string. PCRE2 takes a cautious approach and returns 1 1778in such cases. 1779<pre> 1780 PCRE2_INFO_MATCHLIMIT 1781</pre> 1782If the pattern set a match limit by including an item of the form 1783(*LIMIT_MATCH=nnnn) at the start, the value is returned. The third argument 1784should point to an unsigned 32-bit integer. If no such value has been set, the 1785call to <b>pcre2_pattern_info()</b> returns the error PCRE2_ERROR_UNSET. 1786<pre> 1787 PCRE2_INFO_MAXLOOKBEHIND 1788</pre> 1789Return the number of characters (not code units) in the longest lookbehind 1790assertion in the pattern. The third argument should point to an unsigned 32-bit 1791integer. This information is useful when doing multi-segment matching using the 1792partial matching facilities. Note that the simple assertions \b and \B 1793require a one-character lookbehind. \A also registers a one-character 1794lookbehind, though it does not actually inspect the previous character. This is 1795to ensure that at least one character from the old segment is retained when a 1796new segment is processed. Otherwise, if there are no lookbehinds in the 1797pattern, \A might match incorrectly at the start of a new segment. 1798<pre> 1799 PCRE2_INFO_MINLENGTH 1800</pre> 1801If a minimum length for matching subject strings was computed, its value is 1802returned. Otherwise the returned value is 0. The value is a number of 1803characters, which in UTF mode may be different from the number of code units. 1804The third argument should point to an <b>uint32_t</b> variable. The value is a 1805lower bound to the length of any matching string. There may not be any strings 1806of that length that do actually match, but every string that does match is at 1807least that long. 1808<pre> 1809 PCRE2_INFO_NAMECOUNT 1810 PCRE2_INFO_NAMEENTRYSIZE 1811 PCRE2_INFO_NAMETABLE 1812</pre> 1813PCRE2 supports the use of named as well as numbered capturing parentheses. The 1814names are just an additional way of identifying the parentheses, which still 1815acquire numbers. Several convenience functions such as 1816<b>pcre2_substring_get_byname()</b> are provided for extracting captured 1817substrings by name. It is also possible to extract the data directly, by first 1818converting the name to a number in order to access the correct pointers in the 1819output vector (described with <b>pcre2_match()</b> below). To do the conversion, 1820you need to use the name-to-number map, which is described by these three 1821values. 1822</P> 1823<P> 1824The map consists of a number of fixed-size entries. PCRE2_INFO_NAMECOUNT gives 1825the number of entries, and PCRE2_INFO_NAMEENTRYSIZE gives the size of each 1826entry in code units; both of these return a <b>uint32_t</b> value. The entry 1827size depends on the length of the longest name. 1828</P> 1829<P> 1830PCRE2_INFO_NAMETABLE returns a pointer to the first entry of the table. This is 1831a PCRE2_SPTR pointer to a block of code units. In the 8-bit library, the first 1832two bytes of each entry are the number of the capturing parenthesis, most 1833significant byte first. In the 16-bit library, the pointer points to 16-bit 1834code units, the first of which contains the parenthesis number. In the 32-bit 1835library, the pointer points to 32-bit code units, the first of which contains 1836the parenthesis number. The rest of the entry is the corresponding name, zero 1837terminated. 1838</P> 1839<P> 1840The names are in alphabetical order. If (?| is used to create multiple groups 1841with the same number, as described in the 1842<a href="pcre2pattern.html#dupsubpatternnumber">section on duplicate subpattern numbers</a> 1843in the 1844<a href="pcre2pattern.html"><b>pcre2pattern</b></a> 1845page, the groups may be given the same name, but there is only one entry in the 1846table. Different names for groups of the same number are not permitted. 1847</P> 1848<P> 1849Duplicate names for subpatterns with different numbers are permitted, but only 1850if PCRE2_DUPNAMES is set. They appear in the table in the order in which they 1851were found in the pattern. In the absence of (?| this is the order of 1852increasing number; when (?| is used this is not necessarily the case because 1853later subpatterns may have lower numbers. 1854</P> 1855<P> 1856As a simple example of the name/number table, consider the following pattern 1857after compilation by the 8-bit library (assume PCRE2_EXTENDED is set, so white 1858space - including newlines - is ignored): 1859<pre> 1860 (?<date> (?<year>(\d\d)?\d\d) - (?<month>\d\d) - (?<day>\d\d) ) 1861</pre> 1862There are four named subpatterns, so the table has four entries, and each entry 1863in the table is eight bytes long. The table is as follows, with non-printing 1864bytes shows in hexadecimal, and undefined bytes shown as ??: 1865<pre> 1866 00 01 d a t e 00 ?? 1867 00 05 d a y 00 ?? ?? 1868 00 04 m o n t h 00 1869 00 02 y e a r 00 ?? 1870</pre> 1871When writing code to extract data from named subpatterns using the 1872name-to-number map, remember that the length of the entries is likely to be 1873different for each compiled pattern. 1874<pre> 1875 PCRE2_INFO_NEWLINE 1876</pre> 1877The output is a <b>uint32_t</b> with one of the following values: 1878<pre> 1879 PCRE2_NEWLINE_CR Carriage return (CR) 1880 PCRE2_NEWLINE_LF Linefeed (LF) 1881 PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF) 1882 PCRE2_NEWLINE_ANY Any Unicode line ending 1883 PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF 1884</pre> 1885This specifies the default character sequence that will be recognized as 1886meaning "newline" while matching. 1887<pre> 1888 PCRE2_INFO_RECURSIONLIMIT 1889</pre> 1890If the pattern set a recursion limit by including an item of the form 1891(*LIMIT_RECURSION=nnnn) at the start, the value is returned. The third 1892argument should point to an unsigned 32-bit integer. If no such value has been 1893set, the call to <b>pcre2_pattern_info()</b> returns the error PCRE2_ERROR_UNSET. 1894<pre> 1895 PCRE2_INFO_SIZE 1896</pre> 1897Return the size of the compiled pattern in bytes (for all three libraries). The 1898third argument should point to a <b>size_t</b> variable. This value includes the 1899size of the general data block that precedes the code units of the compiled 1900pattern itself. The value that is used when <b>pcre2_compile()</b> is getting 1901memory in which to place the compiled pattern may be slightly larger than the 1902value returned by this option, because there are cases where the code that 1903calculates the size has to over-estimate. Processing a pattern with the JIT 1904compiler does not alter the value returned by this option. 1905<a name="infoaboutcallouts"></a></P> 1906<br><a name="SEC23" href="#TOC1">INFORMATION ABOUT A PATTERN'S CALLOUTS</a><br> 1907<P> 1908<b>int pcre2_callout_enumerate(const pcre2_code *<i>code</i>,</b> 1909<b> int (*<i>callback</i>)(pcre2_callout_enumerate_block *, void *),</b> 1910<b> void *<i>user_data</i>);</b> 1911<br> 1912<br> 1913A script language that supports the use of string arguments in callouts might 1914like to scan all the callouts in a pattern before running the match. This can 1915be done by calling <b>pcre2_callout_enumerate()</b>. The first argument is a 1916pointer to a compiled pattern, the second points to a callback function, and 1917the third is arbitrary user data. The callback function is called for every 1918callout in the pattern in the order in which they appear. Its first argument is 1919a pointer to a callout enumeration block, and its second argument is the 1920<i>user_data</i> value that was passed to <b>pcre2_callout_enumerate()</b>. The 1921contents of the callout enumeration block are described in the 1922<a href="pcre2callout.html"><b>pcre2callout</b></a> 1923documentation, which also gives further details about callouts. 1924</P> 1925<br><a name="SEC24" href="#TOC1">SERIALIZATION AND PRECOMPILING</a><br> 1926<P> 1927It is possible to save compiled patterns on disc or elsewhere, and reload them 1928later, subject to a number of restrictions. The functions whose names begin 1929with <b>pcre2_serialize_</b> are used for this purpose. They are described in 1930the 1931<a href="pcre2serialize.html"><b>pcre2serialize</b></a> 1932documentation. 1933<a name="matchdatablock"></a></P> 1934<br><a name="SEC25" href="#TOC1">THE MATCH DATA BLOCK</a><br> 1935<P> 1936<b>pcre2_match_data *pcre2_match_data_create(uint32_t <i>ovecsize</i>,</b> 1937<b> pcre2_general_context *<i>gcontext</i>);</b> 1938<br> 1939<br> 1940<b>pcre2_match_data *pcre2_match_data_create_from_pattern(</b> 1941<b> const pcre2_code *<i>code</i>, pcre2_general_context *<i>gcontext</i>);</b> 1942<br> 1943<br> 1944<b>void pcre2_match_data_free(pcre2_match_data *<i>match_data</i>);</b> 1945</P> 1946<P> 1947Information about a successful or unsuccessful match is placed in a match 1948data block, which is an opaque structure that is accessed by function calls. In 1949particular, the match data block contains a vector of offsets into the subject 1950string that define the matched part of the subject and any substrings that were 1951captured. This is know as the <i>ovector</i>. 1952</P> 1953<P> 1954Before calling <b>pcre2_match()</b>, <b>pcre2_dfa_match()</b>, or 1955<b>pcre2_jit_match()</b> you must create a match data block by calling one of 1956the creation functions above. For <b>pcre2_match_data_create()</b>, the first 1957argument is the number of pairs of offsets in the <i>ovector</i>. One pair of 1958offsets is required to identify the string that matched the whole pattern, with 1959another pair for each captured substring. For example, a value of 4 creates 1960enough space to record the matched portion of the subject plus three captured 1961substrings. A minimum of at least 1 pair is imposed by 1962<b>pcre2_match_data_create()</b>, so it is always possible to return the overall 1963matched string. 1964</P> 1965<P> 1966The second argument of <b>pcre2_match_data_create()</b> is a pointer to a 1967general context, which can specify custom memory management for obtaining the 1968memory for the match data block. If you are not using custom memory management, 1969pass NULL, which causes <b>malloc()</b> to be used. 1970</P> 1971<P> 1972For <b>pcre2_match_data_create_from_pattern()</b>, the first argument is a 1973pointer to a compiled pattern. The ovector is created to be exactly the right 1974size to hold all the substrings a pattern might capture. The second argument is 1975again a pointer to a general context, but in this case if NULL is passed, the 1976memory is obtained using the same allocator that was used for the compiled 1977pattern (custom or default). 1978</P> 1979<P> 1980A match data block can be used many times, with the same or different compiled 1981patterns. You can extract information from a match data block after a match 1982operation has finished, using functions that are described in the sections on 1983<a href="#matchedstrings">matched strings</a> 1984and 1985<a href="#matchotherdata">other match data</a> 1986below. 1987</P> 1988<P> 1989When a call of <b>pcre2_match()</b> fails, valid data is available in the match 1990block only when the error is PCRE2_ERROR_NOMATCH, PCRE2_ERROR_PARTIAL, or one 1991of the error codes for an invalid UTF string. Exactly what is available depends 1992on the error, and is detailed below. 1993</P> 1994<P> 1995When one of the matching functions is called, pointers to the compiled pattern 1996and the subject string are set in the match data block so that they can be 1997referenced by the extraction functions. After running a match, you must not 1998free a compiled pattern or a subject string until after all operations on the 1999match data block (for that match) have taken place. 2000</P> 2001<P> 2002When a match data block itself is no longer needed, it should be freed by 2003calling <b>pcre2_match_data_free()</b>. 2004</P> 2005<br><a name="SEC26" href="#TOC1">MATCHING A PATTERN: THE TRADITIONAL FUNCTION</a><br> 2006<P> 2007<b>int pcre2_match(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b> 2008<b> PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b> 2009<b> uint32_t <i>options</i>, pcre2_match_data *<i>match_data</i>,</b> 2010<b> pcre2_match_context *<i>mcontext</i>);</b> 2011</P> 2012<P> 2013The function <b>pcre2_match()</b> is called to match a subject string against a 2014compiled pattern, which is passed in the <i>code</i> argument. You can call 2015<b>pcre2_match()</b> with the same <i>code</i> argument as many times as you 2016like, in order to find multiple matches in the subject string or to match 2017different subject strings with the same pattern. 2018</P> 2019<P> 2020This function is the main matching facility of the library, and it operates in 2021a Perl-like manner. For specialist use there is also an alternative matching 2022function, which is described 2023<a href="#dfamatch">below</a> 2024in the section about the <b>pcre2_dfa_match()</b> function. 2025</P> 2026<P> 2027Here is an example of a simple call to <b>pcre2_match()</b>: 2028<pre> 2029 pcre2_match_data *md = pcre2_match_data_create(4, NULL); 2030 int rc = pcre2_match( 2031 re, /* result of pcre2_compile() */ 2032 "some string", /* the subject string */ 2033 11, /* the length of the subject string */ 2034 0, /* start at offset 0 in the subject */ 2035 0, /* default options */ 2036 match_data, /* the match data block */ 2037 NULL); /* a match context; NULL means use defaults */ 2038</pre> 2039If the subject string is zero-terminated, the length can be given as 2040PCRE2_ZERO_TERMINATED. A match context must be provided if certain less common 2041matching parameters are to be changed. For details, see the section on 2042<a href="#matchcontext">the match context</a> 2043above. 2044</P> 2045<br><b> 2046The string to be matched by <b>pcre2_match()</b> 2047</b><br> 2048<P> 2049The subject string is passed to <b>pcre2_match()</b> as a pointer in 2050<i>subject</i>, a length in <i>length</i>, and a starting offset in 2051<i>startoffset</i>. The length and offset are in code units, not characters. 2052That is, they are in bytes for the 8-bit library, 16-bit code units for the 205316-bit library, and 32-bit code units for the 32-bit library, whether or not 2054UTF processing is enabled. 2055</P> 2056<P> 2057If <i>startoffset</i> is greater than the length of the subject, 2058<b>pcre2_match()</b> returns PCRE2_ERROR_BADOFFSET. When the starting offset is 2059zero, the search for a match starts at the beginning of the subject, and this 2060is by far the most common case. In UTF-8 or UTF-16 mode, the starting offset 2061must point to the start of a character, or to the end of the subject (in UTF-32 2062mode, one code unit equals one character, so all offsets are valid). Like the 2063pattern string, the subject may contain binary zeroes. 2064</P> 2065<P> 2066A non-zero starting offset is useful when searching for another match in the 2067same subject by calling <b>pcre2_match()</b> again after a previous success. 2068Setting <i>startoffset</i> differs from passing over a shortened string and 2069setting PCRE2_NOTBOL in the case of a pattern that begins with any kind of 2070lookbehind. For example, consider the pattern 2071<pre> 2072 \Biss\B 2073</pre> 2074which finds occurrences of "iss" in the middle of words. (\B matches only if 2075the current position in the subject is not a word boundary.) When applied to 2076the string "Mississipi" the first call to <b>pcre2_match()</b> finds the first 2077occurrence. If <b>pcre2_match()</b> is called again with just the remainder of 2078the subject, namely "issipi", it does not match, because \B is always false at 2079the start of the subject, which is deemed to be a word boundary. However, if 2080<b>pcre2_match()</b> is passed the entire string again, but with 2081<i>startoffset</i> set to 4, it finds the second occurrence of "iss" because it 2082is able to look behind the starting point to discover that it is preceded by a 2083letter. 2084</P> 2085<P> 2086Finding all the matches in a subject is tricky when the pattern can match an 2087empty string. It is possible to emulate Perl's /g behaviour by first trying the 2088match again at the same offset, with the PCRE2_NOTEMPTY_ATSTART and 2089PCRE2_ANCHORED options, and then if that fails, advancing the starting offset 2090and trying an ordinary match again. There is some code that demonstrates how to 2091do this in the 2092<a href="pcre2demo.html"><b>pcre2demo</b></a> 2093sample program. In the most general case, you have to check to see if the 2094newline convention recognizes CRLF as a newline, and if so, and the current 2095character is CR followed by LF, advance the starting offset by two characters 2096instead of one. 2097</P> 2098<P> 2099If a non-zero starting offset is passed when the pattern is anchored, one 2100attempt to match at the given offset is made. This can only succeed if the 2101pattern does not require the match to be at the start of the subject. 2102<a name="matchoptions"></a></P> 2103<br><b> 2104Option bits for <b>pcre2_match()</b> 2105</b><br> 2106<P> 2107The unused bits of the <i>options</i> argument for <b>pcre2_match()</b> must be 2108zero. The only bits that may be set are PCRE2_ANCHORED, PCRE2_NOTBOL, 2109PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_JIT, 2110PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. Their action is 2111described below. 2112</P> 2113<P> 2114Setting PCRE2_ANCHORED at match time is not supported by the just-in-time (JIT) 2115compiler. If it is set, JIT matching is disabled and the normal interpretive 2116code in <b>pcre2_match()</b> is run. Apart from PCRE2_NO_JIT (obviously), the 2117remaining options are supported for JIT matching. 2118<pre> 2119 PCRE2_ANCHORED 2120</pre> 2121The PCRE2_ANCHORED option limits <b>pcre2_match()</b> to matching at the first 2122matching position. If a pattern was compiled with PCRE2_ANCHORED, or turned out 2123to be anchored by virtue of its contents, it cannot be made unachored at 2124matching time. Note that setting the option at match time disables JIT 2125matching. 2126<pre> 2127 PCRE2_NOTBOL 2128</pre> 2129This option specifies that first character of the subject string is not the 2130beginning of a line, so the circumflex metacharacter should not match before 2131it. Setting this without having set PCRE2_MULTILINE at compile time causes 2132circumflex never to match. This option affects only the behaviour of the 2133circumflex metacharacter. It does not affect \A. 2134<pre> 2135 PCRE2_NOTEOL 2136</pre> 2137This option specifies that the end of the subject string is not the end of a 2138line, so the dollar metacharacter should not match it nor (except in multiline 2139mode) a newline immediately before it. Setting this without having set 2140PCRE2_MULTILINE at compile time causes dollar never to match. This option 2141affects only the behaviour of the dollar metacharacter. It does not affect \Z 2142or \z. 2143<pre> 2144 PCRE2_NOTEMPTY 2145</pre> 2146An empty string is not considered to be a valid match if this option is set. If 2147there are alternatives in the pattern, they are tried. If all the alternatives 2148match the empty string, the entire match fails. For example, if the pattern 2149<pre> 2150 a?b? 2151</pre> 2152is applied to a string not beginning with "a" or "b", it matches an empty 2153string at the start of the subject. With PCRE2_NOTEMPTY set, this match is not 2154valid, so <b>pcre2_match()</b> searches further into the string for occurrences 2155of "a" or "b". 2156<pre> 2157 PCRE2_NOTEMPTY_ATSTART 2158</pre> 2159This is like PCRE2_NOTEMPTY, except that it locks out an empty string match 2160only at the first matching position, that is, at the start of the subject plus 2161the starting offset. An empty string match later in the subject is permitted. 2162If the pattern is anchored, such a match can occur only if the pattern contains 2163\K. 2164<pre> 2165 PCRE2_NO_JIT 2166</pre> 2167By default, if a pattern has been successfully processed by 2168<b>pcre2_jit_compile()</b>, JIT is automatically used when <b>pcre2_match()</b> 2169is called with options that JIT supports. Setting PCRE2_NO_JIT disables the use 2170of JIT; it forces matching to be done by the interpreter. 2171<pre> 2172 PCRE2_NO_UTF_CHECK 2173</pre> 2174When PCRE2_UTF is set at compile time, the validity of the subject as a UTF 2175string is checked by default when <b>pcre2_match()</b> is subsequently called. 2176If a non-zero starting offset is given, the check is applied only to that part 2177of the subject that could be inspected during matching, and there is a check 2178that the starting offset points to the first code unit of a character or to the 2179end of the subject. If there are no lookbehind assertions in the pattern, the 2180check starts at the starting offset. Otherwise, it starts at the length of the 2181longest lookbehind before the starting offset, or at the start of the subject 2182if there are not that many characters before the starting offset. Note that the 2183sequences \b and \B are one-character lookbehinds. 2184</P> 2185<P> 2186The check is carried out before any other processing takes place, and a 2187negative error code is returned if the check fails. There are several UTF error 2188codes for each code unit width, corresponding to different problems with the 2189code unit sequence. There are discussions about the validity of 2190<a href="pcre2unicode.html#utf8strings">UTF-8 strings,</a> 2191<a href="pcre2unicode.html#utf16strings">UTF-16 strings,</a> 2192and 2193<a href="pcre2unicode.html#utf32strings">UTF-32 strings</a> 2194in the 2195<a href="pcre2unicode.html"><b>pcre2unicode</b></a> 2196page. 2197</P> 2198<P> 2199If you know that your subject is valid, and you want to skip these checks for 2200performance reasons, you can set the PCRE2_NO_UTF_CHECK option when calling 2201<b>pcre2_match()</b>. You might want to do this for the second and subsequent 2202calls to <b>pcre2_match()</b> if you are making repeated calls to find all the 2203matches in a single subject string. 2204</P> 2205<P> 2206NOTE: When PCRE2_NO_UTF_CHECK is set, the effect of passing an invalid string 2207as a subject, or an invalid value of <i>startoffset</i>, is undefined. Your 2208program may crash or loop indefinitely. 2209<pre> 2210 PCRE2_PARTIAL_HARD 2211 PCRE2_PARTIAL_SOFT 2212</pre> 2213These options turn on the partial matching feature. A partial match occurs if 2214the end of the subject string is reached successfully, but there are not enough 2215subject characters to complete the match. If this happens when 2216PCRE2_PARTIAL_SOFT (but not PCRE2_PARTIAL_HARD) is set, matching continues by 2217testing any remaining alternatives. Only if no complete match can be found is 2218PCRE2_ERROR_PARTIAL returned instead of PCRE2_ERROR_NOMATCH. In other words, 2219PCRE2_PARTIAL_SOFT specifies that the caller is prepared to handle a partial 2220match, but only if no complete match can be found. 2221</P> 2222<P> 2223If PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In this case, if 2224a partial match is found, <b>pcre2_match()</b> immediately returns 2225PCRE2_ERROR_PARTIAL, without considering any other alternatives. In other 2226words, when PCRE2_PARTIAL_HARD is set, a partial match is considered to be more 2227important that an alternative complete match. 2228</P> 2229<P> 2230There is a more detailed discussion of partial and multi-segment matching, with 2231examples, in the 2232<a href="pcre2partial.html"><b>pcre2partial</b></a> 2233documentation. 2234</P> 2235<br><a name="SEC27" href="#TOC1">NEWLINE HANDLING WHEN MATCHING</a><br> 2236<P> 2237When PCRE2 is built, a default newline convention is set; this is usually the 2238standard convention for the operating system. The default can be overridden in 2239a 2240<a href="#compilecontext">compile context</a> 2241by calling <b>pcre2_set_newline()</b>. It can also be overridden by starting a 2242pattern string with, for example, (*CRLF), as described in the 2243<a href="pcre2pattern.html#newlines">section on newline conventions</a> 2244in the 2245<a href="pcre2pattern.html"><b>pcre2pattern</b></a> 2246page. During matching, the newline choice affects the behaviour of the dot, 2247circumflex, and dollar metacharacters. It may also alter the way the match 2248starting position is advanced after a match failure for an unanchored pattern. 2249</P> 2250<P> 2251When PCRE2_NEWLINE_CRLF, PCRE2_NEWLINE_ANYCRLF, or PCRE2_NEWLINE_ANY is set as 2252the newline convention, and a match attempt for an unanchored pattern fails 2253when the current starting position is at a CRLF sequence, and the pattern 2254contains no explicit matches for CR or LF characters, the match position is 2255advanced by two characters instead of one, in other words, to after the CRLF. 2256</P> 2257<P> 2258The above rule is a compromise that makes the most common cases work as 2259expected. For example, if the pattern is .+A (and the PCRE2_DOTALL option is 2260not set), it does not match the string "\r\nA" because, after failing at the 2261start, it skips both the CR and the LF before retrying. However, the pattern 2262[\r\n]A does match that string, because it contains an explicit CR or LF 2263reference, and so advances only by one character after the first failure. 2264</P> 2265<P> 2266An explicit match for CR of LF is either a literal appearance of one of those 2267characters in the pattern, or one of the \r or \n escape sequences. Implicit 2268matches such as [^X] do not count, nor does \s, even though it includes CR and 2269LF in the characters that it matches. 2270</P> 2271<P> 2272Notwithstanding the above, anomalous effects may still occur when CRLF is a 2273valid newline sequence and explicit \r or \n escapes appear in the pattern. 2274<a name="matchedstrings"></a></P> 2275<br><a name="SEC28" href="#TOC1">HOW PCRE2_MATCH() RETURNS A STRING AND CAPTURED SUBSTRINGS</a><br> 2276<P> 2277<b>uint32_t pcre2_get_ovector_count(pcre2_match_data *<i>match_data</i>);</b> 2278<br> 2279<br> 2280<b>PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *<i>match_data</i>);</b> 2281</P> 2282<P> 2283In general, a pattern matches a certain portion of the subject, and in 2284addition, further substrings from the subject may be picked out by 2285parenthesized parts of the pattern. Following the usage in Jeffrey Friedl's 2286book, this is called "capturing" in what follows, and the phrase "capturing 2287subpattern" or "capturing group" is used for a fragment of a pattern that picks 2288out a substring. PCRE2 supports several other kinds of parenthesized subpattern 2289that do not cause substrings to be captured. The <b>pcre2_pattern_info()</b> 2290function can be used to find out how many capturing subpatterns there are in a 2291compiled pattern. 2292</P> 2293<P> 2294You can use auxiliary functions for accessing captured substrings 2295<a href="#extractbynumber">by number</a> 2296or 2297<a href="#extractbyname">by name,</a> 2298as described in sections below. 2299</P> 2300<P> 2301Alternatively, you can make direct use of the vector of PCRE2_SIZE values, 2302called the <b>ovector</b>, which contains the offsets of captured strings. It is 2303part of the 2304<a href="#matchdatablock">match data block.</a> 2305The function <b>pcre2_get_ovector_pointer()</b> returns the address of the 2306ovector, and <b>pcre2_get_ovector_count()</b> returns the number of pairs of 2307values it contains. 2308</P> 2309<P> 2310Within the ovector, the first in each pair of values is set to the offset of 2311the first code unit of a substring, and the second is set to the offset of the 2312first code unit after the end of a substring. These values are always code unit 2313offsets, not character offsets. That is, they are byte offsets in the 8-bit 2314library, 16-bit offsets in the 16-bit library, and 32-bit offsets in the 32-bit 2315library. 2316</P> 2317<P> 2318After a partial match (error return PCRE2_ERROR_PARTIAL), only the first pair 2319of offsets (that is, <i>ovector[0]</i> and <i>ovector[1]</i>) are set. They 2320identify the part of the subject that was partially matched. See the 2321<a href="pcre2partial.html"><b>pcre2partial</b></a> 2322documentation for details of partial matching. 2323</P> 2324<P> 2325After a successful match, the first pair of offsets identifies the portion of 2326the subject string that was matched by the entire pattern. The next pair is 2327used for the first capturing subpattern, and so on. The value returned by 2328<b>pcre2_match()</b> is one more than the highest numbered pair that has been 2329set. For example, if two substrings have been captured, the returned value is 23303. If there are no capturing subpatterns, the return value from a successful 2331match is 1, indicating that just the first pair of offsets has been set. 2332</P> 2333<P> 2334If a pattern uses the \K escape sequence within a positive assertion, the 2335reported start of a successful match can be greater than the end of the match. 2336For example, if the pattern (?=ab\K) is matched against "ab", the start and 2337end offset values for the match are 2 and 0. 2338</P> 2339<P> 2340If a capturing subpattern group is matched repeatedly within a single match 2341operation, it is the last portion of the subject that it matched that is 2342returned. 2343</P> 2344<P> 2345If the ovector is too small to hold all the captured substring offsets, as much 2346as possible is filled in, and the function returns a value of zero. If captured 2347substrings are not of interest, <b>pcre2_match()</b> may be called with a match 2348data block whose ovector is of minimum length (that is, one pair). However, if 2349the pattern contains back references and the <i>ovector</i> is not big enough to 2350remember the related substrings, PCRE2 has to get additional memory for use 2351during matching. Thus it is usually advisable to set up a match data block 2352containing an ovector of reasonable size. 2353</P> 2354<P> 2355It is possible for capturing subpattern number <i>n+1</i> to match some part of 2356the subject when subpattern <i>n</i> has not been used at all. For example, if 2357the string "abc" is matched against the pattern (a|(z))(bc) the return from the 2358function is 4, and subpatterns 1 and 3 are matched, but 2 is not. When this 2359happens, both values in the offset pairs corresponding to unused subpatterns 2360are set to PCRE2_UNSET. 2361</P> 2362<P> 2363Offset values that correspond to unused subpatterns at the end of the 2364expression are also set to PCRE2_UNSET. For example, if the string "abc" is 2365matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not matched. 2366The return from the function is 2, because the highest used capturing 2367subpattern number is 1. The offsets for for the second and third capturing 2368subpatterns (assuming the vector is large enough, of course) are set to 2369PCRE2_UNSET. 2370</P> 2371<P> 2372Elements in the ovector that do not correspond to capturing parentheses in the 2373pattern are never changed. That is, if a pattern contains <i>n</i> capturing 2374parentheses, no more than <i>ovector[0]</i> to <i>ovector[2n+1]</i> are set by 2375<b>pcre2_match()</b>. The other elements retain whatever values they previously 2376had. 2377<a name="matchotherdata"></a></P> 2378<br><a name="SEC29" href="#TOC1">OTHER INFORMATION ABOUT A MATCH</a><br> 2379<P> 2380<b>PCRE2_SPTR pcre2_get_mark(pcre2_match_data *<i>match_data</i>);</b> 2381<br> 2382<br> 2383<b>PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *<i>match_data</i>);</b> 2384</P> 2385<P> 2386As well as the offsets in the ovector, other information about a match is 2387retained in the match data block and can be retrieved by the above functions in 2388appropriate circumstances. If they are called at other times, the result is 2389undefined. 2390</P> 2391<P> 2392After a successful match, a partial match (PCRE2_ERROR_PARTIAL), or a failure 2393to match (PCRE2_ERROR_NOMATCH), a (*MARK) name may be available, and 2394<b>pcre2_get_mark()</b> can be called. It returns a pointer to the 2395zero-terminated name, which is within the compiled pattern. Otherwise NULL is 2396returned. The length of the (*MARK) name (excluding the terminating zero) is 2397stored in the code unit that preceeds the name. You should use this instead of 2398relying on the terminating zero if the (*MARK) name might contain a binary 2399zero. 2400</P> 2401<P> 2402After a successful match, the (*MARK) name that is returned is the 2403last one encountered on the matching path through the pattern. After a "no 2404match" or a partial match, the last encountered (*MARK) name is returned. For 2405example, consider this pattern: 2406<pre> 2407 ^(*MARK:A)((*MARK:B)a|b)c 2408</pre> 2409When it matches "bc", the returned mark is A. The B mark is "seen" in the first 2410branch of the group, but it is not on the matching path. On the other hand, 2411when this pattern fails to match "bx", the returned mark is B. 2412</P> 2413<P> 2414After a successful match, a partial match, or one of the invalid UTF errors 2415(for example, PCRE2_ERROR_UTF8_ERR5), <b>pcre2_get_startchar()</b> can be 2416called. After a successful or partial match it returns the code unit offset of 2417the character at which the match started. For a non-partial match, this can be 2418different to the value of <i>ovector[0]</i> if the pattern contains the \K 2419escape sequence. After a partial match, however, this value is always the same 2420as <i>ovector[0]</i> because \K does not affect the result of a partial match. 2421</P> 2422<P> 2423After a UTF check failure, <b>pcre2_get_startchar()</b> can be used to obtain 2424the code unit offset of the invalid UTF character. Details are given in the 2425<a href="pcre2unicode.html"><b>pcre2unicode</b></a> 2426page. 2427<a name="errorlist"></a></P> 2428<br><a name="SEC30" href="#TOC1">ERROR RETURNS FROM <b>pcre2_match()</b></a><br> 2429<P> 2430If <b>pcre2_match()</b> fails, it returns a negative number. This can be 2431converted to a text string by calling the <b>pcre2_get_error_message()</b> 2432function (see "Obtaining a textual error message" 2433<a href="#geterrormessage">below).</a> 2434Negative error codes are also returned by other functions, and are documented 2435with them. The codes are given names in the header file. If UTF checking is in 2436force and an invalid UTF subject string is detected, one of a number of 2437UTF-specific negative error codes is returned. Details are given in the 2438<a href="pcre2unicode.html"><b>pcre2unicode</b></a> 2439page. The following are the other errors that may be returned by 2440<b>pcre2_match()</b>: 2441<pre> 2442 PCRE2_ERROR_NOMATCH 2443</pre> 2444The subject string did not match the pattern. 2445<pre> 2446 PCRE2_ERROR_PARTIAL 2447</pre> 2448The subject string did not match, but it did match partially. See the 2449<a href="pcre2partial.html"><b>pcre2partial</b></a> 2450documentation for details of partial matching. 2451<pre> 2452 PCRE2_ERROR_BADMAGIC 2453</pre> 2454PCRE2 stores a 4-byte "magic number" at the start of the compiled code, to 2455catch the case when it is passed a junk pointer. This is the error that is 2456returned when the magic number is not present. 2457<pre> 2458 PCRE2_ERROR_BADMODE 2459</pre> 2460This error is given when a pattern that was compiled by the 8-bit library is 2461passed to a 16-bit or 32-bit library function, or vice versa. 2462<pre> 2463 PCRE2_ERROR_BADOFFSET 2464</pre> 2465The value of <i>startoffset</i> was greater than the length of the subject. 2466<pre> 2467 PCRE2_ERROR_BADOPTION 2468</pre> 2469An unrecognized bit was set in the <i>options</i> argument. 2470<pre> 2471 PCRE2_ERROR_BADUTFOFFSET 2472</pre> 2473The UTF code unit sequence that was passed as a subject was checked and found 2474to be valid (the PCRE2_NO_UTF_CHECK option was not set), but the value of 2475<i>startoffset</i> did not point to the beginning of a UTF character or the end 2476of the subject. 2477<pre> 2478 PCRE2_ERROR_CALLOUT 2479</pre> 2480This error is never generated by <b>pcre2_match()</b> itself. It is provided for 2481use by callout functions that want to cause <b>pcre2_match()</b> or 2482<b>pcre2_callout_enumerate()</b> to return a distinctive error code. See the 2483<a href="pcre2callout.html"><b>pcre2callout</b></a> 2484documentation for details. 2485<pre> 2486 PCRE2_ERROR_INTERNAL 2487</pre> 2488An unexpected internal error has occurred. This error could be caused by a bug 2489in PCRE2 or by overwriting of the compiled pattern. 2490<pre> 2491 PCRE2_ERROR_JIT_BADOPTION 2492</pre> 2493This error is returned when a pattern that was successfully studied using JIT 2494is being matched, but the matching mode (partial or complete match) does not 2495correspond to any JIT compilation mode. When the JIT fast path function is 2496used, this error may be also given for invalid options. See the 2497<a href="pcre2jit.html"><b>pcre2jit</b></a> 2498documentation for more details. 2499<pre> 2500 PCRE2_ERROR_JIT_STACKLIMIT 2501</pre> 2502This error is returned when a pattern that was successfully studied using JIT 2503is being matched, but the memory available for the just-in-time processing 2504stack is not large enough. See the 2505<a href="pcre2jit.html"><b>pcre2jit</b></a> 2506documentation for more details. 2507<pre> 2508 PCRE2_ERROR_MATCHLIMIT 2509</pre> 2510The backtracking limit was reached. 2511<pre> 2512 PCRE2_ERROR_NOMEMORY 2513</pre> 2514If a pattern contains back references, but the ovector is not big enough to 2515remember the referenced substrings, PCRE2 gets a block of memory at the start 2516of matching to use for this purpose. There are some other special cases where 2517extra memory is needed during matching. This error is given when memory cannot 2518be obtained. 2519<pre> 2520 PCRE2_ERROR_NULL 2521</pre> 2522Either the <i>code</i>, <i>subject</i>, or <i>match_data</i> argument was passed 2523as NULL. 2524<pre> 2525 PCRE2_ERROR_RECURSELOOP 2526</pre> 2527This error is returned when <b>pcre2_match()</b> detects a recursion loop within 2528the pattern. Specifically, it means that either the whole pattern or a 2529subpattern has been called recursively for the second time at the same position 2530in the subject string. Some simple patterns that might do this are detected and 2531faulted at compile time, but more complicated cases, in particular mutual 2532recursions between two different subpatterns, cannot be detected until matching 2533is attempted. 2534<pre> 2535 PCRE2_ERROR_RECURSIONLIMIT 2536</pre> 2537The internal recursion limit was reached. 2538<a name="geterrormessage"></a></P> 2539<br><a name="SEC31" href="#TOC1">OBTAINING A TEXTUAL ERROR MESSAGE</a><br> 2540<P> 2541<b>int pcre2_get_error_message(int <i>errorcode</i>, PCRE2_UCHAR *<i>buffer</i>,</b> 2542<b> PCRE2_SIZE <i>bufflen</i>);</b> 2543</P> 2544<P> 2545A text message for an error code from any PCRE2 function (compile, match, or 2546auxiliary) can be obtained by calling <b>pcre2_get_error_message()</b>. The code 2547is passed as the first argument, with the remaining two arguments specifying a 2548code unit buffer and its length, into which the text message is placed. Note 2549that the message is returned in code units of the appropriate width for the 2550library that is being used. 2551</P> 2552<P> 2553The returned message is terminated with a trailing zero, and the function 2554returns the number of code units used, excluding the trailing zero. If the 2555error number is unknown, the negative error code PCRE2_ERROR_BADDATA is 2556returned. If the buffer is too small, the message is truncated (but still with 2557a trailing zero), and the negative error code PCRE2_ERROR_NOMEMORY is returned. 2558None of the messages are very long; a buffer size of 120 code units is ample. 2559<a name="extractbynumber"></a></P> 2560<br><a name="SEC32" href="#TOC1">EXTRACTING CAPTURED SUBSTRINGS BY NUMBER</a><br> 2561<P> 2562<b>int pcre2_substring_length_bynumber(pcre2_match_data *<i>match_data</i>,</b> 2563<b> uint32_t <i>number</i>, PCRE2_SIZE *<i>length</i>);</b> 2564<br> 2565<br> 2566<b>int pcre2_substring_copy_bynumber(pcre2_match_data *<i>match_data</i>,</b> 2567<b> uint32_t <i>number</i>, PCRE2_UCHAR *<i>buffer</i>,</b> 2568<b> PCRE2_SIZE *<i>bufflen</i>);</b> 2569<br> 2570<br> 2571<b>int pcre2_substring_get_bynumber(pcre2_match_data *<i>match_data</i>,</b> 2572<b> uint32_t <i>number</i>, PCRE2_UCHAR **<i>bufferptr</i>,</b> 2573<b> PCRE2_SIZE *<i>bufflen</i>);</b> 2574<br> 2575<br> 2576<b>void pcre2_substring_free(PCRE2_UCHAR *<i>buffer</i>);</b> 2577</P> 2578<P> 2579Captured substrings can be accessed directly by using the ovector as described 2580<a href="#matchedstrings">above.</a> 2581For convenience, auxiliary functions are provided for extracting captured 2582substrings as new, separate, zero-terminated strings. A substring that contains 2583a binary zero is correctly extracted and has a further zero added on the end, 2584but the result is not, of course, a C string. 2585</P> 2586<P> 2587The functions in this section identify substrings by number. The number zero 2588refers to the entire matched substring, with higher numbers referring to 2589substrings captured by parenthesized groups. After a partial match, only 2590substring zero is available. An attempt to extract any other substring gives 2591the error PCRE2_ERROR_PARTIAL. The next section describes similar functions for 2592extracting captured substrings by name. 2593</P> 2594<P> 2595If a pattern uses the \K escape sequence within a positive assertion, the 2596reported start of a successful match can be greater than the end of the match. 2597For example, if the pattern (?=ab\K) is matched against "ab", the start and 2598end offset values for the match are 2 and 0. In this situation, calling these 2599functions with a zero substring number extracts a zero-length empty string. 2600</P> 2601<P> 2602You can find the length in code units of a captured substring without 2603extracting it by calling <b>pcre2_substring_length_bynumber()</b>. The first 2604argument is a pointer to the match data block, the second is the group number, 2605and the third is a pointer to a variable into which the length is placed. If 2606you just want to know whether or not the substring has been captured, you can 2607pass the third argument as NULL. 2608</P> 2609<P> 2610The <b>pcre2_substring_copy_bynumber()</b> function copies a captured substring 2611into a supplied buffer, whereas <b>pcre2_substring_get_bynumber()</b> copies it 2612into new memory, obtained using the same memory allocation function that was 2613used for the match data block. The first two arguments of these functions are a 2614pointer to the match data block and a capturing group number. 2615</P> 2616<P> 2617The final arguments of <b>pcre2_substring_copy_bynumber()</b> are a pointer to 2618the buffer and a pointer to a variable that contains its length in code units. 2619This is updated to contain the actual number of code units used for the 2620extracted substring, excluding the terminating zero. 2621</P> 2622<P> 2623For <b>pcre2_substring_get_bynumber()</b> the third and fourth arguments point 2624to variables that are updated with a pointer to the new memory and the number 2625of code units that comprise the substring, again excluding the terminating 2626zero. When the substring is no longer needed, the memory should be freed by 2627calling <b>pcre2_substring_free()</b>. 2628</P> 2629<P> 2630The return value from all these functions is zero for success, or a negative 2631error code. If the pattern match failed, the match failure code is returned. 2632If a substring number greater than zero is used after a partial match, 2633PCRE2_ERROR_PARTIAL is returned. Other possible error codes are: 2634<pre> 2635 PCRE2_ERROR_NOMEMORY 2636</pre> 2637The buffer was too small for <b>pcre2_substring_copy_bynumber()</b>, or the 2638attempt to get memory failed for <b>pcre2_substring_get_bynumber()</b>. 2639<pre> 2640 PCRE2_ERROR_NOSUBSTRING 2641</pre> 2642There is no substring with that number in the pattern, that is, the number is 2643greater than the number of capturing parentheses. 2644<pre> 2645 PCRE2_ERROR_UNAVAILABLE 2646</pre> 2647The substring number, though not greater than the number of captures in the 2648pattern, is greater than the number of slots in the ovector, so the substring 2649could not be captured. 2650<pre> 2651 PCRE2_ERROR_UNSET 2652</pre> 2653The substring did not participate in the match. For example, if the pattern is 2654(abc)|(def) and the subject is "def", and the ovector contains at least two 2655capturing slots, substring number 1 is unset. 2656</P> 2657<br><a name="SEC33" href="#TOC1">EXTRACTING A LIST OF ALL CAPTURED SUBSTRINGS</a><br> 2658<P> 2659<b>int pcre2_substring_list_get(pcre2_match_data *<i>match_data</i>,</b> 2660<b>" PCRE2_UCHAR ***<i>listptr</i>, PCRE2_SIZE **<i>lengthsptr</i>);</b> 2661<br> 2662<br> 2663<b>void pcre2_substring_list_free(PCRE2_SPTR *<i>list</i>);</b> 2664</P> 2665<P> 2666The <b>pcre2_substring_list_get()</b> function extracts all available substrings 2667and builds a list of pointers to them. It also (optionally) builds a second 2668list that contains their lengths (in code units), excluding a terminating zero 2669that is added to each of them. All this is done in a single block of memory 2670that is obtained using the same memory allocation function that was used to get 2671the match data block. 2672</P> 2673<P> 2674This function must be called only after a successful match. If called after a 2675partial match, the error code PCRE2_ERROR_PARTIAL is returned. 2676</P> 2677<P> 2678The address of the memory block is returned via <i>listptr</i>, which is also 2679the start of the list of string pointers. The end of the list is marked by a 2680NULL pointer. The address of the list of lengths is returned via 2681<i>lengthsptr</i>. If your strings do not contain binary zeros and you do not 2682therefore need the lengths, you may supply NULL as the <b>lengthsptr</b> 2683argument to disable the creation of a list of lengths. The yield of the 2684function is zero if all went well, or PCRE2_ERROR_NOMEMORY if the memory block 2685could not be obtained. When the list is no longer needed, it should be freed by 2686calling <b>pcre2_substring_list_free()</b>. 2687</P> 2688<P> 2689If this function encounters a substring that is unset, which can happen when 2690capturing subpattern number <i>n+1</i> matches some part of the subject, but 2691subpattern <i>n</i> has not been used at all, it returns an empty string. This 2692can be distinguished from a genuine zero-length substring by inspecting the 2693appropriate offset in the ovector, which contain PCRE2_UNSET for unset 2694substrings, or by calling <b>pcre2_substring_length_bynumber()</b>. 2695<a name="extractbyname"></a></P> 2696<br><a name="SEC34" href="#TOC1">EXTRACTING CAPTURED SUBSTRINGS BY NAME</a><br> 2697<P> 2698<b>int pcre2_substring_number_from_name(const pcre2_code *<i>code</i>,</b> 2699<b> PCRE2_SPTR <i>name</i>);</b> 2700<br> 2701<br> 2702<b>int pcre2_substring_length_byname(pcre2_match_data *<i>match_data</i>,</b> 2703<b> PCRE2_SPTR <i>name</i>, PCRE2_SIZE *<i>length</i>);</b> 2704<br> 2705<br> 2706<b>int pcre2_substring_copy_byname(pcre2_match_data *<i>match_data</i>,</b> 2707<b> PCRE2_SPTR <i>name</i>, PCRE2_UCHAR *<i>buffer</i>, PCRE2_SIZE *<i>bufflen</i>);</b> 2708<br> 2709<br> 2710<b>int pcre2_substring_get_byname(pcre2_match_data *<i>match_data</i>,</b> 2711<b> PCRE2_SPTR <i>name</i>, PCRE2_UCHAR **<i>bufferptr</i>, PCRE2_SIZE *<i>bufflen</i>);</b> 2712<br> 2713<br> 2714<b>void pcre2_substring_free(PCRE2_UCHAR *<i>buffer</i>);</b> 2715</P> 2716<P> 2717To extract a substring by name, you first have to find associated number. 2718For example, for this pattern: 2719<pre> 2720 (a+)b(?<xxx>\d+)... 2721</pre> 2722the number of the subpattern called "xxx" is 2. If the name is known to be 2723unique (PCRE2_DUPNAMES was not set), you can find the number from the name by 2724calling <b>pcre2_substring_number_from_name()</b>. The first argument is the 2725compiled pattern, and the second is the name. The yield of the function is the 2726subpattern number, PCRE2_ERROR_NOSUBSTRING if there is no subpattern of that 2727name, or PCRE2_ERROR_NOUNIQUESUBSTRING if there is more than one subpattern of 2728that name. Given the number, you can extract the substring directly, or use one 2729of the functions described above. 2730</P> 2731<P> 2732For convenience, there are also "byname" functions that correspond to the 2733"bynumber" functions, the only difference being that the second argument is a 2734name instead of a number. If PCRE2_DUPNAMES is set and there are duplicate 2735names, these functions scan all the groups with the given name, and return the 2736first named string that is set. 2737</P> 2738<P> 2739If there are no groups with the given name, PCRE2_ERROR_NOSUBSTRING is 2740returned. If all groups with the name have numbers that are greater than the 2741number of slots in the ovector, PCRE2_ERROR_UNAVAILABLE is returned. If there 2742is at least one group with a slot in the ovector, but no group is found to be 2743set, PCRE2_ERROR_UNSET is returned. 2744</P> 2745<P> 2746<b>Warning:</b> If the pattern uses the (?| feature to set up multiple 2747subpatterns with the same number, as described in the 2748<a href="pcre2pattern.html#dupsubpatternnumber">section on duplicate subpattern numbers</a> 2749in the 2750<a href="pcre2pattern.html"><b>pcre2pattern</b></a> 2751page, you cannot use names to distinguish the different subpatterns, because 2752names are not included in the compiled code. The matching process uses only 2753numbers. For this reason, the use of different names for subpatterns of the 2754same number causes an error at compile time. 2755</P> 2756<br><a name="SEC35" href="#TOC1">CREATING A NEW STRING WITH SUBSTITUTIONS</a><br> 2757<P> 2758<b>int pcre2_substitute(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b> 2759<b> PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b> 2760<b> uint32_t <i>options</i>, pcre2_match_data *<i>match_data</i>,</b> 2761<b> pcre2_match_context *<i>mcontext</i>, PCRE2_SPTR <i>replacement</i>,</b> 2762<b> PCRE2_SIZE <i>rlength</i>, PCRE2_UCHAR *\fIoutputbuffer\zfP,</b> 2763<b> PCRE2_SIZE *<i>outlengthptr</i>);</b> 2764</P> 2765<P> 2766This function calls <b>pcre2_match()</b> and then makes a copy of the subject 2767string in <i>outputbuffer</i>, replacing the part that was matched with the 2768<i>replacement</i> string, whose length is supplied in <b>rlength</b>. This can 2769be given as PCRE2_ZERO_TERMINATED for a zero-terminated string. Matches in 2770which a \K item in a lookahead in the pattern causes the match to end before 2771it starts are not supported, and give rise to an error return. 2772</P> 2773<P> 2774The first seven arguments of <b>pcre2_substitute()</b> are the same as for 2775<b>pcre2_match()</b>, except that the partial matching options are not 2776permitted, and <i>match_data</i> may be passed as NULL, in which case a match 2777data block is obtained and freed within this function, using memory management 2778functions from the match context, if provided, or else those that were used to 2779allocate memory for the compiled code. 2780</P> 2781<P> 2782The <i>outlengthptr</i> argument must point to a variable that contains the 2783length, in code units, of the output buffer. If the function is successful, the 2784value is updated to contain the length of the new string, excluding the 2785trailing zero that is automatically added. 2786</P> 2787<P> 2788If the function is not successful, the value set via <i>outlengthptr</i> depends 2789on the type of error. For syntax errors in the replacement string, the value is 2790the offset in the replacement string where the error was detected. For other 2791errors, the value is PCRE2_UNSET by default. This includes the case of the 2792output buffer being too small, unless PCRE2_SUBSTITUTE_OVERFLOW_LENGTH is set 2793(see below), in which case the value is the minimum length needed, including 2794space for the trailing zero. Note that in order to compute the required length, 2795<b>pcre2_substitute()</b> has to simulate all the matching and copying, instead 2796of giving an error return as soon as the buffer overflows. Note also that the 2797length is in code units, not bytes. 2798</P> 2799<P> 2800In the replacement string, which is interpreted as a UTF string in UTF mode, 2801and is checked for UTF validity unless the PCRE2_NO_UTF_CHECK option is set, a 2802dollar character is an escape character that can specify the insertion of 2803characters from capturing groups or (*MARK) items in the pattern. The following 2804forms are always recognized: 2805<pre> 2806 $$ insert a dollar character 2807 $<n> or ${<n>} insert the contents of group <n> 2808 $*MARK or ${*MARK} insert the name of the last (*MARK) encountered 2809</pre> 2810Either a group number or a group name can be given for <n>. Curly brackets are 2811required only if the following character would be interpreted as part of the 2812number or name. The number may be zero to include the entire matched string. 2813For example, if the pattern a(b)c is matched with "=abc=" and the replacement 2814string "+$1$0$1+", the result is "=+babcb+=". 2815</P> 2816<P> 2817The facility for inserting a (*MARK) name can be used to perform simple 2818simultaneous substitutions, as this <b>pcre2test</b> example shows: 2819<pre> 2820 /(*:pear)apple|(*:orange)lemon/g,replace=${*MARK} 2821 apple lemon 2822 2: pear orange 2823</pre> 2824As well as the usual options for <b>pcre2_match()</b>, a number of additional 2825options can be set in the <i>options</i> argument. 2826</P> 2827<P> 2828PCRE2_SUBSTITUTE_GLOBAL causes the function to iterate over the subject string, 2829replacing every matching substring. If this is not set, only the first matching 2830substring is replaced. If any matched substring has zero length, after the 2831substitution has happened, an attempt to find a non-empty match at the same 2832position is performed. If this is not successful, the current position is 2833advanced by one character except when CRLF is a valid newline sequence and the 2834next two characters are CR, LF. In this case, the current position is advanced 2835by two characters. 2836</P> 2837<P> 2838PCRE2_SUBSTITUTE_OVERFLOW_LENGTH changes what happens when the output buffer is 2839too small. The default action is to return PCRE2_ERROR_NOMEMORY immediately. If 2840this option is set, however, <b>pcre2_substitute()</b> continues to go through 2841the motions of matching and substituting (without, of course, writing anything) 2842in order to compute the size of buffer that is needed. This value is passed 2843back via the <i>outlengthptr</i> variable, with the result of the function still 2844being PCRE2_ERROR_NOMEMORY. 2845</P> 2846<P> 2847Passing a buffer size of zero is a permitted way of finding out how much memory 2848is needed for given substitution. However, this does mean that the entire 2849operation is carried out twice. Depending on the application, it may be more 2850efficient to allocate a large buffer and free the excess afterwards, instead of 2851using PCRE2_SUBSTITUTE_OVERFLOW_LENGTH. 2852</P> 2853<P> 2854PCRE2_SUBSTITUTE_UNKNOWN_UNSET causes references to capturing groups that do 2855not appear in the pattern to be treated as unset groups. This option should be 2856used with care, because it means that a typo in a group name or number no 2857longer causes the PCRE2_ERROR_NOSUBSTRING error. 2858</P> 2859<P> 2860PCRE2_SUBSTITUTE_UNSET_EMPTY causes unset capturing groups (including unknown 2861groups when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) to be treated as empty 2862strings when inserted as described above. If this option is not set, an attempt 2863to insert an unset group causes the PCRE2_ERROR_UNSET error. This option does 2864not influence the extended substitution syntax described below. 2865</P> 2866<P> 2867PCRE2_SUBSTITUTE_EXTENDED causes extra processing to be applied to the 2868replacement string. Without this option, only the dollar character is special, 2869and only the group insertion forms listed above are valid. When 2870PCRE2_SUBSTITUTE_EXTENDED is set, two things change: 2871</P> 2872<P> 2873Firstly, backslash in a replacement string is interpreted as an escape 2874character. The usual forms such as \n or \x{ddd} can be used to specify 2875particular character codes, and backslash followed by any non-alphanumeric 2876character quotes that character. Extended quoting can be coded using \Q...\E, 2877exactly as in pattern strings. 2878</P> 2879<P> 2880There are also four escape sequences for forcing the case of inserted letters. 2881The insertion mechanism has three states: no case forcing, force upper case, 2882and force lower case. The escape sequences change the current state: \U and 2883\L change to upper or lower case forcing, respectively, and \E (when not 2884terminating a \Q quoted sequence) reverts to no case forcing. The sequences 2885\u and \l force the next character (if it is a letter) to upper or lower 2886case, respectively, and then the state automatically reverts to no case 2887forcing. Case forcing applies to all inserted characters, including those from 2888captured groups and letters within \Q...\E quoted sequences. 2889</P> 2890<P> 2891Note that case forcing sequences such as \U...\E do not nest. For example, 2892the result of processing "\Uaa\LBB\Ecc\E" is "AAbbcc"; the final \E has no 2893effect. 2894</P> 2895<P> 2896The second effect of setting PCRE2_SUBSTITUTE_EXTENDED is to add more 2897flexibility to group substitution. The syntax is similar to that used by Bash: 2898<pre> 2899 ${<n>:-<string>} 2900 ${<n>:+<string1>:<string2>} 2901</pre> 2902As before, <n> may be a group number or a name. The first form specifies a 2903default value. If group <n> is set, its value is inserted; if not, <string> is 2904expanded and the result inserted. The second form specifies strings that are 2905expanded and inserted when group <n> is set or unset, respectively. The first 2906form is just a convenient shorthand for 2907<pre> 2908 ${<n>:+${<n>}:<string>} 2909</pre> 2910Backslash can be used to escape colons and closing curly brackets in the 2911replacement strings. A change of the case forcing state within a replacement 2912string remains in force afterwards, as shown in this <b>pcre2test</b> example: 2913<pre> 2914 /(some)?(body)/substitute_extended,replace=${1:+\U:\L}HeLLo 2915 body 2916 1: hello 2917 somebody 2918 1: HELLO 2919</pre> 2920The PCRE2_SUBSTITUTE_UNSET_EMPTY option does not affect these extended 2921substitutions. However, PCRE2_SUBSTITUTE_UNKNOWN_UNSET does cause unknown 2922groups in the extended syntax forms to be treated as unset. 2923</P> 2924<P> 2925If successful, <b>pcre2_substitute()</b> returns the number of replacements that 2926were made. This may be zero if no matches were found, and is never greater than 29271 unless PCRE2_SUBSTITUTE_GLOBAL is set. 2928</P> 2929<P> 2930In the event of an error, a negative error code is returned. Except for 2931PCRE2_ERROR_NOMATCH (which is never returned), errors from <b>pcre2_match()</b> 2932are passed straight back. 2933</P> 2934<P> 2935PCRE2_ERROR_NOSUBSTRING is returned for a non-existent substring insertion, 2936unless PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set. 2937</P> 2938<P> 2939PCRE2_ERROR_UNSET is returned for an unset substring insertion (including an 2940unknown substring when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) when the simple 2941(non-extended) syntax is used and PCRE2_SUBSTITUTE_UNSET_EMPTY is not set. 2942</P> 2943<P> 2944PCRE2_ERROR_NOMEMORY is returned if the output buffer is not big enough. If the 2945PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set, the size of buffer that is 2946needed is returned via <i>outlengthptr</i>. Note that this does not happen by 2947default. 2948</P> 2949<P> 2950PCRE2_ERROR_BADREPLACEMENT is used for miscellaneous syntax errors in the 2951replacement string, with more particular errors being PCRE2_ERROR_BADREPESCAPE 2952(invalid escape sequence), PCRE2_ERROR_REPMISSING_BRACE (closing curly bracket 2953not found), PCRE2_BADSUBSTITUTION (syntax error in extended group 2954substitution), and PCRE2_BADSUBPATTERN (the pattern match ended before it 2955started, which can happen if \K is used in an assertion). 2956</P> 2957<P> 2958As for all PCRE2 errors, a text message that describes the error can be 2959obtained by calling the <b>pcre2_get_error_message()</b> function (see 2960"Obtaining a textual error message" 2961<a href="#geterrormessage">above).</a> 2962</P> 2963<br><a name="SEC36" href="#TOC1">DUPLICATE SUBPATTERN NAMES</a><br> 2964<P> 2965<b>int pcre2_substring_nametable_scan(const pcre2_code *<i>code</i>,</b> 2966<b> PCRE2_SPTR <i>name</i>, PCRE2_SPTR *<i>first</i>, PCRE2_SPTR *<i>last</i>);</b> 2967</P> 2968<P> 2969When a pattern is compiled with the PCRE2_DUPNAMES option, names for 2970subpatterns are not required to be unique. Duplicate names are always allowed 2971for subpatterns with the same number, created by using the (?| feature. Indeed, 2972if such subpatterns are named, they are required to use the same names. 2973</P> 2974<P> 2975Normally, patterns with duplicate names are such that in any one match, only 2976one of the named subpatterns participates. An example is shown in the 2977<a href="pcre2pattern.html"><b>pcre2pattern</b></a> 2978documentation. 2979</P> 2980<P> 2981When duplicates are present, <b>pcre2_substring_copy_byname()</b> and 2982<b>pcre2_substring_get_byname()</b> return the first substring corresponding to 2983the given name that is set. Only if none are set is PCRE2_ERROR_UNSET is 2984returned. The <b>pcre2_substring_number_from_name()</b> function returns the 2985error PCRE2_ERROR_NOUNIQUESUBSTRING when there are duplicate names. 2986</P> 2987<P> 2988If you want to get full details of all captured substrings for a given name, 2989you must use the <b>pcre2_substring_nametable_scan()</b> function. The first 2990argument is the compiled pattern, and the second is the name. If the third and 2991fourth arguments are NULL, the function returns a group number for a unique 2992name, or PCRE2_ERROR_NOUNIQUESUBSTRING otherwise. 2993</P> 2994<P> 2995When the third and fourth arguments are not NULL, they must be pointers to 2996variables that are updated by the function. After it has run, they point to the 2997first and last entries in the name-to-number table for the given name, and the 2998function returns the length of each entry in code units. In both cases, 2999PCRE2_ERROR_NOSUBSTRING is returned if there are no entries for the given name. 3000</P> 3001<P> 3002The format of the name table is described 3003<a href="#infoaboutpattern">above</a> 3004in the section entitled <i>Information about a pattern</i>. Given all the 3005relevant entries for the name, you can extract each of their numbers, and hence 3006the captured data. 3007</P> 3008<br><a name="SEC37" href="#TOC1">FINDING ALL POSSIBLE MATCHES AT ONE POSITION</a><br> 3009<P> 3010The traditional matching function uses a similar algorithm to Perl, which stops 3011when it finds the first match at a given point in the subject. If you want to 3012find all possible matches, or the longest possible match at a given position, 3013consider using the alternative matching function (see below) instead. If you 3014cannot use the alternative function, you can kludge it up by making use of the 3015callout facility, which is described in the 3016<a href="pcre2callout.html"><b>pcre2callout</b></a> 3017documentation. 3018</P> 3019<P> 3020What you have to do is to insert a callout right at the end of the pattern. 3021When your callout function is called, extract and save the current matched 3022substring. Then return 1, which forces <b>pcre2_match()</b> to backtrack and try 3023other alternatives. Ultimately, when it runs out of matches, 3024<b>pcre2_match()</b> will yield PCRE2_ERROR_NOMATCH. 3025<a name="dfamatch"></a></P> 3026<br><a name="SEC38" href="#TOC1">MATCHING A PATTERN: THE ALTERNATIVE FUNCTION</a><br> 3027<P> 3028<b>int pcre2_dfa_match(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b> 3029<b> PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b> 3030<b> uint32_t <i>options</i>, pcre2_match_data *<i>match_data</i>,</b> 3031<b> pcre2_match_context *<i>mcontext</i>,</b> 3032<b> int *<i>workspace</i>, PCRE2_SIZE <i>wscount</i>);</b> 3033</P> 3034<P> 3035The function <b>pcre2_dfa_match()</b> is called to match a subject string 3036against a compiled pattern, using a matching algorithm that scans the subject 3037string just once, and does not backtrack. This has different characteristics to 3038the normal algorithm, and is not compatible with Perl. Some of the features of 3039PCRE2 patterns are not supported. Nevertheless, there are times when this kind 3040of matching can be useful. For a discussion of the two matching algorithms, and 3041a list of features that <b>pcre2_dfa_match()</b> does not support, see the 3042<a href="pcre2matching.html"><b>pcre2matching</b></a> 3043documentation. 3044</P> 3045<P> 3046The arguments for the <b>pcre2_dfa_match()</b> function are the same as for 3047<b>pcre2_match()</b>, plus two extras. The ovector within the match data block 3048is used in a different way, and this is described below. The other common 3049arguments are used in the same way as for <b>pcre2_match()</b>, so their 3050description is not repeated here. 3051</P> 3052<P> 3053The two additional arguments provide workspace for the function. The workspace 3054vector should contain at least 20 elements. It is used for keeping track of 3055multiple paths through the pattern tree. More workspace is needed for patterns 3056and subjects where there are a lot of potential matches. 3057</P> 3058<P> 3059Here is an example of a simple call to <b>pcre2_dfa_match()</b>: 3060<pre> 3061 int wspace[20]; 3062 pcre2_match_data *md = pcre2_match_data_create(4, NULL); 3063 int rc = pcre2_dfa_match( 3064 re, /* result of pcre2_compile() */ 3065 "some string", /* the subject string */ 3066 11, /* the length of the subject string */ 3067 0, /* start at offset 0 in the subject */ 3068 0, /* default options */ 3069 match_data, /* the match data block */ 3070 NULL, /* a match context; NULL means use defaults */ 3071 wspace, /* working space vector */ 3072 20); /* number of elements (NOT size in bytes) */ 3073</PRE> 3074</P> 3075<br><b> 3076Option bits for <b>pcre_dfa_match()</b> 3077</b><br> 3078<P> 3079The unused bits of the <i>options</i> argument for <b>pcre2_dfa_match()</b> must 3080be zero. The only bits that may be set are PCRE2_ANCHORED, PCRE2_NOTBOL, 3081PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK, 3082PCRE2_PARTIAL_HARD, PCRE2_PARTIAL_SOFT, PCRE2_DFA_SHORTEST, and 3083PCRE2_DFA_RESTART. All but the last four of these are exactly the same as for 3084<b>pcre2_match()</b>, so their description is not repeated here. 3085<pre> 3086 PCRE2_PARTIAL_HARD 3087 PCRE2_PARTIAL_SOFT 3088</pre> 3089These have the same general effect as they do for <b>pcre2_match()</b>, but the 3090details are slightly different. When PCRE2_PARTIAL_HARD is set for 3091<b>pcre2_dfa_match()</b>, it returns PCRE2_ERROR_PARTIAL if the end of the 3092subject is reached and there is still at least one matching possibility that 3093requires additional characters. This happens even if some complete matches have 3094already been found. When PCRE2_PARTIAL_SOFT is set, the return code 3095PCRE2_ERROR_NOMATCH is converted into PCRE2_ERROR_PARTIAL if the end of the 3096subject is reached, there have been no complete matches, but there is still at 3097least one matching possibility. The portion of the string that was inspected 3098when the longest partial match was found is set as the first matching string in 3099both cases. There is a more detailed discussion of partial and multi-segment 3100matching, with examples, in the 3101<a href="pcre2partial.html"><b>pcre2partial</b></a> 3102documentation. 3103<pre> 3104 PCRE2_DFA_SHORTEST 3105</pre> 3106Setting the PCRE2_DFA_SHORTEST option causes the matching algorithm to stop as 3107soon as it has found one match. Because of the way the alternative algorithm 3108works, this is necessarily the shortest possible match at the first possible 3109matching point in the subject string. 3110<pre> 3111 PCRE2_DFA_RESTART 3112</pre> 3113When <b>pcre2_dfa_match()</b> returns a partial match, it is possible to call it 3114again, with additional subject characters, and have it continue with the same 3115match. The PCRE2_DFA_RESTART option requests this action; when it is set, the 3116<i>workspace</i> and <i>wscount</i> options must reference the same vector as 3117before because data about the match so far is left in them after a partial 3118match. There is more discussion of this facility in the 3119<a href="pcre2partial.html"><b>pcre2partial</b></a> 3120documentation. 3121</P> 3122<br><b> 3123Successful returns from <b>pcre2_dfa_match()</b> 3124</b><br> 3125<P> 3126When <b>pcre2_dfa_match()</b> succeeds, it may have matched more than one 3127substring in the subject. Note, however, that all the matches from one run of 3128the function start at the same point in the subject. The shorter matches are 3129all initial substrings of the longer matches. For example, if the pattern 3130<pre> 3131 <.*> 3132</pre> 3133is matched against the string 3134<pre> 3135 This is <something> <something else> <something further> no more 3136</pre> 3137the three matched strings are 3138<pre> 3139 <something> <something else> <something further> 3140 <something> <something else> 3141 <something> 3142</pre> 3143On success, the yield of the function is a number greater than zero, which is 3144the number of matched substrings. The offsets of the substrings are returned in 3145the ovector, and can be extracted by number in the same way as for 3146<b>pcre2_match()</b>, but the numbers bear no relation to any capturing groups 3147that may exist in the pattern, because DFA matching does not support group 3148capture. 3149</P> 3150<P> 3151Calls to the convenience functions that extract substrings by name 3152return the error PCRE2_ERROR_DFA_UFUNC (unsupported function) if used after a 3153DFA match. The convenience functions that extract substrings by number never 3154return PCRE2_ERROR_NOSUBSTRING, and the meanings of some other errors are 3155slightly different: 3156<pre> 3157 PCRE2_ERROR_UNAVAILABLE 3158</pre> 3159The ovector is not big enough to include a slot for the given substring number. 3160<pre> 3161 PCRE2_ERROR_UNSET 3162</pre> 3163There is a slot in the ovector for this substring, but there were insufficient 3164matches to fill it. 3165</P> 3166<P> 3167The matched strings are stored in the ovector in reverse order of length; that 3168is, the longest matching string is first. If there were too many matches to fit 3169into the ovector, the yield of the function is zero, and the vector is filled 3170with the longest matches. 3171</P> 3172<P> 3173NOTE: PCRE2's "auto-possessification" optimization usually applies to character 3174repeats at the end of a pattern (as well as internally). For example, the 3175pattern "a\d+" is compiled as if it were "a\d++". For DFA matching, this 3176means that only one possible match is found. If you really do want multiple 3177matches in such cases, either use an ungreedy repeat auch as "a\d+?" or set 3178the PCRE2_NO_AUTO_POSSESS option when compiling. 3179</P> 3180<br><b> 3181Error returns from <b>pcre2_dfa_match()</b> 3182</b><br> 3183<P> 3184The <b>pcre2_dfa_match()</b> function returns a negative number when it fails. 3185Many of the errors are the same as for <b>pcre2_match()</b>, as described 3186<a href="#errorlist">above.</a> 3187There are in addition the following errors that are specific to 3188<b>pcre2_dfa_match()</b>: 3189<pre> 3190 PCRE2_ERROR_DFA_UITEM 3191</pre> 3192This return is given if <b>pcre2_dfa_match()</b> encounters an item in the 3193pattern that it does not support, for instance, the use of \C in a UTF mode or 3194a back reference. 3195<pre> 3196 PCRE2_ERROR_DFA_UCOND 3197</pre> 3198This return is given if <b>pcre2_dfa_match()</b> encounters a condition item 3199that uses a back reference for the condition, or a test for recursion in a 3200specific group. These are not supported. 3201<pre> 3202 PCRE2_ERROR_DFA_WSSIZE 3203</pre> 3204This return is given if <b>pcre2_dfa_match()</b> runs out of space in the 3205<i>workspace</i> vector. 3206<pre> 3207 PCRE2_ERROR_DFA_RECURSE 3208</pre> 3209When a recursive subpattern is processed, the matching function calls itself 3210recursively, using private memory for the ovector and <i>workspace</i>. This 3211error is given if the internal ovector is not large enough. This should be 3212extremely rare, as a vector of size 1000 is used. 3213<pre> 3214 PCRE2_ERROR_DFA_BADRESTART 3215</pre> 3216When <b>pcre2_dfa_match()</b> is called with the <b>PCRE2_DFA_RESTART</b> option, 3217some plausibility checks are made on the contents of the workspace, which 3218should contain data about the previous partial match. If any of these checks 3219fail, this error is given. 3220</P> 3221<br><a name="SEC39" href="#TOC1">SEE ALSO</a><br> 3222<P> 3223<b>pcre2build</b>(3), <b>pcre2callout</b>(3), <b>pcre2demo(3)</b>, 3224<b>pcre2matching</b>(3), <b>pcre2partial</b>(3), <b>pcre2posix</b>(3), 3225<b>pcre2sample</b>(3), <b>pcre2stack</b>(3), <b>pcre2unicode</b>(3). 3226</P> 3227<br><a name="SEC40" href="#TOC1">AUTHOR</a><br> 3228<P> 3229Philip Hazel 3230<br> 3231University Computing Service 3232<br> 3233Cambridge, England. 3234<br> 3235</P> 3236<br><a name="SEC41" href="#TOC1">REVISION</a><br> 3237<P> 3238Last updated: 17 June 2016 3239<br> 3240Copyright © 1997-2016 University of Cambridge. 3241<br> 3242<p> 3243Return to the <a href="index.html">PCRE2 index page</a>. 3244</p> 3245