• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1<html>
2<head>
3<title>pcre2api specification</title>
4</head>
5<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
6<h1>pcre2api man page</h1>
7<p>
8Return to the <a href="index.html">PCRE2 index page</a>.
9</p>
10<p>
11This page is part of the PCRE2 HTML documentation. It was generated
12automatically from the original man page. If there is any nonsense in it,
13please consult the man page, in case the conversion went wrong.
14<br>
15<ul>
16<li><a name="TOC1" href="#SEC1">PCRE2 NATIVE API BASIC FUNCTIONS</a>
17<li><a name="TOC2" href="#SEC2">PCRE2 NATIVE API AUXILIARY MATCH FUNCTIONS</a>
18<li><a name="TOC3" href="#SEC3">PCRE2 NATIVE API GENERAL CONTEXT FUNCTIONS</a>
19<li><a name="TOC4" href="#SEC4">PCRE2 NATIVE API COMPILE CONTEXT FUNCTIONS</a>
20<li><a name="TOC5" href="#SEC5">PCRE2 NATIVE API MATCH CONTEXT FUNCTIONS</a>
21<li><a name="TOC6" href="#SEC6">PCRE2 NATIVE API STRING EXTRACTION FUNCTIONS</a>
22<li><a name="TOC7" href="#SEC7">PCRE2 NATIVE API STRING SUBSTITUTION FUNCTION</a>
23<li><a name="TOC8" href="#SEC8">PCRE2 NATIVE API JIT FUNCTIONS</a>
24<li><a name="TOC9" href="#SEC9">PCRE2 NATIVE API SERIALIZATION FUNCTIONS</a>
25<li><a name="TOC10" href="#SEC10">PCRE2 NATIVE API AUXILIARY FUNCTIONS</a>
26<li><a name="TOC11" href="#SEC11">PCRE2 8-BIT, 16-BIT, AND 32-BIT LIBRARIES</a>
27<li><a name="TOC12" href="#SEC12">PCRE2 API OVERVIEW</a>
28<li><a name="TOC13" href="#SEC13">STRING LENGTHS AND OFFSETS</a>
29<li><a name="TOC14" href="#SEC14">NEWLINES</a>
30<li><a name="TOC15" href="#SEC15">MULTITHREADING</a>
31<li><a name="TOC16" href="#SEC16">PCRE2 CONTEXTS</a>
32<li><a name="TOC17" href="#SEC17">CHECKING BUILD-TIME OPTIONS</a>
33<li><a name="TOC18" href="#SEC18">COMPILING A PATTERN</a>
34<li><a name="TOC19" href="#SEC19">COMPILATION ERROR CODES</a>
35<li><a name="TOC20" href="#SEC20">JUST-IN-TIME (JIT) COMPILATION</a>
36<li><a name="TOC21" href="#SEC21">LOCALE SUPPORT</a>
37<li><a name="TOC22" href="#SEC22">INFORMATION ABOUT A COMPILED PATTERN</a>
38<li><a name="TOC23" href="#SEC23">INFORMATION ABOUT A PATTERN'S CALLOUTS</a>
39<li><a name="TOC24" href="#SEC24">SERIALIZATION AND PRECOMPILING</a>
40<li><a name="TOC25" href="#SEC25">THE MATCH DATA BLOCK</a>
41<li><a name="TOC26" href="#SEC26">MATCHING A PATTERN: THE TRADITIONAL FUNCTION</a>
42<li><a name="TOC27" href="#SEC27">NEWLINE HANDLING WHEN MATCHING</a>
43<li><a name="TOC28" href="#SEC28">HOW PCRE2_MATCH() RETURNS A STRING AND CAPTURED SUBSTRINGS</a>
44<li><a name="TOC29" href="#SEC29">OTHER INFORMATION ABOUT A MATCH</a>
45<li><a name="TOC30" href="#SEC30">ERROR RETURNS FROM <b>pcre2_match()</b></a>
46<li><a name="TOC31" href="#SEC31">OBTAINING A TEXTUAL ERROR MESSAGE</a>
47<li><a name="TOC32" href="#SEC32">EXTRACTING CAPTURED SUBSTRINGS BY NUMBER</a>
48<li><a name="TOC33" href="#SEC33">EXTRACTING A LIST OF ALL CAPTURED SUBSTRINGS</a>
49<li><a name="TOC34" href="#SEC34">EXTRACTING CAPTURED SUBSTRINGS BY NAME</a>
50<li><a name="TOC35" href="#SEC35">CREATING A NEW STRING WITH SUBSTITUTIONS</a>
51<li><a name="TOC36" href="#SEC36">DUPLICATE SUBPATTERN NAMES</a>
52<li><a name="TOC37" href="#SEC37">FINDING ALL POSSIBLE MATCHES AT ONE POSITION</a>
53<li><a name="TOC38" href="#SEC38">MATCHING A PATTERN: THE ALTERNATIVE FUNCTION</a>
54<li><a name="TOC39" href="#SEC39">SEE ALSO</a>
55<li><a name="TOC40" href="#SEC40">AUTHOR</a>
56<li><a name="TOC41" href="#SEC41">REVISION</a>
57</ul>
58<P>
59<b>#include &#60;pcre2.h&#62;</b>
60<br>
61<br>
62PCRE2 is a new API for PCRE. This document contains a description of all its
63functions. See the
64<a href="pcre2.html"><b>pcre2</b></a>
65document for an overview of all the PCRE2 documentation.
66</P>
67<br><a name="SEC1" href="#TOC1">PCRE2 NATIVE API BASIC FUNCTIONS</a><br>
68<P>
69<b>pcre2_code *pcre2_compile(PCRE2_SPTR <i>pattern</i>, PCRE2_SIZE <i>length</i>,</b>
70<b>  uint32_t <i>options</i>, int *<i>errorcode</i>, PCRE2_SIZE *<i>erroroffset,</i></b>
71<b>  pcre2_compile_context *<i>ccontext</i>);</b>
72<br>
73<br>
74<b>void pcre2_code_free(pcre2_code *<i>code</i>);</b>
75<br>
76<br>
77<b>pcre2_match_data *pcre2_match_data_create(uint32_t <i>ovecsize</i>,</b>
78<b>  pcre2_general_context *<i>gcontext</i>);</b>
79<br>
80<br>
81<b>pcre2_match_data *pcre2_match_data_create_from_pattern(</b>
82<b>  const pcre2_code *<i>code</i>, pcre2_general_context *<i>gcontext</i>);</b>
83<br>
84<br>
85<b>int pcre2_match(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b>
86<b>  PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b>
87<b>  uint32_t <i>options</i>, pcre2_match_data *<i>match_data</i>,</b>
88<b>  pcre2_match_context *<i>mcontext</i>);</b>
89<br>
90<br>
91<b>int pcre2_dfa_match(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b>
92<b>  PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b>
93<b>  uint32_t <i>options</i>, pcre2_match_data *<i>match_data</i>,</b>
94<b>  pcre2_match_context *<i>mcontext</i>,</b>
95<b>  int *<i>workspace</i>, PCRE2_SIZE <i>wscount</i>);</b>
96<br>
97<br>
98<b>void pcre2_match_data_free(pcre2_match_data *<i>match_data</i>);</b>
99</P>
100<br><a name="SEC2" href="#TOC1">PCRE2 NATIVE API AUXILIARY MATCH FUNCTIONS</a><br>
101<P>
102<b>PCRE2_SPTR pcre2_get_mark(pcre2_match_data *<i>match_data</i>);</b>
103<br>
104<br>
105<b>uint32_t pcre2_get_ovector_count(pcre2_match_data *<i>match_data</i>);</b>
106<br>
107<br>
108<b>PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *<i>match_data</i>);</b>
109<br>
110<br>
111<b>PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *<i>match_data</i>);</b>
112</P>
113<br><a name="SEC3" href="#TOC1">PCRE2 NATIVE API GENERAL CONTEXT FUNCTIONS</a><br>
114<P>
115<b>pcre2_general_context *pcre2_general_context_create(</b>
116<b>  void *(*<i>private_malloc</i>)(PCRE2_SIZE, void *),</b>
117<b>  void (*<i>private_free</i>)(void *, void *), void *<i>memory_data</i>);</b>
118<br>
119<br>
120<b>pcre2_general_context *pcre2_general_context_copy(</b>
121<b>  pcre2_general_context *<i>gcontext</i>);</b>
122<br>
123<br>
124<b>void pcre2_general_context_free(pcre2_general_context *<i>gcontext</i>);</b>
125</P>
126<br><a name="SEC4" href="#TOC1">PCRE2 NATIVE API COMPILE CONTEXT FUNCTIONS</a><br>
127<P>
128<b>pcre2_compile_context *pcre2_compile_context_create(</b>
129<b>  pcre2_general_context *<i>gcontext</i>);</b>
130<br>
131<br>
132<b>pcre2_compile_context *pcre2_compile_context_copy(</b>
133<b>  pcre2_compile_context *<i>ccontext</i>);</b>
134<br>
135<br>
136<b>void pcre2_compile_context_free(pcre2_compile_context *<i>ccontext</i>);</b>
137<br>
138<br>
139<b>int pcre2_set_bsr(pcre2_compile_context *<i>ccontext</i>,</b>
140<b>  uint32_t <i>value</i>);</b>
141<br>
142<br>
143<b>int pcre2_set_character_tables(pcre2_compile_context *<i>ccontext</i>,</b>
144<b>  const unsigned char *<i>tables</i>);</b>
145<br>
146<br>
147<b>int pcre2_set_max_pattern_length(pcre2_compile_context *<i>ccontext</i>,</b>
148<b>  PCRE2_SIZE <i>value</i>);</b>
149<br>
150<br>
151<b>int pcre2_set_newline(pcre2_compile_context *<i>ccontext</i>,</b>
152<b>  uint32_t <i>value</i>);</b>
153<br>
154<br>
155<b>int pcre2_set_parens_nest_limit(pcre2_compile_context *<i>ccontext</i>,</b>
156<b>  uint32_t <i>value</i>);</b>
157<br>
158<br>
159<b>int pcre2_set_compile_recursion_guard(pcre2_compile_context *<i>ccontext</i>,</b>
160<b>  int (*<i>guard_function</i>)(uint32_t, void *), void *<i>user_data</i>);</b>
161</P>
162<br><a name="SEC5" href="#TOC1">PCRE2 NATIVE API MATCH CONTEXT FUNCTIONS</a><br>
163<P>
164<b>pcre2_match_context *pcre2_match_context_create(</b>
165<b>  pcre2_general_context *<i>gcontext</i>);</b>
166<br>
167<br>
168<b>pcre2_match_context *pcre2_match_context_copy(</b>
169<b>  pcre2_match_context *<i>mcontext</i>);</b>
170<br>
171<br>
172<b>void pcre2_match_context_free(pcre2_match_context *<i>mcontext</i>);</b>
173<br>
174<br>
175<b>int pcre2_set_callout(pcre2_match_context *<i>mcontext</i>,</b>
176<b>  int (*<i>callout_function</i>)(pcre2_callout_block *, void *),</b>
177<b>  void *<i>callout_data</i>);</b>
178<br>
179<br>
180<b>int pcre2_set_match_limit(pcre2_match_context *<i>mcontext</i>,</b>
181<b>  uint32_t <i>value</i>);</b>
182<br>
183<br>
184<b>int pcre2_set_offset_limit(pcre2_match_context *<i>mcontext</i>,</b>
185<b>  PCRE2_SIZE <i>value</i>);</b>
186<br>
187<br>
188<b>int pcre2_set_recursion_limit(pcre2_match_context *<i>mcontext</i>,</b>
189<b>  uint32_t <i>value</i>);</b>
190<br>
191<br>
192<b>int pcre2_set_recursion_memory_management(</b>
193<b>  pcre2_match_context *<i>mcontext</i>,</b>
194<b>  void *(*<i>private_malloc</i>)(PCRE2_SIZE, void *),</b>
195<b>  void (*<i>private_free</i>)(void *, void *), void *<i>memory_data</i>);</b>
196</P>
197<br><a name="SEC6" href="#TOC1">PCRE2 NATIVE API STRING EXTRACTION FUNCTIONS</a><br>
198<P>
199<b>int pcre2_substring_copy_byname(pcre2_match_data *<i>match_data</i>,</b>
200<b>  PCRE2_SPTR <i>name</i>, PCRE2_UCHAR *<i>buffer</i>, PCRE2_SIZE *<i>bufflen</i>);</b>
201<br>
202<br>
203<b>int pcre2_substring_copy_bynumber(pcre2_match_data *<i>match_data</i>,</b>
204<b>  uint32_t <i>number</i>, PCRE2_UCHAR *<i>buffer</i>,</b>
205<b>  PCRE2_SIZE *<i>bufflen</i>);</b>
206<br>
207<br>
208<b>void pcre2_substring_free(PCRE2_UCHAR *<i>buffer</i>);</b>
209<br>
210<br>
211<b>int pcre2_substring_get_byname(pcre2_match_data *<i>match_data</i>,</b>
212<b>  PCRE2_SPTR <i>name</i>, PCRE2_UCHAR **<i>bufferptr</i>, PCRE2_SIZE *<i>bufflen</i>);</b>
213<br>
214<br>
215<b>int pcre2_substring_get_bynumber(pcre2_match_data *<i>match_data</i>,</b>
216<b>  uint32_t <i>number</i>, PCRE2_UCHAR **<i>bufferptr</i>,</b>
217<b>  PCRE2_SIZE *<i>bufflen</i>);</b>
218<br>
219<br>
220<b>int pcre2_substring_length_byname(pcre2_match_data *<i>match_data</i>,</b>
221<b>  PCRE2_SPTR <i>name</i>, PCRE2_SIZE *<i>length</i>);</b>
222<br>
223<br>
224<b>int pcre2_substring_length_bynumber(pcre2_match_data *<i>match_data</i>,</b>
225<b>  uint32_t <i>number</i>, PCRE2_SIZE *<i>length</i>);</b>
226<br>
227<br>
228<b>int pcre2_substring_nametable_scan(const pcre2_code *<i>code</i>,</b>
229<b>  PCRE2_SPTR <i>name</i>, PCRE2_SPTR *<i>first</i>, PCRE2_SPTR *<i>last</i>);</b>
230<br>
231<br>
232<b>int pcre2_substring_number_from_name(const pcre2_code *<i>code</i>,</b>
233<b>  PCRE2_SPTR <i>name</i>);</b>
234<br>
235<br>
236<b>void pcre2_substring_list_free(PCRE2_SPTR *<i>list</i>);</b>
237<br>
238<br>
239<b>int pcre2_substring_list_get(pcre2_match_data *<i>match_data</i>,</b>
240<b>"  PCRE2_UCHAR ***<i>listptr</i>, PCRE2_SIZE **<i>lengthsptr</i>);</b>
241</P>
242<br><a name="SEC7" href="#TOC1">PCRE2 NATIVE API STRING SUBSTITUTION FUNCTION</a><br>
243<P>
244<b>int pcre2_substitute(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b>
245<b>  PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b>
246<b>  uint32_t <i>options</i>, pcre2_match_data *<i>match_data</i>,</b>
247<b>  pcre2_match_context *<i>mcontext</i>, PCRE2_SPTR \fIreplacementzfP,</b>
248<b>  PCRE2_SIZE <i>rlength</i>, PCRE2_UCHAR *<i>outputbuffer</i>,</b>
249<b>  PCRE2_SIZE *<i>outlengthptr</i>);</b>
250</P>
251<br><a name="SEC8" href="#TOC1">PCRE2 NATIVE API JIT FUNCTIONS</a><br>
252<P>
253<b>int pcre2_jit_compile(pcre2_code *<i>code</i>, uint32_t <i>options</i>);</b>
254<br>
255<br>
256<b>int pcre2_jit_match(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b>
257<b>  PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b>
258<b>  uint32_t <i>options</i>, pcre2_match_data *<i>match_data</i>,</b>
259<b>  pcre2_match_context *<i>mcontext</i>);</b>
260<br>
261<br>
262<b>void pcre2_jit_free_unused_memory(pcre2_general_context *<i>gcontext</i>);</b>
263<br>
264<br>
265<b>pcre2_jit_stack *pcre2_jit_stack_create(PCRE2_SIZE <i>startsize</i>,</b>
266<b>  PCRE2_SIZE <i>maxsize</i>, pcre2_general_context *<i>gcontext</i>);</b>
267<br>
268<br>
269<b>void pcre2_jit_stack_assign(pcre2_match_context *<i>mcontext</i>,</b>
270<b>  pcre2_jit_callback <i>callback_function</i>, void *<i>callback_data</i>);</b>
271<br>
272<br>
273<b>void pcre2_jit_stack_free(pcre2_jit_stack *<i>jit_stack</i>);</b>
274</P>
275<br><a name="SEC9" href="#TOC1">PCRE2 NATIVE API SERIALIZATION FUNCTIONS</a><br>
276<P>
277<b>int32_t pcre2_serialize_decode(pcre2_code **<i>codes</i>,</b>
278<b>  int32_t <i>number_of_codes</i>, const uint8_t *<i>bytes</i>,</b>
279<b>  pcre2_general_context *<i>gcontext</i>);</b>
280<br>
281<br>
282<b>int32_t pcre2_serialize_encode(const pcre2_code **<i>codes</i>,</b>
283<b>  int32_t <i>number_of_codes</i>, uint8_t **<i>serialized_bytes</i>,</b>
284<b>  PCRE2_SIZE *<i>serialized_size</i>, pcre2_general_context *<i>gcontext</i>);</b>
285<br>
286<br>
287<b>void pcre2_serialize_free(uint8_t *<i>bytes</i>);</b>
288<br>
289<br>
290<b>int32_t pcre2_serialize_get_number_of_codes(const uint8_t *<i>bytes</i>);</b>
291</P>
292<br><a name="SEC10" href="#TOC1">PCRE2 NATIVE API AUXILIARY FUNCTIONS</a><br>
293<P>
294<b>pcre2_code *pcre2_code_copy(const pcre2_code *<i>code</i>);</b>
295<br>
296<br>
297<b>int pcre2_get_error_message(int <i>errorcode</i>, PCRE2_UCHAR *<i>buffer</i>,</b>
298<b>  PCRE2_SIZE <i>bufflen</i>);</b>
299<br>
300<br>
301<b>const unsigned char *pcre2_maketables(pcre2_general_context *<i>gcontext</i>);</b>
302<br>
303<br>
304<b>int pcre2_pattern_info(const pcre2 *<i>code</i>, uint32_t <i>what</i>, void *<i>where</i>);</b>
305<br>
306<br>
307<b>int pcre2_callout_enumerate(const pcre2_code *<i>code</i>,</b>
308<b>  int (*<i>callback</i>)(pcre2_callout_enumerate_block *, void *),</b>
309<b>  void *<i>user_data</i>);</b>
310<br>
311<br>
312<b>int pcre2_config(uint32_t <i>what</i>, void *<i>where</i>);</b>
313</P>
314<br><a name="SEC11" href="#TOC1">PCRE2 8-BIT, 16-BIT, AND 32-BIT LIBRARIES</a><br>
315<P>
316There are three PCRE2 libraries, supporting 8-bit, 16-bit, and 32-bit code
317units, respectively. However, there is just one header file, <b>pcre2.h</b>.
318This contains the function prototypes and other definitions for all three
319libraries. One, two, or all three can be installed simultaneously. On Unix-like
320systems the libraries are called <b>libpcre2-8</b>, <b>libpcre2-16</b>, and
321<b>libpcre2-32</b>, and they can also co-exist with the original PCRE libraries.
322</P>
323<P>
324Character strings are passed to and from a PCRE2 library as a sequence of
325unsigned integers in code units of the appropriate width. Every PCRE2 function
326comes in three different forms, one for each library, for example:
327<pre>
328  <b>pcre2_compile_8()</b>
329  <b>pcre2_compile_16()</b>
330  <b>pcre2_compile_32()</b>
331</pre>
332There are also three different sets of data types:
333<pre>
334  <b>PCRE2_UCHAR8, PCRE2_UCHAR16, PCRE2_UCHAR32</b>
335  <b>PCRE2_SPTR8,  PCRE2_SPTR16,  PCRE2_SPTR32</b>
336</pre>
337The UCHAR types define unsigned code units of the appropriate widths. For
338example, PCRE2_UCHAR16 is usually defined as `uint16_t'. The SPTR types are
339constant pointers to the equivalent UCHAR types, that is, they are pointers to
340vectors of unsigned code units.
341</P>
342<P>
343Many applications use only one code unit width. For their convenience, macros
344are defined whose names are the generic forms such as <b>pcre2_compile()</b> and
345PCRE2_SPTR. These macros use the value of the macro PCRE2_CODE_UNIT_WIDTH to
346generate the appropriate width-specific function and macro names.
347PCRE2_CODE_UNIT_WIDTH is not defined by default. An application must define it
348to be 8, 16, or 32 before including <b>pcre2.h</b> in order to make use of the
349generic names.
350</P>
351<P>
352Applications that use more than one code unit width can be linked with more
353than one PCRE2 library, but must define PCRE2_CODE_UNIT_WIDTH to be 0 before
354including <b>pcre2.h</b>, and then use the real function names. Any code that is
355to be included in an environment where the value of PCRE2_CODE_UNIT_WIDTH is
356unknown should also use the real function names. (Unfortunately, it is not
357possible in C code to save and restore the value of a macro.)
358</P>
359<P>
360If PCRE2_CODE_UNIT_WIDTH is not defined before including <b>pcre2.h</b>, a
361compiler error occurs.
362</P>
363<P>
364When using multiple libraries in an application, you must take care when
365processing any particular pattern to use only functions from a single library.
366For example, if you want to run a match using a pattern that was compiled with
367<b>pcre2_compile_16()</b>, you must do so with <b>pcre2_match_16()</b>, not
368<b>pcre2_match_8()</b>.
369</P>
370<P>
371In the function summaries above, and in the rest of this document and other
372PCRE2 documents, functions and data types are described using their generic
373names, without the 8, 16, or 32 suffix.
374</P>
375<br><a name="SEC12" href="#TOC1">PCRE2 API OVERVIEW</a><br>
376<P>
377PCRE2 has its own native API, which is described in this document. There are
378also some wrapper functions for the 8-bit library that correspond to the
379POSIX regular expression API, but they do not give access to all the
380functionality. They are described in the
381<a href="pcre2posix.html"><b>pcre2posix</b></a>
382documentation. Both these APIs define a set of C function calls.
383</P>
384<P>
385The native API C data types, function prototypes, option values, and error
386codes are defined in the header file <b>pcre2.h</b>, which contains definitions
387of PCRE2_MAJOR and PCRE2_MINOR, the major and minor release numbers for the
388library. Applications can use these to include support for different releases
389of PCRE2.
390</P>
391<P>
392In a Windows environment, if you want to statically link an application program
393against a non-dll PCRE2 library, you must define PCRE2_STATIC before including
394<b>pcre2.h</b>.
395</P>
396<P>
397The functions <b>pcre2_compile()</b>, and <b>pcre2_match()</b> are used for
398compiling and matching regular expressions in a Perl-compatible manner. A
399sample program that demonstrates the simplest way of using them is provided in
400the file called <i>pcre2demo.c</i> in the PCRE2 source distribution. A listing
401of this program is given in the
402<a href="pcre2demo.html"><b>pcre2demo</b></a>
403documentation, and the
404<a href="pcre2sample.html"><b>pcre2sample</b></a>
405documentation describes how to compile and run it.
406</P>
407<P>
408Just-in-time compiler support is an optional feature of PCRE2 that can be built
409in appropriate hardware environments. It greatly speeds up the matching
410performance of many patterns. Programs can request that it be used if
411available, by calling <b>pcre2_jit_compile()</b> after a pattern has been
412successfully compiled by <b>pcre2_compile()</b>. This does nothing if JIT
413support is not available.
414</P>
415<P>
416More complicated programs might need to make use of the specialist functions
417<b>pcre2_jit_stack_create()</b>, <b>pcre2_jit_stack_free()</b>, and
418<b>pcre2_jit_stack_assign()</b> in order to control the JIT code's memory usage.
419</P>
420<P>
421JIT matching is automatically used by <b>pcre2_match()</b> if it is available,
422unless the PCRE2_NO_JIT option is set. There is also a direct interface for JIT
423matching, which gives improved performance. The JIT-specific functions are
424discussed in the
425<a href="pcre2jit.html"><b>pcre2jit</b></a>
426documentation.
427</P>
428<P>
429A second matching function, <b>pcre2_dfa_match()</b>, which is not
430Perl-compatible, is also provided. This uses a different algorithm for the
431matching. The alternative algorithm finds all possible matches (at a given
432point in the subject), and scans the subject just once (unless there are
433lookbehind assertions). However, this algorithm does not return captured
434substrings. A description of the two matching algorithms and their advantages
435and disadvantages is given in the
436<a href="pcre2matching.html"><b>pcre2matching</b></a>
437documentation. There is no JIT support for <b>pcre2_dfa_match()</b>.
438</P>
439<P>
440In addition to the main compiling and matching functions, there are convenience
441functions for extracting captured substrings from a subject string that has
442been matched by <b>pcre2_match()</b>. They are:
443<pre>
444  <b>pcre2_substring_copy_byname()</b>
445  <b>pcre2_substring_copy_bynumber()</b>
446  <b>pcre2_substring_get_byname()</b>
447  <b>pcre2_substring_get_bynumber()</b>
448  <b>pcre2_substring_list_get()</b>
449  <b>pcre2_substring_length_byname()</b>
450  <b>pcre2_substring_length_bynumber()</b>
451  <b>pcre2_substring_nametable_scan()</b>
452  <b>pcre2_substring_number_from_name()</b>
453</pre>
454<b>pcre2_substring_free()</b> and <b>pcre2_substring_list_free()</b> are also
455provided, to free the memory used for extracted strings.
456</P>
457<P>
458The function <b>pcre2_substitute()</b> can be called to match a pattern and
459return a copy of the subject string with substitutions for parts that were
460matched.
461</P>
462<P>
463Functions whose names begin with <b>pcre2_serialize_</b> are used for saving
464compiled patterns on disc or elsewhere, and reloading them later.
465</P>
466<P>
467Finally, there are functions for finding out information about a compiled
468pattern (<b>pcre2_pattern_info()</b>) and about the configuration with which
469PCRE2 was built (<b>pcre2_config()</b>).
470</P>
471<P>
472Functions with names ending with <b>_free()</b> are used for freeing memory
473blocks of various sorts. In all cases, if one of these functions is called with
474a NULL argument, it does nothing.
475</P>
476<br><a name="SEC13" href="#TOC1">STRING LENGTHS AND OFFSETS</a><br>
477<P>
478The PCRE2 API uses string lengths and offsets into strings of code units in
479several places. These values are always of type PCRE2_SIZE, which is an
480unsigned integer type, currently always defined as <i>size_t</i>. The largest
481value that can be stored in such a type (that is ~(PCRE2_SIZE)0) is reserved
482as a special indicator for zero-terminated strings and unset offsets.
483Therefore, the longest string that can be handled is one less than this
484maximum.
485<a name="newlines"></a></P>
486<br><a name="SEC14" href="#TOC1">NEWLINES</a><br>
487<P>
488PCRE2 supports five different conventions for indicating line breaks in
489strings: a single CR (carriage return) character, a single LF (linefeed)
490character, the two-character sequence CRLF, any of the three preceding, or any
491Unicode newline sequence. The Unicode newline sequences are the three just
492mentioned, plus the single characters VT (vertical tab, U+000B), FF (form feed,
493U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS
494(paragraph separator, U+2029).
495</P>
496<P>
497Each of the first three conventions is used by at least one operating system as
498its standard newline sequence. When PCRE2 is built, a default can be specified.
499The default default is LF, which is the Unix standard. However, the newline
500convention can be changed by an application when calling <b>pcre2_compile()</b>,
501or it can be specified by special text at the start of the pattern itself; this
502overrides any other settings. See the
503<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
504page for details of the special character sequences.
505</P>
506<P>
507In the PCRE2 documentation the word "newline" is used to mean "the character or
508pair of characters that indicate a line break". The choice of newline
509convention affects the handling of the dot, circumflex, and dollar
510metacharacters, the handling of #-comments in /x mode, and, when CRLF is a
511recognized line ending sequence, the match position advancement for a
512non-anchored pattern. There is more detail about this in the
513<a href="#matchoptions">section on <b>pcre2_match()</b> options</a>
514below.
515</P>
516<P>
517The choice of newline convention does not affect the interpretation of
518the \n or \r escape sequences, nor does it affect what \R matches; this has
519its own separate convention.
520</P>
521<br><a name="SEC15" href="#TOC1">MULTITHREADING</a><br>
522<P>
523In a multithreaded application it is important to keep thread-specific data
524separate from data that can be shared between threads. The PCRE2 library code
525itself is thread-safe: it contains no static or global variables. The API is
526designed to be fairly simple for non-threaded applications while at the same
527time ensuring that multithreaded applications can use it.
528</P>
529<P>
530There are several different blocks of data that are used to pass information
531between the application and the PCRE2 libraries.
532</P>
533<br><b>
534The compiled pattern
535</b><br>
536<P>
537A pointer to the compiled form of a pattern is returned to the user when
538<b>pcre2_compile()</b> is successful. The data in the compiled pattern is fixed,
539and does not change when the pattern is matched. Therefore, it is thread-safe,
540that is, the same compiled pattern can be used by more than one thread
541simultaneously. For example, an application can compile all its patterns at the
542start, before forking off multiple threads that use them. However, if the
543just-in-time optimization feature is being used, it needs separate memory stack
544areas for each thread. See the
545<a href="pcre2jit.html"><b>pcre2jit</b></a>
546documentation for more details.
547</P>
548<P>
549In a more complicated situation, where patterns are compiled only when they are
550first needed, but are still shared between threads, pointers to compiled
551patterns must be protected from simultaneous writing by multiple threads, at
552least until a pattern has been compiled. The logic can be something like this:
553<pre>
554  Get a read-only (shared) lock (mutex) for pointer
555  if (pointer == NULL)
556    {
557    Get a write (unique) lock for pointer
558    pointer = pcre2_compile(...
559    }
560  Release the lock
561  Use pointer in pcre2_match()
562</pre>
563Of course, testing for compilation errors should also be included in the code.
564</P>
565<P>
566If JIT is being used, but the JIT compilation is not being done immediately,
567(perhaps waiting to see if the pattern is used often enough) similar logic is
568required. JIT compilation updates a pointer within the compiled code block, so
569a thread must gain unique write access to the pointer before calling
570<b>pcre2_jit_compile()</b>. Alternatively, <b>pcre2_code_copy()</b> can be used
571to obtain a private copy of the compiled code.
572</P>
573<br><b>
574Context blocks
575</b><br>
576<P>
577The next main section below introduces the idea of "contexts" in which PCRE2
578functions are called. A context is nothing more than a collection of parameters
579that control the way PCRE2 operates. Grouping a number of parameters together
580in a context is a convenient way of passing them to a PCRE2 function without
581using lots of arguments. The parameters that are stored in contexts are in some
582sense "advanced features" of the API. Many straightforward applications will
583not need to use contexts.
584</P>
585<P>
586In a multithreaded application, if the parameters in a context are values that
587are never changed, the same context can be used by all the threads. However, if
588any thread needs to change any value in a context, it must make its own
589thread-specific copy.
590</P>
591<br><b>
592Match blocks
593</b><br>
594<P>
595The matching functions need a block of memory for working space and for storing
596the results of a match. This includes details of what was matched, as well as
597additional information such as the name of a (*MARK) setting. Each thread must
598provide its own copy of this memory.
599</P>
600<br><a name="SEC16" href="#TOC1">PCRE2 CONTEXTS</a><br>
601<P>
602Some PCRE2 functions have a lot of parameters, many of which are used only by
603specialist applications, for example, those that use custom memory management
604or non-standard character tables. To keep function argument lists at a
605reasonable size, and at the same time to keep the API extensible, "uncommon"
606parameters are passed to certain functions in a <b>context</b> instead of
607directly. A context is just a block of memory that holds the parameter values.
608Applications that do not need to adjust any of the context parameters can pass
609NULL when a context pointer is required.
610</P>
611<P>
612There are three different types of context: a general context that is relevant
613for several PCRE2 operations, a compile-time context, and a match-time context.
614</P>
615<br><b>
616The general context
617</b><br>
618<P>
619At present, this context just contains pointers to (and data for) external
620memory management functions that are called from several places in the PCRE2
621library. The context is named `general' rather than specifically `memory'
622because in future other fields may be added. If you do not want to supply your
623own custom memory management functions, you do not need to bother with a
624general context. A general context is created by:
625<b>pcre2_general_context *pcre2_general_context_create(</b>
626<b>  void *(*<i>private_malloc</i>)(PCRE2_SIZE, void *),</b>
627<b>  void (*<i>private_free</i>)(void *, void *), void *<i>memory_data</i>);</b>
628<br>
629<br>
630The two function pointers specify custom memory management functions, whose
631prototypes are:
632<pre>
633  <b>void *private_malloc(PCRE2_SIZE, void *);</b>
634  <b>void  private_free(void *, void *);</b>
635</pre>
636Whenever code in PCRE2 calls these functions, the final argument is the value
637of <i>memory_data</i>. Either of the first two arguments of the creation
638function may be NULL, in which case the system memory management functions
639<i>malloc()</i> and <i>free()</i> are used. (This is not currently useful, as
640there are no other fields in a general context, but in future there might be.)
641The <i>private_malloc()</i> function is used (if supplied) to obtain memory for
642storing the context, and all three values are saved as part of the context.
643</P>
644<P>
645Whenever PCRE2 creates a data block of any kind, the block contains a pointer
646to the <i>free()</i> function that matches the <i>malloc()</i> function that was
647used. When the time comes to free the block, this function is called.
648</P>
649<P>
650A general context can be copied by calling:
651<b>pcre2_general_context *pcre2_general_context_copy(</b>
652<b>  pcre2_general_context *<i>gcontext</i>);</b>
653<br>
654<br>
655The memory used for a general context should be freed by calling:
656<b>void pcre2_general_context_free(pcre2_general_context *<i>gcontext</i>);</b>
657<a name="compilecontext"></a></P>
658<br><b>
659The compile context
660</b><br>
661<P>
662A compile context is required if you want to change the default values of any
663of the following compile-time parameters:
664<pre>
665  What \R matches (Unicode newlines or CR, LF, CRLF only)
666  PCRE2's character tables
667  The newline character sequence
668  The compile time nested parentheses limit
669  The maximum length of the pattern string
670  An external function for stack checking
671</pre>
672A compile context is also required if you are using custom memory management.
673If none of these apply, just pass NULL as the context argument of
674<i>pcre2_compile()</i>.
675</P>
676<P>
677A compile context is created, copied, and freed by the following functions:
678<b>pcre2_compile_context *pcre2_compile_context_create(</b>
679<b>  pcre2_general_context *<i>gcontext</i>);</b>
680<br>
681<br>
682<b>pcre2_compile_context *pcre2_compile_context_copy(</b>
683<b>  pcre2_compile_context *<i>ccontext</i>);</b>
684<br>
685<br>
686<b>void pcre2_compile_context_free(pcre2_compile_context *<i>ccontext</i>);</b>
687<br>
688<br>
689A compile context is created with default values for its parameters. These can
690be changed by calling the following functions, which return 0 on success, or
691PCRE2_ERROR_BADDATA if invalid data is detected.
692<b>int pcre2_set_bsr(pcre2_compile_context *<i>ccontext</i>,</b>
693<b>  uint32_t <i>value</i>);</b>
694<br>
695<br>
696The value must be PCRE2_BSR_ANYCRLF, to specify that \R matches only CR, LF,
697or CRLF, or PCRE2_BSR_UNICODE, to specify that \R matches any Unicode line
698ending sequence. The value is used by the JIT compiler and by the two
699interpreted matching functions, <i>pcre2_match()</i> and
700<i>pcre2_dfa_match()</i>.
701<b>int pcre2_set_character_tables(pcre2_compile_context *<i>ccontext</i>,</b>
702<b>  const unsigned char *<i>tables</i>);</b>
703<br>
704<br>
705The value must be the result of a call to <i>pcre2_maketables()</i>, whose only
706argument is a general context. This function builds a set of character tables
707in the current locale.
708<b>int pcre2_set_max_pattern_length(pcre2_compile_context *<i>ccontext</i>,</b>
709<b>  PCRE2_SIZE <i>value</i>);</b>
710<br>
711<br>
712This sets a maximum length, in code units, for the pattern string that is to be
713compiled. If the pattern is longer, an error is generated. This facility is
714provided so that applications that accept patterns from external sources can
715limit their size. The default is the largest number that a PCRE2_SIZE variable
716can hold, which is effectively unlimited.
717<b>int pcre2_set_newline(pcre2_compile_context *<i>ccontext</i>,</b>
718<b>  uint32_t <i>value</i>);</b>
719<br>
720<br>
721This specifies which characters or character sequences are to be recognized as
722newlines. The value must be one of PCRE2_NEWLINE_CR (carriage return only),
723PCRE2_NEWLINE_LF (linefeed only), PCRE2_NEWLINE_CRLF (the two-character
724sequence CR followed by LF), PCRE2_NEWLINE_ANYCRLF (any of the above), or
725PCRE2_NEWLINE_ANY (any Unicode newline sequence).
726</P>
727<P>
728When a pattern is compiled with the PCRE2_EXTENDED option, the value of this
729parameter affects the recognition of white space and the end of internal
730comments starting with #. The value is saved with the compiled pattern for
731subsequent use by the JIT compiler and by the two interpreted matching
732functions, <i>pcre2_match()</i> and <i>pcre2_dfa_match()</i>.
733<b>int pcre2_set_parens_nest_limit(pcre2_compile_context *<i>ccontext</i>,</b>
734<b>  uint32_t <i>value</i>);</b>
735<br>
736<br>
737This parameter ajusts the limit, set when PCRE2 is built (default 250), on the
738depth of parenthesis nesting in a pattern. This limit stops rogue patterns
739using up too much system stack when being compiled.
740<b>int pcre2_set_compile_recursion_guard(pcre2_compile_context *<i>ccontext</i>,</b>
741<b>  int (*<i>guard_function</i>)(uint32_t, void *), void *<i>user_data</i>);</b>
742<br>
743<br>
744There is at least one application that runs PCRE2 in threads with very limited
745system stack, where running out of stack is to be avoided at all costs. The
746parenthesis limit above cannot take account of how much stack is actually
747available. For a finer control, you can supply a function that is called
748whenever <b>pcre2_compile()</b> starts to compile a parenthesized part of a
749pattern. This function can check the actual stack size (or anything else that
750it wants to, of course).
751</P>
752<P>
753The first argument to the callout function gives the current depth of
754nesting, and the second is user data that is set up by the last argument of
755<b>pcre2_set_compile_recursion_guard()</b>. The callout function should return
756zero if all is well, or non-zero to force an error.
757<a name="matchcontext"></a></P>
758<br><b>
759The match context
760</b><br>
761<P>
762A match context is required if you want to change the default values of any
763of the following match-time parameters:
764<pre>
765  A callout function
766  The offset limit for matching an unanchored pattern
767  The limit for calling <b>match()</b> (see below)
768  The limit for calling <b>match()</b> recursively
769</pre>
770A match context is also required if you are using custom memory management.
771If none of these apply, just pass NULL as the context argument of
772<b>pcre2_match()</b>, <b>pcre2_dfa_match()</b>, or <b>pcre2_jit_match()</b>.
773</P>
774<P>
775A match context is created, copied, and freed by the following functions:
776<b>pcre2_match_context *pcre2_match_context_create(</b>
777<b>  pcre2_general_context *<i>gcontext</i>);</b>
778<br>
779<br>
780<b>pcre2_match_context *pcre2_match_context_copy(</b>
781<b>  pcre2_match_context *<i>mcontext</i>);</b>
782<br>
783<br>
784<b>void pcre2_match_context_free(pcre2_match_context *<i>mcontext</i>);</b>
785<br>
786<br>
787A match context is created with default values for its parameters. These can
788be changed by calling the following functions, which return 0 on success, or
789PCRE2_ERROR_BADDATA if invalid data is detected.
790<b>int pcre2_set_callout(pcre2_match_context *<i>mcontext</i>,</b>
791<b>  int (*<i>callout_function</i>)(pcre2_callout_block *, void *),</b>
792<b>  void *<i>callout_data</i>);</b>
793<br>
794<br>
795This sets up a "callout" function, which PCRE2 will call at specified points
796during a matching operation. Details are given in the
797<a href="pcre2callout.html"><b>pcre2callout</b></a>
798documentation.
799<b>int pcre2_set_offset_limit(pcre2_match_context *<i>mcontext</i>,</b>
800<b>  PCRE2_SIZE <i>value</i>);</b>
801<br>
802<br>
803The <i>offset_limit</i> parameter limits how far an unanchored search can
804advance in the subject string. The default value is PCRE2_UNSET. The
805<b>pcre2_match()</b> and <b>pcre2_dfa_match()</b> functions return
806PCRE2_ERROR_NOMATCH if a match with a starting point before or at the given
807offset is not found. For example, if the pattern /abc/ is matched against
808"123abc" with an offset limit less than 3, the result is PCRE2_ERROR_NO_MATCH.
809A match can never be found if the <i>startoffset</i> argument of
810<b>pcre2_match()</b> or <b>pcre2_dfa_match()</b> is greater than the offset
811limit.
812</P>
813<P>
814When using this facility, you must set PCRE2_USE_OFFSET_LIMIT when calling
815<b>pcre2_compile()</b> so that when JIT is in use, different code can be
816compiled. If a match is started with a non-default match limit when
817PCRE2_USE_OFFSET_LIMIT is not set, an error is generated.
818</P>
819<P>
820The offset limit facility can be used to track progress when searching large
821subject strings. See also the PCRE2_FIRSTLINE option, which requires a match to
822start within the first line of the subject. If this is set with an offset
823limit, a match must occur in the first line and also within the offset limit.
824In other words, whichever limit comes first is used.
825<b>int pcre2_set_match_limit(pcre2_match_context *<i>mcontext</i>,</b>
826<b>  uint32_t <i>value</i>);</b>
827<br>
828<br>
829The <i>match_limit</i> parameter provides a means of preventing PCRE2 from using
830up too many resources when processing patterns that are not going to match, but
831which have a very large number of possibilities in their search trees. The
832classic example is a pattern that uses nested unlimited repeats.
833</P>
834<P>
835Internally, <b>pcre2_match()</b> uses a function called <b>match()</b>, which it
836calls repeatedly (sometimes recursively). The limit set by <i>match_limit</i> is
837imposed on the number of times this function is called during a match, which
838has the effect of limiting the amount of backtracking that can take place. For
839patterns that are not anchored, the count restarts from zero for each position
840in the subject string. This limit is not relevant to <b>pcre2_dfa_match()</b>,
841which ignores it.
842</P>
843<P>
844When <b>pcre2_match()</b> is called with a pattern that was successfully
845processed by <b>pcre2_jit_compile()</b>, the way in which matching is executed
846is entirely different. However, there is still the possibility of runaway
847matching that goes on for a very long time, and so the <i>match_limit</i> value
848is also used in this case (but in a different way) to limit how long the
849matching can continue.
850</P>
851<P>
852The default value for the limit can be set when PCRE2 is built; the default
853default is 10 million, which handles all but the most extreme cases. If the
854limit is exceeded, <b>pcre2_match()</b> returns PCRE2_ERROR_MATCHLIMIT. A value
855for the match limit may also be supplied by an item at the start of a pattern
856of the form
857<pre>
858  (*LIMIT_MATCH=ddd)
859</pre>
860where ddd is a decimal number. However, such a setting is ignored unless ddd is
861less than the limit set by the caller of <b>pcre2_match()</b> or, if no such
862limit is set, less than the default.
863<b>int pcre2_set_recursion_limit(pcre2_match_context *<i>mcontext</i>,</b>
864<b>  uint32_t <i>value</i>);</b>
865<br>
866<br>
867The <i>recursion_limit</i> parameter is similar to <i>match_limit</i>, but
868instead of limiting the total number of times that <b>match()</b> is called, it
869limits the depth of recursion. The recursion depth is a smaller number than the
870total number of calls, because not all calls to <b>match()</b> are recursive.
871This limit is of use only if it is set smaller than <i>match_limit</i>.
872</P>
873<P>
874Limiting the recursion depth limits the amount of system stack that can be
875used, or, when PCRE2 has been compiled to use memory on the heap instead of the
876stack, the amount of heap memory that can be used. This limit is not relevant,
877and is ignored, when matching is done using JIT compiled code or by the
878<b>pcre2_dfa_match()</b> function.
879</P>
880<P>
881The default value for <i>recursion_limit</i> can be set when PCRE2 is built; the
882default default is the same value as the default for <i>match_limit</i>. If the
883limit is exceeded, <b>pcre2_match()</b> returns PCRE2_ERROR_RECURSIONLIMIT. A
884value for the recursion limit may also be supplied by an item at the start of a
885pattern of the form
886<pre>
887  (*LIMIT_RECURSION=ddd)
888</pre>
889where ddd is a decimal number. However, such a setting is ignored unless ddd is
890less than the limit set by the caller of <b>pcre2_match()</b> or, if no such
891limit is set, less than the default.
892<b>int pcre2_set_recursion_memory_management(</b>
893<b>  pcre2_match_context *<i>mcontext</i>,</b>
894<b>  void *(*<i>private_malloc</i>)(PCRE2_SIZE, void *),</b>
895<b>  void (*<i>private_free</i>)(void *, void *), void *<i>memory_data</i>);</b>
896<br>
897<br>
898This function sets up two additional custom memory management functions for use
899by <b>pcre2_match()</b> when PCRE2 is compiled to use the heap for remembering
900backtracking data, instead of recursive function calls that use the system
901stack. There is a discussion about PCRE2's stack usage in the
902<a href="pcre2stack.html"><b>pcre2stack</b></a>
903documentation. See the
904<a href="pcre2build.html"><b>pcre2build</b></a>
905documentation for details of how to build PCRE2.
906</P>
907<P>
908Using the heap for recursion is a non-standard way of building PCRE2, for use
909in environments that have limited stacks. Because of the greater use of memory
910management, <b>pcre2_match()</b> runs more slowly. Functions that are different
911to the general custom memory functions are provided so that special-purpose
912external code can be used for this case, because the memory blocks are all the
913same size. The blocks are retained by <b>pcre2_match()</b> until it is about to
914exit so that they can be re-used when possible during the match. In the absence
915of these functions, the normal custom memory management functions are used, if
916supplied, otherwise the system functions.
917</P>
918<br><a name="SEC17" href="#TOC1">CHECKING BUILD-TIME OPTIONS</a><br>
919<P>
920<b>int pcre2_config(uint32_t <i>what</i>, void *<i>where</i>);</b>
921</P>
922<P>
923The function <b>pcre2_config()</b> makes it possible for a PCRE2 client to
924discover which optional features have been compiled into the PCRE2 library. The
925<a href="pcre2build.html"><b>pcre2build</b></a>
926documentation has more details about these optional features.
927</P>
928<P>
929The first argument for <b>pcre2_config()</b> specifies which information is
930required. The second argument is a pointer to memory into which the information
931is placed. If NULL is passed, the function returns the amount of memory that is
932needed for the requested information. For calls that return numerical values,
933the value is in bytes; when requesting these values, <i>where</i> should point
934to appropriately aligned memory. For calls that return strings, the required
935length is given in code units, not counting the terminating zero.
936</P>
937<P>
938When requesting information, the returned value from <b>pcre2_config()</b> is
939non-negative on success, or the negative error code PCRE2_ERROR_BADOPTION if
940the value in the first argument is not recognized. The following information is
941available:
942<pre>
943  PCRE2_CONFIG_BSR
944</pre>
945The output is a uint32_t integer whose value indicates what character
946sequences the \R escape sequence matches by default. A value of
947PCRE2_BSR_UNICODE means that \R matches any Unicode line ending sequence; a
948value of PCRE2_BSR_ANYCRLF means that \R matches only CR, LF, or CRLF. The
949default can be overridden when a pattern is compiled.
950<pre>
951  PCRE2_CONFIG_JIT
952</pre>
953The output is a uint32_t integer that is set to one if support for just-in-time
954compiling is available; otherwise it is set to zero.
955<pre>
956  PCRE2_CONFIG_JITTARGET
957</pre>
958The <i>where</i> argument should point to a buffer that is at least 48 code
959units long. (The exact length required can be found by calling
960<b>pcre2_config()</b> with <b>where</b> set to NULL.) The buffer is filled with a
961string that contains the name of the architecture for which the JIT compiler is
962configured, for example "x86 32bit (little endian + unaligned)". If JIT support
963is not available, PCRE2_ERROR_BADOPTION is returned, otherwise the number of
964code units used is returned. This is the length of the string, plus one unit
965for the terminating zero.
966<pre>
967  PCRE2_CONFIG_LINKSIZE
968</pre>
969The output is a uint32_t integer that contains the number of bytes used for
970internal linkage in compiled regular expressions. When PCRE2 is configured, the
971value can be set to 2, 3, or 4, with the default being 2. This is the value
972that is returned by <b>pcre2_config()</b>. However, when the 16-bit library is
973compiled, a value of 3 is rounded up to 4, and when the 32-bit library is
974compiled, internal linkages always use 4 bytes, so the configured value is not
975relevant.
976</P>
977<P>
978The default value of 2 for the 8-bit and 16-bit libraries is sufficient for all
979but the most massive patterns, since it allows the size of the compiled pattern
980to be up to 64K code units. Larger values allow larger regular expressions to
981be compiled by those two libraries, but at the expense of slower matching.
982<pre>
983  PCRE2_CONFIG_MATCHLIMIT
984</pre>
985The output is a uint32_t integer that gives the default limit for the number of
986internal matching function calls in a <b>pcre2_match()</b> execution. Further
987details are given with <b>pcre2_match()</b> below.
988<pre>
989  PCRE2_CONFIG_NEWLINE
990</pre>
991The output is a uint32_t integer whose value specifies the default character
992sequence that is recognized as meaning "newline". The values are:
993<pre>
994  PCRE2_NEWLINE_CR       Carriage return (CR)
995  PCRE2_NEWLINE_LF       Linefeed (LF)
996  PCRE2_NEWLINE_CRLF     Carriage return, linefeed (CRLF)
997  PCRE2_NEWLINE_ANY      Any Unicode line ending
998  PCRE2_NEWLINE_ANYCRLF  Any of CR, LF, or CRLF
999</pre>
1000The default should normally correspond to the standard sequence for your
1001operating system.
1002<pre>
1003  PCRE2_CONFIG_PARENSLIMIT
1004</pre>
1005The output is a uint32_t integer that gives the maximum depth of nesting
1006of parentheses (of any kind) in a pattern. This limit is imposed to cap the
1007amount of system stack used when a pattern is compiled. It is specified when
1008PCRE2 is built; the default is 250. This limit does not take into account the
1009stack that may already be used by the calling application. For finer control
1010over compilation stack usage, see <b>pcre2_set_compile_recursion_guard()</b>.
1011<pre>
1012  PCRE2_CONFIG_RECURSIONLIMIT
1013</pre>
1014The output is a uint32_t integer that gives the default limit for the depth of
1015recursion when calling the internal matching function in a <b>pcre2_match()</b>
1016execution. Further details are given with <b>pcre2_match()</b> below.
1017<pre>
1018  PCRE2_CONFIG_STACKRECURSE
1019</pre>
1020The output is a uint32_t integer that is set to one if internal recursion when
1021running <b>pcre2_match()</b> is implemented by recursive function calls that use
1022the system stack to remember their state. This is the usual way that PCRE2 is
1023compiled. The output is zero if PCRE2 was compiled to use blocks of data on the
1024heap instead of recursive function calls.
1025<pre>
1026  PCRE2_CONFIG_UNICODE_VERSION
1027</pre>
1028The <i>where</i> argument should point to a buffer that is at least 24 code
1029units long. (The exact length required can be found by calling
1030<b>pcre2_config()</b> with <b>where</b> set to NULL.) If PCRE2 has been compiled
1031without Unicode support, the buffer is filled with the text "Unicode not
1032supported". Otherwise, the Unicode version string (for example, "8.0.0") is
1033inserted. The number of code units used is returned. This is the length of the
1034string plus one unit for the terminating zero.
1035<pre>
1036  PCRE2_CONFIG_UNICODE
1037</pre>
1038The output is a uint32_t integer that is set to one if Unicode support is
1039available; otherwise it is set to zero. Unicode support implies UTF support.
1040<pre>
1041  PCRE2_CONFIG_VERSION
1042</pre>
1043The <i>where</i> argument should point to a buffer that is at least 12 code
1044units long. (The exact length required can be found by calling
1045<b>pcre2_config()</b> with <b>where</b> set to NULL.) The buffer is filled with
1046the PCRE2 version string, zero-terminated. The number of code units used is
1047returned. This is the length of the string plus one unit for the terminating
1048zero.
1049<a name="compiling"></a></P>
1050<br><a name="SEC18" href="#TOC1">COMPILING A PATTERN</a><br>
1051<P>
1052<b>pcre2_code *pcre2_compile(PCRE2_SPTR <i>pattern</i>, PCRE2_SIZE <i>length</i>,</b>
1053<b>  uint32_t <i>options</i>, int *<i>errorcode</i>, PCRE2_SIZE *<i>erroroffset,</i></b>
1054<b>  pcre2_compile_context *<i>ccontext</i>);</b>
1055<br>
1056<br>
1057<b>void pcre2_code_free(pcre2_code *<i>code</i>);</b>
1058<br>
1059<br>
1060<b>pcre2_code *pcre2_code_copy(const pcre2_code *<i>code</i>);</b>
1061</P>
1062<P>
1063The <b>pcre2_compile()</b> function compiles a pattern into an internal form.
1064The pattern is defined by a pointer to a string of code units and a length. If
1065the pattern is zero-terminated, the length can be specified as
1066PCRE2_ZERO_TERMINATED. The function returns a pointer to a block of memory that
1067contains the compiled pattern and related data, or NULL if an error occurred.
1068</P>
1069<P>
1070If the compile context argument <i>ccontext</i> is NULL, memory for the compiled
1071pattern is obtained by calling <b>malloc()</b>. Otherwise, it is obtained from
1072the same memory function that was used for the compile context. The caller must
1073free the memory by calling <b>pcre2_code_free()</b> when it is no longer needed.
1074</P>
1075<P>
1076The function <b>pcre2_code_copy()</b> makes a copy of the compiled code in new
1077memory, using the same memory allocator as was used for the original. However,
1078if the code has been processed by the JIT compiler (see
1079<a href="#jitcompiling">below),</a>
1080the JIT information cannot be copied (because it is position-dependent).
1081The new copy can initially be used only for non-JIT matching, though it can be
1082passed to <b>pcre2_jit_compile()</b> if required. The <b>pcre2_code_copy()</b>
1083function provides a way for individual threads in a multithreaded application
1084to acquire a private copy of shared compiled code.
1085</P>
1086<P>
1087NOTE: When one of the matching functions is called, pointers to the compiled
1088pattern and the subject string are set in the match data block so that they can
1089be referenced by the substring extraction functions. After running a match, you
1090must not free a compiled pattern (or a subject string) until after all
1091operations on the
1092<a href="#matchdatablock">match data block</a>
1093have taken place.
1094</P>
1095<P>
1096The <i>options</i> argument for <b>pcre2_compile()</b> contains various bit
1097settings that affect the compilation. It should be zero if no options are
1098required. The available options are described below. Some of them (in
1099particular, those that are compatible with Perl, but some others as well) can
1100also be set and unset from within the pattern (see the detailed description in
1101the
1102<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
1103documentation).
1104</P>
1105<P>
1106For those options that can be different in different parts of the pattern, the
1107contents of the <i>options</i> argument specifies their settings at the start of
1108compilation. The PCRE2_ANCHORED and PCRE2_NO_UTF_CHECK options can be set at
1109the time of matching as well as at compile time.
1110</P>
1111<P>
1112Other, less frequently required compile-time parameters (for example, the
1113newline setting) can be provided in a compile context (as described
1114<a href="#compilecontext">above).</a>
1115</P>
1116<P>
1117If <i>errorcode</i> or <i>erroroffset</i> is NULL, <b>pcre2_compile()</b> returns
1118NULL immediately. Otherwise, the variables to which these point are set to an
1119error code and an offset (number of code units) within the pattern,
1120respectively, when <b>pcre2_compile()</b> returns NULL because a compilation
1121error has occurred. The values are not defined when compilation is successful
1122and <b>pcre2_compile()</b> returns a non-NULL value.
1123</P>
1124<P>
1125The <b>pcre2_get_error_message()</b> function (see "Obtaining a textual error
1126message"
1127<a href="#geterrormessage">below)</a>
1128provides a textual message for each error code. Compilation errors have
1129positive error codes; UTF formatting error codes are negative. For an invalid
1130UTF-8 or UTF-16 string, the offset is that of the first code unit of the
1131failing character.
1132</P>
1133<P>
1134Some errors are not detected until the whole pattern has been scanned; in these
1135cases, the offset passed back is the length of the pattern. Note that the
1136offset is in code units, not characters, even in a UTF mode. It may sometimes
1137point into the middle of a UTF-8 or UTF-16 character.
1138</P>
1139<P>
1140This code fragment shows a typical straightforward call to
1141<b>pcre2_compile()</b>:
1142<pre>
1143  pcre2_code *re;
1144  PCRE2_SIZE erroffset;
1145  int errorcode;
1146  re = pcre2_compile(
1147    "^A.*Z",                /* the pattern */
1148    PCRE2_ZERO_TERMINATED,  /* the pattern is zero-terminated */
1149    0,                      /* default options */
1150    &errorcode,             /* for error code */
1151    &erroffset,             /* for error offset */
1152    NULL);                  /* no compile context */
1153</pre>
1154The following names for option bits are defined in the <b>pcre2.h</b> header
1155file:
1156<pre>
1157  PCRE2_ANCHORED
1158</pre>
1159If this bit is set, the pattern is forced to be "anchored", that is, it is
1160constrained to match only at the first matching point in the string that is
1161being searched (the "subject string"). This effect can also be achieved by
1162appropriate constructs in the pattern itself, which is the only way to do it in
1163Perl.
1164<pre>
1165  PCRE2_ALLOW_EMPTY_CLASS
1166</pre>
1167By default, for compatibility with Perl, a closing square bracket that
1168immediately follows an opening one is treated as a data character for the
1169class. When PCRE2_ALLOW_EMPTY_CLASS is set, it terminates the class, which
1170therefore contains no characters and so can never match.
1171<pre>
1172  PCRE2_ALT_BSUX
1173</pre>
1174This option request alternative handling of three escape sequences, which
1175makes PCRE2's behaviour more like ECMAscript (aka JavaScript). When it is set:
1176</P>
1177<P>
1178(1) \U matches an upper case "U" character; by default \U causes a compile
1179time error (Perl uses \U to upper case subsequent characters).
1180</P>
1181<P>
1182(2) \u matches a lower case "u" character unless it is followed by four
1183hexadecimal digits, in which case the hexadecimal number defines the code point
1184to match. By default, \u causes a compile time error (Perl uses it to upper
1185case the following character).
1186</P>
1187<P>
1188(3) \x matches a lower case "x" character unless it is followed by two
1189hexadecimal digits, in which case the hexadecimal number defines the code point
1190to match. By default, as in Perl, a hexadecimal number is always expected after
1191\x, but it may have zero, one, or two digits (so, for example, \xz matches a
1192binary zero character followed by z).
1193<pre>
1194  PCRE2_ALT_CIRCUMFLEX
1195</pre>
1196In multiline mode (when PCRE2_MULTILINE is set), the circumflex metacharacter
1197matches at the start of the subject (unless PCRE2_NOTBOL is set), and also
1198after any internal newline. However, it does not match after a newline at the
1199end of the subject, for compatibility with Perl. If you want a multiline
1200circumflex also to match after a terminating newline, you must set
1201PCRE2_ALT_CIRCUMFLEX.
1202<pre>
1203  PCRE2_ALT_VERBNAMES
1204</pre>
1205By default, for compatibility with Perl, the name in any verb sequence such as
1206(*MARK:NAME) is any sequence of characters that does not include a closing
1207parenthesis. The name is not processed in any way, and it is not possible to
1208include a closing parenthesis in the name. However, if the PCRE2_ALT_VERBNAMES
1209option is set, normal backslash processing is applied to verb names and only an
1210unescaped closing parenthesis terminates the name. A closing parenthesis can be
1211included in a name either as \) or between \Q and \E. If the PCRE2_EXTENDED
1212option is set, unescaped whitespace in verb names is skipped and #-comments are
1213recognized, exactly as in the rest of the pattern.
1214<pre>
1215  PCRE2_AUTO_CALLOUT
1216</pre>
1217If this bit is set, <b>pcre2_compile()</b> automatically inserts callout items,
1218all with number 255, before each pattern item. For discussion of the callout
1219facility, see the
1220<a href="pcre2callout.html"><b>pcre2callout</b></a>
1221documentation.
1222<pre>
1223  PCRE2_CASELESS
1224</pre>
1225If this bit is set, letters in the pattern match both upper and lower case
1226letters in the subject. It is equivalent to Perl's /i option, and it can be
1227changed within a pattern by a (?i) option setting.
1228<pre>
1229  PCRE2_DOLLAR_ENDONLY
1230</pre>
1231If this bit is set, a dollar metacharacter in the pattern matches only at the
1232end of the subject string. Without this option, a dollar also matches
1233immediately before a newline at the end of the string (but not before any other
1234newlines). The PCRE2_DOLLAR_ENDONLY option is ignored if PCRE2_MULTILINE is
1235set. There is no equivalent to this option in Perl, and no way to set it within
1236a pattern.
1237<pre>
1238  PCRE2_DOTALL
1239</pre>
1240If this bit is set, a dot metacharacter in the pattern matches any character,
1241including one that indicates a newline. However, it only ever matches one
1242character, even if newlines are coded as CRLF. Without this option, a dot does
1243not match when the current position in the subject is at a newline. This option
1244is equivalent to Perl's /s option, and it can be changed within a pattern by a
1245(?s) option setting. A negative class such as [^a] always matches newline
1246characters, independent of the setting of this option.
1247<pre>
1248  PCRE2_DUPNAMES
1249</pre>
1250If this bit is set, names used to identify capturing subpatterns need not be
1251unique. This can be helpful for certain types of pattern when it is known that
1252only one instance of the named subpattern can ever be matched. There are more
1253details of named subpatterns below; see also the
1254<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
1255documentation.
1256<pre>
1257  PCRE2_EXTENDED
1258</pre>
1259If this bit is set, most white space characters in the pattern are totally
1260ignored except when escaped or inside a character class. However, white space
1261is not allowed within sequences such as (?&#62; that introduce various
1262parenthesized subpatterns, nor within numerical quantifiers such as {1,3}.
1263Ignorable white space is permitted between an item and a following quantifier
1264and between a quantifier and a following + that indicates possessiveness.
1265</P>
1266<P>
1267PCRE2_EXTENDED also causes characters between an unescaped # outside a
1268character class and the next newline, inclusive, to be ignored, which makes it
1269possible to include comments inside complicated patterns. Note that the end of
1270this type of comment is a literal newline sequence in the pattern; escape
1271sequences that happen to represent a newline do not count. PCRE2_EXTENDED is
1272equivalent to Perl's /x option, and it can be changed within a pattern by a
1273(?x) option setting.
1274</P>
1275<P>
1276Which characters are interpreted as newlines can be specified by a setting in
1277the compile context that is passed to <b>pcre2_compile()</b> or by a special
1278sequence at the start of the pattern, as described in the section entitled
1279<a href="pcre2pattern.html#newlines">"Newline conventions"</a>
1280in the <b>pcre2pattern</b> documentation. A default is defined when PCRE2 is
1281built.
1282<pre>
1283  PCRE2_FIRSTLINE
1284</pre>
1285If this option is set, an unanchored pattern is required to match before or at
1286the first newline in the subject string, though the matched text may continue
1287over the newline. See also PCRE2_USE_OFFSET_LIMIT, which provides a more
1288general limiting facility. If PCRE2_FIRSTLINE is set with an offset limit, a
1289match must occur in the first line and also within the offset limit. In other
1290words, whichever limit comes first is used.
1291<pre>
1292  PCRE2_MATCH_UNSET_BACKREF
1293</pre>
1294If this option is set, a back reference to an unset subpattern group matches an
1295empty string (by default this causes the current matching alternative to fail).
1296A pattern such as (\1)(a) succeeds when this option is set (assuming it can
1297find an "a" in the subject), whereas it fails by default, for Perl
1298compatibility. Setting this option makes PCRE2 behave more like ECMAscript (aka
1299JavaScript).
1300<pre>
1301  PCRE2_MULTILINE
1302</pre>
1303By default, for the purposes of matching "start of line" and "end of line",
1304PCRE2 treats the subject string as consisting of a single line of characters,
1305even if it actually contains newlines. The "start of line" metacharacter (^)
1306matches only at the start of the string, and the "end of line" metacharacter
1307($) matches only at the end of the string, or before a terminating newline
1308(except when PCRE2_DOLLAR_ENDONLY is set). Note, however, that unless
1309PCRE2_DOTALL is set, the "any character" metacharacter (.) does not match at a
1310newline. This behaviour (for ^, $, and dot) is the same as Perl.
1311</P>
1312<P>
1313When PCRE2_MULTILINE it is set, the "start of line" and "end of line"
1314constructs match immediately following or immediately before internal newlines
1315in the subject string, respectively, as well as at the very start and end. This
1316is equivalent to Perl's /m option, and it can be changed within a pattern by a
1317(?m) option setting. Note that the "start of line" metacharacter does not match
1318after a newline at the end of the subject, for compatibility with Perl.
1319However, you can change this by setting the PCRE2_ALT_CIRCUMFLEX option. If
1320there are no newlines in a subject string, or no occurrences of ^ or $ in a
1321pattern, setting PCRE2_MULTILINE has no effect.
1322<pre>
1323  PCRE2_NEVER_BACKSLASH_C
1324</pre>
1325This option locks out the use of \C in the pattern that is being compiled.
1326This escape can cause unpredictable behaviour in UTF-8 or UTF-16 modes, because
1327it may leave the current matching point in the middle of a multi-code-unit
1328character. This option may be useful in applications that process patterns from
1329external sources. Note that there is also a build-time option that permanently
1330locks out the use of \C.
1331<pre>
1332  PCRE2_NEVER_UCP
1333</pre>
1334This option locks out the use of Unicode properties for handling \B, \b, \D,
1335\d, \S, \s, \W, \w, and some of the POSIX character classes, as described
1336for the PCRE2_UCP option below. In particular, it prevents the creator of the
1337pattern from enabling this facility by starting the pattern with (*UCP). This
1338option may be useful in applications that process patterns from external
1339sources. The option combination PCRE_UCP and PCRE_NEVER_UCP causes an error.
1340<pre>
1341  PCRE2_NEVER_UTF
1342</pre>
1343This option locks out interpretation of the pattern as UTF-8, UTF-16, or
1344UTF-32, depending on which library is in use. In particular, it prevents the
1345creator of the pattern from switching to UTF interpretation by starting the
1346pattern with (*UTF). This option may be useful in applications that process
1347patterns from external sources. The combination of PCRE2_UTF and
1348PCRE2_NEVER_UTF causes an error.
1349<pre>
1350  PCRE2_NO_AUTO_CAPTURE
1351</pre>
1352If this option is set, it disables the use of numbered capturing parentheses in
1353the pattern. Any opening parenthesis that is not followed by ? behaves as if it
1354were followed by ?: but named parentheses can still be used for capturing (and
1355they acquire numbers in the usual way). There is no equivalent of this option
1356in Perl. Note that, if this option is set, references to capturing groups (back
1357references or recursion/subroutine calls) may only refer to named groups,
1358though the reference can be by name or by number.
1359<pre>
1360  PCRE2_NO_AUTO_POSSESS
1361</pre>
1362If this option is set, it disables "auto-possessification", which is an
1363optimization that, for example, turns a+b into a++b in order to avoid
1364backtracks into a+ that can never be successful. However, if callouts are in
1365use, auto-possessification means that some callouts are never taken. You can
1366set this option if you want the matching functions to do a full unoptimized
1367search and run all the callouts, but it is mainly provided for testing
1368purposes.
1369<pre>
1370  PCRE2_NO_DOTSTAR_ANCHOR
1371</pre>
1372If this option is set, it disables an optimization that is applied when .* is
1373the first significant item in a top-level branch of a pattern, and all the
1374other branches also start with .* or with \A or \G or ^. The optimization is
1375automatically disabled for .* if it is inside an atomic group or a capturing
1376group that is the subject of a back reference, or if the pattern contains
1377(*PRUNE) or (*SKIP). When the optimization is not disabled, such a pattern is
1378automatically anchored if PCRE2_DOTALL is set for all the .* items and
1379PCRE2_MULTILINE is not set for any ^ items. Otherwise, the fact that any match
1380must start either at the start of the subject or following a newline is
1381remembered. Like other optimizations, this can cause callouts to be skipped.
1382<pre>
1383  PCRE2_NO_START_OPTIMIZE
1384</pre>
1385This is an option whose main effect is at matching time. It does not change
1386what <b>pcre2_compile()</b> generates, but it does affect the output of the JIT
1387compiler.
1388</P>
1389<P>
1390There are a number of optimizations that may occur at the start of a match, in
1391order to speed up the process. For example, if it is known that an unanchored
1392match must start with a specific character, the matching code searches the
1393subject for that character, and fails immediately if it cannot find it, without
1394actually running the main matching function. This means that a special item
1395such as (*COMMIT) at the start of a pattern is not considered until after a
1396suitable starting point for the match has been found. Also, when callouts or
1397(*MARK) items are in use, these "start-up" optimizations can cause them to be
1398skipped if the pattern is never actually used. The start-up optimizations are
1399in effect a pre-scan of the subject that takes place before the pattern is run.
1400</P>
1401<P>
1402The PCRE2_NO_START_OPTIMIZE option disables the start-up optimizations,
1403possibly causing performance to suffer, but ensuring that in cases where the
1404result is "no match", the callouts do occur, and that items such as (*COMMIT)
1405and (*MARK) are considered at every possible starting position in the subject
1406string.
1407</P>
1408<P>
1409Setting PCRE2_NO_START_OPTIMIZE may change the outcome of a matching operation.
1410Consider the pattern
1411<pre>
1412  (*COMMIT)ABC
1413</pre>
1414When this is compiled, PCRE2 records the fact that a match must start with the
1415character "A". Suppose the subject string is "DEFABC". The start-up
1416optimization scans along the subject, finds "A" and runs the first match
1417attempt from there. The (*COMMIT) item means that the pattern must match the
1418current starting position, which in this case, it does. However, if the same
1419match is run with PCRE2_NO_START_OPTIMIZE set, the initial scan along the
1420subject string does not happen. The first match attempt is run starting from
1421"D" and when this fails, (*COMMIT) prevents any further matches being tried, so
1422the overall result is "no match". There are also other start-up optimizations.
1423For example, a minimum length for the subject may be recorded. Consider the
1424pattern
1425<pre>
1426  (*MARK:A)(X|Y)
1427</pre>
1428The minimum length for a match is one character. If the subject is "ABC", there
1429will be attempts to match "ABC", "BC", and "C". An attempt to match an empty
1430string at the end of the subject does not take place, because PCRE2 knows that
1431the subject is now too short, and so the (*MARK) is never encountered. In this
1432case, the optimization does not affect the overall match result, which is still
1433"no match", but it does affect the auxiliary information that is returned.
1434<pre>
1435  PCRE2_NO_UTF_CHECK
1436</pre>
1437When PCRE2_UTF is set, the validity of the pattern as a UTF string is
1438automatically checked. There are discussions about the validity of
1439<a href="pcre2unicode.html#utf8strings">UTF-8 strings,</a>
1440<a href="pcre2unicode.html#utf16strings">UTF-16 strings,</a>
1441and
1442<a href="pcre2unicode.html#utf32strings">UTF-32 strings</a>
1443in the
1444<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
1445document.
1446If an invalid UTF sequence is found, <b>pcre2_compile()</b> returns a negative
1447error code.
1448</P>
1449<P>
1450If you know that your pattern is valid, and you want to skip this check for
1451performance reasons, you can set the PCRE2_NO_UTF_CHECK option. When it is set,
1452the effect of passing an invalid UTF string as a pattern is undefined. It may
1453cause your program to crash or loop. Note that this option can also be passed
1454to <b>pcre2_match()</b> and <b>pcre_dfa_match()</b>, to suppress validity
1455checking of the subject string.
1456<pre>
1457  PCRE2_UCP
1458</pre>
1459This option changes the way PCRE2 processes \B, \b, \D, \d, \S, \s, \W,
1460\w, and some of the POSIX character classes. By default, only ASCII characters
1461are recognized, but if PCRE2_UCP is set, Unicode properties are used instead to
1462classify characters. More details are given in the section on
1463<a href="pcre2pattern.html#genericchartypes">generic character types</a>
1464in the
1465<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
1466page. If you set PCRE2_UCP, matching one of the items it affects takes much
1467longer. The option is available only if PCRE2 has been compiled with Unicode
1468support.
1469<pre>
1470  PCRE2_UNGREEDY
1471</pre>
1472This option inverts the "greediness" of the quantifiers so that they are not
1473greedy by default, but become greedy if followed by "?". It is not compatible
1474with Perl. It can also be set by a (?U) option setting within the pattern.
1475<pre>
1476  PCRE2_USE_OFFSET_LIMIT
1477</pre>
1478This option must be set for <b>pcre2_compile()</b> if
1479<b>pcre2_set_offset_limit()</b> is going to be used to set a non-default offset
1480limit in a match context for matches that use this pattern. An error is
1481generated if an offset limit is set without this option. For more details, see
1482the description of <b>pcre2_set_offset_limit()</b> in the
1483<a href="#matchcontext">section</a>
1484that describes match contexts. See also the PCRE2_FIRSTLINE
1485option above.
1486<pre>
1487  PCRE2_UTF
1488</pre>
1489This option causes PCRE2 to regard both the pattern and the subject strings
1490that are subsequently processed as strings of UTF characters instead of
1491single-code-unit strings. It is available when PCRE2 is built to include
1492Unicode support (which is the default). If Unicode support is not available,
1493the use of this option provokes an error. Details of how this option changes
1494the behaviour of PCRE2 are given in the
1495<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
1496page.
1497</P>
1498<br><a name="SEC19" href="#TOC1">COMPILATION ERROR CODES</a><br>
1499<P>
1500There are over 80 positive error codes that <b>pcre2_compile()</b> may return
1501(via <i>errorcode</i>) if it finds an error in the pattern. There are also some
1502negative error codes that are used for invalid UTF strings. These are the same
1503as given by <b>pcre2_match()</b> and <b>pcre2_dfa_match()</b>, and are described
1504in the
1505<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
1506page. The <b>pcre2_get_error_message()</b> function (see "Obtaining a textual
1507error message"
1508<a href="#geterrormessage">below)</a>
1509can be called to obtain a textual error message from any error code.
1510<a name="jitcompiling"></a></P>
1511<br><a name="SEC20" href="#TOC1">JUST-IN-TIME (JIT) COMPILATION</a><br>
1512<P>
1513<b>int pcre2_jit_compile(pcre2_code *<i>code</i>, uint32_t <i>options</i>);</b>
1514<br>
1515<br>
1516<b>int pcre2_jit_match(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b>
1517<b>  PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b>
1518<b>  uint32_t <i>options</i>, pcre2_match_data *<i>match_data</i>,</b>
1519<b>  pcre2_match_context *<i>mcontext</i>);</b>
1520<br>
1521<br>
1522<b>void pcre2_jit_free_unused_memory(pcre2_general_context *<i>gcontext</i>);</b>
1523<br>
1524<br>
1525<b>pcre2_jit_stack *pcre2_jit_stack_create(PCRE2_SIZE <i>startsize</i>,</b>
1526<b>  PCRE2_SIZE <i>maxsize</i>, pcre2_general_context *<i>gcontext</i>);</b>
1527<br>
1528<br>
1529<b>void pcre2_jit_stack_assign(pcre2_match_context *<i>mcontext</i>,</b>
1530<b>  pcre2_jit_callback <i>callback_function</i>, void *<i>callback_data</i>);</b>
1531<br>
1532<br>
1533<b>void pcre2_jit_stack_free(pcre2_jit_stack *<i>jit_stack</i>);</b>
1534</P>
1535<P>
1536These functions provide support for JIT compilation, which, if the just-in-time
1537compiler is available, further processes a compiled pattern into machine code
1538that executes much faster than the <b>pcre2_match()</b> interpretive matching
1539function. Full details are given in the
1540<a href="pcre2jit.html"><b>pcre2jit</b></a>
1541documentation.
1542</P>
1543<P>
1544JIT compilation is a heavyweight optimization. It can take some time for
1545patterns to be analyzed, and for one-off matches and simple patterns the
1546benefit of faster execution might be offset by a much slower compilation time.
1547Most, but not all patterns can be optimized by the JIT compiler.
1548<a name="localesupport"></a></P>
1549<br><a name="SEC21" href="#TOC1">LOCALE SUPPORT</a><br>
1550<P>
1551PCRE2 handles caseless matching, and determines whether characters are letters,
1552digits, or whatever, by reference to a set of tables, indexed by character code
1553point. This applies only to characters whose code points are less than 256. By
1554default, higher-valued code points never match escapes such as \w or \d.
1555However, if PCRE2 is built with UTF support, all characters can be tested with
1556\p and \P, or, alternatively, the PCRE2_UCP option can be set when a pattern
1557is compiled; this causes \w and friends to use Unicode property support
1558instead of the built-in tables.
1559</P>
1560<P>
1561The use of locales with Unicode is discouraged. If you are handling characters
1562with code points greater than 128, you should either use Unicode support, or
1563use locales, but not try to mix the two.
1564</P>
1565<P>
1566PCRE2 contains an internal set of character tables that are used by default.
1567These are sufficient for many applications. Normally, the internal tables
1568recognize only ASCII characters. However, when PCRE2 is built, it is possible
1569to cause the internal tables to be rebuilt in the default "C" locale of the
1570local system, which may cause them to be different.
1571</P>
1572<P>
1573The internal tables can be overridden by tables supplied by the application
1574that calls PCRE2. These may be created in a different locale from the default.
1575As more and more applications change to using Unicode, the need for this locale
1576support is expected to die away.
1577</P>
1578<P>
1579External tables are built by calling the <b>pcre2_maketables()</b> function, in
1580the relevant locale. The result can be passed to <b>pcre2_compile()</b> as often
1581as necessary, by creating a compile context and calling
1582<b>pcre2_set_character_tables()</b> to set the tables pointer therein. For
1583example, to build and use tables that are appropriate for the French locale
1584(where accented characters with values greater than 128 are treated as
1585letters), the following code could be used:
1586<pre>
1587  setlocale(LC_CTYPE, "fr_FR");
1588  tables = pcre2_maketables(NULL);
1589  ccontext = pcre2_compile_context_create(NULL);
1590  pcre2_set_character_tables(ccontext, tables);
1591  re = pcre2_compile(..., ccontext);
1592</pre>
1593The locale name "fr_FR" is used on Linux and other Unix-like systems; if you
1594are using Windows, the name for the French locale is "french". It is the
1595caller's responsibility to ensure that the memory containing the tables remains
1596available for as long as it is needed.
1597</P>
1598<P>
1599The pointer that is passed (via the compile context) to <b>pcre2_compile()</b>
1600is saved with the compiled pattern, and the same tables are used by
1601<b>pcre2_match()</b> and <b>pcre_dfa_match()</b>. Thus, for any single pattern,
1602compilation, and matching all happen in the same locale, but different patterns
1603can be processed in different locales.
1604<a name="infoaboutpattern"></a></P>
1605<br><a name="SEC22" href="#TOC1">INFORMATION ABOUT A COMPILED PATTERN</a><br>
1606<P>
1607<b>int pcre2_pattern_info(const pcre2 *<i>code</i>, uint32_t <i>what</i>, void *<i>where</i>);</b>
1608</P>
1609<P>
1610The <b>pcre2_pattern_info()</b> function returns general information about a
1611compiled pattern. For information about callouts, see the
1612<a href="pcre2pattern.html#infoaboutcallouts">next section.</a>
1613The first argument for <b>pcre2_pattern_info()</b> is a pointer to the compiled
1614pattern. The second argument specifies which piece of information is required,
1615and the third argument is a pointer to a variable to receive the data. If the
1616third argument is NULL, the first argument is ignored, and the function returns
1617the size in bytes of the variable that is required for the information
1618requested. Otherwise, The yield of the function is zero for success, or one of
1619the following negative numbers:
1620<pre>
1621  PCRE2_ERROR_NULL           the argument <i>code</i> was NULL
1622  PCRE2_ERROR_BADMAGIC       the "magic number" was not found
1623  PCRE2_ERROR_BADOPTION      the value of <i>what</i> was invalid
1624  PCRE2_ERROR_UNSET          the requested field is not set
1625</pre>
1626The "magic number" is placed at the start of each compiled pattern as an simple
1627check against passing an arbitrary memory pointer. Here is a typical call of
1628<b>pcre2_pattern_info()</b>, to obtain the length of the compiled pattern:
1629<pre>
1630  int rc;
1631  size_t length;
1632  rc = pcre2_pattern_info(
1633    re,               /* result of pcre2_compile() */
1634    PCRE2_INFO_SIZE,  /* what is required */
1635    &length);         /* where to put the data */
1636</pre>
1637The possible values for the second argument are defined in <b>pcre2.h</b>, and
1638are as follows:
1639<pre>
1640  PCRE2_INFO_ALLOPTIONS
1641  PCRE2_INFO_ARGOPTIONS
1642</pre>
1643Return a copy of the pattern's options. The third argument should point to a
1644<b>uint32_t</b> variable. PCRE2_INFO_ARGOPTIONS returns exactly the options that
1645were passed to <b>pcre2_compile()</b>, whereas PCRE2_INFO_ALLOPTIONS returns
1646the compile options as modified by any top-level (*XXX) option settings such as
1647(*UTF) at the start of the pattern itself.
1648</P>
1649<P>
1650For example, if the pattern /(*UTF)abc/ is compiled with the PCRE2_EXTENDED
1651option, the result for PCRE2_INFO_ALLOPTIONS is PCRE2_EXTENDED and PCRE2_UTF.
1652Option settings such as (?i) that can change within a pattern do not affect the
1653result of PCRE2_INFO_ALLOPTIONS, even if they appear right at the start of the
1654pattern. (This was different in some earlier releases.)
1655</P>
1656<P>
1657A pattern compiled without PCRE2_ANCHORED is automatically anchored by PCRE2 if
1658the first significant item in every top-level branch is one of the following:
1659<pre>
1660  ^     unless PCRE2_MULTILINE is set
1661  \A    always
1662  \G    always
1663  .*    sometimes - see below
1664</pre>
1665When .* is the first significant item, anchoring is possible only when all the
1666following are true:
1667<pre>
1668  .* is not in an atomic group
1669  .* is not in a capturing group that is the subject of a back reference
1670  PCRE2_DOTALL is in force for .*
1671  Neither (*PRUNE) nor (*SKIP) appears in the pattern.
1672  PCRE2_NO_DOTSTAR_ANCHOR is not set.
1673</pre>
1674For patterns that are auto-anchored, the PCRE2_ANCHORED bit is set in the
1675options returned for PCRE2_INFO_ALLOPTIONS.
1676<pre>
1677  PCRE2_INFO_BACKREFMAX
1678</pre>
1679Return the number of the highest back reference in the pattern. The third
1680argument should point to an <b>uint32_t</b> variable. Named subpatterns acquire
1681numbers as well as names, and these count towards the highest back reference.
1682Back references such as \4 or \g{12} match the captured characters of the
1683given group, but in addition, the check that a capturing group is set in a
1684conditional subpattern such as (?(3)a|b) is also a back reference. Zero is
1685returned if there are no back references.
1686<pre>
1687  PCRE2_INFO_BSR
1688</pre>
1689The output is a uint32_t whose value indicates what character sequences the \R
1690escape sequence matches. A value of PCRE2_BSR_UNICODE means that \R matches
1691any Unicode line ending sequence; a value of PCRE2_BSR_ANYCRLF means that \R
1692matches only CR, LF, or CRLF.
1693<pre>
1694  PCRE2_INFO_CAPTURECOUNT
1695</pre>
1696Return the highest capturing subpattern number in the pattern. In patterns
1697where (?| is not used, this is also the total number of capturing subpatterns.
1698The third argument should point to an <b>uint32_t</b> variable.
1699<pre>
1700  PCRE2_INFO_FIRSTBITMAP
1701</pre>
1702In the absence of a single first code unit for a non-anchored pattern,
1703<b>pcre2_compile()</b> may construct a 256-bit table that defines a fixed set of
1704values for the first code unit in any match. For example, a pattern that starts
1705with [abc] results in a table with three bits set. When code unit values
1706greater than 255 are supported, the flag bit for 255 means "any code unit of
1707value 255 or above". If such a table was constructed, a pointer to it is
1708returned. Otherwise NULL is returned. The third argument should point to an
1709<b>const uint8_t *</b> variable.
1710<pre>
1711  PCRE2_INFO_FIRSTCODETYPE
1712</pre>
1713Return information about the first code unit of any matched string, for a
1714non-anchored pattern. The third argument should point to an <b>uint32_t</b>
1715variable. If there is a fixed first value, for example, the letter "c" from a
1716pattern such as (cat|cow|coyote), 1 is returned, and the character value can be
1717retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed first value, but
1718it is known that a match can occur only at the start of the subject or
1719following a newline in the subject, 2 is returned. Otherwise, and for anchored
1720patterns, 0 is returned.
1721<pre>
1722  PCRE2_INFO_FIRSTCODEUNIT
1723</pre>
1724Return the value of the first code unit of any matched string in the situation
1725where PCRE2_INFO_FIRSTCODETYPE returns 1; otherwise return 0. The third
1726argument should point to an <b>uint32_t</b> variable. In the 8-bit library, the
1727value is always less than 256. In the 16-bit library the value can be up to
17280xffff. In the 32-bit library in UTF-32 mode the value can be up to 0x10ffff,
1729and up to 0xffffffff when not using UTF-32 mode.
1730<pre>
1731  PCRE2_INFO_HASBACKSLASHC
1732</pre>
1733Return 1 if the pattern contains any instances of \C, otherwise 0. The third
1734argument should point to an <b>uint32_t</b> variable.
1735<pre>
1736  PCRE2_INFO_HASCRORLF
1737</pre>
1738Return 1 if the pattern contains any explicit matches for CR or LF characters,
1739otherwise 0. The third argument should point to an <b>uint32_t</b> variable. An
1740explicit match is either a literal CR or LF character, or \r or \n.
1741<pre>
1742  PCRE2_INFO_JCHANGED
1743</pre>
1744Return 1 if the (?J) or (?-J) option setting is used in the pattern, otherwise
17450. The third argument should point to an <b>uint32_t</b> variable. (?J) and
1746(?-J) set and unset the local PCRE2_DUPNAMES option, respectively.
1747<pre>
1748  PCRE2_INFO_JITSIZE
1749</pre>
1750If the compiled pattern was successfully processed by
1751<b>pcre2_jit_compile()</b>, return the size of the JIT compiled code, otherwise
1752return zero. The third argument should point to a <b>size_t</b> variable.
1753<pre>
1754  PCRE2_INFO_LASTCODETYPE
1755</pre>
1756Returns 1 if there is a rightmost literal code unit that must exist in any
1757matched string, other than at its start. The third argument should  point to an
1758<b>uint32_t</b> variable. If there is no such value, 0 is returned. When 1 is
1759returned, the code unit value itself can be retrieved using
1760PCRE2_INFO_LASTCODEUNIT. For anchored patterns, a last literal value is
1761recorded only if it follows something of variable length. For example, for the
1762pattern /^a\d+z\d+/ the returned value is 1 (with "z" returned from
1763PCRE2_INFO_LASTCODEUNIT), but for /^a\dz\d/ the returned value is 0.
1764<pre>
1765  PCRE2_INFO_LASTCODEUNIT
1766</pre>
1767Return the value of the rightmost literal data unit that must exist in any
1768matched string, other than at its start, if such a value has been recorded. The
1769third argument should point to an <b>uint32_t</b> variable. If there is no such
1770value, 0 is returned.
1771<pre>
1772  PCRE2_INFO_MATCHEMPTY
1773</pre>
1774Return 1 if the pattern might match an empty string, otherwise 0. The third
1775argument should point to an <b>uint32_t</b> variable. When a pattern contains
1776recursive subroutine calls it is not always possible to determine whether or
1777not it can match an empty string. PCRE2 takes a cautious approach and returns 1
1778in such cases.
1779<pre>
1780  PCRE2_INFO_MATCHLIMIT
1781</pre>
1782If the pattern set a match limit by including an item of the form
1783(*LIMIT_MATCH=nnnn) at the start, the value is returned. The third argument
1784should point to an unsigned 32-bit integer. If no such value has been set, the
1785call to <b>pcre2_pattern_info()</b> returns the error PCRE2_ERROR_UNSET.
1786<pre>
1787  PCRE2_INFO_MAXLOOKBEHIND
1788</pre>
1789Return the number of characters (not code units) in the longest lookbehind
1790assertion in the pattern. The third argument should point to an unsigned 32-bit
1791integer. This information is useful when doing multi-segment matching using the
1792partial matching facilities. Note that the simple assertions \b and \B
1793require a one-character lookbehind. \A also registers a one-character
1794lookbehind, though it does not actually inspect the previous character. This is
1795to ensure that at least one character from the old segment is retained when a
1796new segment is processed. Otherwise, if there are no lookbehinds in the
1797pattern, \A might match incorrectly at the start of a new segment.
1798<pre>
1799  PCRE2_INFO_MINLENGTH
1800</pre>
1801If a minimum length for matching subject strings was computed, its value is
1802returned. Otherwise the returned value is 0. The value is a number of
1803characters, which in UTF mode may be different from the number of code units.
1804The third argument should point to an <b>uint32_t</b> variable. The value is a
1805lower bound to the length of any matching string. There may not be any strings
1806of that length that do actually match, but every string that does match is at
1807least that long.
1808<pre>
1809  PCRE2_INFO_NAMECOUNT
1810  PCRE2_INFO_NAMEENTRYSIZE
1811  PCRE2_INFO_NAMETABLE
1812</pre>
1813PCRE2 supports the use of named as well as numbered capturing parentheses. The
1814names are just an additional way of identifying the parentheses, which still
1815acquire numbers. Several convenience functions such as
1816<b>pcre2_substring_get_byname()</b> are provided for extracting captured
1817substrings by name. It is also possible to extract the data directly, by first
1818converting the name to a number in order to access the correct pointers in the
1819output vector (described with <b>pcre2_match()</b> below). To do the conversion,
1820you need to use the name-to-number map, which is described by these three
1821values.
1822</P>
1823<P>
1824The map consists of a number of fixed-size entries. PCRE2_INFO_NAMECOUNT gives
1825the number of entries, and PCRE2_INFO_NAMEENTRYSIZE gives the size of each
1826entry in code units; both of these return a <b>uint32_t</b> value. The entry
1827size depends on the length of the longest name.
1828</P>
1829<P>
1830PCRE2_INFO_NAMETABLE returns a pointer to the first entry of the table. This is
1831a PCRE2_SPTR pointer to a block of code units. In the 8-bit library, the first
1832two bytes of each entry are the number of the capturing parenthesis, most
1833significant byte first. In the 16-bit library, the pointer points to 16-bit
1834code units, the first of which contains the parenthesis number. In the 32-bit
1835library, the pointer points to 32-bit code units, the first of which contains
1836the parenthesis number. The rest of the entry is the corresponding name, zero
1837terminated.
1838</P>
1839<P>
1840The names are in alphabetical order. If (?| is used to create multiple groups
1841with the same number, as described in the
1842<a href="pcre2pattern.html#dupsubpatternnumber">section on duplicate subpattern numbers</a>
1843in the
1844<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
1845page, the groups may be given the same name, but there is only one entry in the
1846table. Different names for groups of the same number are not permitted.
1847</P>
1848<P>
1849Duplicate names for subpatterns with different numbers are permitted, but only
1850if PCRE2_DUPNAMES is set. They appear in the table in the order in which they
1851were found in the pattern. In the absence of (?| this is the order of
1852increasing number; when (?| is used this is not necessarily the case because
1853later subpatterns may have lower numbers.
1854</P>
1855<P>
1856As a simple example of the name/number table, consider the following pattern
1857after compilation by the 8-bit library (assume PCRE2_EXTENDED is set, so white
1858space - including newlines - is ignored):
1859<pre>
1860  (?&#60;date&#62; (?&#60;year&#62;(\d\d)?\d\d) - (?&#60;month&#62;\d\d) - (?&#60;day&#62;\d\d) )
1861</pre>
1862There are four named subpatterns, so the table has four entries, and each entry
1863in the table is eight bytes long. The table is as follows, with non-printing
1864bytes shows in hexadecimal, and undefined bytes shown as ??:
1865<pre>
1866  00 01 d  a  t  e  00 ??
1867  00 05 d  a  y  00 ?? ??
1868  00 04 m  o  n  t  h  00
1869  00 02 y  e  a  r  00 ??
1870</pre>
1871When writing code to extract data from named subpatterns using the
1872name-to-number map, remember that the length of the entries is likely to be
1873different for each compiled pattern.
1874<pre>
1875  PCRE2_INFO_NEWLINE
1876</pre>
1877The output is a <b>uint32_t</b> with one of the following values:
1878<pre>
1879  PCRE2_NEWLINE_CR       Carriage return (CR)
1880  PCRE2_NEWLINE_LF       Linefeed (LF)
1881  PCRE2_NEWLINE_CRLF     Carriage return, linefeed (CRLF)
1882  PCRE2_NEWLINE_ANY      Any Unicode line ending
1883  PCRE2_NEWLINE_ANYCRLF  Any of CR, LF, or CRLF
1884</pre>
1885This specifies the default character sequence that will be recognized as
1886meaning "newline" while matching.
1887<pre>
1888  PCRE2_INFO_RECURSIONLIMIT
1889</pre>
1890If the pattern set a recursion limit by including an item of the form
1891(*LIMIT_RECURSION=nnnn) at the start, the value is returned. The third
1892argument should point to an unsigned 32-bit integer. If no such value has been
1893set, the call to <b>pcre2_pattern_info()</b> returns the error PCRE2_ERROR_UNSET.
1894<pre>
1895  PCRE2_INFO_SIZE
1896</pre>
1897Return the size of the compiled pattern in bytes (for all three libraries). The
1898third argument should point to a <b>size_t</b> variable. This value includes the
1899size of the general data block that precedes the code units of the compiled
1900pattern itself. The value that is used when <b>pcre2_compile()</b> is getting
1901memory in which to place the compiled pattern may be slightly larger than the
1902value returned by this option, because there are cases where the code that
1903calculates the size has to over-estimate. Processing a pattern with the JIT
1904compiler does not alter the value returned by this option.
1905<a name="infoaboutcallouts"></a></P>
1906<br><a name="SEC23" href="#TOC1">INFORMATION ABOUT A PATTERN'S CALLOUTS</a><br>
1907<P>
1908<b>int pcre2_callout_enumerate(const pcre2_code *<i>code</i>,</b>
1909<b>  int (*<i>callback</i>)(pcre2_callout_enumerate_block *, void *),</b>
1910<b>  void *<i>user_data</i>);</b>
1911<br>
1912<br>
1913A script language that supports the use of string arguments in callouts might
1914like to scan all the callouts in a pattern before running the match. This can
1915be done by calling <b>pcre2_callout_enumerate()</b>. The first argument is a
1916pointer to a compiled pattern, the second points to a callback function, and
1917the third is arbitrary user data. The callback function is called for every
1918callout in the pattern in the order in which they appear. Its first argument is
1919a pointer to a callout enumeration block, and its second argument is the
1920<i>user_data</i> value that was passed to <b>pcre2_callout_enumerate()</b>. The
1921contents of the callout enumeration block are described in the
1922<a href="pcre2callout.html"><b>pcre2callout</b></a>
1923documentation, which also gives further details about callouts.
1924</P>
1925<br><a name="SEC24" href="#TOC1">SERIALIZATION AND PRECOMPILING</a><br>
1926<P>
1927It is possible to save compiled patterns on disc or elsewhere, and reload them
1928later, subject to a number of restrictions. The functions whose names begin
1929with <b>pcre2_serialize_</b> are used for this purpose. They are described in
1930the
1931<a href="pcre2serialize.html"><b>pcre2serialize</b></a>
1932documentation.
1933<a name="matchdatablock"></a></P>
1934<br><a name="SEC25" href="#TOC1">THE MATCH DATA BLOCK</a><br>
1935<P>
1936<b>pcre2_match_data *pcre2_match_data_create(uint32_t <i>ovecsize</i>,</b>
1937<b>  pcre2_general_context *<i>gcontext</i>);</b>
1938<br>
1939<br>
1940<b>pcre2_match_data *pcre2_match_data_create_from_pattern(</b>
1941<b>  const pcre2_code *<i>code</i>, pcre2_general_context *<i>gcontext</i>);</b>
1942<br>
1943<br>
1944<b>void pcre2_match_data_free(pcre2_match_data *<i>match_data</i>);</b>
1945</P>
1946<P>
1947Information about a successful or unsuccessful match is placed in a match
1948data block, which is an opaque structure that is accessed by function calls. In
1949particular, the match data block contains a vector of offsets into the subject
1950string that define the matched part of the subject and any substrings that were
1951captured. This is know as the <i>ovector</i>.
1952</P>
1953<P>
1954Before calling <b>pcre2_match()</b>, <b>pcre2_dfa_match()</b>, or
1955<b>pcre2_jit_match()</b> you must create a match data block by calling one of
1956the creation functions above. For <b>pcre2_match_data_create()</b>, the first
1957argument is the number of pairs of offsets in the <i>ovector</i>. One pair of
1958offsets is required to identify the string that matched the whole pattern, with
1959another pair for each captured substring. For example, a value of 4 creates
1960enough space to record the matched portion of the subject plus three captured
1961substrings. A minimum of at least 1 pair is imposed by
1962<b>pcre2_match_data_create()</b>, so it is always possible to return the overall
1963matched string.
1964</P>
1965<P>
1966The second argument of <b>pcre2_match_data_create()</b> is a pointer to a
1967general context, which can specify custom memory management for obtaining the
1968memory for the match data block. If you are not using custom memory management,
1969pass NULL, which causes <b>malloc()</b> to be used.
1970</P>
1971<P>
1972For <b>pcre2_match_data_create_from_pattern()</b>, the first argument is a
1973pointer to a compiled pattern. The ovector is created to be exactly the right
1974size to hold all the substrings a pattern might capture. The second argument is
1975again a pointer to a general context, but in this case if NULL is passed, the
1976memory is obtained using the same allocator that was used for the compiled
1977pattern (custom or default).
1978</P>
1979<P>
1980A match data block can be used many times, with the same or different compiled
1981patterns. You can extract information from a match data block after a match
1982operation has finished, using functions that are described in the sections on
1983<a href="#matchedstrings">matched strings</a>
1984and
1985<a href="#matchotherdata">other match data</a>
1986below.
1987</P>
1988<P>
1989When a call of <b>pcre2_match()</b> fails, valid data is available in the match
1990block only when the error is PCRE2_ERROR_NOMATCH, PCRE2_ERROR_PARTIAL, or one
1991of the error codes for an invalid UTF string. Exactly what is available depends
1992on the error, and is detailed below.
1993</P>
1994<P>
1995When one of the matching functions is called, pointers to the compiled pattern
1996and the subject string are set in the match data block so that they can be
1997referenced by the extraction functions. After running a match, you must not
1998free a compiled pattern or a subject string until after all operations on the
1999match data block (for that match) have taken place.
2000</P>
2001<P>
2002When a match data block itself is no longer needed, it should be freed by
2003calling <b>pcre2_match_data_free()</b>.
2004</P>
2005<br><a name="SEC26" href="#TOC1">MATCHING A PATTERN: THE TRADITIONAL FUNCTION</a><br>
2006<P>
2007<b>int pcre2_match(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b>
2008<b>  PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b>
2009<b>  uint32_t <i>options</i>, pcre2_match_data *<i>match_data</i>,</b>
2010<b>  pcre2_match_context *<i>mcontext</i>);</b>
2011</P>
2012<P>
2013The function <b>pcre2_match()</b> is called to match a subject string against a
2014compiled pattern, which is passed in the <i>code</i> argument. You can call
2015<b>pcre2_match()</b> with the same <i>code</i> argument as many times as you
2016like, in order to find multiple matches in the subject string or to match
2017different subject strings with the same pattern.
2018</P>
2019<P>
2020This function is the main matching facility of the library, and it operates in
2021a Perl-like manner. For specialist use there is also an alternative matching
2022function, which is described
2023<a href="#dfamatch">below</a>
2024in the section about the <b>pcre2_dfa_match()</b> function.
2025</P>
2026<P>
2027Here is an example of a simple call to <b>pcre2_match()</b>:
2028<pre>
2029  pcre2_match_data *md = pcre2_match_data_create(4, NULL);
2030  int rc = pcre2_match(
2031    re,             /* result of pcre2_compile() */
2032    "some string",  /* the subject string */
2033    11,             /* the length of the subject string */
2034    0,              /* start at offset 0 in the subject */
2035    0,              /* default options */
2036    match_data,     /* the match data block */
2037    NULL);          /* a match context; NULL means use defaults */
2038</pre>
2039If the subject string is zero-terminated, the length can be given as
2040PCRE2_ZERO_TERMINATED. A match context must be provided if certain less common
2041matching parameters are to be changed. For details, see the section on
2042<a href="#matchcontext">the match context</a>
2043above.
2044</P>
2045<br><b>
2046The string to be matched by <b>pcre2_match()</b>
2047</b><br>
2048<P>
2049The subject string is passed to <b>pcre2_match()</b> as a pointer in
2050<i>subject</i>, a length in <i>length</i>, and a starting offset in
2051<i>startoffset</i>. The length and offset are in code units, not characters.
2052That is, they are in bytes for the 8-bit library, 16-bit code units for the
205316-bit library, and 32-bit code units for the 32-bit library, whether or not
2054UTF processing is enabled.
2055</P>
2056<P>
2057If <i>startoffset</i> is greater than the length of the subject,
2058<b>pcre2_match()</b> returns PCRE2_ERROR_BADOFFSET. When the starting offset is
2059zero, the search for a match starts at the beginning of the subject, and this
2060is by far the most common case. In UTF-8 or UTF-16 mode, the starting offset
2061must point to the start of a character, or to the end of the subject (in UTF-32
2062mode, one code unit equals one character, so all offsets are valid). Like the
2063pattern string, the subject may contain binary zeroes.
2064</P>
2065<P>
2066A non-zero starting offset is useful when searching for another match in the
2067same subject by calling <b>pcre2_match()</b> again after a previous success.
2068Setting <i>startoffset</i> differs from passing over a shortened string and
2069setting PCRE2_NOTBOL in the case of a pattern that begins with any kind of
2070lookbehind. For example, consider the pattern
2071<pre>
2072  \Biss\B
2073</pre>
2074which finds occurrences of "iss" in the middle of words. (\B matches only if
2075the current position in the subject is not a word boundary.) When applied to
2076the string "Mississipi" the first call to <b>pcre2_match()</b> finds the first
2077occurrence. If <b>pcre2_match()</b> is called again with just the remainder of
2078the subject, namely "issipi", it does not match, because \B is always false at
2079the start of the subject, which is deemed to be a word boundary. However, if
2080<b>pcre2_match()</b> is passed the entire string again, but with
2081<i>startoffset</i> set to 4, it finds the second occurrence of "iss" because it
2082is able to look behind the starting point to discover that it is preceded by a
2083letter.
2084</P>
2085<P>
2086Finding all the matches in a subject is tricky when the pattern can match an
2087empty string. It is possible to emulate Perl's /g behaviour by first trying the
2088match again at the same offset, with the PCRE2_NOTEMPTY_ATSTART and
2089PCRE2_ANCHORED options, and then if that fails, advancing the starting offset
2090and trying an ordinary match again. There is some code that demonstrates how to
2091do this in the
2092<a href="pcre2demo.html"><b>pcre2demo</b></a>
2093sample program. In the most general case, you have to check to see if the
2094newline convention recognizes CRLF as a newline, and if so, and the current
2095character is CR followed by LF, advance the starting offset by two characters
2096instead of one.
2097</P>
2098<P>
2099If a non-zero starting offset is passed when the pattern is anchored, one
2100attempt to match at the given offset is made. This can only succeed if the
2101pattern does not require the match to be at the start of the subject.
2102<a name="matchoptions"></a></P>
2103<br><b>
2104Option bits for <b>pcre2_match()</b>
2105</b><br>
2106<P>
2107The unused bits of the <i>options</i> argument for <b>pcre2_match()</b> must be
2108zero. The only bits that may be set are PCRE2_ANCHORED, PCRE2_NOTBOL,
2109PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_JIT,
2110PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. Their action is
2111described below.
2112</P>
2113<P>
2114Setting PCRE2_ANCHORED at match time is not supported by the just-in-time (JIT)
2115compiler. If it is set, JIT matching is disabled and the normal interpretive
2116code in <b>pcre2_match()</b> is run. Apart from PCRE2_NO_JIT (obviously), the
2117remaining options are supported for JIT matching.
2118<pre>
2119  PCRE2_ANCHORED
2120</pre>
2121The PCRE2_ANCHORED option limits <b>pcre2_match()</b> to matching at the first
2122matching position. If a pattern was compiled with PCRE2_ANCHORED, or turned out
2123to be anchored by virtue of its contents, it cannot be made unachored at
2124matching time. Note that setting the option at match time disables JIT
2125matching.
2126<pre>
2127  PCRE2_NOTBOL
2128</pre>
2129This option specifies that first character of the subject string is not the
2130beginning of a line, so the circumflex metacharacter should not match before
2131it. Setting this without having set PCRE2_MULTILINE at compile time causes
2132circumflex never to match. This option affects only the behaviour of the
2133circumflex metacharacter. It does not affect \A.
2134<pre>
2135  PCRE2_NOTEOL
2136</pre>
2137This option specifies that the end of the subject string is not the end of a
2138line, so the dollar metacharacter should not match it nor (except in multiline
2139mode) a newline immediately before it. Setting this without having set
2140PCRE2_MULTILINE at compile time causes dollar never to match. This option
2141affects only the behaviour of the dollar metacharacter. It does not affect \Z
2142or \z.
2143<pre>
2144  PCRE2_NOTEMPTY
2145</pre>
2146An empty string is not considered to be a valid match if this option is set. If
2147there are alternatives in the pattern, they are tried. If all the alternatives
2148match the empty string, the entire match fails. For example, if the pattern
2149<pre>
2150  a?b?
2151</pre>
2152is applied to a string not beginning with "a" or "b", it matches an empty
2153string at the start of the subject. With PCRE2_NOTEMPTY set, this match is not
2154valid, so <b>pcre2_match()</b> searches further into the string for occurrences
2155of "a" or "b".
2156<pre>
2157  PCRE2_NOTEMPTY_ATSTART
2158</pre>
2159This is like PCRE2_NOTEMPTY, except that it locks out an empty string match
2160only at the first matching position, that is, at the start of the subject plus
2161the starting offset. An empty string match later in the subject is permitted.
2162If the pattern is anchored, such a match can occur only if the pattern contains
2163\K.
2164<pre>
2165  PCRE2_NO_JIT
2166</pre>
2167By default, if a pattern has been successfully processed by
2168<b>pcre2_jit_compile()</b>, JIT is automatically used when <b>pcre2_match()</b>
2169is called with options that JIT supports. Setting PCRE2_NO_JIT disables the use
2170of JIT; it forces matching to be done by the interpreter.
2171<pre>
2172  PCRE2_NO_UTF_CHECK
2173</pre>
2174When PCRE2_UTF is set at compile time, the validity of the subject as a UTF
2175string is checked by default when <b>pcre2_match()</b> is subsequently called.
2176If a non-zero starting offset is given, the check is applied only to that part
2177of the subject that could be inspected during matching, and there is a check
2178that the starting offset points to the first code unit of a character or to the
2179end of the subject. If there are no lookbehind assertions in the pattern, the
2180check starts at the starting offset. Otherwise, it starts at the length of the
2181longest lookbehind before the starting offset, or at the start of the subject
2182if there are not that many characters before the starting offset. Note that the
2183sequences \b and \B are one-character lookbehinds.
2184</P>
2185<P>
2186The check is carried out before any other processing takes place, and a
2187negative error code is returned if the check fails. There are several UTF error
2188codes for each code unit width, corresponding to different problems with the
2189code unit sequence. There are discussions about the validity of
2190<a href="pcre2unicode.html#utf8strings">UTF-8 strings,</a>
2191<a href="pcre2unicode.html#utf16strings">UTF-16 strings,</a>
2192and
2193<a href="pcre2unicode.html#utf32strings">UTF-32 strings</a>
2194in the
2195<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
2196page.
2197</P>
2198<P>
2199If you know that your subject is valid, and you want to skip these checks for
2200performance reasons, you can set the PCRE2_NO_UTF_CHECK option when calling
2201<b>pcre2_match()</b>. You might want to do this for the second and subsequent
2202calls to <b>pcre2_match()</b> if you are making repeated calls to find all the
2203matches in a single subject string.
2204</P>
2205<P>
2206NOTE: When PCRE2_NO_UTF_CHECK is set, the effect of passing an invalid string
2207as a subject, or an invalid value of <i>startoffset</i>, is undefined. Your
2208program may crash or loop indefinitely.
2209<pre>
2210  PCRE2_PARTIAL_HARD
2211  PCRE2_PARTIAL_SOFT
2212</pre>
2213These options turn on the partial matching feature. A partial match occurs if
2214the end of the subject string is reached successfully, but there are not enough
2215subject characters to complete the match. If this happens when
2216PCRE2_PARTIAL_SOFT (but not PCRE2_PARTIAL_HARD) is set, matching continues by
2217testing any remaining alternatives. Only if no complete match can be found is
2218PCRE2_ERROR_PARTIAL returned instead of PCRE2_ERROR_NOMATCH. In other words,
2219PCRE2_PARTIAL_SOFT specifies that the caller is prepared to handle a partial
2220match, but only if no complete match can be found.
2221</P>
2222<P>
2223If PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In this case, if
2224a partial match is found, <b>pcre2_match()</b> immediately returns
2225PCRE2_ERROR_PARTIAL, without considering any other alternatives. In other
2226words, when PCRE2_PARTIAL_HARD is set, a partial match is considered to be more
2227important that an alternative complete match.
2228</P>
2229<P>
2230There is a more detailed discussion of partial and multi-segment matching, with
2231examples, in the
2232<a href="pcre2partial.html"><b>pcre2partial</b></a>
2233documentation.
2234</P>
2235<br><a name="SEC27" href="#TOC1">NEWLINE HANDLING WHEN MATCHING</a><br>
2236<P>
2237When PCRE2 is built, a default newline convention is set; this is usually the
2238standard convention for the operating system. The default can be overridden in
2239a
2240<a href="#compilecontext">compile context</a>
2241by calling <b>pcre2_set_newline()</b>. It can also be overridden by starting a
2242pattern string with, for example, (*CRLF), as described in the
2243<a href="pcre2pattern.html#newlines">section on newline conventions</a>
2244in the
2245<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
2246page. During matching, the newline choice affects the behaviour of the dot,
2247circumflex, and dollar metacharacters. It may also alter the way the match
2248starting position is advanced after a match failure for an unanchored pattern.
2249</P>
2250<P>
2251When PCRE2_NEWLINE_CRLF, PCRE2_NEWLINE_ANYCRLF, or PCRE2_NEWLINE_ANY is set as
2252the newline convention, and a match attempt for an unanchored pattern fails
2253when the current starting position is at a CRLF sequence, and the pattern
2254contains no explicit matches for CR or LF characters, the match position is
2255advanced by two characters instead of one, in other words, to after the CRLF.
2256</P>
2257<P>
2258The above rule is a compromise that makes the most common cases work as
2259expected. For example, if the pattern is .+A (and the PCRE2_DOTALL option is
2260not set), it does not match the string "\r\nA" because, after failing at the
2261start, it skips both the CR and the LF before retrying. However, the pattern
2262[\r\n]A does match that string, because it contains an explicit CR or LF
2263reference, and so advances only by one character after the first failure.
2264</P>
2265<P>
2266An explicit match for CR of LF is either a literal appearance of one of those
2267characters in the pattern, or one of the \r or \n escape sequences. Implicit
2268matches such as [^X] do not count, nor does \s, even though it includes CR and
2269LF in the characters that it matches.
2270</P>
2271<P>
2272Notwithstanding the above, anomalous effects may still occur when CRLF is a
2273valid newline sequence and explicit \r or \n escapes appear in the pattern.
2274<a name="matchedstrings"></a></P>
2275<br><a name="SEC28" href="#TOC1">HOW PCRE2_MATCH() RETURNS A STRING AND CAPTURED SUBSTRINGS</a><br>
2276<P>
2277<b>uint32_t pcre2_get_ovector_count(pcre2_match_data *<i>match_data</i>);</b>
2278<br>
2279<br>
2280<b>PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *<i>match_data</i>);</b>
2281</P>
2282<P>
2283In general, a pattern matches a certain portion of the subject, and in
2284addition, further substrings from the subject may be picked out by
2285parenthesized parts of the pattern. Following the usage in Jeffrey Friedl's
2286book, this is called "capturing" in what follows, and the phrase "capturing
2287subpattern" or "capturing group" is used for a fragment of a pattern that picks
2288out a substring. PCRE2 supports several other kinds of parenthesized subpattern
2289that do not cause substrings to be captured. The <b>pcre2_pattern_info()</b>
2290function can be used to find out how many capturing subpatterns there are in a
2291compiled pattern.
2292</P>
2293<P>
2294You can use auxiliary functions for accessing captured substrings
2295<a href="#extractbynumber">by number</a>
2296or
2297<a href="#extractbyname">by name,</a>
2298as described in sections below.
2299</P>
2300<P>
2301Alternatively, you can make direct use of the vector of PCRE2_SIZE values,
2302called the <b>ovector</b>, which contains the offsets of captured strings. It is
2303part of the
2304<a href="#matchdatablock">match data block.</a>
2305The function <b>pcre2_get_ovector_pointer()</b> returns the address of the
2306ovector, and <b>pcre2_get_ovector_count()</b> returns the number of pairs of
2307values it contains.
2308</P>
2309<P>
2310Within the ovector, the first in each pair of values is set to the offset of
2311the first code unit of a substring, and the second is set to the offset of the
2312first code unit after the end of a substring. These values are always code unit
2313offsets, not character offsets. That is, they are byte offsets in the 8-bit
2314library, 16-bit offsets in the 16-bit library, and 32-bit offsets in the 32-bit
2315library.
2316</P>
2317<P>
2318After a partial match (error return PCRE2_ERROR_PARTIAL), only the first pair
2319of offsets (that is, <i>ovector[0]</i> and <i>ovector[1]</i>) are set. They
2320identify the part of the subject that was partially matched. See the
2321<a href="pcre2partial.html"><b>pcre2partial</b></a>
2322documentation for details of partial matching.
2323</P>
2324<P>
2325After a successful match, the first pair of offsets identifies the portion of
2326the subject string that was matched by the entire pattern. The next pair is
2327used for the first capturing subpattern, and so on. The value returned by
2328<b>pcre2_match()</b> is one more than the highest numbered pair that has been
2329set. For example, if two substrings have been captured, the returned value is
23303. If there are no capturing subpatterns, the return value from a successful
2331match is 1, indicating that just the first pair of offsets has been set.
2332</P>
2333<P>
2334If a pattern uses the \K escape sequence within a positive assertion, the
2335reported start of a successful match can be greater than the end of the match.
2336For example, if the pattern (?=ab\K) is matched against "ab", the start and
2337end offset values for the match are 2 and 0.
2338</P>
2339<P>
2340If a capturing subpattern group is matched repeatedly within a single match
2341operation, it is the last portion of the subject that it matched that is
2342returned.
2343</P>
2344<P>
2345If the ovector is too small to hold all the captured substring offsets, as much
2346as possible is filled in, and the function returns a value of zero. If captured
2347substrings are not of interest, <b>pcre2_match()</b> may be called with a match
2348data block whose ovector is of minimum length (that is, one pair). However, if
2349the pattern contains back references and the <i>ovector</i> is not big enough to
2350remember the related substrings, PCRE2 has to get additional memory for use
2351during matching. Thus it is usually advisable to set up a match data block
2352containing an ovector of reasonable size.
2353</P>
2354<P>
2355It is possible for capturing subpattern number <i>n+1</i> to match some part of
2356the subject when subpattern <i>n</i> has not been used at all. For example, if
2357the string "abc" is matched against the pattern (a|(z))(bc) the return from the
2358function is 4, and subpatterns 1 and 3 are matched, but 2 is not. When this
2359happens, both values in the offset pairs corresponding to unused subpatterns
2360are set to PCRE2_UNSET.
2361</P>
2362<P>
2363Offset values that correspond to unused subpatterns at the end of the
2364expression are also set to PCRE2_UNSET. For example, if the string "abc" is
2365matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not matched.
2366The return from the function is 2, because the highest used capturing
2367subpattern number is 1. The offsets for for the second and third capturing
2368subpatterns (assuming the vector is large enough, of course) are set to
2369PCRE2_UNSET.
2370</P>
2371<P>
2372Elements in the ovector that do not correspond to capturing parentheses in the
2373pattern are never changed. That is, if a pattern contains <i>n</i> capturing
2374parentheses, no more than <i>ovector[0]</i> to <i>ovector[2n+1]</i> are set by
2375<b>pcre2_match()</b>. The other elements retain whatever values they previously
2376had.
2377<a name="matchotherdata"></a></P>
2378<br><a name="SEC29" href="#TOC1">OTHER INFORMATION ABOUT A MATCH</a><br>
2379<P>
2380<b>PCRE2_SPTR pcre2_get_mark(pcre2_match_data *<i>match_data</i>);</b>
2381<br>
2382<br>
2383<b>PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *<i>match_data</i>);</b>
2384</P>
2385<P>
2386As well as the offsets in the ovector, other information about a match is
2387retained in the match data block and can be retrieved by the above functions in
2388appropriate circumstances. If they are called at other times, the result is
2389undefined.
2390</P>
2391<P>
2392After a successful match, a partial match (PCRE2_ERROR_PARTIAL), or a failure
2393to match (PCRE2_ERROR_NOMATCH), a (*MARK) name may be available, and
2394<b>pcre2_get_mark()</b> can be called. It returns a pointer to the
2395zero-terminated name, which is within the compiled pattern. Otherwise NULL is
2396returned. The length of the (*MARK) name (excluding the terminating zero) is
2397stored in the code unit that preceeds the name. You should use this instead of
2398relying on the terminating zero if the (*MARK) name might contain a binary
2399zero.
2400</P>
2401<P>
2402After a successful match, the (*MARK) name that is returned is the
2403last one encountered on the matching path through the pattern. After a "no
2404match" or a partial match, the last encountered (*MARK) name is returned. For
2405example, consider this pattern:
2406<pre>
2407  ^(*MARK:A)((*MARK:B)a|b)c
2408</pre>
2409When it matches "bc", the returned mark is A. The B mark is "seen" in the first
2410branch of the group, but it is not on the matching path. On the other hand,
2411when this pattern fails to match "bx", the returned mark is B.
2412</P>
2413<P>
2414After a successful match, a partial match, or one of the invalid UTF errors
2415(for example, PCRE2_ERROR_UTF8_ERR5), <b>pcre2_get_startchar()</b> can be
2416called. After a successful or partial match it returns the code unit offset of
2417the character at which the match started. For a non-partial match, this can be
2418different to the value of <i>ovector[0]</i> if the pattern contains the \K
2419escape sequence. After a partial match, however, this value is always the same
2420as <i>ovector[0]</i> because \K does not affect the result of a partial match.
2421</P>
2422<P>
2423After a UTF check failure, <b>pcre2_get_startchar()</b> can be used to obtain
2424the code unit offset of the invalid UTF character. Details are given in the
2425<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
2426page.
2427<a name="errorlist"></a></P>
2428<br><a name="SEC30" href="#TOC1">ERROR RETURNS FROM <b>pcre2_match()</b></a><br>
2429<P>
2430If <b>pcre2_match()</b> fails, it returns a negative number. This can be
2431converted to a text string by calling the <b>pcre2_get_error_message()</b>
2432function (see "Obtaining a textual error message"
2433<a href="#geterrormessage">below).</a>
2434Negative error codes are also returned by other functions, and are documented
2435with them. The codes are given names in the header file. If UTF checking is in
2436force and an invalid UTF subject string is detected, one of a number of
2437UTF-specific negative error codes is returned. Details are given in the
2438<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
2439page. The following are the other errors that may be returned by
2440<b>pcre2_match()</b>:
2441<pre>
2442  PCRE2_ERROR_NOMATCH
2443</pre>
2444The subject string did not match the pattern.
2445<pre>
2446  PCRE2_ERROR_PARTIAL
2447</pre>
2448The subject string did not match, but it did match partially. See the
2449<a href="pcre2partial.html"><b>pcre2partial</b></a>
2450documentation for details of partial matching.
2451<pre>
2452  PCRE2_ERROR_BADMAGIC
2453</pre>
2454PCRE2 stores a 4-byte "magic number" at the start of the compiled code, to
2455catch the case when it is passed a junk pointer. This is the error that is
2456returned when the magic number is not present.
2457<pre>
2458  PCRE2_ERROR_BADMODE
2459</pre>
2460This error is given when a pattern that was compiled by the 8-bit library is
2461passed to a 16-bit or 32-bit library function, or vice versa.
2462<pre>
2463  PCRE2_ERROR_BADOFFSET
2464</pre>
2465The value of <i>startoffset</i> was greater than the length of the subject.
2466<pre>
2467  PCRE2_ERROR_BADOPTION
2468</pre>
2469An unrecognized bit was set in the <i>options</i> argument.
2470<pre>
2471  PCRE2_ERROR_BADUTFOFFSET
2472</pre>
2473The UTF code unit sequence that was passed as a subject was checked and found
2474to be valid (the PCRE2_NO_UTF_CHECK option was not set), but the value of
2475<i>startoffset</i> did not point to the beginning of a UTF character or the end
2476of the subject.
2477<pre>
2478  PCRE2_ERROR_CALLOUT
2479</pre>
2480This error is never generated by <b>pcre2_match()</b> itself. It is provided for
2481use by callout functions that want to cause <b>pcre2_match()</b> or
2482<b>pcre2_callout_enumerate()</b> to return a distinctive error code. See the
2483<a href="pcre2callout.html"><b>pcre2callout</b></a>
2484documentation for details.
2485<pre>
2486  PCRE2_ERROR_INTERNAL
2487</pre>
2488An unexpected internal error has occurred. This error could be caused by a bug
2489in PCRE2 or by overwriting of the compiled pattern.
2490<pre>
2491  PCRE2_ERROR_JIT_BADOPTION
2492</pre>
2493This error is returned when a pattern that was successfully studied using JIT
2494is being matched, but the matching mode (partial or complete match) does not
2495correspond to any JIT compilation mode. When the JIT fast path function is
2496used, this error may be also given for invalid options. See the
2497<a href="pcre2jit.html"><b>pcre2jit</b></a>
2498documentation for more details.
2499<pre>
2500  PCRE2_ERROR_JIT_STACKLIMIT
2501</pre>
2502This error is returned when a pattern that was successfully studied using JIT
2503is being matched, but the memory available for the just-in-time processing
2504stack is not large enough. See the
2505<a href="pcre2jit.html"><b>pcre2jit</b></a>
2506documentation for more details.
2507<pre>
2508  PCRE2_ERROR_MATCHLIMIT
2509</pre>
2510The backtracking limit was reached.
2511<pre>
2512  PCRE2_ERROR_NOMEMORY
2513</pre>
2514If a pattern contains back references, but the ovector is not big enough to
2515remember the referenced substrings, PCRE2 gets a block of memory at the start
2516of matching to use for this purpose. There are some other special cases where
2517extra memory is needed during matching. This error is given when memory cannot
2518be obtained.
2519<pre>
2520  PCRE2_ERROR_NULL
2521</pre>
2522Either the <i>code</i>, <i>subject</i>, or <i>match_data</i> argument was passed
2523as NULL.
2524<pre>
2525  PCRE2_ERROR_RECURSELOOP
2526</pre>
2527This error is returned when <b>pcre2_match()</b> detects a recursion loop within
2528the pattern. Specifically, it means that either the whole pattern or a
2529subpattern has been called recursively for the second time at the same position
2530in the subject string. Some simple patterns that might do this are detected and
2531faulted at compile time, but more complicated cases, in particular mutual
2532recursions between two different subpatterns, cannot be detected until matching
2533is attempted.
2534<pre>
2535  PCRE2_ERROR_RECURSIONLIMIT
2536</pre>
2537The internal recursion limit was reached.
2538<a name="geterrormessage"></a></P>
2539<br><a name="SEC31" href="#TOC1">OBTAINING A TEXTUAL ERROR MESSAGE</a><br>
2540<P>
2541<b>int pcre2_get_error_message(int <i>errorcode</i>, PCRE2_UCHAR *<i>buffer</i>,</b>
2542<b>  PCRE2_SIZE <i>bufflen</i>);</b>
2543</P>
2544<P>
2545A text message for an error code from any PCRE2 function (compile, match, or
2546auxiliary) can be obtained by calling <b>pcre2_get_error_message()</b>. The code
2547is passed as the first argument, with the remaining two arguments specifying a
2548code unit buffer and its length, into which the text message is placed. Note
2549that the message is returned in code units of the appropriate width for the
2550library that is being used.
2551</P>
2552<P>
2553The returned message is terminated with a trailing zero, and the function
2554returns the number of code units used, excluding the trailing zero. If the
2555error number is unknown, the negative error code PCRE2_ERROR_BADDATA is
2556returned. If the buffer is too small, the message is truncated (but still with
2557a trailing zero), and the negative error code PCRE2_ERROR_NOMEMORY is returned.
2558None of the messages are very long; a buffer size of 120 code units is ample.
2559<a name="extractbynumber"></a></P>
2560<br><a name="SEC32" href="#TOC1">EXTRACTING CAPTURED SUBSTRINGS BY NUMBER</a><br>
2561<P>
2562<b>int pcre2_substring_length_bynumber(pcre2_match_data *<i>match_data</i>,</b>
2563<b>  uint32_t <i>number</i>, PCRE2_SIZE *<i>length</i>);</b>
2564<br>
2565<br>
2566<b>int pcre2_substring_copy_bynumber(pcre2_match_data *<i>match_data</i>,</b>
2567<b>  uint32_t <i>number</i>, PCRE2_UCHAR *<i>buffer</i>,</b>
2568<b>  PCRE2_SIZE *<i>bufflen</i>);</b>
2569<br>
2570<br>
2571<b>int pcre2_substring_get_bynumber(pcre2_match_data *<i>match_data</i>,</b>
2572<b>  uint32_t <i>number</i>, PCRE2_UCHAR **<i>bufferptr</i>,</b>
2573<b>  PCRE2_SIZE *<i>bufflen</i>);</b>
2574<br>
2575<br>
2576<b>void pcre2_substring_free(PCRE2_UCHAR *<i>buffer</i>);</b>
2577</P>
2578<P>
2579Captured substrings can be accessed directly by using the ovector as described
2580<a href="#matchedstrings">above.</a>
2581For convenience, auxiliary functions are provided for extracting captured
2582substrings as new, separate, zero-terminated strings. A substring that contains
2583a binary zero is correctly extracted and has a further zero added on the end,
2584but the result is not, of course, a C string.
2585</P>
2586<P>
2587The functions in this section identify substrings by number. The number zero
2588refers to the entire matched substring, with higher numbers referring to
2589substrings captured by parenthesized groups. After a partial match, only
2590substring zero is available. An attempt to extract any other substring gives
2591the error PCRE2_ERROR_PARTIAL. The next section describes similar functions for
2592extracting captured substrings by name.
2593</P>
2594<P>
2595If a pattern uses the \K escape sequence within a positive assertion, the
2596reported start of a successful match can be greater than the end of the match.
2597For example, if the pattern (?=ab\K) is matched against "ab", the start and
2598end offset values for the match are 2 and 0. In this situation, calling these
2599functions with a zero substring number extracts a zero-length empty string.
2600</P>
2601<P>
2602You can find the length in code units of a captured substring without
2603extracting it by calling <b>pcre2_substring_length_bynumber()</b>. The first
2604argument is a pointer to the match data block, the second is the group number,
2605and the third is a pointer to a variable into which the length is placed. If
2606you just want to know whether or not the substring has been captured, you can
2607pass the third argument as NULL.
2608</P>
2609<P>
2610The <b>pcre2_substring_copy_bynumber()</b> function copies a captured substring
2611into a supplied buffer, whereas <b>pcre2_substring_get_bynumber()</b> copies it
2612into new memory, obtained using the same memory allocation function that was
2613used for the match data block. The first two arguments of these functions are a
2614pointer to the match data block and a capturing group number.
2615</P>
2616<P>
2617The final arguments of <b>pcre2_substring_copy_bynumber()</b> are a pointer to
2618the buffer and a pointer to a variable that contains its length in code units.
2619This is updated to contain the actual number of code units used for the
2620extracted substring, excluding the terminating zero.
2621</P>
2622<P>
2623For <b>pcre2_substring_get_bynumber()</b> the third and fourth arguments point
2624to variables that are updated with a pointer to the new memory and the number
2625of code units that comprise the substring, again excluding the terminating
2626zero. When the substring is no longer needed, the memory should be freed by
2627calling <b>pcre2_substring_free()</b>.
2628</P>
2629<P>
2630The return value from all these functions is zero for success, or a negative
2631error code. If the pattern match failed, the match failure code is returned.
2632If a substring number greater than zero is used after a partial match,
2633PCRE2_ERROR_PARTIAL is returned. Other possible error codes are:
2634<pre>
2635  PCRE2_ERROR_NOMEMORY
2636</pre>
2637The buffer was too small for <b>pcre2_substring_copy_bynumber()</b>, or the
2638attempt to get memory failed for <b>pcre2_substring_get_bynumber()</b>.
2639<pre>
2640  PCRE2_ERROR_NOSUBSTRING
2641</pre>
2642There is no substring with that number in the pattern, that is, the number is
2643greater than the number of capturing parentheses.
2644<pre>
2645  PCRE2_ERROR_UNAVAILABLE
2646</pre>
2647The substring number, though not greater than the number of captures in the
2648pattern, is greater than the number of slots in the ovector, so the substring
2649could not be captured.
2650<pre>
2651  PCRE2_ERROR_UNSET
2652</pre>
2653The substring did not participate in the match. For example, if the pattern is
2654(abc)|(def) and the subject is "def", and the ovector contains at least two
2655capturing slots, substring number 1 is unset.
2656</P>
2657<br><a name="SEC33" href="#TOC1">EXTRACTING A LIST OF ALL CAPTURED SUBSTRINGS</a><br>
2658<P>
2659<b>int pcre2_substring_list_get(pcre2_match_data *<i>match_data</i>,</b>
2660<b>"  PCRE2_UCHAR ***<i>listptr</i>, PCRE2_SIZE **<i>lengthsptr</i>);</b>
2661<br>
2662<br>
2663<b>void pcre2_substring_list_free(PCRE2_SPTR *<i>list</i>);</b>
2664</P>
2665<P>
2666The <b>pcre2_substring_list_get()</b> function extracts all available substrings
2667and builds a list of pointers to them. It also (optionally) builds a second
2668list that contains their lengths (in code units), excluding a terminating zero
2669that is added to each of them. All this is done in a single block of memory
2670that is obtained using the same memory allocation function that was used to get
2671the match data block.
2672</P>
2673<P>
2674This function must be called only after a successful match. If called after a
2675partial match, the error code PCRE2_ERROR_PARTIAL is returned.
2676</P>
2677<P>
2678The address of the memory block is returned via <i>listptr</i>, which is also
2679the start of the list of string pointers. The end of the list is marked by a
2680NULL pointer. The address of the list of lengths is returned via
2681<i>lengthsptr</i>. If your strings do not contain binary zeros and you do not
2682therefore need the lengths, you may supply NULL as the <b>lengthsptr</b>
2683argument to disable the creation of a list of lengths. The yield of the
2684function is zero if all went well, or PCRE2_ERROR_NOMEMORY if the memory block
2685could not be obtained. When the list is no longer needed, it should be freed by
2686calling <b>pcre2_substring_list_free()</b>.
2687</P>
2688<P>
2689If this function encounters a substring that is unset, which can happen when
2690capturing subpattern number <i>n+1</i> matches some part of the subject, but
2691subpattern <i>n</i> has not been used at all, it returns an empty string. This
2692can be distinguished from a genuine zero-length substring by inspecting the
2693appropriate offset in the ovector, which contain PCRE2_UNSET for unset
2694substrings, or by calling <b>pcre2_substring_length_bynumber()</b>.
2695<a name="extractbyname"></a></P>
2696<br><a name="SEC34" href="#TOC1">EXTRACTING CAPTURED SUBSTRINGS BY NAME</a><br>
2697<P>
2698<b>int pcre2_substring_number_from_name(const pcre2_code *<i>code</i>,</b>
2699<b>  PCRE2_SPTR <i>name</i>);</b>
2700<br>
2701<br>
2702<b>int pcre2_substring_length_byname(pcre2_match_data *<i>match_data</i>,</b>
2703<b>  PCRE2_SPTR <i>name</i>, PCRE2_SIZE *<i>length</i>);</b>
2704<br>
2705<br>
2706<b>int pcre2_substring_copy_byname(pcre2_match_data *<i>match_data</i>,</b>
2707<b>  PCRE2_SPTR <i>name</i>, PCRE2_UCHAR *<i>buffer</i>, PCRE2_SIZE *<i>bufflen</i>);</b>
2708<br>
2709<br>
2710<b>int pcre2_substring_get_byname(pcre2_match_data *<i>match_data</i>,</b>
2711<b>  PCRE2_SPTR <i>name</i>, PCRE2_UCHAR **<i>bufferptr</i>, PCRE2_SIZE *<i>bufflen</i>);</b>
2712<br>
2713<br>
2714<b>void pcre2_substring_free(PCRE2_UCHAR *<i>buffer</i>);</b>
2715</P>
2716<P>
2717To extract a substring by name, you first have to find associated number.
2718For example, for this pattern:
2719<pre>
2720  (a+)b(?&#60;xxx&#62;\d+)...
2721</pre>
2722the number of the subpattern called "xxx" is 2. If the name is known to be
2723unique (PCRE2_DUPNAMES was not set), you can find the number from the name by
2724calling <b>pcre2_substring_number_from_name()</b>. The first argument is the
2725compiled pattern, and the second is the name. The yield of the function is the
2726subpattern number, PCRE2_ERROR_NOSUBSTRING if there is no subpattern of that
2727name, or PCRE2_ERROR_NOUNIQUESUBSTRING if there is more than one subpattern of
2728that name. Given the number, you can extract the substring directly, or use one
2729of the functions described above.
2730</P>
2731<P>
2732For convenience, there are also "byname" functions that correspond to the
2733"bynumber" functions, the only difference being that the second argument is a
2734name instead of a number. If PCRE2_DUPNAMES is set and there are duplicate
2735names, these functions scan all the groups with the given name, and return the
2736first named string that is set.
2737</P>
2738<P>
2739If there are no groups with the given name, PCRE2_ERROR_NOSUBSTRING is
2740returned. If all groups with the name have numbers that are greater than the
2741number of slots in the ovector, PCRE2_ERROR_UNAVAILABLE is returned. If there
2742is at least one group with a slot in the ovector, but no group is found to be
2743set, PCRE2_ERROR_UNSET is returned.
2744</P>
2745<P>
2746<b>Warning:</b> If the pattern uses the (?| feature to set up multiple
2747subpatterns with the same number, as described in the
2748<a href="pcre2pattern.html#dupsubpatternnumber">section on duplicate subpattern numbers</a>
2749in the
2750<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
2751page, you cannot use names to distinguish the different subpatterns, because
2752names are not included in the compiled code. The matching process uses only
2753numbers. For this reason, the use of different names for subpatterns of the
2754same number causes an error at compile time.
2755</P>
2756<br><a name="SEC35" href="#TOC1">CREATING A NEW STRING WITH SUBSTITUTIONS</a><br>
2757<P>
2758<b>int pcre2_substitute(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b>
2759<b>  PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b>
2760<b>  uint32_t <i>options</i>, pcre2_match_data *<i>match_data</i>,</b>
2761<b>  pcre2_match_context *<i>mcontext</i>, PCRE2_SPTR <i>replacement</i>,</b>
2762<b>  PCRE2_SIZE <i>rlength</i>, PCRE2_UCHAR *\fIoutputbuffer\zfP,</b>
2763<b>  PCRE2_SIZE *<i>outlengthptr</i>);</b>
2764</P>
2765<P>
2766This function calls <b>pcre2_match()</b> and then makes a copy of the subject
2767string in <i>outputbuffer</i>, replacing the part that was matched with the
2768<i>replacement</i> string, whose length is supplied in <b>rlength</b>. This can
2769be given as PCRE2_ZERO_TERMINATED for a zero-terminated string. Matches in
2770which a \K item in a lookahead in the pattern causes the match to end before
2771it starts are not supported, and give rise to an error return.
2772</P>
2773<P>
2774The first seven arguments of <b>pcre2_substitute()</b> are the same as for
2775<b>pcre2_match()</b>, except that the partial matching options are not
2776permitted, and <i>match_data</i> may be passed as NULL, in which case a match
2777data block is obtained and freed within this function, using memory management
2778functions from the match context, if provided, or else those that were used to
2779allocate memory for the compiled code.
2780</P>
2781<P>
2782The <i>outlengthptr</i> argument must point to a variable that contains the
2783length, in code units, of the output buffer. If the function is successful, the
2784value is updated to contain the length of the new string, excluding the
2785trailing zero that is automatically added.
2786</P>
2787<P>
2788If the function is not successful, the value set via <i>outlengthptr</i> depends
2789on the type of error. For syntax errors in the replacement string, the value is
2790the offset in the replacement string where the error was detected. For other
2791errors, the value is PCRE2_UNSET by default. This includes the case of the
2792output buffer being too small, unless PCRE2_SUBSTITUTE_OVERFLOW_LENGTH is set
2793(see below), in which case the value is the minimum length needed, including
2794space for the trailing zero. Note that in order to compute the required length,
2795<b>pcre2_substitute()</b> has to simulate all the matching and copying, instead
2796of giving an error return as soon as the buffer overflows. Note also that the
2797length is in code units, not bytes.
2798</P>
2799<P>
2800In the replacement string, which is interpreted as a UTF string in UTF mode,
2801and is checked for UTF validity unless the PCRE2_NO_UTF_CHECK option is set, a
2802dollar character is an escape character that can specify the insertion of
2803characters from capturing groups or (*MARK) items in the pattern. The following
2804forms are always recognized:
2805<pre>
2806  $$                  insert a dollar character
2807  $&#60;n&#62; or ${&#60;n&#62;}      insert the contents of group &#60;n&#62;
2808  $*MARK or ${*MARK}  insert the name of the last (*MARK) encountered
2809</pre>
2810Either a group number or a group name can be given for &#60;n&#62;. Curly brackets are
2811required only if the following character would be interpreted as part of the
2812number or name. The number may be zero to include the entire matched string.
2813For example, if the pattern a(b)c is matched with "=abc=" and the replacement
2814string "+$1$0$1+", the result is "=+babcb+=".
2815</P>
2816<P>
2817The facility for inserting a (*MARK) name can be used to perform simple
2818simultaneous substitutions, as this <b>pcre2test</b> example shows:
2819<pre>
2820  /(*:pear)apple|(*:orange)lemon/g,replace=${*MARK}
2821      apple lemon
2822   2: pear orange
2823</pre>
2824As well as the usual options for <b>pcre2_match()</b>, a number of additional
2825options can be set in the <i>options</i> argument.
2826</P>
2827<P>
2828PCRE2_SUBSTITUTE_GLOBAL causes the function to iterate over the subject string,
2829replacing every matching substring. If this is not set, only the first matching
2830substring is replaced. If any matched substring has zero length, after the
2831substitution has happened, an attempt to find a non-empty match at the same
2832position is performed. If this is not successful, the current position is
2833advanced by one character except when CRLF is a valid newline sequence and the
2834next two characters are CR, LF. In this case, the current position is advanced
2835by two characters.
2836</P>
2837<P>
2838PCRE2_SUBSTITUTE_OVERFLOW_LENGTH changes what happens when the output buffer is
2839too small. The default action is to return PCRE2_ERROR_NOMEMORY immediately. If
2840this option is set, however, <b>pcre2_substitute()</b> continues to go through
2841the motions of matching and substituting (without, of course, writing anything)
2842in order to compute the size of buffer that is needed. This value is passed
2843back via the <i>outlengthptr</i> variable, with the result of the function still
2844being PCRE2_ERROR_NOMEMORY.
2845</P>
2846<P>
2847Passing a buffer size of zero is a permitted way of finding out how much memory
2848is needed for given substitution. However, this does mean that the entire
2849operation is carried out twice. Depending on the application, it may be more
2850efficient to allocate a large buffer and free the excess afterwards, instead of
2851using PCRE2_SUBSTITUTE_OVERFLOW_LENGTH.
2852</P>
2853<P>
2854PCRE2_SUBSTITUTE_UNKNOWN_UNSET causes references to capturing groups that do
2855not appear in the pattern to be treated as unset groups. This option should be
2856used with care, because it means that a typo in a group name or number no
2857longer causes the PCRE2_ERROR_NOSUBSTRING error.
2858</P>
2859<P>
2860PCRE2_SUBSTITUTE_UNSET_EMPTY causes unset capturing groups (including unknown
2861groups when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) to be treated as empty
2862strings when inserted as described above. If this option is not set, an attempt
2863to insert an unset group causes the PCRE2_ERROR_UNSET error. This option does
2864not influence the extended substitution syntax described below.
2865</P>
2866<P>
2867PCRE2_SUBSTITUTE_EXTENDED causes extra processing to be applied to the
2868replacement string. Without this option, only the dollar character is special,
2869and only the group insertion forms listed above are valid. When
2870PCRE2_SUBSTITUTE_EXTENDED is set, two things change:
2871</P>
2872<P>
2873Firstly, backslash in a replacement string is interpreted as an escape
2874character. The usual forms such as \n or \x{ddd} can be used to specify
2875particular character codes, and backslash followed by any non-alphanumeric
2876character quotes that character. Extended quoting can be coded using \Q...\E,
2877exactly as in pattern strings.
2878</P>
2879<P>
2880There are also four escape sequences for forcing the case of inserted letters.
2881The insertion mechanism has three states: no case forcing, force upper case,
2882and force lower case. The escape sequences change the current state: \U and
2883\L change to upper or lower case forcing, respectively, and \E (when not
2884terminating a \Q quoted sequence) reverts to no case forcing. The sequences
2885\u and \l force the next character (if it is a letter) to upper or lower
2886case, respectively, and then the state automatically reverts to no case
2887forcing. Case forcing applies to all inserted  characters, including those from
2888captured groups and letters within \Q...\E quoted sequences.
2889</P>
2890<P>
2891Note that case forcing sequences such as \U...\E do not nest. For example,
2892the result of processing "\Uaa\LBB\Ecc\E" is "AAbbcc"; the final \E has no
2893effect.
2894</P>
2895<P>
2896The second effect of setting PCRE2_SUBSTITUTE_EXTENDED is to add more
2897flexibility to group substitution. The syntax is similar to that used by Bash:
2898<pre>
2899  ${&#60;n&#62;:-&#60;string&#62;}
2900  ${&#60;n&#62;:+&#60;string1&#62;:&#60;string2&#62;}
2901</pre>
2902As before, &#60;n&#62; may be a group number or a name. The first form specifies a
2903default value. If group &#60;n&#62; is set, its value is inserted; if not, &#60;string&#62; is
2904expanded and the result inserted. The second form specifies strings that are
2905expanded and inserted when group &#60;n&#62; is set or unset, respectively. The first
2906form is just a convenient shorthand for
2907<pre>
2908  ${&#60;n&#62;:+${&#60;n&#62;}:&#60;string&#62;}
2909</pre>
2910Backslash can be used to escape colons and closing curly brackets in the
2911replacement strings. A change of the case forcing state within a replacement
2912string remains in force afterwards, as shown in this <b>pcre2test</b> example:
2913<pre>
2914  /(some)?(body)/substitute_extended,replace=${1:+\U:\L}HeLLo
2915      body
2916   1: hello
2917      somebody
2918   1: HELLO
2919</pre>
2920The PCRE2_SUBSTITUTE_UNSET_EMPTY option does not affect these extended
2921substitutions. However, PCRE2_SUBSTITUTE_UNKNOWN_UNSET does cause unknown
2922groups in the extended syntax forms to be treated as unset.
2923</P>
2924<P>
2925If successful, <b>pcre2_substitute()</b> returns the number of replacements that
2926were made. This may be zero if no matches were found, and is never greater than
29271 unless PCRE2_SUBSTITUTE_GLOBAL is set.
2928</P>
2929<P>
2930In the event of an error, a negative error code is returned. Except for
2931PCRE2_ERROR_NOMATCH (which is never returned), errors from <b>pcre2_match()</b>
2932are passed straight back.
2933</P>
2934<P>
2935PCRE2_ERROR_NOSUBSTRING is returned for a non-existent substring insertion,
2936unless PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set.
2937</P>
2938<P>
2939PCRE2_ERROR_UNSET is returned for an unset substring insertion (including an
2940unknown substring when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) when the simple
2941(non-extended) syntax is used and PCRE2_SUBSTITUTE_UNSET_EMPTY is not set.
2942</P>
2943<P>
2944PCRE2_ERROR_NOMEMORY is returned if the output buffer is not big enough. If the
2945PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set, the size of buffer that is
2946needed is returned via <i>outlengthptr</i>. Note that this does not happen by
2947default.
2948</P>
2949<P>
2950PCRE2_ERROR_BADREPLACEMENT is used for miscellaneous syntax errors in the
2951replacement string, with more particular errors being PCRE2_ERROR_BADREPESCAPE
2952(invalid escape sequence), PCRE2_ERROR_REPMISSING_BRACE (closing curly bracket
2953not found), PCRE2_BADSUBSTITUTION (syntax error in extended group
2954substitution), and PCRE2_BADSUBPATTERN (the pattern match ended before it
2955started, which can happen if \K is used in an assertion).
2956</P>
2957<P>
2958As for all PCRE2 errors, a text message that describes the error can be
2959obtained by calling the <b>pcre2_get_error_message()</b> function (see
2960"Obtaining a textual error message"
2961<a href="#geterrormessage">above).</a>
2962</P>
2963<br><a name="SEC36" href="#TOC1">DUPLICATE SUBPATTERN NAMES</a><br>
2964<P>
2965<b>int pcre2_substring_nametable_scan(const pcre2_code *<i>code</i>,</b>
2966<b>  PCRE2_SPTR <i>name</i>, PCRE2_SPTR *<i>first</i>, PCRE2_SPTR *<i>last</i>);</b>
2967</P>
2968<P>
2969When a pattern is compiled with the PCRE2_DUPNAMES option, names for
2970subpatterns are not required to be unique. Duplicate names are always allowed
2971for subpatterns with the same number, created by using the (?| feature. Indeed,
2972if such subpatterns are named, they are required to use the same names.
2973</P>
2974<P>
2975Normally, patterns with duplicate names are such that in any one match, only
2976one of the named subpatterns participates. An example is shown in the
2977<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
2978documentation.
2979</P>
2980<P>
2981When duplicates are present, <b>pcre2_substring_copy_byname()</b> and
2982<b>pcre2_substring_get_byname()</b> return the first substring corresponding to
2983the given name that is set. Only if none are set is PCRE2_ERROR_UNSET is
2984returned. The <b>pcre2_substring_number_from_name()</b> function returns the
2985error PCRE2_ERROR_NOUNIQUESUBSTRING when there are duplicate names.
2986</P>
2987<P>
2988If you want to get full details of all captured substrings for a given name,
2989you must use the <b>pcre2_substring_nametable_scan()</b> function. The first
2990argument is the compiled pattern, and the second is the name. If the third and
2991fourth arguments are NULL, the function returns a group number for a unique
2992name, or PCRE2_ERROR_NOUNIQUESUBSTRING otherwise.
2993</P>
2994<P>
2995When the third and fourth arguments are not NULL, they must be pointers to
2996variables that are updated by the function. After it has run, they point to the
2997first and last entries in the name-to-number table for the given name, and the
2998function returns the length of each entry in code units. In both cases,
2999PCRE2_ERROR_NOSUBSTRING is returned if there are no entries for the given name.
3000</P>
3001<P>
3002The format of the name table is described
3003<a href="#infoaboutpattern">above</a>
3004in the section entitled <i>Information about a pattern</i>. Given all the
3005relevant entries for the name, you can extract each of their numbers, and hence
3006the captured data.
3007</P>
3008<br><a name="SEC37" href="#TOC1">FINDING ALL POSSIBLE MATCHES AT ONE POSITION</a><br>
3009<P>
3010The traditional matching function uses a similar algorithm to Perl, which stops
3011when it finds the first match at a given point in the subject. If you want to
3012find all possible matches, or the longest possible match at a given position,
3013consider using the alternative matching function (see below) instead. If you
3014cannot use the alternative function, you can kludge it up by making use of the
3015callout facility, which is described in the
3016<a href="pcre2callout.html"><b>pcre2callout</b></a>
3017documentation.
3018</P>
3019<P>
3020What you have to do is to insert a callout right at the end of the pattern.
3021When your callout function is called, extract and save the current matched
3022substring. Then return 1, which forces <b>pcre2_match()</b> to backtrack and try
3023other alternatives. Ultimately, when it runs out of matches,
3024<b>pcre2_match()</b> will yield PCRE2_ERROR_NOMATCH.
3025<a name="dfamatch"></a></P>
3026<br><a name="SEC38" href="#TOC1">MATCHING A PATTERN: THE ALTERNATIVE FUNCTION</a><br>
3027<P>
3028<b>int pcre2_dfa_match(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b>
3029<b>  PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b>
3030<b>  uint32_t <i>options</i>, pcre2_match_data *<i>match_data</i>,</b>
3031<b>  pcre2_match_context *<i>mcontext</i>,</b>
3032<b>  int *<i>workspace</i>, PCRE2_SIZE <i>wscount</i>);</b>
3033</P>
3034<P>
3035The function <b>pcre2_dfa_match()</b> is called to match a subject string
3036against a compiled pattern, using a matching algorithm that scans the subject
3037string just once, and does not backtrack. This has different characteristics to
3038the normal algorithm, and is not compatible with Perl. Some of the features of
3039PCRE2 patterns are not supported. Nevertheless, there are times when this kind
3040of matching can be useful. For a discussion of the two matching algorithms, and
3041a list of features that <b>pcre2_dfa_match()</b> does not support, see the
3042<a href="pcre2matching.html"><b>pcre2matching</b></a>
3043documentation.
3044</P>
3045<P>
3046The arguments for the <b>pcre2_dfa_match()</b> function are the same as for
3047<b>pcre2_match()</b>, plus two extras. The ovector within the match data block
3048is used in a different way, and this is described below. The other common
3049arguments are used in the same way as for <b>pcre2_match()</b>, so their
3050description is not repeated here.
3051</P>
3052<P>
3053The two additional arguments provide workspace for the function. The workspace
3054vector should contain at least 20 elements. It is used for keeping track of
3055multiple paths through the pattern tree. More workspace is needed for patterns
3056and subjects where there are a lot of potential matches.
3057</P>
3058<P>
3059Here is an example of a simple call to <b>pcre2_dfa_match()</b>:
3060<pre>
3061  int wspace[20];
3062  pcre2_match_data *md = pcre2_match_data_create(4, NULL);
3063  int rc = pcre2_dfa_match(
3064    re,             /* result of pcre2_compile() */
3065    "some string",  /* the subject string */
3066    11,             /* the length of the subject string */
3067    0,              /* start at offset 0 in the subject */
3068    0,              /* default options */
3069    match_data,     /* the match data block */
3070    NULL,           /* a match context; NULL means use defaults */
3071    wspace,         /* working space vector */
3072    20);            /* number of elements (NOT size in bytes) */
3073</PRE>
3074</P>
3075<br><b>
3076Option bits for <b>pcre_dfa_match()</b>
3077</b><br>
3078<P>
3079The unused bits of the <i>options</i> argument for <b>pcre2_dfa_match()</b> must
3080be zero. The only bits that may be set are PCRE2_ANCHORED, PCRE2_NOTBOL,
3081PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK,
3082PCRE2_PARTIAL_HARD, PCRE2_PARTIAL_SOFT, PCRE2_DFA_SHORTEST, and
3083PCRE2_DFA_RESTART. All but the last four of these are exactly the same as for
3084<b>pcre2_match()</b>, so their description is not repeated here.
3085<pre>
3086  PCRE2_PARTIAL_HARD
3087  PCRE2_PARTIAL_SOFT
3088</pre>
3089These have the same general effect as they do for <b>pcre2_match()</b>, but the
3090details are slightly different. When PCRE2_PARTIAL_HARD is set for
3091<b>pcre2_dfa_match()</b>, it returns PCRE2_ERROR_PARTIAL if the end of the
3092subject is reached and there is still at least one matching possibility that
3093requires additional characters. This happens even if some complete matches have
3094already been found. When PCRE2_PARTIAL_SOFT is set, the return code
3095PCRE2_ERROR_NOMATCH is converted into PCRE2_ERROR_PARTIAL if the end of the
3096subject is reached, there have been no complete matches, but there is still at
3097least one matching possibility. The portion of the string that was inspected
3098when the longest partial match was found is set as the first matching string in
3099both cases. There is a more detailed discussion of partial and multi-segment
3100matching, with examples, in the
3101<a href="pcre2partial.html"><b>pcre2partial</b></a>
3102documentation.
3103<pre>
3104  PCRE2_DFA_SHORTEST
3105</pre>
3106Setting the PCRE2_DFA_SHORTEST option causes the matching algorithm to stop as
3107soon as it has found one match. Because of the way the alternative algorithm
3108works, this is necessarily the shortest possible match at the first possible
3109matching point in the subject string.
3110<pre>
3111  PCRE2_DFA_RESTART
3112</pre>
3113When <b>pcre2_dfa_match()</b> returns a partial match, it is possible to call it
3114again, with additional subject characters, and have it continue with the same
3115match. The PCRE2_DFA_RESTART option requests this action; when it is set, the
3116<i>workspace</i> and <i>wscount</i> options must reference the same vector as
3117before because data about the match so far is left in them after a partial
3118match. There is more discussion of this facility in the
3119<a href="pcre2partial.html"><b>pcre2partial</b></a>
3120documentation.
3121</P>
3122<br><b>
3123Successful returns from <b>pcre2_dfa_match()</b>
3124</b><br>
3125<P>
3126When <b>pcre2_dfa_match()</b> succeeds, it may have matched more than one
3127substring in the subject. Note, however, that all the matches from one run of
3128the function start at the same point in the subject. The shorter matches are
3129all initial substrings of the longer matches. For example, if the pattern
3130<pre>
3131  &#60;.*&#62;
3132</pre>
3133is matched against the string
3134<pre>
3135  This is &#60;something&#62; &#60;something else&#62; &#60;something further&#62; no more
3136</pre>
3137the three matched strings are
3138<pre>
3139  &#60;something&#62; &#60;something else&#62; &#60;something further&#62;
3140  &#60;something&#62; &#60;something else&#62;
3141  &#60;something&#62;
3142</pre>
3143On success, the yield of the function is a number greater than zero, which is
3144the number of matched substrings. The offsets of the substrings are returned in
3145the ovector, and can be extracted by number in the same way as for
3146<b>pcre2_match()</b>, but the numbers bear no relation to any capturing groups
3147that may exist in the pattern, because DFA matching does not support group
3148capture.
3149</P>
3150<P>
3151Calls to the convenience functions that extract substrings by name
3152return the error PCRE2_ERROR_DFA_UFUNC (unsupported function) if used after a
3153DFA match. The convenience functions that extract substrings by number never
3154return PCRE2_ERROR_NOSUBSTRING, and the meanings of some other errors are
3155slightly different:
3156<pre>
3157  PCRE2_ERROR_UNAVAILABLE
3158</pre>
3159The ovector is not big enough to include a slot for the given substring number.
3160<pre>
3161  PCRE2_ERROR_UNSET
3162</pre>
3163There is a slot in the ovector for this substring, but there were insufficient
3164matches to fill it.
3165</P>
3166<P>
3167The matched strings are stored in the ovector in reverse order of length; that
3168is, the longest matching string is first. If there were too many matches to fit
3169into the ovector, the yield of the function is zero, and the vector is filled
3170with the longest matches.
3171</P>
3172<P>
3173NOTE: PCRE2's "auto-possessification" optimization usually applies to character
3174repeats at the end of a pattern (as well as internally). For example, the
3175pattern "a\d+" is compiled as if it were "a\d++". For DFA matching, this
3176means that only one possible match is found. If you really do want multiple
3177matches in such cases, either use an ungreedy repeat auch as "a\d+?" or set
3178the PCRE2_NO_AUTO_POSSESS option when compiling.
3179</P>
3180<br><b>
3181Error returns from <b>pcre2_dfa_match()</b>
3182</b><br>
3183<P>
3184The <b>pcre2_dfa_match()</b> function returns a negative number when it fails.
3185Many of the errors are the same as for <b>pcre2_match()</b>, as described
3186<a href="#errorlist">above.</a>
3187There are in addition the following errors that are specific to
3188<b>pcre2_dfa_match()</b>:
3189<pre>
3190  PCRE2_ERROR_DFA_UITEM
3191</pre>
3192This return is given if <b>pcre2_dfa_match()</b> encounters an item in the
3193pattern that it does not support, for instance, the use of \C in a UTF mode or
3194a back reference.
3195<pre>
3196  PCRE2_ERROR_DFA_UCOND
3197</pre>
3198This return is given if <b>pcre2_dfa_match()</b> encounters a condition item
3199that uses a back reference for the condition, or a test for recursion in a
3200specific group. These are not supported.
3201<pre>
3202  PCRE2_ERROR_DFA_WSSIZE
3203</pre>
3204This return is given if <b>pcre2_dfa_match()</b> runs out of space in the
3205<i>workspace</i> vector.
3206<pre>
3207  PCRE2_ERROR_DFA_RECURSE
3208</pre>
3209When a recursive subpattern is processed, the matching function calls itself
3210recursively, using private memory for the ovector and <i>workspace</i>. This
3211error is given if the internal ovector is not large enough. This should be
3212extremely rare, as a vector of size 1000 is used.
3213<pre>
3214  PCRE2_ERROR_DFA_BADRESTART
3215</pre>
3216When <b>pcre2_dfa_match()</b> is called with the <b>PCRE2_DFA_RESTART</b> option,
3217some plausibility checks are made on the contents of the workspace, which
3218should contain data about the previous partial match. If any of these checks
3219fail, this error is given.
3220</P>
3221<br><a name="SEC39" href="#TOC1">SEE ALSO</a><br>
3222<P>
3223<b>pcre2build</b>(3), <b>pcre2callout</b>(3), <b>pcre2demo(3)</b>,
3224<b>pcre2matching</b>(3), <b>pcre2partial</b>(3), <b>pcre2posix</b>(3),
3225<b>pcre2sample</b>(3), <b>pcre2stack</b>(3), <b>pcre2unicode</b>(3).
3226</P>
3227<br><a name="SEC40" href="#TOC1">AUTHOR</a><br>
3228<P>
3229Philip Hazel
3230<br>
3231University Computing Service
3232<br>
3233Cambridge, England.
3234<br>
3235</P>
3236<br><a name="SEC41" href="#TOC1">REVISION</a><br>
3237<P>
3238Last updated: 17 June 2016
3239<br>
3240Copyright &copy; 1997-2016 University of Cambridge.
3241<br>
3242<p>
3243Return to the <a href="index.html">PCRE2 index page</a>.
3244</p>
3245