• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1-----------------------------------------------------------------------------
2This file contains a concatenation of the PCRE2 man pages, converted to plain
3text format for ease of searching with a text editor, or for use on systems
4that do not have a man page processor. The small individual files that give
5synopses of each function in the library have not been included. Neither has
6the pcre2demo program. There are separate text files for the pcre2grep and
7pcre2test commands.
8-----------------------------------------------------------------------------
9
10
11PCRE2(3)                   Library Functions Manual                   PCRE2(3)
12
13
14
15NAME
16       PCRE2 - Perl-compatible regular expressions (revised API)
17
18INTRODUCTION
19
20       PCRE2 is the name used for a revised API for the PCRE library, which is
21       a set of functions, written in C,  that  implement  regular  expression
22       pattern matching using the same syntax and semantics as Perl, with just
23       a few differences. After nearly two decades,  the  limitations  of  the
24       original  API  were  making development increasingly difficult. The new
25       API is more extensible, and it was simplified by abolishing  the  sepa-
26       rate  "study" optimizing function; in PCRE2, patterns are automatically
27       optimized where possible. Since forking from PCRE1, the code  has  been
28       extensively  refactored and new features introduced. The old library is
29       now obsolete and is no longer maintained.
30
31       As well as Perl-style regular expression patterns, some  features  that
32       appeared  in  Python and the original PCRE before they appeared in Perl
33       are available using the Python syntax. There is also some  support  for
34       one  or  two .NET and Oniguruma syntax items, and there are options for
35       requesting some minor changes that give better  ECMAScript  (aka  Java-
36       Script) compatibility.
37
38       The  source code for PCRE2 can be compiled to support strings of 8-bit,
39       16-bit, or 32-bit code units, which means that up to three separate li-
40       braries may be installed, one for each code unit size. The size of code
41       unit is not related to the bit size of the underlying  hardware.  In  a
42       64-bit  environment that also supports 32-bit applications, versions of
43       PCRE2 that are compiled in both 64-bit and 32-bit modes may be needed.
44
45       The original work to extend PCRE to 16-bit and 32-bit  code  units  was
46       done by Zoltan Herczeg and Christian Persch, respectively. In all three
47       cases, strings can be interpreted either  as  one  character  per  code
48       unit, or as UTF-encoded Unicode, with support for Unicode general cate-
49       gory properties. Unicode support is optional at build time (but is  the
50       default). However, processing strings as UTF code units must be enabled
51       explicitly at run time. The version of Unicode in use can be discovered
52       by running
53
54         pcre2test -C
55
56       The  three  libraries  contain  identical sets of functions, with names
57       ending in _8,  _16,  or  _32,  respectively  (for  example,  pcre2_com-
58       pile_8()).  However,  by defining PCRE2_CODE_UNIT_WIDTH to be 8, 16, or
59       32, a program that uses just one code unit width can be  written  using
60       generic names such as pcre2_compile(), and the documentation is written
61       assuming that this is the case.
62
63       In addition to the Perl-compatible matching function, PCRE2 contains an
64       alternative  function that matches the same compiled patterns in a dif-
65       ferent way. In certain circumstances, the alternative function has some
66       advantages.   For  a discussion of the two matching algorithms, see the
67       pcre2matching page.
68
69       Details of exactly which Perl regular expression features are  and  are
70       not  supported  by  PCRE2  are  given  in  separate  documents. See the
71       pcre2pattern and pcre2compat pages. There is a syntax  summary  in  the
72       pcre2syntax page.
73
74       Some  features  of PCRE2 can be included, excluded, or changed when the
75       library is built. The pcre2_config() function makes it possible  for  a
76       client  to  discover  which  features are available. The features them-
77       selves are described in the pcre2build page. Documentation about build-
78       ing  PCRE2 for various operating systems can be found in the README and
79       NON-AUTOTOOLS_BUILD files in the source distribution.
80
81       The libraries contains a number of undocumented internal functions  and
82       data  tables  that  are  used by more than one of the exported external
83       functions, but which are not intended  for  use  by  external  callers.
84       Their  names  all begin with "_pcre2", which hopefully will not provoke
85       any name clashes. In some environments, it is possible to control which
86       external  symbols  are  exported when a shared library is built, and in
87       these cases the undocumented symbols are not exported.
88
89
90SECURITY CONSIDERATIONS
91
92       If you are using PCRE2 in a non-UTF application that permits  users  to
93       supply  arbitrary  patterns  for  compilation, you should be aware of a
94       feature that allows users to turn on UTF support from within a pattern.
95       For  example, an 8-bit pattern that begins with "(*UTF)" turns on UTF-8
96       mode, which interprets patterns and subjects as strings of  UTF-8  code
97       units instead of individual 8-bit characters. This causes both the pat-
98       tern and any data against which it is matched to be checked  for  UTF-8
99       validity.  If the data string is very long, such a check might use suf-
100       ficiently many resources as to cause your application to  lose  perfor-
101       mance.
102
103       One  way  of guarding against this possibility is to use the pcre2_pat-
104       tern_info() function  to  check  the  compiled  pattern's  options  for
105       PCRE2_UTF.  Alternatively,  you can set the PCRE2_NEVER_UTF option when
106       calling pcre2_compile(). This causes a compile time error if  the  pat-
107       tern contains a UTF-setting sequence.
108
109       The  use  of Unicode properties for character types such as \d can also
110       be enabled from within the pattern, by specifying "(*UCP)".  This  fea-
111       ture can be disallowed by setting the PCRE2_NEVER_UCP option.
112
113       If  your  application  is one that supports UTF, be aware that validity
114       checking can take time. If the same data string is to be  matched  many
115       times,  you  can  use  the PCRE2_NO_UTF_CHECK option for the second and
116       subsequent matches to avoid running redundant checks.
117
118       The use of the \C escape sequence in a UTF-8 or UTF-16 pattern can lead
119       to  problems,  because  it  may leave the current matching point in the
120       middle of a multi-code-unit character. The PCRE2_NEVER_BACKSLASH_C  op-
121       tion can be used by an application to lock out the use of \C, causing a
122       compile-time error if it is encountered. It is also possible  to  build
123       PCRE2 with the use of \C permanently disabled.
124
125       Another  way  that  performance can be hit is by running a pattern that
126       has a very large search tree against a string that  will  never  match.
127       Nested  unlimited repeats in a pattern are a common example. PCRE2 pro-
128       vides some protection against  this:  see  the  pcre2_set_match_limit()
129       function  in  the  pcre2api  page.  There  is a similar function called
130       pcre2_set_depth_limit() that can be used to restrict the amount of mem-
131       ory that is used.
132
133
134USER DOCUMENTATION
135
136       The  user  documentation for PCRE2 comprises a number of different sec-
137       tions. In the "man" format, each of these is a separate "man page".  In
138       the  HTML  format, each is a separate page, linked from the index page.
139       In the plain  text  format,  the  descriptions  of  the  pcre2grep  and
140       pcre2test programs are in files called pcre2grep.txt and pcre2test.txt,
141       respectively. The remaining sections, except for the pcre2demo  section
142       (which  is a program listing), and the short pages for individual func-
143       tions, are concatenated in pcre2.txt, for ease of searching.  The  sec-
144       tions are as follows:
145
146         pcre2              this document
147         pcre2-config       show PCRE2 installation configuration information
148         pcre2api           details of PCRE2's native C API
149         pcre2build         building PCRE2
150         pcre2callout       details of the pattern callout feature
151         pcre2compat        discussion of Perl compatibility
152         pcre2convert       details of pattern conversion functions
153         pcre2demo          a demonstration C program that uses PCRE2
154         pcre2grep          description of the pcre2grep command (8-bit only)
155         pcre2jit           discussion of just-in-time optimization support
156         pcre2limits        details of size and other limits
157         pcre2matching      discussion of the two matching algorithms
158         pcre2partial       details of the partial matching facility
159         pcre2pattern       syntax and semantics of supported regular
160                              expression patterns
161         pcre2perform       discussion of performance issues
162         pcre2posix         the POSIX-compatible C API for the 8-bit library
163         pcre2sample        discussion of the pcre2demo program
164         pcre2serialize     details of pattern serialization
165         pcre2syntax        quick syntax reference
166         pcre2test          description of the pcre2test command
167         pcre2unicode       discussion of Unicode and UTF support
168
169       In  the  "man"  and HTML formats, there is also a short page for each C
170       library function, listing its arguments and results.
171
172
173AUTHOR
174
175       Philip Hazel
176       Retired from University Computing Service
177       Cambridge, England.
178
179       Putting an actual email address here is a spam magnet. If you  want  to
180       email me, use my two names separated by a dot at gmail.com.
181
182
183REVISION
184
185       Last updated: 27 August 2021
186       Copyright (c) 1997-2021 University of Cambridge.
187------------------------------------------------------------------------------
188
189
190PCRE2API(3)                Library Functions Manual                PCRE2API(3)
191
192
193
194NAME
195       PCRE2 - Perl-compatible regular expressions (revised API)
196
197       #include <pcre2.h>
198
199       PCRE2  is  a  new API for PCRE, starting at release 10.0. This document
200       contains a description of all its native functions. See the pcre2 docu-
201       ment for an overview of all the PCRE2 documentation.
202
203
204PCRE2 NATIVE API BASIC FUNCTIONS
205
206       pcre2_code *pcre2_compile(PCRE2_SPTR pattern, PCRE2_SIZE length,
207         uint32_t options, int *errorcode, PCRE2_SIZE *erroroffset,
208         pcre2_compile_context *ccontext);
209
210       void pcre2_code_free(pcre2_code *code);
211
212       pcre2_match_data *pcre2_match_data_create(uint32_t ovecsize,
213         pcre2_general_context *gcontext);
214
215       pcre2_match_data *pcre2_match_data_create_from_pattern(
216         const pcre2_code *code, pcre2_general_context *gcontext);
217
218       int pcre2_match(const pcre2_code *code, PCRE2_SPTR subject,
219         PCRE2_SIZE length, PCRE2_SIZE startoffset,
220         uint32_t options, pcre2_match_data *match_data,
221         pcre2_match_context *mcontext);
222
223       int pcre2_dfa_match(const pcre2_code *code, PCRE2_SPTR subject,
224         PCRE2_SIZE length, PCRE2_SIZE startoffset,
225         uint32_t options, pcre2_match_data *match_data,
226         pcre2_match_context *mcontext,
227         int *workspace, PCRE2_SIZE wscount);
228
229       void pcre2_match_data_free(pcre2_match_data *match_data);
230
231
232PCRE2 NATIVE API AUXILIARY MATCH FUNCTIONS
233
234       PCRE2_SPTR pcre2_get_mark(pcre2_match_data *match_data);
235
236       uint32_t pcre2_get_ovector_count(pcre2_match_data *match_data);
237
238       PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *match_data);
239
240       PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *match_data);
241
242
243PCRE2 NATIVE API GENERAL CONTEXT FUNCTIONS
244
245       pcre2_general_context *pcre2_general_context_create(
246         void *(*private_malloc)(PCRE2_SIZE, void *),
247         void (*private_free)(void *, void *), void *memory_data);
248
249       pcre2_general_context *pcre2_general_context_copy(
250         pcre2_general_context *gcontext);
251
252       void pcre2_general_context_free(pcre2_general_context *gcontext);
253
254
255PCRE2 NATIVE API COMPILE CONTEXT FUNCTIONS
256
257       pcre2_compile_context *pcre2_compile_context_create(
258         pcre2_general_context *gcontext);
259
260       pcre2_compile_context *pcre2_compile_context_copy(
261         pcre2_compile_context *ccontext);
262
263       void pcre2_compile_context_free(pcre2_compile_context *ccontext);
264
265       int pcre2_set_bsr(pcre2_compile_context *ccontext,
266         uint32_t value);
267
268       int pcre2_set_character_tables(pcre2_compile_context *ccontext,
269         const uint8_t *tables);
270
271       int pcre2_set_compile_extra_options(pcre2_compile_context *ccontext,
272         uint32_t extra_options);
273
274       int pcre2_set_max_pattern_length(pcre2_compile_context *ccontext,
275         PCRE2_SIZE value);
276
277       int pcre2_set_newline(pcre2_compile_context *ccontext,
278         uint32_t value);
279
280       int pcre2_set_parens_nest_limit(pcre2_compile_context *ccontext,
281         uint32_t value);
282
283       int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext,
284         int (*guard_function)(uint32_t, void *), void *user_data);
285
286
287PCRE2 NATIVE API MATCH CONTEXT FUNCTIONS
288
289       pcre2_match_context *pcre2_match_context_create(
290         pcre2_general_context *gcontext);
291
292       pcre2_match_context *pcre2_match_context_copy(
293         pcre2_match_context *mcontext);
294
295       void pcre2_match_context_free(pcre2_match_context *mcontext);
296
297       int pcre2_set_callout(pcre2_match_context *mcontext,
298         int (*callout_function)(pcre2_callout_block *, void *),
299         void *callout_data);
300
301       int pcre2_set_substitute_callout(pcre2_match_context *mcontext,
302         int (*callout_function)(pcre2_substitute_callout_block *, void *),
303         void *callout_data);
304
305       int pcre2_set_offset_limit(pcre2_match_context *mcontext,
306         PCRE2_SIZE value);
307
308       int pcre2_set_heap_limit(pcre2_match_context *mcontext,
309         uint32_t value);
310
311       int pcre2_set_match_limit(pcre2_match_context *mcontext,
312         uint32_t value);
313
314       int pcre2_set_depth_limit(pcre2_match_context *mcontext,
315         uint32_t value);
316
317
318PCRE2 NATIVE API STRING EXTRACTION FUNCTIONS
319
320       int pcre2_substring_copy_byname(pcre2_match_data *match_data,
321         PCRE2_SPTR name, PCRE2_UCHAR *buffer, PCRE2_SIZE *bufflen);
322
323       int pcre2_substring_copy_bynumber(pcre2_match_data *match_data,
324         uint32_t number, PCRE2_UCHAR *buffer,
325         PCRE2_SIZE *bufflen);
326
327       void pcre2_substring_free(PCRE2_UCHAR *buffer);
328
329       int pcre2_substring_get_byname(pcre2_match_data *match_data,
330         PCRE2_SPTR name, PCRE2_UCHAR **bufferptr, PCRE2_SIZE *bufflen);
331
332       int pcre2_substring_get_bynumber(pcre2_match_data *match_data,
333         uint32_t number, PCRE2_UCHAR **bufferptr,
334         PCRE2_SIZE *bufflen);
335
336       int pcre2_substring_length_byname(pcre2_match_data *match_data,
337         PCRE2_SPTR name, PCRE2_SIZE *length);
338
339       int pcre2_substring_length_bynumber(pcre2_match_data *match_data,
340         uint32_t number, PCRE2_SIZE *length);
341
342       int pcre2_substring_nametable_scan(const pcre2_code *code,
343         PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last);
344
345       int pcre2_substring_number_from_name(const pcre2_code *code,
346         PCRE2_SPTR name);
347
348       void pcre2_substring_list_free(PCRE2_SPTR *list);
349
350       int pcre2_substring_list_get(pcre2_match_data *match_data,
351         PCRE2_UCHAR ***listptr, PCRE2_SIZE **lengthsptr);
352
353
354PCRE2 NATIVE API STRING SUBSTITUTION FUNCTION
355
356       int pcre2_substitute(const pcre2_code *code, PCRE2_SPTR subject,
357         PCRE2_SIZE length, PCRE2_SIZE startoffset,
358         uint32_t options, pcre2_match_data *match_data,
359         pcre2_match_context *mcontext, PCRE2_SPTR replacementz,
360         PCRE2_SIZE rlength, PCRE2_UCHAR *outputbuffer,
361         PCRE2_SIZE *outlengthptr);
362
363
364PCRE2 NATIVE API JIT FUNCTIONS
365
366       int pcre2_jit_compile(pcre2_code *code, uint32_t options);
367
368       int pcre2_jit_match(const pcre2_code *code, PCRE2_SPTR subject,
369         PCRE2_SIZE length, PCRE2_SIZE startoffset,
370         uint32_t options, pcre2_match_data *match_data,
371         pcre2_match_context *mcontext);
372
373       void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext);
374
375       pcre2_jit_stack *pcre2_jit_stack_create(PCRE2_SIZE startsize,
376         PCRE2_SIZE maxsize, pcre2_general_context *gcontext);
377
378       void pcre2_jit_stack_assign(pcre2_match_context *mcontext,
379         pcre2_jit_callback callback_function, void *callback_data);
380
381       void pcre2_jit_stack_free(pcre2_jit_stack *jit_stack);
382
383
384PCRE2 NATIVE API SERIALIZATION FUNCTIONS
385
386       int32_t pcre2_serialize_decode(pcre2_code **codes,
387         int32_t number_of_codes, const uint8_t *bytes,
388         pcre2_general_context *gcontext);
389
390       int32_t pcre2_serialize_encode(const pcre2_code **codes,
391         int32_t number_of_codes, uint8_t **serialized_bytes,
392         PCRE2_SIZE *serialized_size, pcre2_general_context *gcontext);
393
394       void pcre2_serialize_free(uint8_t *bytes);
395
396       int32_t pcre2_serialize_get_number_of_codes(const uint8_t *bytes);
397
398
399PCRE2 NATIVE API AUXILIARY FUNCTIONS
400
401       pcre2_code *pcre2_code_copy(const pcre2_code *code);
402
403       pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *code);
404
405       int pcre2_get_error_message(int errorcode, PCRE2_UCHAR *buffer,
406         PCRE2_SIZE bufflen);
407
408       const uint8_t *pcre2_maketables(pcre2_general_context *gcontext);
409
410       void pcre2_maketables_free(pcre2_general_context *gcontext,
411         const uint8_t *tables);
412
413       int pcre2_pattern_info(const pcre2_code *code, uint32_t what,
414         void *where);
415
416       int pcre2_callout_enumerate(const pcre2_code *code,
417         int (*callback)(pcre2_callout_enumerate_block *, void *),
418         void *user_data);
419
420       int pcre2_config(uint32_t what, void *where);
421
422
423PCRE2 NATIVE API OBSOLETE FUNCTIONS
424
425       int pcre2_set_recursion_limit(pcre2_match_context *mcontext,
426         uint32_t value);
427
428       int pcre2_set_recursion_memory_management(
429         pcre2_match_context *mcontext,
430         void *(*private_malloc)(PCRE2_SIZE, void *),
431         void (*private_free)(void *, void *), void *memory_data);
432
433       These  functions became obsolete at release 10.30 and are retained only
434       for backward compatibility. They should not be used in  new  code.  The
435       first  is  replaced by pcre2_set_depth_limit(); the second is no longer
436       needed and has no effect (it always returns zero).
437
438
439PCRE2 EXPERIMENTAL PATTERN CONVERSION FUNCTIONS
440
441       pcre2_convert_context *pcre2_convert_context_create(
442         pcre2_general_context *gcontext);
443
444       pcre2_convert_context *pcre2_convert_context_copy(
445         pcre2_convert_context *cvcontext);
446
447       void pcre2_convert_context_free(pcre2_convert_context *cvcontext);
448
449       int pcre2_set_glob_escape(pcre2_convert_context *cvcontext,
450         uint32_t escape_char);
451
452       int pcre2_set_glob_separator(pcre2_convert_context *cvcontext,
453         uint32_t separator_char);
454
455       int pcre2_pattern_convert(PCRE2_SPTR pattern, PCRE2_SIZE length,
456         uint32_t options, PCRE2_UCHAR **buffer,
457         PCRE2_SIZE *blength, pcre2_convert_context *cvcontext);
458
459       void pcre2_converted_pattern_free(PCRE2_UCHAR *converted_pattern);
460
461       These functions provide a way of  converting  non-PCRE2  patterns  into
462       patterns that can be processed by pcre2_compile(). This facility is ex-
463       perimental and may be changed in future releases. At  present,  "globs"
464       and  POSIX  basic  and  extended patterns can be converted. Details are
465       given in the pcre2convert documentation.
466
467
468PCRE2 8-BIT, 16-BIT, AND 32-BIT LIBRARIES
469
470       There are three PCRE2 libraries, supporting 8-bit, 16-bit,  and  32-bit
471       code  units,  respectively.  However,  there  is  just one header file,
472       pcre2.h.  This contains the function prototypes and  other  definitions
473       for all three libraries. One, two, or all three can be installed simul-
474       taneously. On Unix-like systems the libraries  are  called  libpcre2-8,
475       libpcre2-16, and libpcre2-32, and they can also co-exist with the orig-
476       inal PCRE libraries.
477
478       Character strings are passed to and from a PCRE2 library as a  sequence
479       of  unsigned  integers  in  code  units of the appropriate width. Every
480       PCRE2 function comes in three different forms, one  for  each  library,
481       for example:
482
483         pcre2_compile_8()
484         pcre2_compile_16()
485         pcre2_compile_32()
486
487       There are also three different sets of data types:
488
489         PCRE2_UCHAR8, PCRE2_UCHAR16, PCRE2_UCHAR32
490         PCRE2_SPTR8,  PCRE2_SPTR16,  PCRE2_SPTR32
491
492       The  UCHAR  types define unsigned code units of the appropriate widths.
493       For example, PCRE2_UCHAR16 is usually defined as `uint16_t'.  The  SPTR
494       types  are  constant  pointers  to the equivalent UCHAR types, that is,
495       they are pointers to vectors of unsigned code units.
496
497       Many applications use only one code unit width. For their  convenience,
498       macros are defined whose names are the generic forms such as pcre2_com-
499       pile() and  PCRE2_SPTR.  These  macros  use  the  value  of  the  macro
500       PCRE2_CODE_UNIT_WIDTH  to generate the appropriate width-specific func-
501       tion and macro names.  PCRE2_CODE_UNIT_WIDTH is not defined by default.
502       An  application  must  define  it  to  be 8, 16, or 32 before including
503       pcre2.h in order to make use of the generic names.
504
505       Applications that use more than one code unit width can be linked  with
506       more  than  one PCRE2 library, but must define PCRE2_CODE_UNIT_WIDTH to
507       be 0 before including pcre2.h, and then use the  real  function  names.
508       Any  code  that  is to be included in an environment where the value of
509       PCRE2_CODE_UNIT_WIDTH is unknown should  also  use  the  real  function
510       names. (Unfortunately, it is not possible in C code to save and restore
511       the value of a macro.)
512
513       If PCRE2_CODE_UNIT_WIDTH is not defined  before  including  pcre2.h,  a
514       compiler error occurs.
515
516       When  using  multiple  libraries  in an application, you must take care
517       when processing any particular pattern to use  only  functions  from  a
518       single  library.   For example, if you want to run a match using a pat-
519       tern that was compiled with pcre2_compile_16(), you  must  do  so  with
520       pcre2_match_16(), not pcre2_match_8() or pcre2_match_32().
521
522       In  the  function summaries above, and in the rest of this document and
523       other PCRE2 documents, functions and data  types  are  described  using
524       their generic names, without the _8, _16, or _32 suffix.
525
526
527PCRE2 API OVERVIEW
528
529       PCRE2  has  its  own  native  API, which is described in this document.
530       There are also some wrapper functions for the 8-bit library that corre-
531       spond  to the POSIX regular expression API, but they do not give access
532       to all the functionality of PCRE2. They are described in the pcre2posix
533       documentation. Both these APIs define a set of C function calls.
534
535       The  native  API  C data types, function prototypes, option values, and
536       error codes are defined in the header file pcre2.h, which also contains
537       definitions of PCRE2_MAJOR and PCRE2_MINOR, the major and minor release
538       numbers for the library. Applications can use these to include  support
539       for different releases of PCRE2.
540
541       In a Windows environment, if you want to statically link an application
542       program against a non-dll PCRE2 library, you must  define  PCRE2_STATIC
543       before including pcre2.h.
544
545       The  functions pcre2_compile() and pcre2_match() are used for compiling
546       and matching regular expressions in a Perl-compatible manner. A  sample
547       program that demonstrates the simplest way of using them is provided in
548       the file called pcre2demo.c in the PCRE2 source distribution. A listing
549       of  this  program  is  given  in  the  pcre2demo documentation, and the
550       pcre2sample documentation describes how to compile and run it.
551
552       The compiling and matching functions recognize various options that are
553       passed as bits in an options argument. There are also some more compli-
554       cated parameters such as custom memory  management  functions  and  re-
555       source  limits  that  are  passed  in "contexts" (which are just memory
556       blocks, described below). Simple applications do not need to  make  use
557       of contexts.
558
559       Just-in-time  (JIT)  compiler  support  is an optional feature of PCRE2
560       that can be built in  appropriate  hardware  environments.  It  greatly
561       speeds  up  the matching performance of many patterns. Programs can re-
562       quest that it be used if available by calling pcre2_jit_compile() after
563       a  pattern has been successfully compiled by pcre2_compile(). This does
564       nothing if JIT support is not available.
565
566       More complicated programs might need to  make  use  of  the  specialist
567       functions    pcre2_jit_stack_create(),    pcre2_jit_stack_free(),   and
568       pcre2_jit_stack_assign() in order to control the JIT code's memory  us-
569       age.
570
571       JIT matching is automatically used by pcre2_match() if it is available,
572       unless the PCRE2_NO_JIT option is set. There is also a direct interface
573       for  JIT  matching,  which gives improved performance at the expense of
574       less sanity checking. The JIT-specific functions are discussed  in  the
575       pcre2jit documentation.
576
577       A  second  matching function, pcre2_dfa_match(), which is not Perl-com-
578       patible, is also provided. This uses  a  different  algorithm  for  the
579       matching.  The  alternative  algorithm finds all possible matches (at a
580       given point in the subject), and scans the subject  just  once  (unless
581       there  are lookaround assertions). However, this algorithm does not re-
582       turn captured substrings. A description of the two matching  algorithms
583       and  their  advantages  and disadvantages is given in the pcre2matching
584       documentation. There is no JIT support for pcre2_dfa_match().
585
586       In addition to the main compiling and  matching  functions,  there  are
587       convenience functions for extracting captured substrings from a subject
588       string that has been matched by pcre2_match(). They are:
589
590         pcre2_substring_copy_byname()
591         pcre2_substring_copy_bynumber()
592         pcre2_substring_get_byname()
593         pcre2_substring_get_bynumber()
594         pcre2_substring_list_get()
595         pcre2_substring_length_byname()
596         pcre2_substring_length_bynumber()
597         pcre2_substring_nametable_scan()
598         pcre2_substring_number_from_name()
599
600       pcre2_substring_free() and pcre2_substring_list_free()  are  also  pro-
601       vided,  to  free  memory used for extracted strings. If either of these
602       functions is called with a NULL argument, the function returns  immedi-
603       ately without doing anything.
604
605       The  function  pcre2_substitute()  can be called to match a pattern and
606       return a copy of the subject string with substitutions for  parts  that
607       were matched.
608
609       Functions  whose  names begin with pcre2_serialize_ are used for saving
610       compiled patterns on disc or elsewhere, and reloading them later.
611
612       Finally, there are functions for finding out information about  a  com-
613       piled  pattern  (pcre2_pattern_info()) and about the configuration with
614       which PCRE2 was built (pcre2_config()).
615
616       Functions with names ending with _free() are used  for  freeing  memory
617       blocks  of  various  sorts.  In all cases, if one of these functions is
618       called with a NULL argument, it does nothing.
619
620
621STRING LENGTHS AND OFFSETS
622
623       The PCRE2 API uses string lengths and  offsets  into  strings  of  code
624       units  in  several  places. These values are always of type PCRE2_SIZE,
625       which is an unsigned integer type, currently always defined as  size_t.
626       The  largest  value  that  can  be  stored  in  such  a  type  (that is
627       ~(PCRE2_SIZE)0) is reserved as a special indicator for  zero-terminated
628       strings  and  unset offsets.  Therefore, the longest string that can be
629       handled is one less than this maximum.
630
631
632NEWLINES
633
634       PCRE2 supports five different conventions for indicating line breaks in
635       strings:  a  single  CR (carriage return) character, a single LF (line-
636       feed) character, the two-character sequence CRLF, any of the three pre-
637       ceding,  or any Unicode newline sequence. The Unicode newline sequences
638       are the three just mentioned, plus the single characters  VT  (vertical
639       tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line
640       separator, U+2028), and PS (paragraph separator, U+2029).
641
642       Each of the first three conventions is used by at least  one  operating
643       system as its standard newline sequence. When PCRE2 is built, a default
644       can be specified.  If it is not, the default is set to LF, which is the
645       Unix standard. However, the newline convention can be changed by an ap-
646       plication when calling pcre2_compile(), or it can be specified by  spe-
647       cial  text at the start of the pattern itself; this overrides any other
648       settings. See the pcre2pattern page for details of the special  charac-
649       ter sequences.
650
651       In  the  PCRE2  documentation  the  word "newline" is used to mean "the
652       character or pair of characters that indicate a line break". The choice
653       of  newline convention affects the handling of the dot, circumflex, and
654       dollar metacharacters, the handling of #-comments in /x mode, and, when
655       CRLF  is a recognized line ending sequence, the match position advance-
656       ment for a non-anchored pattern. There is more detail about this in the
657       section on pcre2_match() options below.
658
659       The  choice of newline convention does not affect the interpretation of
660       the \n or \r escape sequences, nor does it affect what \R matches; this
661       has its own separate convention.
662
663
664MULTITHREADING
665
666       In  a multithreaded application it is important to keep thread-specific
667       data separate from data that can be shared between threads.  The  PCRE2
668       library  code  itself  is  thread-safe: it contains no static or global
669       variables. The API is designed to be fairly simple for non-threaded ap-
670       plications  while at the same time ensuring that multithreaded applica-
671       tions can use it.
672
673       There are several different blocks of data that are used to pass infor-
674       mation between the application and the PCRE2 libraries.
675
676   The compiled pattern
677
678       A  pointer  to  the  compiled form of a pattern is returned to the user
679       when pcre2_compile() is successful. The data in the compiled pattern is
680       fixed,  and  does not change when the pattern is matched. Therefore, it
681       is thread-safe, that is, the same compiled pattern can be used by  more
682       than one thread simultaneously. For example, an application can compile
683       all its patterns at the start, before forking off multiple threads that
684       use  them.  However,  if the just-in-time (JIT) optimization feature is
685       being used, it needs separate memory stack areas for each  thread.  See
686       the pcre2jit documentation for more details.
687
688       In  a more complicated situation, where patterns are compiled only when
689       they are first needed, but are still shared between  threads,  pointers
690       to  compiled  patterns  must  be protected from simultaneous writing by
691       multiple threads. This is somewhat tricky to do correctly. If you  know
692       that  writing  to  a pointer is atomic in your environment, you can use
693       logic like this:
694
695         Get a read-only (shared) lock (mutex) for pointer
696         if (pointer == NULL)
697           {
698           Get a write (unique) lock for pointer
699           if (pointer == NULL) pointer = pcre2_compile(...
700           }
701         Release the lock
702         Use pointer in pcre2_match()
703
704       Of course, testing for compilation errors should also  be  included  in
705       the code.
706
707       The  reason  for checking the pointer a second time is as follows: Sev-
708       eral threads may have acquired the shared lock and tested  the  pointer
709       for being NULL, but only one of them will be given the write lock, with
710       the rest kept waiting. The winning thread will compile the pattern  and
711       store  the  result.  After this thread releases the write lock, another
712       thread will get it, and if it does not retest pointer for  being  NULL,
713       will recompile the pattern and overwrite the pointer, creating a memory
714       leak and possibly causing other issues.
715
716       In an environment where writing to a pointer may  not  be  atomic,  the
717       above  logic  is not sufficient. The thread that is doing the compiling
718       may be descheduled after writing only part of the pointer, which  could
719       cause  other  threads  to use an invalid value. Instead of checking the
720       pointer itself, a separate "pointer is valid" flag (that can be updated
721       atomically) must be used:
722
723         Get a read-only (shared) lock (mutex) for pointer
724         if (!pointer_is_valid)
725           {
726           Get a write (unique) lock for pointer
727           if (!pointer_is_valid)
728             {
729             pointer = pcre2_compile(...
730             pointer_is_valid = TRUE
731             }
732           }
733         Release the lock
734         Use pointer in pcre2_match()
735
736       If JIT is being used, but the JIT compilation is not being done immedi-
737       ately (perhaps waiting to see if the pattern  is  used  often  enough),
738       similar  logic  is required. JIT compilation updates a value within the
739       compiled code block, so a thread must gain unique write access  to  the
740       pointer     before    calling    pcre2_jit_compile().    Alternatively,
741       pcre2_code_copy() or pcre2_code_copy_with_tables() can be used  to  ob-
742       tain  a  private  copy of the compiled code before calling the JIT com-
743       piler.
744
745   Context blocks
746
747       The next main section below introduces the idea of "contexts" in  which
748       PCRE2 functions are called. A context is nothing more than a collection
749       of parameters that control the way PCRE2 operates. Grouping a number of
750       parameters together in a context is a convenient way of passing them to
751       a PCRE2 function without using lots of arguments. The  parameters  that
752       are  stored  in  contexts  are in some sense "advanced features" of the
753       API. Many straightforward applications will not need to use contexts.
754
755       In a multithreaded application, if the parameters in a context are val-
756       ues  that  are  never  changed, the same context can be used by all the
757       threads. However, if any thread needs to change any value in a context,
758       it must make its own thread-specific copy.
759
760   Match blocks
761
762       The  matching  functions need a block of memory for storing the results
763       of a match. This includes details of what was matched, as well as addi-
764       tional  information  such as the name of a (*MARK) setting. Each thread
765       must provide its own copy of this memory.
766
767
768PCRE2 CONTEXTS
769
770       Some PCRE2 functions have a lot of parameters, many of which  are  used
771       only  by  specialist  applications,  for example, those that use custom
772       memory management or non-standard character tables.  To  keep  function
773       argument  lists  at a reasonable size, and at the same time to keep the
774       API extensible, "uncommon" parameters are passed to  certain  functions
775       in  a  context instead of directly. A context is just a block of memory
776       that holds the parameter values.  Applications that do not need to  ad-
777       just any of the context parameters can pass NULL when a context pointer
778       is required.
779
780       There are three different types of context: a general context  that  is
781       relevant  for  several  PCRE2 operations, a compile-time context, and a
782       match-time context.
783
784   The general context
785
786       At present, this context just contains pointers to (and data  for)  ex-
787       ternal  memory management functions that are called from several places
788       in the PCRE2 library.  The  context  is  named  `general'  rather  than
789       specifically  `memory'  because in future other fields may be added. If
790       you do not want to supply your own custom memory management  functions,
791       you  do not need to bother with a general context. A general context is
792       created by:
793
794       pcre2_general_context *pcre2_general_context_create(
795         void *(*private_malloc)(PCRE2_SIZE, void *),
796         void (*private_free)(void *, void *), void *memory_data);
797
798       The two function pointers specify custom memory  management  functions,
799       whose prototypes are:
800
801         void *private_malloc(PCRE2_SIZE, void *);
802         void  private_free(void *, void *);
803
804       Whenever code in PCRE2 calls these functions, the final argument is the
805       value of memory_data. Either of the first two arguments of the creation
806       function  may be NULL, in which case the system memory management func-
807       tions malloc() and free() are used. (This is not currently  useful,  as
808       there  are  no  other  fields in a general context, but in future there
809       might be.)  The private_malloc() function is used (if supplied) to  ob-
810       tain  memory for storing the context, and all three values are saved as
811       part of the context.
812
813       Whenever PCRE2 creates a data block of any kind, the block  contains  a
814       pointer  to the free() function that matches the malloc() function that
815       was used. When the time comes to  free  the  block,  this  function  is
816       called.
817
818       A general context can be copied by calling:
819
820       pcre2_general_context *pcre2_general_context_copy(
821         pcre2_general_context *gcontext);
822
823       The memory used for a general context should be freed by calling:
824
825       void pcre2_general_context_free(pcre2_general_context *gcontext);
826
827       If  this  function  is  passed  a NULL argument, it returns immediately
828       without doing anything.
829
830   The compile context
831
832       A compile context is required if you want to provide an external  func-
833       tion  for  stack  checking  during compilation or to change the default
834       values of any of the following compile-time parameters:
835
836         What \R matches (Unicode newlines or CR, LF, CRLF only)
837         PCRE2's character tables
838         The newline character sequence
839         The compile time nested parentheses limit
840         The maximum length of the pattern string
841         The extra options bits (none set by default)
842
843       A compile context is also required if you are using custom memory  man-
844       agement.   If  none of these apply, just pass NULL as the context argu-
845       ment of pcre2_compile().
846
847       A compile context is created, copied, and freed by the following  func-
848       tions:
849
850       pcre2_compile_context *pcre2_compile_context_create(
851         pcre2_general_context *gcontext);
852
853       pcre2_compile_context *pcre2_compile_context_copy(
854         pcre2_compile_context *ccontext);
855
856       void pcre2_compile_context_free(pcre2_compile_context *ccontext);
857
858       A  compile  context  is created with default values for its parameters.
859       These can be changed by calling the following functions, which return 0
860       on success, or PCRE2_ERROR_BADDATA if invalid data is detected.
861
862       int pcre2_set_bsr(pcre2_compile_context *ccontext,
863         uint32_t value);
864
865       The  value  must  be PCRE2_BSR_ANYCRLF, to specify that \R matches only
866       CR, LF, or CRLF, or PCRE2_BSR_UNICODE, to specify that \R  matches  any
867       Unicode line ending sequence. The value is used by the JIT compiler and
868       by  the  two  interpreted   matching   functions,   pcre2_match()   and
869       pcre2_dfa_match().
870
871       int pcre2_set_character_tables(pcre2_compile_context *ccontext,
872         const uint8_t *tables);
873
874       The  value  must  be  the result of a call to pcre2_maketables(), whose
875       only argument is a general context. This function builds a set of char-
876       acter tables in the current locale.
877
878       int pcre2_set_compile_extra_options(pcre2_compile_context *ccontext,
879         uint32_t extra_options);
880
881       As  PCRE2  has developed, almost all the 32 option bits that are avail-
882       able in the options argument of pcre2_compile() have been used  up.  To
883       avoid  running  out, the compile context contains a set of extra option
884       bits which are used for some newer, assumed rarer, options. This  func-
885       tion  sets  those bits. It always sets all the bits (either on or off).
886       It does not modify any existing setting. The available options are  de-
887       fined in the section entitled "Extra compile options" below.
888
889       int pcre2_set_max_pattern_length(pcre2_compile_context *ccontext,
890         PCRE2_SIZE value);
891
892       This  sets a maximum length, in code units, for any pattern string that
893       is compiled with this context. If the pattern is longer,  an  error  is
894       generated.   This facility is provided so that applications that accept
895       patterns from external sources can limit their size. The default is the
896       largest  number  that  a  PCRE2_SIZE variable can hold, which is effec-
897       tively unlimited.
898
899       int pcre2_set_newline(pcre2_compile_context *ccontext,
900         uint32_t value);
901
902       This specifies which characters or character sequences are to be recog-
903       nized  as newlines. The value must be one of PCRE2_NEWLINE_CR (carriage
904       return only), PCRE2_NEWLINE_LF (linefeed only), PCRE2_NEWLINE_CRLF (the
905       two-character  sequence  CR followed by LF), PCRE2_NEWLINE_ANYCRLF (any
906       of the above), PCRE2_NEWLINE_ANY (any  Unicode  newline  sequence),  or
907       PCRE2_NEWLINE_NUL (the NUL character, that is a binary zero).
908
909       A pattern can override the value set in the compile context by starting
910       with a sequence such as (*CRLF). See the pcre2pattern page for details.
911
912       When a  pattern  is  compiled  with  the  PCRE2_EXTENDED  or  PCRE2_EX-
913       TENDED_MORE  option,  the newline convention affects the recognition of
914       the end of internal comments starting with #. The value is  saved  with
915       the  compiled pattern for subsequent use by the JIT compiler and by the
916       two    interpreted    matching     functions,     pcre2_match()     and
917       pcre2_dfa_match().
918
919       int pcre2_set_parens_nest_limit(pcre2_compile_context *ccontext,
920         uint32_t value);
921
922       This  parameter  adjusts  the  limit,  set when PCRE2 is built (default
923       250), on the depth of parenthesis nesting  in  a  pattern.  This  limit
924       stops  rogue  patterns  using  up too much system stack when being com-
925       piled. The limit applies to parentheses of all kinds, not just  captur-
926       ing parentheses.
927
928       int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext,
929         int (*guard_function)(uint32_t, void *), void *user_data);
930
931       There  is at least one application that runs PCRE2 in threads with very
932       limited system stack, where running out of stack is to  be  avoided  at
933       all  costs. The parenthesis limit above cannot take account of how much
934       stack is actually available during compilation. For  a  finer  control,
935       you  can  supply  a  function  that  is called whenever pcre2_compile()
936       starts to compile a parenthesized part of a pattern. This function  can
937       check  the  actual  stack  size  (or anything else that it wants to, of
938       course).
939
940       The first argument to the callout function gives the current  depth  of
941       nesting,  and  the second is user data that is set up by the last argu-
942       ment  of  pcre2_set_compile_recursion_guard().  The  callout   function
943       should return zero if all is well, or non-zero to force an error.
944
945   The match context
946
947       A match context is required if you want to:
948
949         Set up a callout function
950         Set an offset limit for matching an unanchored pattern
951         Change the limit on the amount of heap used when matching
952         Change the backtracking match limit
953         Change the backtracking depth limit
954         Set custom memory management specifically for the match
955
956       If  none  of  these  apply,  just  pass NULL as the context argument of
957       pcre2_match(), pcre2_dfa_match(), or pcre2_jit_match().
958
959       A match context is created, copied, and freed by  the  following  func-
960       tions:
961
962       pcre2_match_context *pcre2_match_context_create(
963         pcre2_general_context *gcontext);
964
965       pcre2_match_context *pcre2_match_context_copy(
966         pcre2_match_context *mcontext);
967
968       void pcre2_match_context_free(pcre2_match_context *mcontext);
969
970       A  match  context  is  created  with default values for its parameters.
971       These can be changed by calling the following functions, which return 0
972       on success, or PCRE2_ERROR_BADDATA if invalid data is detected.
973
974       int pcre2_set_callout(pcre2_match_context *mcontext,
975         int (*callout_function)(pcre2_callout_block *, void *),
976         void *callout_data);
977
978       This  sets  up a callout function for PCRE2 to call at specified points
979       during a matching operation. Details are given in the pcre2callout doc-
980       umentation.
981
982       int pcre2_set_substitute_callout(pcre2_match_context *mcontext,
983         int (*callout_function)(pcre2_substitute_callout_block *, void *),
984         void *callout_data);
985
986       This  sets up a callout function for PCRE2 to call after each substitu-
987       tion made by pcre2_substitute(). Details are given in the section enti-
988       tled "Creating a new string with substitutions" below.
989
990       int pcre2_set_offset_limit(pcre2_match_context *mcontext,
991         PCRE2_SIZE value);
992
993       The  offset_limit parameter limits how far an unanchored search can ad-
994       vance in the subject string. The  default  value  is  PCRE2_UNSET.  The
995       pcre2_match()  and  pcre2_dfa_match()  functions return PCRE2_ERROR_NO-
996       MATCH if a match with a starting point before or at the given offset is
997       not found. The pcre2_substitute() function makes no more substitutions.
998
999       For  example,  if the pattern /abc/ is matched against "123abc" with an
1000       offset limit less than 3, the result is  PCRE2_ERROR_NOMATCH.  A  match
1001       can  never  be  found  if  the  startoffset  argument of pcre2_match(),
1002       pcre2_dfa_match(), or pcre2_substitute() is  greater  than  the  offset
1003       limit set in the match context.
1004
1005       When  using  this facility, you must set the PCRE2_USE_OFFSET_LIMIT op-
1006       tion when calling pcre2_compile() so that when JIT is in use, different
1007       code  can  be  compiled. If a match is started with a non-default match
1008       limit when PCRE2_USE_OFFSET_LIMIT is not set, an error is generated.
1009
1010       The offset limit facility can be used to track progress when  searching
1011       large  subject  strings or to limit the extent of global substitutions.
1012       See also the PCRE2_FIRSTLINE option, which requires a  match  to  start
1013       before  or  at  the first newline that follows the start of matching in
1014       the subject. If this is set with an offset limit, a match must occur in
1015       the first line and also within the offset limit. In other words, which-
1016       ever limit comes first is used.
1017
1018       int pcre2_set_heap_limit(pcre2_match_context *mcontext,
1019         uint32_t value);
1020
1021       The heap_limit parameter specifies, in units of kibibytes (1024 bytes),
1022       the  maximum  amount  of heap memory that pcre2_match() may use to hold
1023       backtracking information when running an interpretive match. This limit
1024       also applies to pcre2_dfa_match(), which may use the heap when process-
1025       ing patterns with a lot of nested pattern recursion or  lookarounds  or
1026       atomic groups. This limit does not apply to matching with the JIT opti-
1027       mization, which has  its  own  memory  control  arrangements  (see  the
1028       pcre2jit  documentation for more details). If the limit is reached, the
1029       negative error code  PCRE2_ERROR_HEAPLIMIT  is  returned.  The  default
1030       limit  can be set when PCRE2 is built; if it is not, the default is set
1031       very large and is essentially "unlimited".
1032
1033       A value for the heap limit may also be supplied by an item at the start
1034       of a pattern of the form
1035
1036         (*LIMIT_HEAP=ddd)
1037
1038       where  ddd  is a decimal number. However, such a setting is ignored un-
1039       less ddd is less than the limit set by the caller of pcre2_match()  or,
1040       if no such limit is set, less than the default.
1041
1042       The  pcre2_match() function starts out using a 20KiB vector on the sys-
1043       tem stack for recording backtracking points. The more nested backtrack-
1044       ing  points  there  are (that is, the deeper the search tree), the more
1045       memory is needed.  Heap memory is used only if the  initial  vector  is
1046       too small. If the heap limit is set to a value less than 21 (in partic-
1047       ular, zero) no heap memory will be used. In this  case,  only  patterns
1048       that  do not have a lot of nested backtracking can be successfully pro-
1049       cessed.
1050
1051       Similarly, for pcre2_dfa_match(), a vector on the system stack is  used
1052       when  processing pattern recursions, lookarounds, or atomic groups, and
1053       only if this is not big enough is heap memory used. In this case,  too,
1054       setting a value of zero disables the use of the heap.
1055
1056       int pcre2_set_match_limit(pcre2_match_context *mcontext,
1057         uint32_t value);
1058
1059       The match_limit parameter provides a means of preventing PCRE2 from us-
1060       ing up too many computing resources when processing patterns  that  are
1061       not going to match, but which have a very large number of possibilities
1062       in their search trees. The classic  example  is  a  pattern  that  uses
1063       nested unlimited repeats.
1064
1065       There  is an internal counter in pcre2_match() that is incremented each
1066       time round its main matching loop. If  this  value  reaches  the  match
1067       limit, pcre2_match() returns the negative value PCRE2_ERROR_MATCHLIMIT.
1068       This has the effect of limiting the amount  of  backtracking  that  can
1069       take place. For patterns that are not anchored, the count restarts from
1070       zero for each position in the subject string. This limit  also  applies
1071       to pcre2_dfa_match(), though the counting is done in a different way.
1072
1073       When  pcre2_match() is called with a pattern that was successfully pro-
1074       cessed by pcre2_jit_compile(), the way in which matching is executed is
1075       entirely  different. However, there is still the possibility of runaway
1076       matching that goes on for a very long  time,  and  so  the  match_limit
1077       value  is  also used in this case (but in a different way) to limit how
1078       long the matching can continue.
1079
1080       The default value for the limit can be set when PCRE2 is built; the de-
1081       fault  default  is  10  million, which handles all but the most extreme
1082       cases. A value for the match limit may also be supplied by an  item  at
1083       the start of a pattern of the form
1084
1085         (*LIMIT_MATCH=ddd)
1086
1087       where  ddd  is a decimal number. However, such a setting is ignored un-
1088       less ddd is less than the limit set by the caller of  pcre2_match()  or
1089       pcre2_dfa_match() or, if no such limit is set, less than the default.
1090
1091       int pcre2_set_depth_limit(pcre2_match_context *mcontext,
1092         uint32_t value);
1093
1094       This   parameter   limits   the   depth   of   nested  backtracking  in
1095       pcre2_match().  Each time a nested backtracking point is passed, a  new
1096       memory "frame" is used to remember the state of matching at that point.
1097       Thus, this parameter indirectly limits the amount  of  memory  that  is
1098       used  in  a match. However, because the size of each memory "frame" de-
1099       pends on the number of capturing parentheses, the actual  memory  limit
1100       varies  from pattern to pattern. This limit was more useful in versions
1101       before 10.30, where function recursion was used for backtracking.
1102
1103       The depth limit is not relevant, and is ignored, when matching is  done
1104       using JIT compiled code. However, it is supported by pcre2_dfa_match(),
1105       which uses it to limit the depth of nested internal recursive  function
1106       calls  that implement atomic groups, lookaround assertions, and pattern
1107       recursions. This limits, indirectly, the amount of system stack that is
1108       used.  It  was  more useful in versions before 10.32, when stack memory
1109       was used for local workspace vectors for recursive function calls. From
1110       version  10.32,  only local variables are allocated on the stack and as
1111       each call uses only a few hundred bytes, even a small stack can support
1112       quite a lot of recursion.
1113
1114       If  the depth of internal recursive function calls is great enough, lo-
1115       cal workspace vectors are allocated on the heap from version 10.32  on-
1116       wards,  so  the  depth  limit also indirectly limits the amount of heap
1117       memory that is used. A recursive pattern such as /(.(?2))((?1)|)/, when
1118       matched  to a very long string using pcre2_dfa_match(), can use a great
1119       deal of memory. However, it is probably better to limit heap usage  di-
1120       rectly by calling pcre2_set_heap_limit().
1121
1122       The  default  value for the depth limit can be set when PCRE2 is built;
1123       if it is not, the default is set to the same value as the  default  for
1124       the   match   limit.   If  the  limit  is  exceeded,  pcre2_match()  or
1125       pcre2_dfa_match() returns PCRE2_ERROR_DEPTHLIMIT. A value for the depth
1126       limit  may also be supplied by an item at the start of a pattern of the
1127       form
1128
1129         (*LIMIT_DEPTH=ddd)
1130
1131       where ddd is a decimal number. However, such a setting is  ignored  un-
1132       less  ddd  is less than the limit set by the caller of pcre2_match() or
1133       pcre2_dfa_match() or, if no such limit is set, less than the default.
1134
1135
1136CHECKING BUILD-TIME OPTIONS
1137
1138       int pcre2_config(uint32_t what, void *where);
1139
1140       The function pcre2_config() makes it possible for  a  PCRE2  client  to
1141       find  the  value  of  certain  configuration parameters and to discover
1142       which optional features have been compiled into the PCRE2 library.  The
1143       pcre2build documentation has more details about these features.
1144
1145       The  first  argument  for pcre2_config() specifies which information is
1146       required. The second argument is a pointer to memory into which the in-
1147       formation is placed. If NULL is passed, the function returns the amount
1148       of memory that is needed for the requested information. For calls  that
1149       return  numerical  values, the value is in bytes; when requesting these
1150       values, where should point to appropriately aligned memory.  For  calls
1151       that  return  strings,  the required length is given in code units, not
1152       counting the terminating zero.
1153
1154       When requesting information, the returned value from pcre2_config()  is
1155       non-negative  on success, or the negative error code PCRE2_ERROR_BADOP-
1156       TION if the value in the first argument is not recognized. The  follow-
1157       ing information is available:
1158
1159         PCRE2_CONFIG_BSR
1160
1161       The  output  is a uint32_t integer whose value indicates what character
1162       sequences the \R  escape  sequence  matches  by  default.  A  value  of
1163       PCRE2_BSR_UNICODE  means  that  \R  matches any Unicode line ending se-
1164       quence; a value of PCRE2_BSR_ANYCRLF means that \R matches only CR, LF,
1165       or CRLF. The default can be overridden when a pattern is compiled.
1166
1167         PCRE2_CONFIG_COMPILED_WIDTHS
1168
1169       The  output  is a uint32_t integer whose lower bits indicate which code
1170       unit widths were selected when PCRE2 was  built.  The  1-bit  indicates
1171       8-bit  support, and the 2-bit and 4-bit indicate 16-bit and 32-bit sup-
1172       port, respectively.
1173
1174         PCRE2_CONFIG_DEPTHLIMIT
1175
1176       The output is a uint32_t integer that gives the default limit  for  the
1177       depth  of  nested  backtracking in pcre2_match() or the depth of nested
1178       recursions, lookarounds, and atomic groups in  pcre2_dfa_match().  Fur-
1179       ther details are given with pcre2_set_depth_limit() above.
1180
1181         PCRE2_CONFIG_HEAPLIMIT
1182
1183       The  output is a uint32_t integer that gives, in kibibytes, the default
1184       limit  for  the  amount  of  heap  memory  used  by  pcre2_match()   or
1185       pcre2_dfa_match().      Further      details     are     given     with
1186       pcre2_set_heap_limit() above.
1187
1188         PCRE2_CONFIG_JIT
1189
1190       The output is a uint32_t integer that is set  to  one  if  support  for
1191       just-in-time compiling is available; otherwise it is set to zero.
1192
1193         PCRE2_CONFIG_JITTARGET
1194
1195       The  where  argument  should point to a buffer that is at least 48 code
1196       units long.  (The  exact  length  required  can  be  found  by  calling
1197       pcre2_config()  with  where  set  to NULL.) The buffer is filled with a
1198       string that contains the name of the architecture  for  which  the  JIT
1199       compiler  is  configured,  for  example "x86 32bit (little endian + un-
1200       aligned)". If JIT support is not  available,  PCRE2_ERROR_BADOPTION  is
1201       returned,  otherwise the number of code units used is returned. This is
1202       the length of the string, plus one unit for the terminating zero.
1203
1204         PCRE2_CONFIG_LINKSIZE
1205
1206       The output is a uint32_t integer that contains the number of bytes used
1207       for  internal  linkage  in  compiled regular expressions. When PCRE2 is
1208       configured, the value can be set to 2, 3, or 4, with the default  being
1209       2.  This is the value that is returned by pcre2_config(). However, when
1210       the 16-bit library is compiled, a value of 3 is rounded up  to  4,  and
1211       when  the  32-bit  library  is compiled, internal linkages always use 4
1212       bytes, so the configured value is not relevant.
1213
1214       The default value of 2 for the 8-bit and 16-bit libraries is sufficient
1215       for  all but the most massive patterns, since it allows the size of the
1216       compiled pattern to be up to 65535  code  units.  Larger  values  allow
1217       larger  regular  expressions to be compiled by those two libraries, but
1218       at the expense of slower matching.
1219
1220         PCRE2_CONFIG_MATCHLIMIT
1221
1222       The output is a uint32_t integer that gives the default match limit for
1223       pcre2_match().  Further  details are given with pcre2_set_match_limit()
1224       above.
1225
1226         PCRE2_CONFIG_NEWLINE
1227
1228       The output is a uint32_t integer  whose  value  specifies  the  default
1229       character  sequence that is recognized as meaning "newline". The values
1230       are:
1231
1232         PCRE2_NEWLINE_CR       Carriage return (CR)
1233         PCRE2_NEWLINE_LF       Linefeed (LF)
1234         PCRE2_NEWLINE_CRLF     Carriage return, linefeed (CRLF)
1235         PCRE2_NEWLINE_ANY      Any Unicode line ending
1236         PCRE2_NEWLINE_ANYCRLF  Any of CR, LF, or CRLF
1237         PCRE2_NEWLINE_NUL      The NUL character (binary zero)
1238
1239       The default should normally correspond to  the  standard  sequence  for
1240       your operating system.
1241
1242         PCRE2_CONFIG_NEVER_BACKSLASH_C
1243
1244       The  output  is  a uint32_t integer that is set to one if the use of \C
1245       was permanently disabled when PCRE2 was built; otherwise it is  set  to
1246       zero.
1247
1248         PCRE2_CONFIG_PARENSLIMIT
1249
1250       The  output is a uint32_t integer that gives the maximum depth of nest-
1251       ing of parentheses (of any kind) in a pattern. This limit is imposed to
1252       cap  the  amount of system stack used when a pattern is compiled. It is
1253       specified when PCRE2 is built; the default is 250. This limit does  not
1254       take into account the stack that may already be used by the calling ap-
1255       plication.  For  finer  control  over  compilation  stack  usage,   see
1256       pcre2_set_compile_recursion_guard().
1257
1258         PCRE2_CONFIG_STACKRECURSE
1259
1260       This parameter is obsolete and should not be used in new code. The out-
1261       put is a uint32_t integer that is always set to zero.
1262
1263         PCRE2_CONFIG_TABLES_LENGTH
1264
1265       The output is a uint32_t integer that gives the length of PCRE2's char-
1266       acter  processing  tables in bytes. For details of these tables see the
1267       section on locale support below.
1268
1269         PCRE2_CONFIG_UNICODE_VERSION
1270
1271       The where argument should point to a buffer that is at  least  24  code
1272       units  long.  (The  exact  length  required  can  be  found  by calling
1273       pcre2_config() with where set to NULL.)  If  PCRE2  has  been  compiled
1274       without  Unicode  support,  the buffer is filled with the text "Unicode
1275       not supported". Otherwise, the Unicode  version  string  (for  example,
1276       "8.0.0")  is  inserted. The number of code units used is returned. This
1277       is the length of the string plus one unit for the terminating zero.
1278
1279         PCRE2_CONFIG_UNICODE
1280
1281       The output is a uint32_t integer that is set to one if Unicode  support
1282       is  available; otherwise it is set to zero. Unicode support implies UTF
1283       support.
1284
1285         PCRE2_CONFIG_VERSION
1286
1287       The where argument should point to a buffer that is at  least  24  code
1288       units  long.  (The  exact  length  required  can  be  found  by calling
1289       pcre2_config() with where set to NULL.) The buffer is filled  with  the
1290       PCRE2 version string, zero-terminated. The number of code units used is
1291       returned. This is the length of the string plus one unit for the termi-
1292       nating zero.
1293
1294
1295COMPILING A PATTERN
1296
1297       pcre2_code *pcre2_compile(PCRE2_SPTR pattern, PCRE2_SIZE length,
1298         uint32_t options, int *errorcode, PCRE2_SIZE *erroroffset,
1299         pcre2_compile_context *ccontext);
1300
1301       void pcre2_code_free(pcre2_code *code);
1302
1303       pcre2_code *pcre2_code_copy(const pcre2_code *code);
1304
1305       pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *code);
1306
1307       The  pcre2_compile() function compiles a pattern into an internal form.
1308       The pattern is defined by a pointer to a string of  code  units  and  a
1309       length  (in  code units). If the pattern is zero-terminated, the length
1310       can be specified  as  PCRE2_ZERO_TERMINATED.  The  function  returns  a
1311       pointer to a block of memory that contains the compiled pattern and re-
1312       lated data, or NULL if an error occurred.
1313
1314       If the compile context argument ccontext is NULL, memory for  the  com-
1315       piled  pattern  is  obtained  by calling malloc(). Otherwise, it is ob-
1316       tained from the same memory function that was used for the compile con-
1317       text. The caller must free the memory by calling pcre2_code_free() when
1318       it is no longer needed.  If pcre2_code_free() is called with a NULL ar-
1319       gument, it returns immediately, without doing anything.
1320
1321       The function pcre2_code_copy() makes a copy of the compiled code in new
1322       memory, using the same memory allocator as was used for  the  original.
1323       However,  if  the  code has been processed by the JIT compiler (see be-
1324       low), the JIT information cannot be copied (because it is  position-de-
1325       pendent).   The  new copy can initially be used only for non-JIT match-
1326       ing, though it can be passed to  pcre2_jit_compile()  if  required.  If
1327       pcre2_code_copy() is called with a NULL argument, it returns NULL.
1328
1329       The pcre2_code_copy() function provides a way for individual threads in
1330       a multithreaded application to acquire a private copy  of  shared  com-
1331       piled  code.   However, it does not make a copy of the character tables
1332       used by the compiled pattern; the new pattern code points to  the  same
1333       tables  as  the original code.  (See "Locale Support" below for details
1334       of these character tables.) In many applications the  same  tables  are
1335       used  throughout, so this behaviour is appropriate. Nevertheless, there
1336       are occasions when a copy of a compiled pattern and the relevant tables
1337       are  needed.  The pcre2_code_copy_with_tables() provides this facility.
1338       Copies of both the code and the tables are  made,  with  the  new  code
1339       pointing  to the new tables. The memory for the new tables is automati-
1340       cally freed when pcre2_code_free() is called for the new  copy  of  the
1341       compiled  code.  If pcre2_code_copy_with_tables() is called with a NULL
1342       argument, it returns NULL.
1343
1344       NOTE: When one of the matching functions is  called,  pointers  to  the
1345       compiled pattern and the subject string are set in the match data block
1346       so that they can be referenced by the  substring  extraction  functions
1347       after  a  successful match.  After running a match, you must not free a
1348       compiled pattern or a subject string until after all operations on  the
1349       match  data  block have taken place, unless, in the case of the subject
1350       string, you have used the PCRE2_COPY_MATCHED_SUBJECT option,  which  is
1351       described  in  the section entitled "Option bits for pcre2_match()" be-
1352       low.
1353
1354       The options argument for pcre2_compile() contains various bit  settings
1355       that  affect the compilation. It should be zero if none of them are re-
1356       quired. The available options are described below.  Some  of  them  (in
1357       particular,  those  that  are  compatible with Perl, but some others as
1358       well) can also be set and unset from within the pattern  (see  the  de-
1359       tailed description in the pcre2pattern documentation).
1360
1361       For  those options that can be different in different parts of the pat-
1362       tern, the contents of the options argument specifies their settings  at
1363       the  start  of  compilation. The PCRE2_ANCHORED, PCRE2_ENDANCHORED, and
1364       PCRE2_NO_UTF_CHECK options can be set at the time of matching  as  well
1365       as at compile time.
1366
1367       Some  additional  options and less frequently required compile-time pa-
1368       rameters (for example, the newline setting) can be provided in  a  com-
1369       pile context (as described above).
1370
1371       If errorcode or erroroffset is NULL, pcre2_compile() returns NULL imme-
1372       diately. Otherwise, the variables to which these point are  set  to  an
1373       error code and an offset (number of code units) within the pattern, re-
1374       spectively, when pcre2_compile() returns NULL because a compilation er-
1375       ror  has  occurred. The values are not defined when compilation is suc-
1376       cessful and pcre2_compile() returns a non-NULL value.
1377
1378       There are nearly 100 positive error codes that pcre2_compile() may  re-
1379       turn  if it finds an error in the pattern. There are also some negative
1380       error codes that are used for invalid UTF strings when validity  check-
1381       ing  is  in  force.  These  are  the same as given by pcre2_match() and
1382       pcre2_dfa_match(), and are described in the pcre2unicode documentation.
1383       There  is  no  separate documentation for the positive error codes, be-
1384       cause the textual error messages  that  are  obtained  by  calling  the
1385       pcre2_get_error_message() function (see "Obtaining a textual error mes-
1386       sage" below) should be  self-explanatory.  Macro  names  starting  with
1387       PCRE2_ERROR_  are defined for both positive and negative error codes in
1388       pcre2.h.
1389
1390       The value returned in erroroffset is an indication of where in the pat-
1391       tern  the  error  occurred. It is not necessarily the furthest point in
1392       the pattern that was read. For example, after the error "lookbehind as-
1393       sertion  is  not fixed length", the error offset points to the start of
1394       the failing assertion. For an invalid UTF-8 or UTF-16 string, the  off-
1395       set is that of the first code unit of the failing character.
1396
1397       Some  errors are not detected until the whole pattern has been scanned;
1398       in these cases, the offset passed back is the length  of  the  pattern.
1399       Note  that  the  offset is in code units, not characters, even in a UTF
1400       mode. It may sometimes point into the middle of a UTF-8 or UTF-16 char-
1401       acter.
1402
1403       This  code  fragment shows a typical straightforward call to pcre2_com-
1404       pile():
1405
1406         pcre2_code *re;
1407         PCRE2_SIZE erroffset;
1408         int errorcode;
1409         re = pcre2_compile(
1410           "^A.*Z",                /* the pattern */
1411           PCRE2_ZERO_TERMINATED,  /* the pattern is zero-terminated */
1412           0,                      /* default options */
1413           &errorcode,             /* for error code */
1414           &erroffset,             /* for error offset */
1415           NULL);                  /* no compile context */
1416
1417
1418   Main compile options
1419
1420       The following names for option bits are defined in the  pcre2.h  header
1421       file:
1422
1423         PCRE2_ANCHORED
1424
1425       If this bit is set, the pattern is forced to be "anchored", that is, it
1426       is constrained to match only at the first matching point in the  string
1427       that  is being searched (the "subject string"). This effect can also be
1428       achieved by appropriate constructs in the pattern itself, which is  the
1429       only way to do it in Perl.
1430
1431         PCRE2_ALLOW_EMPTY_CLASS
1432
1433       By  default, for compatibility with Perl, a closing square bracket that
1434       immediately follows an opening one is treated as a data  character  for
1435       the  class.  When  PCRE2_ALLOW_EMPTY_CLASS  is  set,  it terminates the
1436       class, which therefore contains no characters and so can never match.
1437
1438         PCRE2_ALT_BSUX
1439
1440       This option request alternative handling  of  three  escape  sequences,
1441       which  makes  PCRE2's  behaviour more like ECMAscript (aka JavaScript).
1442       When it is set:
1443
1444       (1) \U matches an upper case "U" character; by default \U causes a com-
1445       pile time error (Perl uses \U to upper case subsequent characters).
1446
1447       (2) \u matches a lower case "u" character unless it is followed by four
1448       hexadecimal digits, in which case the hexadecimal  number  defines  the
1449       code  point  to match. By default, \u causes a compile time error (Perl
1450       uses it to upper case the following character).
1451
1452       (3) \x matches a lower case "x" character unless it is followed by  two
1453       hexadecimal  digits,  in  which case the hexadecimal number defines the
1454       code point to match. By default, as in Perl, a  hexadecimal  number  is
1455       always expected after \x, but it may have zero, one, or two digits (so,
1456       for example, \xz matches a binary zero character followed by z).
1457
1458       ECMAscript 6 added additional functionality to \u. This can be accessed
1459       using  the  PCRE2_EXTRA_ALT_BSUX  extra  option (see "Extra compile op-
1460       tions" below).  Note that this alternative escape handling applies only
1461       to  patterns.  Neither  of  these options affects the processing of re-
1462       placement strings passed to pcre2_substitute().
1463
1464         PCRE2_ALT_CIRCUMFLEX
1465
1466       In  multiline  mode  (when  PCRE2_MULTILINE  is  set),  the  circumflex
1467       metacharacter  matches at the start of the subject (unless PCRE2_NOTBOL
1468       is set), and also after any internal  newline.  However,  it  does  not
1469       match after a newline at the end of the subject, for compatibility with
1470       Perl. If you want a multiline circumflex also to match after  a  termi-
1471       nating newline, you must set PCRE2_ALT_CIRCUMFLEX.
1472
1473         PCRE2_ALT_VERBNAMES
1474
1475       By  default, for compatibility with Perl, the name in any verb sequence
1476       such as (*MARK:NAME) is any sequence of characters that  does  not  in-
1477       clude  a closing parenthesis. The name is not processed in any way, and
1478       it is not possible to include a closing parenthesis in the  name.  How-
1479       ever,  if  the PCRE2_ALT_VERBNAMES option is set, normal backslash pro-
1480       cessing is applied to verb names and only an unescaped  closing  paren-
1481       thesis  terminates the name. A closing parenthesis can be included in a
1482       name either as \) or between  \Q  and  \E.  If  the  PCRE2_EXTENDED  or
1483       PCRE2_EXTENDED_MORE  option  is set with PCRE2_ALT_VERBNAMES, unescaped
1484       whitespace in verb names is skipped and #-comments are recognized,  ex-
1485       actly as in the rest of the pattern.
1486
1487         PCRE2_AUTO_CALLOUT
1488
1489       If  this  bit  is  set,  pcre2_compile()  automatically inserts callout
1490       items, all with number 255, before each pattern  item,  except  immedi-
1491       ately  before  or after an explicit callout in the pattern. For discus-
1492       sion of the callout facility, see the pcre2callout documentation.
1493
1494         PCRE2_CASELESS
1495
1496       If this bit is set, letters in the pattern match both upper  and  lower
1497       case  letters in the subject. It is equivalent to Perl's /i option, and
1498       it can be changed within a pattern by a (?i) option setting. If  either
1499       PCRE2_UTF  or  PCRE2_UCP  is  set,  Unicode properties are used for all
1500       characters with more than one other case, and for all characters  whose
1501       code  points  are  greater  than  U+007F. Note that there are two ASCII
1502       characters, K and S, that, in addition to their lower case ASCII equiv-
1503       alents,  are case-equivalent with U+212A (Kelvin sign) and U+017F (long
1504       S) respectively. For lower valued characters with only one other  case,
1505       a  lookup table is used for speed. When neither PCRE2_UTF nor PCRE2_UCP
1506       is set, a lookup table is used for all code points less than  256,  and
1507       higher  code  points  (available  only  in  16-bit  or 32-bit mode) are
1508       treated as not having another case.
1509
1510         PCRE2_DOLLAR_ENDONLY
1511
1512       If this bit is set, a dollar metacharacter in the pattern matches  only
1513       at  the  end  of the subject string. Without this option, a dollar also
1514       matches immediately before a newline at the end of the string (but  not
1515       before  any other newlines). The PCRE2_DOLLAR_ENDONLY option is ignored
1516       if PCRE2_MULTILINE is set. There is no equivalent  to  this  option  in
1517       Perl, and no way to set it within a pattern.
1518
1519         PCRE2_DOTALL
1520
1521       If  this  bit  is  set,  a dot metacharacter in the pattern matches any
1522       character, including one that indicates a  newline.  However,  it  only
1523       ever matches one character, even if newlines are coded as CRLF. Without
1524       this option, a dot does not match when the current position in the sub-
1525       ject  is  at  a newline. This option is equivalent to Perl's /s option,
1526       and it can be changed within a pattern by a (?s) option setting. A neg-
1527       ative  class such as [^a] always matches newline characters, and the \N
1528       escape sequence always matches a non-newline character, independent  of
1529       the setting of PCRE2_DOTALL.
1530
1531         PCRE2_DUPNAMES
1532
1533       If  this  bit is set, names used to identify capture groups need not be
1534       unique.  This can be helpful for certain types of pattern  when  it  is
1535       known  that  only  one instance of the named group can ever be matched.
1536       There are more details of named capture  groups  below;  see  also  the
1537       pcre2pattern documentation.
1538
1539         PCRE2_ENDANCHORED
1540
1541       If  this  bit is set, the end of any pattern match must be right at the
1542       end of the string being searched (the "subject string"). If the pattern
1543       match succeeds by reaching (*ACCEPT), but does not reach the end of the
1544       subject, the match fails at the current starting point. For  unanchored
1545       patterns,  a  new  match is then tried at the next starting point. How-
1546       ever, if the match succeeds by reaching the end of the pattern, but not
1547       the  end  of  the subject, backtracking occurs and an alternative match
1548       may be found. Consider these two patterns:
1549
1550         .(*ACCEPT)|..
1551         .|..
1552
1553       If matched against "abc" with PCRE2_ENDANCHORED set, the first  matches
1554       "c"  whereas  the  second matches "bc". The effect of PCRE2_ENDANCHORED
1555       can also be achieved by appropriate constructs in the  pattern  itself,
1556       which is the only way to do it in Perl.
1557
1558       For DFA matching with pcre2_dfa_match(), PCRE2_ENDANCHORED applies only
1559       to the first (that is, the  longest)  matched  string.  Other  parallel
1560       matches,  which are necessarily substrings of the first one, must obvi-
1561       ously end before the end of the subject.
1562
1563         PCRE2_EXTENDED
1564
1565       If this bit is set, most white space characters in the pattern are  to-
1566       tally ignored except when escaped or inside a character class. However,
1567       white space is not allowed within sequences such as (?> that  introduce
1568       various  parenthesized groups, nor within numerical quantifiers such as
1569       {1,3}. Ignorable white space is permitted between an item and a follow-
1570       ing  quantifier  and  between a quantifier and a following + that indi-
1571       cates possessiveness. PCRE2_EXTENDED is equivalent to Perl's /x option,
1572       and it can be changed within a pattern by a (?x) option setting.
1573
1574       When  PCRE2  is compiled without Unicode support, PCRE2_EXTENDED recog-
1575       nizes as white space only those characters with code points  less  than
1576       256 that are flagged as white space in its low-character table. The ta-
1577       ble is normally created by pcre2_maketables(), which uses the isspace()
1578       function  to identify space characters. In most ASCII environments, the
1579       relevant characters are those with code  points  0x0009  (tab),  0x000A
1580       (linefeed),  0x000B (vertical tab), 0x000C (formfeed), 0x000D (carriage
1581       return), and 0x0020 (space).
1582
1583       When PCRE2 is compiled with Unicode support, in addition to these char-
1584       acters,  five  more Unicode "Pattern White Space" characters are recog-
1585       nized by PCRE2_EXTENDED. These are U+0085 (next line), U+200E (left-to-
1586       right  mark), U+200F (right-to-left mark), U+2028 (line separator), and
1587       U+2029 (paragraph separator). This set of characters  is  the  same  as
1588       recognized  by  Perl's /x option. Note that the horizontal and vertical
1589       space characters that are matched by the \h and \v escapes in  patterns
1590       are a much bigger set.
1591
1592       As  well as ignoring most white space, PCRE2_EXTENDED also causes char-
1593       acters between an unescaped # outside a character class  and  the  next
1594       newline,  inclusive,  to be ignored, which makes it possible to include
1595       comments inside complicated patterns. Note that the end of this type of
1596       comment  is a literal newline sequence in the pattern; escape sequences
1597       that happen to represent a newline do not count.
1598
1599       Which characters are interpreted as newlines can be specified by a set-
1600       ting  in  the compile context that is passed to pcre2_compile() or by a
1601       special sequence at the start of the pattern, as described in the  sec-
1602       tion  entitled "Newline conventions" in the pcre2pattern documentation.
1603       A default is defined when PCRE2 is built.
1604
1605         PCRE2_EXTENDED_MORE
1606
1607       This option has the effect of PCRE2_EXTENDED,  but,  in  addition,  un-
1608       escaped  space and horizontal tab characters are ignored inside a char-
1609       acter class. Note: only these two characters are ignored, not the  full
1610       set  of pattern white space characters that are ignored outside a char-
1611       acter class. PCRE2_EXTENDED_MORE is equivalent to  Perl's  /xx  option,
1612       and it can be changed within a pattern by a (?xx) option setting.
1613
1614         PCRE2_FIRSTLINE
1615
1616       If this option is set, the start of an unanchored pattern match must be
1617       before or at the first newline in  the  subject  string  following  the
1618       start  of  matching, though the matched text may continue over the new-
1619       line. If startoffset is non-zero, the limiting newline is not necessar-
1620       ily  the  first  newline  in  the  subject. For example, if the subject
1621       string is "abc\nxyz" (where \n represents a single-character newline) a
1622       pattern  match for "yz" succeeds with PCRE2_FIRSTLINE if startoffset is
1623       greater than 3. See also PCRE2_USE_OFFSET_LIMIT, which provides a  more
1624       general  limiting  facility.  If  PCRE2_FIRSTLINE is set with an offset
1625       limit, a match must occur in the first line and also within the  offset
1626       limit. In other words, whichever limit comes first is used.
1627
1628         PCRE2_LITERAL
1629
1630       If this option is set, all meta-characters in the pattern are disabled,
1631       and it is treated as a literal string. Matching literal strings with  a
1632       regular expression engine is not the most efficient way of doing it. If
1633       you are doing a lot of literal matching and  are  worried  about  effi-
1634       ciency, you should consider using other approaches. The only other main
1635       options  that  are  allowed  with  PCRE2_LITERAL  are:  PCRE2_ANCHORED,
1636       PCRE2_ENDANCHORED, PCRE2_AUTO_CALLOUT, PCRE2_CASELESS, PCRE2_FIRSTLINE,
1637       PCRE2_MATCH_INVALID_UTF,  PCRE2_NO_START_OPTIMIZE,  PCRE2_NO_UTF_CHECK,
1638       PCRE2_UTF,  and  PCRE2_USE_OFFSET_LIMIT.  The  extra  options PCRE2_EX-
1639       TRA_MATCH_LINE and PCRE2_EXTRA_MATCH_WORD are also supported. Any other
1640       options cause an error.
1641
1642         PCRE2_MATCH_INVALID_UTF
1643
1644       This  option  forces PCRE2_UTF (see below) and also enables support for
1645       matching by pcre2_match() in subject strings that contain  invalid  UTF
1646       sequences.   This  facility  is not supported for DFA matching. For de-
1647       tails, see the pcre2unicode documentation.
1648
1649         PCRE2_MATCH_UNSET_BACKREF
1650
1651       If this option is set,  a  backreference  to  an  unset  capture  group
1652       matches  an  empty  string (by default this causes the current matching
1653       alternative to fail).  A pattern such as (\1)(a) succeeds when this op-
1654       tion  is  set  (assuming it can find an "a" in the subject), whereas it
1655       fails by default, for Perl compatibility.  Setting  this  option  makes
1656       PCRE2 behave more like ECMAscript (aka JavaScript).
1657
1658         PCRE2_MULTILINE
1659
1660       By  default,  for  the purposes of matching "start of line" and "end of
1661       line", PCRE2 treats the subject string as consisting of a  single  line
1662       of  characters,  even  if  it actually contains newlines. The "start of
1663       line" metacharacter (^) matches only at the start of  the  string,  and
1664       the  "end  of  line"  metacharacter  ($) matches only at the end of the
1665       string, or before a terminating newline (except  when  PCRE2_DOLLAR_EN-
1666       DONLY is set). Note, however, that unless PCRE2_DOTALL is set, the "any
1667       character" metacharacter (.) does not match at a newline.  This  behav-
1668       iour (for ^, $, and dot) is the same as Perl.
1669
1670       When  PCRE2_MULTILINE  it is set, the "start of line" and "end of line"
1671       constructs match immediately following or immediately  before  internal
1672       newlines  in  the  subject string, respectively, as well as at the very
1673       start and end. This is equivalent to Perl's /m option, and  it  can  be
1674       changed within a pattern by a (?m) option setting. Note that the "start
1675       of line" metacharacter does not match after a newline at the end of the
1676       subject,  for compatibility with Perl.  However, you can change this by
1677       setting the PCRE2_ALT_CIRCUMFLEX option. If there are no newlines in  a
1678       subject  string,  or  no  occurrences  of  ^ or $ in a pattern, setting
1679       PCRE2_MULTILINE has no effect.
1680
1681         PCRE2_NEVER_BACKSLASH_C
1682
1683       This option locks out the use of \C in the pattern that is  being  com-
1684       piled.   This  escape  can  cause  unpredictable  behaviour in UTF-8 or
1685       UTF-16 modes, because it may leave the current matching  point  in  the
1686       middle of a multi-code-unit character. This option may be useful in ap-
1687       plications that process patterns from external sources. Note that there
1688       is also a build-time option that permanently locks out the use of \C.
1689
1690         PCRE2_NEVER_UCP
1691
1692       This  option  locks  out the use of Unicode properties for handling \B,
1693       \b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes, as
1694       described  for  the  PCRE2_UCP option below. In particular, it prevents
1695       the creator of the pattern from enabling this facility by starting  the
1696       pattern  with  (*UCP).  This  option may be useful in applications that
1697       process patterns from external sources. The option combination PCRE_UCP
1698       and PCRE_NEVER_UCP causes an error.
1699
1700         PCRE2_NEVER_UTF
1701
1702       This  option  locks out interpretation of the pattern as UTF-8, UTF-16,
1703       or UTF-32, depending on which library is in use. In particular, it pre-
1704       vents  the  creator of the pattern from switching to UTF interpretation
1705       by starting the pattern with (*UTF). This option may be useful  in  ap-
1706       plications that process patterns from external sources. The combination
1707       of PCRE2_UTF and PCRE2_NEVER_UTF causes an error.
1708
1709         PCRE2_NO_AUTO_CAPTURE
1710
1711       If this option is set, it disables the use of numbered capturing paren-
1712       theses  in the pattern. Any opening parenthesis that is not followed by
1713       ? behaves as if it were followed by ?: but named parentheses can  still
1714       be used for capturing (and they acquire numbers in the usual way). This
1715       is the same as Perl's /n option.  Note that, when this option  is  set,
1716       references  to  capture  groups (backreferences or recursion/subroutine
1717       calls) may only refer to named groups, though the reference can  be  by
1718       name or by number.
1719
1720         PCRE2_NO_AUTO_POSSESS
1721
1722       If this option is set, it disables "auto-possessification", which is an
1723       optimization that, for example, turns a+b into a++b in order  to  avoid
1724       backtracks  into  a+ that can never be successful. However, if callouts
1725       are in use, auto-possessification means that some  callouts  are  never
1726       taken. You can set this option if you want the matching functions to do
1727       a full unoptimized search and run all the callouts, but  it  is  mainly
1728       provided for testing purposes.
1729
1730         PCRE2_NO_DOTSTAR_ANCHOR
1731
1732       If this option is set, it disables an optimization that is applied when
1733       .* is the first significant item in a top-level branch  of  a  pattern,
1734       and  all  the  other branches also start with .* or with \A or \G or ^.
1735       The optimization is automatically disabled for .* if it  is  inside  an
1736       atomic group or a capture group that is the subject of a backreference,
1737       or if the pattern contains (*PRUNE) or (*SKIP). When  the  optimization
1738       is   not   disabled,  such  a  pattern  is  automatically  anchored  if
1739       PCRE2_DOTALL is set for all the .* items and PCRE2_MULTILINE is not set
1740       for  any  ^ items. Otherwise, the fact that any match must start either
1741       at the start of the subject or following a newline is remembered.  Like
1742       other optimizations, this can cause callouts to be skipped.
1743
1744         PCRE2_NO_START_OPTIMIZE
1745
1746       This  is  an  option whose main effect is at matching time. It does not
1747       change what pcre2_compile() generates, but it does affect the output of
1748       the JIT compiler.
1749
1750       There  are  a  number of optimizations that may occur at the start of a
1751       match, in order to speed up the process. For example, if  it  is  known
1752       that  an  unanchored  match must start with a specific code unit value,
1753       the matching code searches the subject for that value, and fails  imme-
1754       diately  if it cannot find it, without actually running the main match-
1755       ing function. This means that a special item such as (*COMMIT)  at  the
1756       start  of  a  pattern is not considered until after a suitable starting
1757       point for the match has been found.  Also,  when  callouts  or  (*MARK)
1758       items  are  in use, these "start-up" optimizations can cause them to be
1759       skipped if the pattern is never actually used. The  start-up  optimiza-
1760       tions  are  in effect a pre-scan of the subject that takes place before
1761       the pattern is run.
1762
1763       The PCRE2_NO_START_OPTIMIZE option disables the start-up optimizations,
1764       possibly  causing  performance  to  suffer,  but ensuring that in cases
1765       where the result is "no match", the callouts do occur, and  that  items
1766       such as (*COMMIT) and (*MARK) are considered at every possible starting
1767       position in the subject string.
1768
1769       Setting PCRE2_NO_START_OPTIMIZE may change the outcome  of  a  matching
1770       operation.  Consider the pattern
1771
1772         (*COMMIT)ABC
1773
1774       When  this  is compiled, PCRE2 records the fact that a match must start
1775       with the character "A". Suppose the subject  string  is  "DEFABC".  The
1776       start-up  optimization  scans along the subject, finds "A" and runs the
1777       first match attempt from there. The (*COMMIT) item means that the  pat-
1778       tern  must  match the current starting position, which in this case, it
1779       does. However, if the same match is  run  with  PCRE2_NO_START_OPTIMIZE
1780       set,  the  initial  scan  along the subject string does not happen. The
1781       first match attempt is run starting  from  "D"  and  when  this  fails,
1782       (*COMMIT)  prevents any further matches being tried, so the overall re-
1783       sult is "no match".
1784
1785       As another start-up optimization makes use of a minimum  length  for  a
1786       matching subject, which is recorded when possible. Consider the pattern
1787
1788         (*MARK:1)B(*MARK:2)(X|Y)
1789
1790       The  minimum  length  for  a match is two characters. If the subject is
1791       "XXBB", the "starting character" optimization skips "XX", then tries to
1792       match  "BB", which is long enough. In the process, (*MARK:2) is encoun-
1793       tered and remembered. When the match attempt fails,  the  next  "B"  is
1794       found,  but  there is only one character left, so there are no more at-
1795       tempts, and "no match" is returned with the "last  mark  seen"  set  to
1796       "2".  If  NO_START_OPTIMIZE is set, however, matches are tried at every
1797       possible starting position, including at the end of the subject,  where
1798       (*MARK:1)  is encountered, but there is no "B", so the "last mark seen"
1799       that is returned is "1". In this case, the optimizations do not  affect
1800       the overall match result, which is still "no match", but they do affect
1801       the auxiliary information that is returned.
1802
1803         PCRE2_NO_UTF_CHECK
1804
1805       When PCRE2_UTF is set, the validity of the pattern as a UTF  string  is
1806       automatically  checked.  There  are  discussions  about the validity of
1807       UTF-8 strings, UTF-16 strings, and UTF-32 strings in  the  pcre2unicode
1808       document.  If an invalid UTF sequence is found, pcre2_compile() returns
1809       a negative error code.
1810
1811       If you know that your pattern is a valid UTF string, and  you  want  to
1812       skip   this   check   for   performance   reasons,   you  can  set  the
1813       PCRE2_NO_UTF_CHECK option. When it is set, the effect of passing an in-
1814       valid  UTF  string as a pattern is undefined. It may cause your program
1815       to crash or loop.
1816
1817       Note  that  this  option  can  also  be  passed  to  pcre2_match()  and
1818       pcre2_dfa_match(),  to  suppress  UTF  validity checking of the subject
1819       string.
1820
1821       Note also that setting PCRE2_NO_UTF_CHECK at compile time does not dis-
1822       able  the error that is given if an escape sequence for an invalid Uni-
1823       code code point is encountered in the pattern. In particular,  the  so-
1824       called  "surrogate"  code points (0xd800 to 0xdfff) are invalid. If you
1825       want to allow escape  sequences  such  as  \x{d800}  you  can  set  the
1826       PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES  extra  option, as described in the
1827       section entitled "Extra compile options" below.  However, this is  pos-
1828       sible only in UTF-8 and UTF-32 modes, because these values are not rep-
1829       resentable in UTF-16.
1830
1831         PCRE2_UCP
1832
1833       This option has two effects. Firstly, it change the way PCRE2 processes
1834       \B,  \b,  \D,  \d,  \S,  \s,  \W,  \w,  and some of the POSIX character
1835       classes. By default, only  ASCII  characters  are  recognized,  but  if
1836       PCRE2_UCP is set, Unicode properties are used instead to classify char-
1837       acters. More details are given in  the  section  on  generic  character
1838       types  in  the pcre2pattern page. If you set PCRE2_UCP, matching one of
1839       the items it affects takes much longer.
1840
1841       The second effect of PCRE2_UCP is to force the use of  Unicode  proper-
1842       ties  for  upper/lower casing operations on characters with code points
1843       greater than 127, even when PCRE2_UTF is not set. This makes it  possi-
1844       ble, for example, to process strings in the 16-bit UCS-2 code. This op-
1845       tion is available only if PCRE2 has been compiled with Unicode  support
1846       (which is the default).
1847
1848         PCRE2_UNGREEDY
1849
1850       This  option  inverts  the "greediness" of the quantifiers so that they
1851       are not greedy by default, but become greedy if followed by "?". It  is
1852       not  compatible  with Perl. It can also be set by a (?U) option setting
1853       within the pattern.
1854
1855         PCRE2_USE_OFFSET_LIMIT
1856
1857       This option must be set for pcre2_compile() if pcre2_set_offset_limit()
1858       is  going  to be used to set a non-default offset limit in a match con-
1859       text for matches that use this pattern. An error  is  generated  if  an
1860       offset  limit is set without this option. For more details, see the de-
1861       scription of pcre2_set_offset_limit() in  the  section  that  describes
1862       match contexts. See also the PCRE2_FIRSTLINE option above.
1863
1864         PCRE2_UTF
1865
1866       This  option  causes  PCRE2  to regard both the pattern and the subject
1867       strings that are subsequently processed as strings  of  UTF  characters
1868       instead  of  single-code-unit  strings.  It  is available when PCRE2 is
1869       built to include Unicode support (which is  the  default).  If  Unicode
1870       support is not available, the use of this option provokes an error. De-
1871       tails of how PCRE2_UTF changes the behaviour of PCRE2 are given in  the
1872       pcre2unicode  page.  In  particular,  note  that  it  changes  the  way
1873       PCRE2_CASELESS handles characters with code points greater than 127.
1874
1875   Extra compile options
1876
1877       The option bits that can be set in a compile  context  by  calling  the
1878       pcre2_set_compile_extra_options() function are as follows:
1879
1880         PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK
1881
1882       Since release 10.38 PCRE2 has forbidden the use of \K within lookaround
1883       assertions, following Perl's lead. This option is provided to re-enable
1884       the previous behaviour (act in positive lookarounds, ignore in negative
1885       ones) in case anybody is relying on it.
1886
1887         PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES
1888
1889       This option applies when compiling a pattern in UTF-8 or  UTF-32  mode.
1890       It  is  forbidden in UTF-16 mode, and ignored in non-UTF modes. Unicode
1891       "surrogate" code points in the range 0xd800 to 0xdfff are used in pairs
1892       in  UTF-16  to  encode  code points with values in the range 0x10000 to
1893       0x10ffff. The surrogates cannot therefore  be  represented  in  UTF-16.
1894       They can be represented in UTF-8 and UTF-32, but are defined as invalid
1895       code points, and cause errors if  encountered  in  a  UTF-8  or  UTF-32
1896       string that is being checked for validity by PCRE2.
1897
1898       These  values also cause errors if encountered in escape sequences such
1899       as \x{d912} within a pattern. However, it seems that some applications,
1900       when using PCRE2 to check for unwanted characters in UTF-8 strings, ex-
1901       plicitly  test  for  the  surrogates  using   escape   sequences.   The
1902       PCRE2_NO_UTF_CHECK  option  does not disable the error that occurs, be-
1903       cause it applies only to the testing of input strings for UTF validity.
1904
1905       If the extra option PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES is set,  surro-
1906       gate  code  point values in UTF-8 and UTF-32 patterns no longer provoke
1907       errors and are incorporated in the compiled pattern. However, they  can
1908       only  match  subject characters if the matching function is called with
1909       PCRE2_NO_UTF_CHECK set.
1910
1911         PCRE2_EXTRA_ALT_BSUX
1912
1913       The original option PCRE2_ALT_BSUX causes PCRE2 to process \U, \u,  and
1914       \x  in  the way that ECMAscript (aka JavaScript) does. Additional func-
1915       tionality was defined by ECMAscript 6; setting PCRE2_EXTRA_ALT_BSUX has
1916       the  effect  of PCRE2_ALT_BSUX, but in addition it recognizes \u{hhh..}
1917       as a hexadecimal character code, where hhh.. is any number of hexadeci-
1918       mal digits.
1919
1920         PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL
1921
1922       This  is a dangerous option. Use with care. By default, an unrecognized
1923       escape such as \j or a malformed one such as \x{2z} causes  a  compile-
1924       time error when detected by pcre2_compile(). Perl is somewhat inconsis-
1925       tent in handling such items: for example, \j is treated  as  a  literal
1926       "j",  and non-hexadecimal digits in \x{} are just ignored, though warn-
1927       ings are given in both cases if Perl's warning switch is enabled.  How-
1928       ever,  a  malformed  octal  number  after \o{ always causes an error in
1929       Perl.
1930
1931       If the PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL  extra  option  is  passed  to
1932       pcre2_compile(),  all  unrecognized  or  malformed escape sequences are
1933       treated as single-character escapes. For example, \j is a  literal  "j"
1934       and  \x{2z}  is treated as the literal string "x{2z}". Setting this op-
1935       tion means that typos in patterns may go undetected and have unexpected
1936       results.  Also  note  that a sequence such as [\N{] is interpreted as a
1937       malformed attempt at [\N{...}] and so is treated as [N{]  whereas  [\N]
1938       gives an error because an unqualified \N is a valid escape sequence but
1939       is not supported in a character class. To reiterate: this is a  danger-
1940       ous option. Use with great care.
1941
1942         PCRE2_EXTRA_ESCAPED_CR_IS_LF
1943
1944       There  are  some  legacy applications where the escape sequence \r in a
1945       pattern is expected to match a newline. If this option is set, \r in  a
1946       pattern  is  converted to \n so that it matches a LF (linefeed) instead
1947       of a CR (carriage return) character. The option does not affect a  lit-
1948       eral  CR in the pattern, nor does it affect CR specified as an explicit
1949       code point such as \x{0D}.
1950
1951         PCRE2_EXTRA_MATCH_LINE
1952
1953       This option is provided for use by  the  -x  option  of  pcre2grep.  It
1954       causes  the  pattern  only to match complete lines. This is achieved by
1955       automatically inserting the code for "^(?:" at the start  of  the  com-
1956       piled  pattern  and ")$" at the end. Thus, when PCRE2_MULTILINE is set,
1957       the matched line may be in the middle of the subject string.  This  op-
1958       tion can be used with PCRE2_LITERAL.
1959
1960         PCRE2_EXTRA_MATCH_WORD
1961
1962       This  option  is  provided  for  use  by the -w option of pcre2grep. It
1963       causes the pattern only to match strings that have a word  boundary  at
1964       the  start and the end. This is achieved by automatically inserting the
1965       code for "\b(?:" at the start of the compiled pattern and ")\b" at  the
1966       end.  The option may be used with PCRE2_LITERAL. However, it is ignored
1967       if PCRE2_EXTRA_MATCH_LINE is also set.
1968
1969
1970JUST-IN-TIME (JIT) COMPILATION
1971
1972       int pcre2_jit_compile(pcre2_code *code, uint32_t options);
1973
1974       int pcre2_jit_match(const pcre2_code *code, PCRE2_SPTR subject,
1975         PCRE2_SIZE length, PCRE2_SIZE startoffset,
1976         uint32_t options, pcre2_match_data *match_data,
1977         pcre2_match_context *mcontext);
1978
1979       void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext);
1980
1981       pcre2_jit_stack *pcre2_jit_stack_create(PCRE2_SIZE startsize,
1982         PCRE2_SIZE maxsize, pcre2_general_context *gcontext);
1983
1984       void pcre2_jit_stack_assign(pcre2_match_context *mcontext,
1985         pcre2_jit_callback callback_function, void *callback_data);
1986
1987       void pcre2_jit_stack_free(pcre2_jit_stack *jit_stack);
1988
1989       These functions provide support for  JIT  compilation,  which,  if  the
1990       just-in-time  compiler  is available, further processes a compiled pat-
1991       tern into machine code that executes much faster than the pcre2_match()
1992       interpretive  matching function. Full details are given in the pcre2jit
1993       documentation.
1994
1995       JIT compilation is a heavyweight optimization. It can  take  some  time
1996       for  patterns  to  be analyzed, and for one-off matches and simple pat-
1997       terns the benefit of faster execution might be offset by a much  slower
1998       compilation  time.  Most (but not all) patterns can be optimized by the
1999       JIT compiler.
2000
2001
2002LOCALE SUPPORT
2003
2004       const uint8_t *pcre2_maketables(pcre2_general_context *gcontext);
2005
2006       void pcre2_maketables_free(pcre2_general_context *gcontext,
2007         const uint8_t *tables);
2008
2009       PCRE2 handles caseless matching, and determines whether characters  are
2010       letters,  digits, or whatever, by reference to a set of tables, indexed
2011       by character code point. However, this applies only to characters whose
2012       code  points  are  less than 256. By default, higher-valued code points
2013       never match escapes such as \w or \d.
2014
2015       When PCRE2 is built with Unicode support (the default), certain Unicode
2016       character  properties  can be tested with \p and \P, or, alternatively,
2017       the PCRE2_UCP option can be set when a pattern is compiled; this causes
2018       \w  and friends to use Unicode property support instead of the built-in
2019       tables.  PCRE2_UCP also causes upper/lower casing operations on charac-
2020       ters with code points greater than 127 to use Unicode properties. These
2021       effects apply even when PCRE2_UTF is not set.
2022
2023       The use of locales with Unicode is discouraged.  If  you  are  handling
2024       characters  with  code  points  greater than 127, you should either use
2025       Unicode support, or use locales, but not try to mix the two.
2026
2027       PCRE2 contains a built-in set of character tables that are used by  de-
2028       fault.   These  are sufficient for many applications. Normally, the in-
2029       ternal tables recognize only ASCII characters. However, when  PCRE2  is
2030       built, it is possible to cause the internal tables to be rebuilt in the
2031       default "C" locale of the local system, which may cause them to be dif-
2032       ferent.
2033
2034       The  built-in tables can be overridden by tables supplied by the appli-
2035       cation that calls PCRE2. These may be created  in  a  different  locale
2036       from  the  default.  As more and more applications change to using Uni-
2037       code, the need for this locale support is expected to die away.
2038
2039       External tables are built by calling the  pcre2_maketables()  function,
2040       in the relevant locale. The only argument to this function is a general
2041       context, which can be used to pass a custom memory  allocator.  If  the
2042       argument is NULL, the system malloc() is used. The result can be passed
2043       to pcre2_compile() as often as necessary, by creating a compile context
2044       and  calling  pcre2_set_character_tables()  to  set  the tables pointer
2045       therein.
2046
2047       For example, to build and use  tables  that  are  appropriate  for  the
2048       French  locale  (where accented characters with values greater than 127
2049       are treated as letters), the following code could be used:
2050
2051         setlocale(LC_CTYPE, "fr_FR");
2052         tables = pcre2_maketables(NULL);
2053         ccontext = pcre2_compile_context_create(NULL);
2054         pcre2_set_character_tables(ccontext, tables);
2055         re = pcre2_compile(..., ccontext);
2056
2057       The locale name "fr_FR" is used on Linux and other  Unix-like  systems;
2058       if you are using Windows, the name for the French locale is "french".
2059
2060       The pointer that is passed (via the compile context) to pcre2_compile()
2061       is saved with the compiled pattern, and the same tables are used by the
2062       matching  functions.  Thus,  for  any  single  pattern, compilation and
2063       matching both happen in the same locale, but different patterns can  be
2064       processed in different locales.
2065
2066       It  is the caller's responsibility to ensure that the memory containing
2067       the tables remains available while they are still in use. When they are
2068       no  longer  needed, you can discard them using pcre2_maketables_free(),
2069       which should pass as its first parameter the same global  context  that
2070       was used to create the tables.
2071
2072   Saving locale tables
2073
2074       The  tables  described above are just a sequence of binary bytes, which
2075       makes them independent of hardware characteristics such  as  endianness
2076       or  whether  the processor is 32-bit or 64-bit. A copy of the result of
2077       pcre2_maketables() can therefore be saved in a file  or  elsewhere  and
2078       re-used  later, even in a different program or on another computer. The
2079       size of the tables (number  of  bytes)  must  be  obtained  by  calling
2080       pcre2_config()   with  the  PCRE2_CONFIG_TABLES_LENGTH  option  because
2081       pcre2_maketables()  does  not  return  this  value.   Note   that   the
2082       pcre2_dftables program, which is part of the PCRE2 build system, can be
2083       used stand-alone to create a file that contains a set of binary tables.
2084       See the pcre2build documentation for details.
2085
2086
2087INFORMATION ABOUT A COMPILED PATTERN
2088
2089       int pcre2_pattern_info(const pcre2 *code, uint32_t what, void *where);
2090
2091       The  pcre2_pattern_info()  function returns general information about a
2092       compiled pattern. For information about callouts, see the next section.
2093       The  first  argument  for pcre2_pattern_info() is a pointer to the com-
2094       piled pattern. The second argument specifies which piece of information
2095       is  required,  and the third argument is a pointer to a variable to re-
2096       ceive the data. If the third argument is NULL, the  first  argument  is
2097       ignored,  and  the  function  returns the size in bytes of the variable
2098       that is required for the information requested. Otherwise, the yield of
2099       the function is zero for success, or one of the following negative num-
2100       bers:
2101
2102         PCRE2_ERROR_NULL           the argument code was NULL
2103         PCRE2_ERROR_BADMAGIC       the "magic number" was not found
2104         PCRE2_ERROR_BADOPTION      the value of what was invalid
2105         PCRE2_ERROR_UNSET          the requested field is not set
2106
2107       The "magic number" is placed at the start of each compiled pattern as a
2108       simple  check  against  passing  an arbitrary memory pointer. Here is a
2109       typical call of pcre2_pattern_info(), to obtain the length of the  com-
2110       piled pattern:
2111
2112         int rc;
2113         size_t length;
2114         rc = pcre2_pattern_info(
2115           re,               /* result of pcre2_compile() */
2116           PCRE2_INFO_SIZE,  /* what is required */
2117           &length);         /* where to put the data */
2118
2119       The possible values for the second argument are defined in pcre2.h, and
2120       are as follows:
2121
2122         PCRE2_INFO_ALLOPTIONS
2123         PCRE2_INFO_ARGOPTIONS
2124         PCRE2_INFO_EXTRAOPTIONS
2125
2126       Return copies of the pattern's options. The third argument should point
2127       to  a  uint32_t variable. PCRE2_INFO_ARGOPTIONS returns exactly the op-
2128       tions that were passed to  pcre2_compile(),  whereas  PCRE2_INFO_ALLOP-
2129       TIONS  returns  the compile options as modified by any top-level (*XXX)
2130       option settings such as (*UTF) at the  start  of  the  pattern  itself.
2131       PCRE2_INFO_EXTRAOPTIONS  returns the extra options that were set in the
2132       compile context by calling the pcre2_set_compile_extra_options()  func-
2133       tion.
2134
2135       For  example, if the pattern /(*UTF)abc/ is compiled with the PCRE2_EX-
2136       TENDED option, the result for PCRE2_INFO_ALLOPTIONS  is  PCRE2_EXTENDED
2137       and  PCRE2_UTF.   Option settings such as (?i) that can change within a
2138       pattern do not affect the result of PCRE2_INFO_ALLOPTIONS, even if they
2139       appear  right  at the start of the pattern. (This was different in some
2140       earlier releases.)
2141
2142       A pattern compiled without PCRE2_ANCHORED is automatically anchored  by
2143       PCRE2 if the first significant item in every top-level branch is one of
2144       the following:
2145
2146         ^     unless PCRE2_MULTILINE is set
2147         \A    always
2148         \G    always
2149         .*    sometimes - see below
2150
2151       When .* is the first significant item, anchoring is possible only  when
2152       all the following are true:
2153
2154         .* is not in an atomic group
2155         .* is not in a capture group that is the subject
2156              of a backreference
2157         PCRE2_DOTALL is in force for .*
2158         Neither (*PRUNE) nor (*SKIP) appears in the pattern
2159         PCRE2_NO_DOTSTAR_ANCHOR is not set
2160
2161       For  patterns  that are auto-anchored, the PCRE2_ANCHORED bit is set in
2162       the options returned for PCRE2_INFO_ALLOPTIONS.
2163
2164         PCRE2_INFO_BACKREFMAX
2165
2166       Return the number of the highest  backreference  in  the  pattern.  The
2167       third  argument  should  point  to  a  uint32_t variable. Named capture
2168       groups acquire numbers as well as names, and these  count  towards  the
2169       highest  backreference.  Backreferences  such as \4 or \g{12} match the
2170       captured characters of the given group, but in addition, the check that
2171       a capture group is set in a conditional group such as (?(3)a|b) is also
2172       a backreference.  Zero is returned if there are no backreferences.
2173
2174         PCRE2_INFO_BSR
2175
2176       The output is a uint32_t integer whose value indicates  what  character
2177       sequences  the \R escape sequence matches. A value of PCRE2_BSR_UNICODE
2178       means that \R matches any Unicode line  ending  sequence;  a  value  of
2179       PCRE2_BSR_ANYCRLF means that \R matches only CR, LF, or CRLF.
2180
2181         PCRE2_INFO_CAPTURECOUNT
2182
2183       Return  the  highest  capture  group number in the pattern. In patterns
2184       where (?| is not used, this is also the total number of capture groups.
2185       The third argument should point to a uint32_t variable.
2186
2187         PCRE2_INFO_DEPTHLIMIT
2188
2189       If  the  pattern set a backtracking depth limit by including an item of
2190       the form (*LIMIT_DEPTH=nnnn) at the start, the value is  returned.  The
2191       third argument should point to a uint32_t integer. If no such value has
2192       been set, the call to pcre2_pattern_info() returns the error  PCRE2_ER-
2193       ROR_UNSET. Note that this limit will only be used during matching if it
2194       is less than the limit set or defaulted by  the  caller  of  the  match
2195       function.
2196
2197         PCRE2_INFO_FIRSTBITMAP
2198
2199       In  the absence of a single first code unit for a non-anchored pattern,
2200       pcre2_compile() may construct a 256-bit table that defines a fixed  set
2201       of  values for the first code unit in any match. For example, a pattern
2202       that starts with [abc] results in a table with  three  bits  set.  When
2203       code  unit  values greater than 255 are supported, the flag bit for 255
2204       means "any code unit of value 255 or above". If such a table  was  con-
2205       structed,  a pointer to it is returned. Otherwise NULL is returned. The
2206       third argument should point to a const uint8_t * variable.
2207
2208         PCRE2_INFO_FIRSTCODETYPE
2209
2210       Return information about the first code unit of any matched string, for
2211       a  non-anchored  pattern. The third argument should point to a uint32_t
2212       variable. If there is a fixed first value, for example, the letter  "c"
2213       from  a  pattern such as (cat|cow|coyote), 1 is returned, and the value
2214       can be retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is  no  fixed
2215       first  value,  but it is known that a match can occur only at the start
2216       of the subject or following a newline in the subject,  2  is  returned.
2217       Otherwise, and for anchored patterns, 0 is returned.
2218
2219         PCRE2_INFO_FIRSTCODEUNIT
2220
2221       Return  the  value  of  the first code unit of any matched string for a
2222       pattern where PCRE2_INFO_FIRSTCODETYPE returns 1; otherwise  return  0.
2223       The  third  argument  should point to a uint32_t variable. In the 8-bit
2224       library, the value is always less than 256. In the 16-bit  library  the
2225       value  can  be  up  to 0xffff. In the 32-bit library in UTF-32 mode the
2226       value can be up to 0x10ffff, and up to 0xffffffff when not using UTF-32
2227       mode.
2228
2229         PCRE2_INFO_FRAMESIZE
2230
2231       Return the size (in bytes) of the data frames that are used to remember
2232       backtracking positions when the pattern is processed  by  pcre2_match()
2233       without  the  use  of  JIT. The third argument should point to a size_t
2234       variable. The frame size depends on the number of capturing parentheses
2235       in the pattern. Each additional capture group adds two PCRE2_SIZE vari-
2236       ables.
2237
2238         PCRE2_INFO_HASBACKSLASHC
2239
2240       Return 1 if the pattern contains any instances of \C, otherwise 0.  The
2241       third argument should point to a uint32_t variable.
2242
2243         PCRE2_INFO_HASCRORLF
2244
2245       Return  1  if  the  pattern  contains any explicit matches for CR or LF
2246       characters, otherwise 0. The third argument should point to a  uint32_t
2247       variable.  An explicit match is either a literal CR or LF character, or
2248       \r or \n or one of the  equivalent  hexadecimal  or  octal  escape  se-
2249       quences.
2250
2251         PCRE2_INFO_HEAPLIMIT
2252
2253       If the pattern set a heap memory limit by including an item of the form
2254       (*LIMIT_HEAP=nnnn) at the start, the value is returned. The third argu-
2255       ment should point to a uint32_t integer. If no such value has been set,
2256       the call to pcre2_pattern_info() returns the  error  PCRE2_ERROR_UNSET.
2257       Note  that  this  limit will only be used during matching if it is less
2258       than the limit set or defaulted by the caller of the match function.
2259
2260         PCRE2_INFO_JCHANGED
2261
2262       Return 1 if the (?J) or (?-J) option setting is used  in  the  pattern,
2263       otherwise  0.  The  third argument should point to a uint32_t variable.
2264       (?J) and (?-J) set and unset the local PCRE2_DUPNAMES  option,  respec-
2265       tively.
2266
2267         PCRE2_INFO_JITSIZE
2268
2269       If  the  compiled  pattern was successfully processed by pcre2_jit_com-
2270       pile(), return the size of the  JIT  compiled  code,  otherwise  return
2271       zero. The third argument should point to a size_t variable.
2272
2273         PCRE2_INFO_LASTCODETYPE
2274
2275       Returns  1 if there is a rightmost literal code unit that must exist in
2276       any matched string, other than at its start. The third argument  should
2277       point to a uint32_t variable. If there is no such value, 0 is returned.
2278       When 1 is returned, the code unit value itself can be  retrieved  using
2279       PCRE2_INFO_LASTCODEUNIT. For anchored patterns, a last literal value is
2280       recorded only if it follows something of variable length. For  example,
2281       for  the pattern /^a\d+z\d+/ the returned value is 1 (with "z" returned
2282       from PCRE2_INFO_LASTCODEUNIT), but for /^a\dz\d/ the returned value  is
2283       0.
2284
2285         PCRE2_INFO_LASTCODEUNIT
2286
2287       Return  the value of the rightmost literal code unit that must exist in
2288       any matched string, other than  at  its  start,  for  a  pattern  where
2289       PCRE2_INFO_LASTCODETYPE returns 1. Otherwise, return 0. The third argu-
2290       ment should point to a uint32_t variable.
2291
2292         PCRE2_INFO_MATCHEMPTY
2293
2294       Return 1 if the pattern might match an empty string, otherwise  0.  The
2295       third argument should point to a uint32_t variable. When a pattern con-
2296       tains recursive subroutine calls it is not always possible to determine
2297       whether or not it can match an empty string. PCRE2 takes a cautious ap-
2298       proach and returns 1 in such cases.
2299
2300         PCRE2_INFO_MATCHLIMIT
2301
2302       If the pattern set a match limit by  including  an  item  of  the  form
2303       (*LIMIT_MATCH=nnnn)  at the start, the value is returned. The third ar-
2304       gument should point to a uint32_t integer. If no such  value  has  been
2305       set, the call to pcre2_pattern_info() returns the error PCRE2_ERROR_UN-
2306       SET. Note that this limit will only be used during matching  if  it  is
2307       less  than  the limit set or defaulted by the caller of the match func-
2308       tion.
2309
2310         PCRE2_INFO_MAXLOOKBEHIND
2311
2312       A lookbehind assertion moves back a certain number of  characters  (not
2313       code  units)  when  it starts to process each of its branches. This re-
2314       quest returns the largest of these backward moves. The  third  argument
2315       should point to a uint32_t integer. The simple assertions \b and \B re-
2316       quire a one-character lookbehind and cause PCRE2_INFO_MAXLOOKBEHIND  to
2317       return  1  in  the absence of anything longer. \A also registers a one-
2318       character lookbehind, though it does not actually inspect the  previous
2319       character.
2320
2321       Note that this information is useful for multi-segment matching only if
2322       the pattern contains no nested lookbehinds. For  example,  the  pattern
2323       (?<=a(?<=ba)c)  returns  a maximum lookbehind of 2, but when it is pro-
2324       cessed, the first lookbehind moves back by two characters, matches  one
2325       character,  then  the  nested lookbehind also moves back by two charac-
2326       ters. This puts the matching point three characters earlier than it was
2327       at  the start.  PCRE2_INFO_MAXLOOKBEHIND is really only useful as a de-
2328       bugging tool. See the pcre2partial documentation for  a  discussion  of
2329       multi-segment matching.
2330
2331         PCRE2_INFO_MINLENGTH
2332
2333       If  a  minimum  length  for  matching subject strings was computed, its
2334       value is returned. Otherwise the returned value is 0. This value is not
2335       computed  when PCRE2_NO_START_OPTIMIZE is set. The value is a number of
2336       characters, which in UTF mode may be different from the number of  code
2337       units.  The  third  argument  should  point to a uint32_t variable. The
2338       value is a lower bound to the length of any matching string. There  may
2339       not  be  any  strings  of that length that do actually match, but every
2340       string that does match is at least that long.
2341
2342         PCRE2_INFO_NAMECOUNT
2343         PCRE2_INFO_NAMEENTRYSIZE
2344         PCRE2_INFO_NAMETABLE
2345
2346       PCRE2 supports the use of named as well as numbered capturing parenthe-
2347       ses.  The names are just an additional way of identifying the parenthe-
2348       ses, which still acquire numbers. Several convenience functions such as
2349       pcre2_substring_get_byname()  are provided for extracting captured sub-
2350       strings by name. It is also possible to extract the data  directly,  by
2351       first  converting  the  name to a number in order to access the correct
2352       pointers in the output vector (described with pcre2_match() below).  To
2353       do the conversion, you need to use the name-to-number map, which is de-
2354       scribed by these three values.
2355
2356       The map consists of a number of  fixed-size  entries.  PCRE2_INFO_NAME-
2357       COUNT  gives  the number of entries, and PCRE2_INFO_NAMEENTRYSIZE gives
2358       the size of each entry in code units; both of these return  a  uint32_t
2359       value. The entry size depends on the length of the longest name.
2360
2361       PCRE2_INFO_NAMETABLE returns a pointer to the first entry of the table.
2362       This is a PCRE2_SPTR pointer to a block of code units. In the 8-bit li-
2363       brary,  the first two bytes of each entry are the number of the captur-
2364       ing parenthesis, most significant byte first. In  the  16-bit  library,
2365       the  pointer  points  to 16-bit code units, the first of which contains
2366       the parenthesis number. In the 32-bit library, the  pointer  points  to
2367       32-bit  code units, the first of which contains the parenthesis number.
2368       The rest of the entry is the corresponding name, zero terminated.
2369
2370       The names are in alphabetical order. If (?| is used to create  multiple
2371       capture groups with the same number, as described in the section on du-
2372       plicate group numbers in the pcre2pattern page, the groups may be given
2373       the  same  name,  but  there  is only one entry in the table. Different
2374       names for groups of the same number are not permitted.
2375
2376       Duplicate names for capture groups with different numbers  are  permit-
2377       ted, but only if PCRE2_DUPNAMES is set. They appear in the table in the
2378       order in which they were found in the pattern. In the  absence  of  (?|
2379       this  is  the  order of increasing number; when (?| is used this is not
2380       necessarily the case because later capture groups may have  lower  num-
2381       bers.
2382
2383       As  a  simple  example of the name/number table, consider the following
2384       pattern after compilation by the 8-bit library  (assume  PCRE2_EXTENDED
2385       is set, so white space - including newlines - is ignored):
2386
2387         (?<date> (?<year>(\d\d)?\d\d) -
2388         (?<month>\d\d) - (?<day>\d\d) )
2389
2390       There are four named capture groups, so the table has four entries, and
2391       each entry in the table is eight bytes long. The table is  as  follows,
2392       with non-printing bytes shows in hexadecimal, and undefined bytes shown
2393       as ??:
2394
2395         00 01 d  a  t  e  00 ??
2396         00 05 d  a  y  00 ?? ??
2397         00 04 m  o  n  t  h  00
2398         00 02 y  e  a  r  00 ??
2399
2400       When writing code to extract data from named capture groups  using  the
2401       name-to-number  map,  remember that the length of the entries is likely
2402       to be different for each compiled pattern.
2403
2404         PCRE2_INFO_NEWLINE
2405
2406       The output is one of the following uint32_t values:
2407
2408         PCRE2_NEWLINE_CR       Carriage return (CR)
2409         PCRE2_NEWLINE_LF       Linefeed (LF)
2410         PCRE2_NEWLINE_CRLF     Carriage return, linefeed (CRLF)
2411         PCRE2_NEWLINE_ANY      Any Unicode line ending
2412         PCRE2_NEWLINE_ANYCRLF  Any of CR, LF, or CRLF
2413         PCRE2_NEWLINE_NUL      The NUL character (binary zero)
2414
2415       This identifies the character sequence that will be recognized as mean-
2416       ing "newline" while matching.
2417
2418         PCRE2_INFO_SIZE
2419
2420       Return  the  size  of  the compiled pattern in bytes (for all three li-
2421       braries). The third argument should point to a  size_t  variable.  This
2422       value  includes  the  size  of the general data block that precedes the
2423       code units of the compiled pattern itself. The value that is used  when
2424       pcre2_compile()  is  getting memory in which to place the compiled pat-
2425       tern may be slightly larger than the value returned by this option, be-
2426       cause  there  are  cases where the code that calculates the size has to
2427       over-estimate. Processing a pattern with the JIT compiler does not  al-
2428       ter the value returned by this option.
2429
2430
2431INFORMATION ABOUT A PATTERN'S CALLOUTS
2432
2433       int pcre2_callout_enumerate(const pcre2_code *code,
2434         int (*callback)(pcre2_callout_enumerate_block *, void *),
2435         void *user_data);
2436
2437       A script language that supports the use of string arguments in callouts
2438       might like to scan all the callouts in a  pattern  before  running  the
2439       match. This can be done by calling pcre2_callout_enumerate(). The first
2440       argument is a pointer to a compiled pattern, the  second  points  to  a
2441       callback  function,  and the third is arbitrary user data. The callback
2442       function is called for every callout in the pattern  in  the  order  in
2443       which they appear. Its first argument is a pointer to a callout enumer-
2444       ation block, and its second argument is the user_data  value  that  was
2445       passed  to  pcre2_callout_enumerate(). The contents of the callout enu-
2446       meration block are described in the pcre2callout  documentation,  which
2447       also gives further details about callouts.
2448
2449
2450SERIALIZATION AND PRECOMPILING
2451
2452       It  is  possible  to  save  compiled patterns on disc or elsewhere, and
2453       reload them later, subject to a number of  restrictions.  The  host  on
2454       which  the  patterns  are  reloaded must be running the same version of
2455       PCRE2, with the same code unit width, and must also have the same endi-
2456       anness,  pointer  width,  and PCRE2_SIZE type. Before compiled patterns
2457       can be saved, they must be converted to a "serialized" form,  which  in
2458       the  case of PCRE2 is really just a bytecode dump.  The functions whose
2459       names begin with pcre2_serialize_ are used for converting to  and  from
2460       the  serialized form. They are described in the pcre2serialize documen-
2461       tation. Note that PCRE2 serialization does not  convert  compiled  pat-
2462       terns to an abstract format like Java or .NET serialization.
2463
2464
2465THE MATCH DATA BLOCK
2466
2467       pcre2_match_data *pcre2_match_data_create(uint32_t ovecsize,
2468         pcre2_general_context *gcontext);
2469
2470       pcre2_match_data *pcre2_match_data_create_from_pattern(
2471         const pcre2_code *code, pcre2_general_context *gcontext);
2472
2473       void pcre2_match_data_free(pcre2_match_data *match_data);
2474
2475       Information  about  a  successful  or unsuccessful match is placed in a
2476       match data block, which is an opaque  structure  that  is  accessed  by
2477       function  calls.  In particular, the match data block contains a vector
2478       of offsets into the subject string that define the matched parts of the
2479       subject. This is known as the ovector.
2480
2481       Before  calling  pcre2_match(), pcre2_dfa_match(), or pcre2_jit_match()
2482       you must create a match data block by calling one of the creation func-
2483       tions  above.  For pcre2_match_data_create(), the first argument is the
2484       number of pairs of offsets in the ovector.
2485
2486       When using pcre2_match(), one pair of offsets is required  to  identify
2487       the  string that matched the whole pattern, with an additional pair for
2488       each captured substring. For example, a value of 4 creates enough space
2489       to  record  the matched portion of the subject plus three captured sub-
2490       strings.
2491
2492       When using pcre2_dfa_match() there may be multiple  matched  substrings
2493       of  different  lengths  at  the  same point in the subject. The ovector
2494       should be made large enough to hold as many as are expected.
2495
2496       A minimum of at least 1 pair is imposed  by  pcre2_match_data_create(),
2497       so  it  is  always possible to return the overall matched string in the
2498       case  of  pcre2_match()  or  the  longest  match   in   the   case   of
2499       pcre2_dfa_match().
2500
2501       The second argument of pcre2_match_data_create() is a pointer to a gen-
2502       eral context, which can specify custom memory management for  obtaining
2503       the memory for the match data block. If you are not using custom memory
2504       management, pass NULL, which causes malloc() to be used.
2505
2506       For pcre2_match_data_create_from_pattern(), the  first  argument  is  a
2507       pointer to a compiled pattern. The ovector is created to be exactly the
2508       right size to hold all the substrings  a  pattern  might  capture  when
2509       matched using pcre2_match(). You should not use this call when matching
2510       with pcre2_dfa_match(). The second argument is again  a  pointer  to  a
2511       general  context, but in this case if NULL is passed, the memory is ob-
2512       tained using the same allocator that was used for the compiled  pattern
2513       (custom or default).
2514
2515       A  match  data block can be used many times, with the same or different
2516       compiled patterns. You can extract information from a match data  block
2517       after  a  match  operation  has  finished, using functions that are de-
2518       scribed in the sections on matched strings and other match data below.
2519
2520       When a call of pcre2_match() fails, valid  data  is  available  in  the
2521       match  block  only  when  the  error  is PCRE2_ERROR_NOMATCH, PCRE2_ER-
2522       ROR_PARTIAL, or one of the error codes for an invalid UTF  string.  Ex-
2523       actly what is available depends on the error, and is detailed below.
2524
2525       When  one of the matching functions is called, pointers to the compiled
2526       pattern and the subject string are set in the match data block so  that
2527       they  can  be referenced by the extraction functions after a successful
2528       match. After running a match, you must not free a compiled pattern or a
2529       subject  string until after all operations on the match data block (for
2530       that match) have taken place,  unless,  in  the  case  of  the  subject
2531       string,  you  have used the PCRE2_COPY_MATCHED_SUBJECT option, which is
2532       described in the section entitled "Option bits for  pcre2_match()"  be-
2533       low.
2534
2535       When  a match data block itself is no longer needed, it should be freed
2536       by calling pcre2_match_data_free(). If this function is called  with  a
2537       NULL argument, it returns immediately, without doing anything.
2538
2539
2540MATCHING A PATTERN: THE TRADITIONAL FUNCTION
2541
2542       int pcre2_match(const pcre2_code *code, PCRE2_SPTR subject,
2543         PCRE2_SIZE length, PCRE2_SIZE startoffset,
2544         uint32_t options, pcre2_match_data *match_data,
2545         pcre2_match_context *mcontext);
2546
2547       The  function pcre2_match() is called to match a subject string against
2548       a compiled pattern, which is passed in the code argument. You can  call
2549       pcre2_match() with the same code argument as many times as you like, in
2550       order to find multiple matches in the subject string or to  match  dif-
2551       ferent subject strings with the same pattern.
2552
2553       This  function is the main matching facility of the library, and it op-
2554       erates in a Perl-like manner. For specialist use there is also  an  al-
2555       ternative  matching  function,  which is described below in the section
2556       about the pcre2_dfa_match() function.
2557
2558       Here is an example of a simple call to pcre2_match():
2559
2560         pcre2_match_data *md = pcre2_match_data_create(4, NULL);
2561         int rc = pcre2_match(
2562           re,             /* result of pcre2_compile() */
2563           "some string",  /* the subject string */
2564           11,             /* the length of the subject string */
2565           0,              /* start at offset 0 in the subject */
2566           0,              /* default options */
2567           md,             /* the match data block */
2568           NULL);          /* a match context; NULL means use defaults */
2569
2570       If the subject string is zero-terminated, the length can  be  given  as
2571       PCRE2_ZERO_TERMINATED. A match context must be provided if certain less
2572       common matching parameters are to be changed. For details, see the sec-
2573       tion on the match context above.
2574
2575   The string to be matched by pcre2_match()
2576
2577       The  subject string is passed to pcre2_match() as a pointer in subject,
2578       a length in length, and a starting offset in  startoffset.  The  length
2579       and  offset  are  in  code units, not characters.  That is, they are in
2580       bytes for the 8-bit library, 16-bit code units for the 16-bit  library,
2581       and  32-bit  code units for the 32-bit library, whether or not UTF pro-
2582       cessing is enabled. As a special case, if subject is NULL and length is
2583       zero,  the  subject is assumed to be an empty string. If length is non-
2584       zero, an error occurs if subject is NULL.
2585
2586       If startoffset is greater than the length of the subject, pcre2_match()
2587       returns  PCRE2_ERROR_BADOFFSET.  When  the starting offset is zero, the
2588       search for a match starts at the beginning of the subject, and this  is
2589       by far the most common case. In UTF-8 or UTF-16 mode, the starting off-
2590       set must point to the start of a character, or to the end of  the  sub-
2591       ject  (in  UTF-32 mode, one code unit equals one character, so all off-
2592       sets are valid). Like the pattern string, the subject may  contain  bi-
2593       nary zeros.
2594
2595       A  non-zero  starting offset is useful when searching for another match
2596       in the same subject by calling pcre2_match()  again  after  a  previous
2597       success.   Setting  startoffset  differs  from passing over a shortened
2598       string and setting PCRE2_NOTBOL in the case of a  pattern  that  begins
2599       with any kind of lookbehind. For example, consider the pattern
2600
2601         \Biss\B
2602
2603       which  finds  occurrences  of "iss" in the middle of words. (\B matches
2604       only if the current position in the subject is not  a  word  boundary.)
2605       When   applied   to   the   string  "Mississippi"  the  first  call  to
2606       pcre2_match() finds the first occurrence. If  pcre2_match()  is  called
2607       again with just the remainder of the subject, namely "issippi", it does
2608       not match, because \B is always false at  the  start  of  the  subject,
2609       which  is  deemed  to  be a word boundary. However, if pcre2_match() is
2610       passed the entire string again, but with startoffset set to 4, it finds
2611       the  second  occurrence  of "iss" because it is able to look behind the
2612       starting point to discover that it is preceded by a letter.
2613
2614       Finding all the matches in a subject is tricky  when  the  pattern  can
2615       match an empty string. It is possible to emulate Perl's /g behaviour by
2616       first  trying  the  match  again  at  the   same   offset,   with   the
2617       PCRE2_NOTEMPTY_ATSTART  and  PCRE2_ANCHORED  options,  and then if that
2618       fails, advancing the starting  offset  and  trying  an  ordinary  match
2619       again.  There  is  some  code  that  demonstrates how to do this in the
2620       pcre2demo sample program. In the most general case, you have  to  check
2621       to  see  if the newline convention recognizes CRLF as a newline, and if
2622       so, and the current character is CR followed by LF, advance the  start-
2623       ing offset by two characters instead of one.
2624
2625       If a non-zero starting offset is passed when the pattern is anchored, a
2626       single attempt to match at the given offset is made. This can only suc-
2627       ceed  if  the  pattern does not require the match to be at the start of
2628       the subject. In other words, the anchoring must be the result  of  set-
2629       ting  the PCRE2_ANCHORED option or the use of .* with PCRE2_DOTALL, not
2630       by starting the pattern with ^ or \A.
2631
2632   Option bits for pcre2_match()
2633
2634       The unused bits of the options argument for pcre2_match() must be zero.
2635       The    only    bits    that    may    be    set   are   PCRE2_ANCHORED,
2636       PCRE2_COPY_MATCHED_SUBJECT, PCRE2_ENDANCHORED, PCRE2_NOTBOL,  PCRE2_NO-
2637       TEOL,     PCRE2_NOTEMPTY,     PCRE2_NOTEMPTY_ATSTART,     PCRE2_NO_JIT,
2638       PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and  PCRE2_PARTIAL_SOFT.  Their
2639       action is described below.
2640
2641       Setting  PCRE2_ANCHORED  or PCRE2_ENDANCHORED at match time is not sup-
2642       ported by the just-in-time (JIT) compiler. If it is set,  JIT  matching
2643       is  disabled  and  the interpretive code in pcre2_match() is run. Apart
2644       from PCRE2_NO_JIT (obviously), the remaining options are supported  for
2645       JIT matching.
2646
2647         PCRE2_ANCHORED
2648
2649       The PCRE2_ANCHORED option limits pcre2_match() to matching at the first
2650       matching position. If a pattern was compiled  with  PCRE2_ANCHORED,  or
2651       turned  out to be anchored by virtue of its contents, it cannot be made
2652       unachored at matching time. Note that setting the option at match  time
2653       disables JIT matching.
2654
2655         PCRE2_COPY_MATCHED_SUBJECT
2656
2657       By  default,  a  pointer to the subject is remembered in the match data
2658       block so that, after a successful match, it can be  referenced  by  the
2659       substring  extraction  functions.  This means that the subject's memory
2660       must not be freed until all such operations are complete. For some  ap-
2661       plications  where the lifetime of the subject string is not guaranteed,
2662       it may be necessary to make a copy of the subject  string,  but  it  is
2663       wasteful  to do this unless the match is successful. After a successful
2664       match, if PCRE2_COPY_MATCHED_SUBJECT is set, the subject is copied  and
2665       the  new  pointer  is remembered in the match data block instead of the
2666       original subject pointer. The memory allocator that was  used  for  the
2667       match  block  itself  is  used.  The  copy  is automatically freed when
2668       pcre2_match_data_free() is called to free the match data block.  It  is
2669       also automatically freed if the match data block is re-used for another
2670       match operation.
2671
2672         PCRE2_ENDANCHORED
2673
2674       If the PCRE2_ENDANCHORED option is set, any string  that  pcre2_match()
2675       matches  must be right at the end of the subject string. Note that set-
2676       ting the option at match time disables JIT matching.
2677
2678         PCRE2_NOTBOL
2679
2680       This option specifies that first character of the subject string is not
2681       the  beginning  of  a  line, so the circumflex metacharacter should not
2682       match before it. Setting this without  having  set  PCRE2_MULTILINE  at
2683       compile time causes circumflex never to match. This option affects only
2684       the behaviour of the circumflex metacharacter. It does not affect \A.
2685
2686         PCRE2_NOTEOL
2687
2688       This option specifies that the end of the subject string is not the end
2689       of  a line, so the dollar metacharacter should not match it nor (except
2690       in multiline mode) a newline immediately before it. Setting this  with-
2691       out  having  set PCRE2_MULTILINE at compile time causes dollar never to
2692       match. This option affects only the behaviour of the dollar metacharac-
2693       ter. It does not affect \Z or \z.
2694
2695         PCRE2_NOTEMPTY
2696
2697       An empty string is not considered to be a valid match if this option is
2698       set. If there are alternatives in the pattern, they are tried.  If  all
2699       the  alternatives  match  the empty string, the entire match fails. For
2700       example, if the pattern
2701
2702         a?b?
2703
2704       is applied to a string not beginning with "a" or  "b",  it  matches  an
2705       empty string at the start of the subject. With PCRE2_NOTEMPTY set, this
2706       match is not valid, so pcre2_match() searches further into  the  string
2707       for occurrences of "a" or "b".
2708
2709         PCRE2_NOTEMPTY_ATSTART
2710
2711       This  is  like PCRE2_NOTEMPTY, except that it locks out an empty string
2712       match only at the first matching position, that is, at the start of the
2713       subject  plus  the  starting offset. An empty string match later in the
2714       subject is permitted.  If the pattern is anchored, such a match can oc-
2715       cur only if the pattern contains \K.
2716
2717         PCRE2_NO_JIT
2718
2719       By   default,   if   a  pattern  has  been  successfully  processed  by
2720       pcre2_jit_compile(), JIT is automatically used  when  pcre2_match()  is
2721       called  with  options  that JIT supports. Setting PCRE2_NO_JIT disables
2722       the use of JIT; it forces matching to be done by the interpreter.
2723
2724         PCRE2_NO_UTF_CHECK
2725
2726       When PCRE2_UTF is set at compile time, the validity of the subject as a
2727       UTF   string   is   checked  unless  PCRE2_NO_UTF_CHECK  is  passed  to
2728       pcre2_match() or PCRE2_MATCH_INVALID_UTF was passed to pcre2_compile().
2729       The latter special case is discussed in detail in the pcre2unicode doc-
2730       umentation.
2731
2732       In the default case, if a non-zero starting offset is given, the  check
2733       is  applied  only  to  that part of the subject that could be inspected
2734       during matching, and there is a check that the starting  offset  points
2735       to  the first code unit of a character or to the end of the subject. If
2736       there are no lookbehind assertions in the pattern, the check starts  at
2737       the starting offset.  Otherwise, it starts at the length of the longest
2738       lookbehind before the starting offset, or at the start of  the  subject
2739       if  there are not that many characters before the starting offset. Note
2740       that the sequences \b and \B are one-character lookbehinds.
2741
2742       The check is carried out before any other processing takes place, and a
2743       negative  error  code is returned if the check fails. There are several
2744       UTF error codes for each code unit width,  corresponding  to  different
2745       problems  with  the code unit sequence. There are discussions about the
2746       validity of UTF-8 strings, UTF-16 strings, and UTF-32  strings  in  the
2747       pcre2unicode documentation.
2748
2749       If you know that your subject is valid, and you want to skip this check
2750       for performance reasons, you can set the PCRE2_NO_UTF_CHECK option when
2751       calling  pcre2_match().  You  might  want to do this for the second and
2752       subsequent calls to pcre2_match() if you are making repeated  calls  to
2753       find multiple matches in the same subject string.
2754
2755       Warning:  Unless  PCRE2_MATCH_INVALID_UTF was set at compile time, when
2756       PCRE2_NO_UTF_CHECK is set at match time the effect of  passing  an  in-
2757       valid string as a subject, or an invalid value of startoffset, is unde-
2758       fined.  Your program may crash or loop indefinitely or give  wrong  re-
2759       sults.
2760
2761         PCRE2_PARTIAL_HARD
2762         PCRE2_PARTIAL_SOFT
2763
2764       These options turn on the partial matching feature. A partial match oc-
2765       curs if the end of the subject  string  is  reached  successfully,  but
2766       there are not enough subject characters to complete the match. In addi-
2767       tion, either at least one character must have  been  inspected  or  the
2768       pattern  must  contain  a  lookbehind,  or the pattern must be one that
2769       could match an empty string.
2770
2771       If this situation arises when PCRE2_PARTIAL_SOFT  (but  not  PCRE2_PAR-
2772       TIAL_HARD) is set, matching continues by testing any remaining alterna-
2773       tives. Only if no complete match can be  found  is  PCRE2_ERROR_PARTIAL
2774       returned  instead  of  PCRE2_ERROR_NOMATCH.  In other words, PCRE2_PAR-
2775       TIAL_SOFT specifies that the caller is prepared  to  handle  a  partial
2776       match, but only if no complete match can be found.
2777
2778       If  PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In this
2779       case, if a partial match is found,  pcre2_match()  immediately  returns
2780       PCRE2_ERROR_PARTIAL,  without  considering  any  other alternatives. In
2781       other words, when PCRE2_PARTIAL_HARD is set, a partial match is consid-
2782       ered to be more important that an alternative complete match.
2783
2784       There is a more detailed discussion of partial and multi-segment match-
2785       ing, with examples, in the pcre2partial documentation.
2786
2787
2788NEWLINE HANDLING WHEN MATCHING
2789
2790       When PCRE2 is built, a default newline convention is set; this is  usu-
2791       ally  the standard convention for the operating system. The default can
2792       be overridden in a compile context by calling  pcre2_set_newline().  It
2793       can  also be overridden by starting a pattern string with, for example,
2794       (*CRLF), as described in the section  on  newline  conventions  in  the
2795       pcre2pattern  page. During matching, the newline choice affects the be-
2796       haviour of the dot, circumflex, and dollar metacharacters. It may  also
2797       alter  the  way  the  match starting position is advanced after a match
2798       failure for an unanchored pattern.
2799
2800       When PCRE2_NEWLINE_CRLF, PCRE2_NEWLINE_ANYCRLF, or PCRE2_NEWLINE_ANY is
2801       set  as  the  newline convention, and a match attempt for an unanchored
2802       pattern fails when the current starting position is at a CRLF sequence,
2803       and  the  pattern contains no explicit matches for CR or LF characters,
2804       the match position is advanced by two characters  instead  of  one,  in
2805       other words, to after the CRLF.
2806
2807       The above rule is a compromise that makes the most common cases work as
2808       expected. For example, if the pattern is .+A (and the PCRE2_DOTALL  op-
2809       tion  is  not set), it does not match the string "\r\nA" because, after
2810       failing at the start, it skips both the CR and the LF before  retrying.
2811       However,  the  pattern  [\r\n]A does match that string, because it con-
2812       tains an explicit CR or LF reference, and so advances only by one char-
2813       acter after the first failure.
2814
2815       An explicit match for CR of LF is either a literal appearance of one of
2816       those characters in the pattern, or one of the \r or \n  or  equivalent
2817       octal or hexadecimal escape sequences. Implicit matches such as [^X] do
2818       not count, nor does \s, even though it includes CR and LF in the  char-
2819       acters that it matches.
2820
2821       Notwithstanding  the above, anomalous effects may still occur when CRLF
2822       is a valid newline sequence and explicit \r or \n escapes appear in the
2823       pattern.
2824
2825
2826HOW PCRE2_MATCH() RETURNS A STRING AND CAPTURED SUBSTRINGS
2827
2828       uint32_t pcre2_get_ovector_count(pcre2_match_data *match_data);
2829
2830       PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *match_data);
2831
2832       In  general, a pattern matches a certain portion of the subject, and in
2833       addition, further substrings from the subject  may  be  picked  out  by
2834       parenthesized  parts  of  the  pattern.  Following the usage in Jeffrey
2835       Friedl's book, this is called "capturing"  in  what  follows,  and  the
2836       phrase  "capture  group" (Perl terminology) is used for a fragment of a
2837       pattern that picks out a substring. PCRE2 supports several other  kinds
2838       of parenthesized group that do not cause substrings to be captured. The
2839       pcre2_pattern_info() function can be used to find out how many  capture
2840       groups there are in a compiled pattern.
2841
2842       You  can  use  auxiliary functions for accessing captured substrings by
2843       number or by name, as described in sections below.
2844
2845       Alternatively, you can make direct use of the vector of PCRE2_SIZE val-
2846       ues,  called  the  ovector,  which  contains  the  offsets  of captured
2847       strings.  It  is  part  of  the  match  data   block.    The   function
2848       pcre2_get_ovector_pointer()  returns  the  address  of the ovector, and
2849       pcre2_get_ovector_count() returns the number of pairs of values it con-
2850       tains.
2851
2852       Within the ovector, the first in each pair of values is set to the off-
2853       set of the first code unit of a substring, and the second is set to the
2854       offset  of the first code unit after the end of a substring. These val-
2855       ues are always code unit offsets, not character offsets. That is,  they
2856       are byte offsets in the 8-bit library, 16-bit offsets in the 16-bit li-
2857       brary, and 32-bit offsets in the 32-bit library.
2858
2859       After a partial match  (error  return  PCRE2_ERROR_PARTIAL),  only  the
2860       first  pair  of  offsets  (that is, ovector[0] and ovector[1]) are set.
2861       They identify the part of the subject that was partially  matched.  See
2862       the pcre2partial documentation for details of partial matching.
2863
2864       After  a  fully  successful match, the first pair of offsets identifies
2865       the portion of the subject string that was matched by the  entire  pat-
2866       tern.  The  next  pair is used for the first captured substring, and so
2867       on. The value returned by pcre2_match() is one more  than  the  highest
2868       numbered  pair  that  has been set. For example, if two substrings have
2869       been captured, the returned value is 3. If there are no  captured  sub-
2870       strings, the return value from a successful match is 1, indicating that
2871       just the first pair of offsets has been set.
2872
2873       If a pattern uses the \K escape sequence within a  positive  assertion,
2874       the reported start of a successful match can be greater than the end of
2875       the match.  For example, if the pattern  (?=ab\K)  is  matched  against
2876       "ab", the start and end offset values for the match are 2 and 0.
2877
2878       If  a  capture group is matched repeatedly within a single match opera-
2879       tion, it is the last portion of the subject that it matched that is re-
2880       turned.
2881
2882       If the ovector is too small to hold all the captured substring offsets,
2883       as much as possible is filled in, and the function returns a  value  of
2884       zero.  If captured substrings are not of interest, pcre2_match() may be
2885       called with a match data block whose ovector is of minimum length (that
2886       is, one pair).
2887
2888       It  is  possible for capture group number n+1 to match some part of the
2889       subject when group n has not been used at  all.  For  example,  if  the
2890       string "abc" is matched against the pattern (a|(z))(bc) the return from
2891       the function is 4, and groups 1 and 3 are matched, but 2 is  not.  When
2892       this  happens,  both values in the offset pairs corresponding to unused
2893       groups are set to PCRE2_UNSET.
2894
2895       Offset values that correspond to unused groups at the end  of  the  ex-
2896       pression  are also set to PCRE2_UNSET. For example, if the string "abc"
2897       is matched against the pattern (abc)(x(yz)?)? groups 2 and  3  are  not
2898       matched.  The  return  from the function is 2, because the highest used
2899       capture group number is 1. The offsets for for  the  second  and  third
2900       capture  groupss  (assuming  the vector is large enough, of course) are
2901       set to PCRE2_UNSET.
2902
2903       Elements in the ovector that do not correspond to capturing parentheses
2904       in the pattern are never changed. That is, if a pattern contains n cap-
2905       turing parentheses, no more than ovector[0] to ovector[2n+1] are set by
2906       pcre2_match().  The  other  elements retain whatever values they previ-
2907       ously had. After a failed match attempt, the contents  of  the  ovector
2908       are unchanged.
2909
2910
2911OTHER INFORMATION ABOUT A MATCH
2912
2913       PCRE2_SPTR pcre2_get_mark(pcre2_match_data *match_data);
2914
2915       PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *match_data);
2916
2917       As  well as the offsets in the ovector, other information about a match
2918       is retained in the match data block and can be retrieved by  the  above
2919       functions  in  appropriate  circumstances.  If they are called at other
2920       times, the result is undefined.
2921
2922       After a successful match, a partial match (PCRE2_ERROR_PARTIAL),  or  a
2923       failure  to  match (PCRE2_ERROR_NOMATCH), a mark name may be available.
2924       The function pcre2_get_mark() can be called to access this name,  which
2925       can  be  specified  in  the  pattern by any of the backtracking control
2926       verbs, not just (*MARK). The same function applies to all the verbs. It
2927       returns a pointer to the zero-terminated name, which is within the com-
2928       piled pattern. If no name is available, NULL is returned. The length of
2929       the  name  (excluding  the terminating zero) is stored in the code unit
2930       that precedes the name. You should use this length instead  of  relying
2931       on the terminating zero if the name might contain a binary zero.
2932
2933       After  a  successful  match, the name that is returned is the last mark
2934       name encountered on the matching path through the pattern. Instances of
2935       backtracking  verbs  without  names do not count. Thus, for example, if
2936       the matching path contains (*MARK:A)(*PRUNE), the name "A" is returned.
2937       After a "no match" or a partial match, the last encountered name is re-
2938       turned. For example, consider this pattern:
2939
2940         ^(*MARK:A)((*MARK:B)a|b)c
2941
2942       When it matches "bc", the returned name is A. The B mark is  "seen"  in
2943       the  first  branch of the group, but it is not on the matching path. On
2944       the other hand, when this pattern fails to  match  "bx",  the  returned
2945       name is B.
2946
2947       Warning:  By  default, certain start-of-match optimizations are used to
2948       give a fast "no match" result in some situations. For example,  if  the
2949       anchoring  is removed from the pattern above, there is an initial check
2950       for the presence of "c" in the subject before running the matching  en-
2951       gine. This check fails for "bx", causing a match failure without seeing
2952       any marks. You can disable the start-of-match optimizations by  setting
2953       the  PCRE2_NO_START_OPTIMIZE  option for pcre2_compile() or by starting
2954       the pattern with (*NO_START_OPT).
2955
2956       After a successful match, a partial match, or one of  the  invalid  UTF
2957       errors  (for example, PCRE2_ERROR_UTF8_ERR5), pcre2_get_startchar() can
2958       be called. After a successful or partial match it returns the code unit
2959       offset  of  the character at which the match started. For a non-partial
2960       match, this can be different to the value of ovector[0] if the  pattern
2961       contains  the  \K escape sequence. After a partial match, however, this
2962       value is always the same as ovector[0] because \K does not  affect  the
2963       result of a partial match.
2964
2965       After  a UTF check failure, pcre2_get_startchar() can be used to obtain
2966       the code unit offset of the invalid UTF character. Details are given in
2967       the pcre2unicode page.
2968
2969
2970ERROR RETURNS FROM pcre2_match()
2971
2972       If  pcre2_match() fails, it returns a negative number. This can be con-
2973       verted to a text string by calling the pcre2_get_error_message()  func-
2974       tion  (see  "Obtaining a textual error message" below).  Negative error
2975       codes are also returned by other functions,  and  are  documented  with
2976       them.  The codes are given names in the header file. If UTF checking is
2977       in force and an invalid UTF subject string is detected, one of a number
2978       of  UTF-specific negative error codes is returned. Details are given in
2979       the pcre2unicode page. The following are the other errors that  may  be
2980       returned by pcre2_match():
2981
2982         PCRE2_ERROR_NOMATCH
2983
2984       The subject string did not match the pattern.
2985
2986         PCRE2_ERROR_PARTIAL
2987
2988       The  subject  string did not match, but it did match partially. See the
2989       pcre2partial documentation for details of partial matching.
2990
2991         PCRE2_ERROR_BADMAGIC
2992
2993       PCRE2 stores a 4-byte "magic number" at the start of the compiled code,
2994       to  catch  the case when it is passed a junk pointer. This is the error
2995       that is returned when the magic number is not present.
2996
2997         PCRE2_ERROR_BADMODE
2998
2999       This error is given when a compiled pattern is passed to a function  in
3000       a  library  of a different code unit width, for example, a pattern com-
3001       piled by the 8-bit library is passed to  a  16-bit  or  32-bit  library
3002       function.
3003
3004         PCRE2_ERROR_BADOFFSET
3005
3006       The value of startoffset was greater than the length of the subject.
3007
3008         PCRE2_ERROR_BADOPTION
3009
3010       An unrecognized bit was set in the options argument.
3011
3012         PCRE2_ERROR_BADUTFOFFSET
3013
3014       The UTF code unit sequence that was passed as a subject was checked and
3015       found to be valid (the PCRE2_NO_UTF_CHECK option was not set), but  the
3016       value  of startoffset did not point to the beginning of a UTF character
3017       or the end of the subject.
3018
3019         PCRE2_ERROR_CALLOUT
3020
3021       This error is never generated by pcre2_match() itself. It  is  provided
3022       for  use  by  callout  functions  that  want  to cause pcre2_match() or
3023       pcre2_callout_enumerate() to return a distinctive error code.  See  the
3024       pcre2callout documentation for details.
3025
3026         PCRE2_ERROR_DEPTHLIMIT
3027
3028       The nested backtracking depth limit was reached.
3029
3030         PCRE2_ERROR_HEAPLIMIT
3031
3032       The heap limit was reached.
3033
3034         PCRE2_ERROR_INTERNAL
3035
3036       An  unexpected  internal error has occurred. This error could be caused
3037       by a bug in PCRE2 or by overwriting of the compiled pattern.
3038
3039         PCRE2_ERROR_JIT_STACKLIMIT
3040
3041       This error is returned when a pattern that was successfully studied us-
3042       ing JIT is being matched, but the memory available for the just-in-time
3043       processing stack is not large enough. See  the  pcre2jit  documentation
3044       for more details.
3045
3046         PCRE2_ERROR_MATCHLIMIT
3047
3048       The backtracking match limit was reached.
3049
3050         PCRE2_ERROR_NOMEMORY
3051
3052       If  a  pattern contains many nested backtracking points, heap memory is
3053       used to remember them. This error is given when the  memory  allocation
3054       function  (default  or  custom)  fails.  Note  that  a different error,
3055       PCRE2_ERROR_HEAPLIMIT, is given if the amount of memory needed  exceeds
3056       the    heap   limit.   PCRE2_ERROR_NOMEMORY   is   also   returned   if
3057       PCRE2_COPY_MATCHED_SUBJECT is set and memory allocation fails.
3058
3059         PCRE2_ERROR_NULL
3060
3061       Either the code, subject, or match_data argument was passed as NULL.
3062
3063         PCRE2_ERROR_RECURSELOOP
3064
3065       This error is returned when  pcre2_match()  detects  a  recursion  loop
3066       within  the  pattern. Specifically, it means that either the whole pat-
3067       tern or a capture group has been called recursively for the second time
3068       at  the  same position in the subject string. Some simple patterns that
3069       might do this are detected and faulted at compile time, but  more  com-
3070       plicated  cases,  in particular mutual recursions between two different
3071       groups, cannot be detected until matching is attempted.
3072
3073
3074OBTAINING A TEXTUAL ERROR MESSAGE
3075
3076       int pcre2_get_error_message(int errorcode, PCRE2_UCHAR *buffer,
3077         PCRE2_SIZE bufflen);
3078
3079       A text message for an error code  from  any  PCRE2  function  (compile,
3080       match,  or  auxiliary)  can be obtained by calling pcre2_get_error_mes-
3081       sage(). The code is passed as the first argument,  with  the  remaining
3082       two  arguments  specifying  a  code  unit buffer and its length in code
3083       units, into which the text message is placed. The message  is  returned
3084       in  code  units  of the appropriate width for the library that is being
3085       used.
3086
3087       The returned message is terminated with a trailing zero, and the  func-
3088       tion  returns  the  number  of  code units used, excluding the trailing
3089       zero. If the error number is unknown, the negative error code PCRE2_ER-
3090       ROR_BADDATA  is  returned.  If  the buffer is too small, the message is
3091       truncated (but still with a trailing zero), and the negative error code
3092       PCRE2_ERROR_NOMEMORY  is returned.  None of the messages are very long;
3093       a buffer size of 120 code units is ample.
3094
3095
3096EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
3097
3098       int pcre2_substring_length_bynumber(pcre2_match_data *match_data,
3099         uint32_t number, PCRE2_SIZE *length);
3100
3101       int pcre2_substring_copy_bynumber(pcre2_match_data *match_data,
3102         uint32_t number, PCRE2_UCHAR *buffer,
3103         PCRE2_SIZE *bufflen);
3104
3105       int pcre2_substring_get_bynumber(pcre2_match_data *match_data,
3106         uint32_t number, PCRE2_UCHAR **bufferptr,
3107         PCRE2_SIZE *bufflen);
3108
3109       void pcre2_substring_free(PCRE2_UCHAR *buffer);
3110
3111       Captured substrings can be accessed directly by using  the  ovector  as
3112       described above.  For convenience, auxiliary functions are provided for
3113       extracting  captured  substrings  as  new,  separate,   zero-terminated
3114       strings. A substring that contains a binary zero is correctly extracted
3115       and has a further zero added on the end, but  the  result  is  not,  of
3116       course, a C string.
3117
3118       The functions in this section identify substrings by number. The number
3119       zero refers to the entire matched substring, with higher numbers refer-
3120       ring  to  substrings  captured by parenthesized groups. After a partial
3121       match, only substring zero is available.  An  attempt  to  extract  any
3122       other  substring  gives the error PCRE2_ERROR_PARTIAL. The next section
3123       describes similar functions for extracting captured substrings by name.
3124
3125       If a pattern uses the \K escape sequence within a  positive  assertion,
3126       the reported start of a successful match can be greater than the end of
3127       the match.  For example, if the pattern  (?=ab\K)  is  matched  against
3128       "ab",  the  start  and  end offset values for the match are 2 and 0. In
3129       this situation, calling these functions with a  zero  substring  number
3130       extracts a zero-length empty string.
3131
3132       You  can  find the length in code units of a captured substring without
3133       extracting it by calling pcre2_substring_length_bynumber().  The  first
3134       argument  is a pointer to the match data block, the second is the group
3135       number, and the third is a pointer to a variable into which the  length
3136       is  placed.  If  you just want to know whether or not the substring has
3137       been captured, you can pass the third argument as NULL.
3138
3139       The pcre2_substring_copy_bynumber() function  copies  a  captured  sub-
3140       string  into  a supplied buffer, whereas pcre2_substring_get_bynumber()
3141       copies it into new memory, obtained using the  same  memory  allocation
3142       function  that  was  used for the match data block. The first two argu-
3143       ments of these functions are a pointer to the match data  block  and  a
3144       capture group number.
3145
3146       The final arguments of pcre2_substring_copy_bynumber() are a pointer to
3147       the buffer and a pointer to a variable that contains its length in code
3148       units.  This is updated to contain the actual number of code units used
3149       for the extracted substring, excluding the terminating zero.
3150
3151       For pcre2_substring_get_bynumber() the third and fourth arguments point
3152       to  variables that are updated with a pointer to the new memory and the
3153       number of code units that comprise the substring, again  excluding  the
3154       terminating  zero.  When  the substring is no longer needed, the memory
3155       should be freed by calling pcre2_substring_free().
3156
3157       The return value from all these functions is zero  for  success,  or  a
3158       negative  error  code.  If  the pattern match failed, the match failure
3159       code is returned.  If a substring number greater than zero is used  af-
3160       ter  a  partial  match, PCRE2_ERROR_PARTIAL is returned. Other possible
3161       error codes are:
3162
3163         PCRE2_ERROR_NOMEMORY
3164
3165       The buffer was too small for  pcre2_substring_copy_bynumber(),  or  the
3166       attempt to get memory failed for pcre2_substring_get_bynumber().
3167
3168         PCRE2_ERROR_NOSUBSTRING
3169
3170       There  is  no  substring  with that number in the pattern, that is, the
3171       number is greater than the number of capturing parentheses.
3172
3173         PCRE2_ERROR_UNAVAILABLE
3174
3175       The substring number, though not greater than the number of captures in
3176       the pattern, is greater than the number of slots in the ovector, so the
3177       substring could not be captured.
3178
3179         PCRE2_ERROR_UNSET
3180
3181       The substring did not participate in the match.  For  example,  if  the
3182       pattern  is  (abc)|(def) and the subject is "def", and the ovector con-
3183       tains at least two capturing slots, substring number 1 is unset.
3184
3185
3186EXTRACTING A LIST OF ALL CAPTURED SUBSTRINGS
3187
3188       int pcre2_substring_list_get(pcre2_match_data *match_data,
3189         PCRE2_UCHAR ***listptr, PCRE2_SIZE **lengthsptr);
3190
3191       void pcre2_substring_list_free(PCRE2_SPTR *list);
3192
3193       The pcre2_substring_list_get() function  extracts  all  available  sub-
3194       strings  and  builds  a  list of pointers to them. It also (optionally)
3195       builds a second list that contains their lengths (in code  units),  ex-
3196       cluding  a  terminating zero that is added to each of them. All this is
3197       done in a single block of memory that is obtained using the same memory
3198       allocation function that was used to get the match data block.
3199
3200       This  function  must be called only after a successful match. If called
3201       after a partial match, the error code PCRE2_ERROR_PARTIAL is returned.
3202
3203       The address of the memory block is returned via listptr, which is  also
3204       the start of the list of string pointers. The end of the list is marked
3205       by a NULL pointer. The address of the list of lengths is  returned  via
3206       lengthsptr.  If your strings do not contain binary zeros and you do not
3207       therefore need the lengths, you may supply NULL as the lengthsptr argu-
3208       ment  to  disable  the  creation of a list of lengths. The yield of the
3209       function is zero if all went well, or PCRE2_ERROR_NOMEMORY if the  mem-
3210       ory  block could not be obtained. When the list is no longer needed, it
3211       should be freed by calling pcre2_substring_list_free().
3212
3213       If this function encounters a substring that is unset, which can happen
3214       when  capture  group  number  n+1 matches some part of the subject, but
3215       group n has not been used at all, it returns an empty string. This  can
3216       be distinguished from a genuine zero-length substring by inspecting the
3217       appropriate offset in the ovector, which contain PCRE2_UNSET for  unset
3218       substrings, or by calling pcre2_substring_length_bynumber().
3219
3220
3221EXTRACTING CAPTURED SUBSTRINGS BY NAME
3222
3223       int pcre2_substring_number_from_name(const pcre2_code *code,
3224         PCRE2_SPTR name);
3225
3226       int pcre2_substring_length_byname(pcre2_match_data *match_data,
3227         PCRE2_SPTR name, PCRE2_SIZE *length);
3228
3229       int pcre2_substring_copy_byname(pcre2_match_data *match_data,
3230         PCRE2_SPTR name, PCRE2_UCHAR *buffer, PCRE2_SIZE *bufflen);
3231
3232       int pcre2_substring_get_byname(pcre2_match_data *match_data,
3233         PCRE2_SPTR name, PCRE2_UCHAR **bufferptr, PCRE2_SIZE *bufflen);
3234
3235       void pcre2_substring_free(PCRE2_UCHAR *buffer);
3236
3237       To  extract a substring by name, you first have to find associated num-
3238       ber.  For example, for this pattern:
3239
3240         (a+)b(?<xxx>\d+)...
3241
3242       the number of the capture group called "xxx" is 2. If the name is known
3243       to be unique (PCRE2_DUPNAMES was not set), you can find the number from
3244       the name by calling pcre2_substring_number_from_name(). The first argu-
3245       ment  is the compiled pattern, and the second is the name. The yield of
3246       the function is the group number, PCRE2_ERROR_NOSUBSTRING if  there  is
3247       no  group  with that name, or PCRE2_ERROR_NOUNIQUESUBSTRING if there is
3248       more than one group with that name.  Given the number, you can  extract
3249       the  substring  directly from the ovector, or use one of the "bynumber"
3250       functions described above.
3251
3252       For convenience, there are also "byname" functions that  correspond  to
3253       the "bynumber" functions, the only difference being that the second ar-
3254       gument is a name instead of a number.  If  PCRE2_DUPNAMES  is  set  and
3255       there are duplicate names, these functions scan all the groups with the
3256       given name, and return the captured  substring  from  the  first  named
3257       group that is set.
3258
3259       If  there are no groups with the given name, PCRE2_ERROR_NOSUBSTRING is
3260       returned. If all groups with the name have  numbers  that  are  greater
3261       than the number of slots in the ovector, PCRE2_ERROR_UNAVAILABLE is re-
3262       turned. If there is at least one group with a slot in the ovector,  but
3263       no group is found to be set, PCRE2_ERROR_UNSET is returned.
3264
3265       Warning: If the pattern uses the (?| feature to set up multiple capture
3266       groups with the same number, as described in the section  on  duplicate
3267       group numbers in the pcre2pattern page, you cannot use names to distin-
3268       guish the different capture groups, because names are not  included  in
3269       the  compiled  code.  The  matching process uses only numbers. For this
3270       reason, the use of different names for  groups  with  the  same  number
3271       causes an error at compile time.
3272
3273
3274CREATING A NEW STRING WITH SUBSTITUTIONS
3275
3276       int pcre2_substitute(const pcre2_code *code, PCRE2_SPTR subject,
3277         PCRE2_SIZE length, PCRE2_SIZE startoffset,
3278         uint32_t options, pcre2_match_data *match_data,
3279         pcre2_match_context *mcontext, PCRE2_SPTR replacement,
3280         PCRE2_SIZE rlength, PCRE2_UCHAR *outputbuffer,
3281         PCRE2_SIZE *outlengthptr);
3282
3283       This  function  optionally calls pcre2_match() and then makes a copy of
3284       the subject string in outputbuffer, replacing parts that  were  matched
3285       with the replacement string, whose length is supplied in rlength, which
3286       can be given as PCRE2_ZERO_TERMINATED for a zero-terminated string.  As
3287       a  special  case,  if  replacement is NULL and rlength is zero, the re-
3288       placement is assumed to be an empty string. If rlength is non-zero,  an
3289       error occurs if replacement is NULL.
3290
3291       There is an option (see PCRE2_SUBSTITUTE_REPLACEMENT_ONLY below) to re-
3292       turn just the replacement string(s). The default action is  to  perform
3293       just  one  replacement  if  the pattern matches, but there is an option
3294       that requests multiple replacements  (see  PCRE2_SUBSTITUTE_GLOBAL  be-
3295       low).
3296
3297       If  successful,  pcre2_substitute() returns the number of substitutions
3298       that were carried out. This may be zero if no match was found,  and  is
3299       never  greater  than one unless PCRE2_SUBSTITUTE_GLOBAL is set. A nega-
3300       tive value is returned if an error is detected.
3301
3302       Matches in which a \K item in a lookahead in  the  pattern  causes  the
3303       match  to  end  before it starts are not supported, and give rise to an
3304       error return. For global replacements, matches in which \K in a lookbe-
3305       hind  causes the match to start earlier than the point that was reached
3306       in the previous iteration are also not supported.
3307
3308       The first seven arguments of pcre2_substitute() are  the  same  as  for
3309       pcre2_match(), except that the partial matching options are not permit-
3310       ted, and match_data may be passed as NULL, in which case a  match  data
3311       block  is obtained and freed within this function, using memory manage-
3312       ment functions from the match context, if provided, or else those  that
3313       were used to allocate memory for the compiled code.
3314
3315       If  match_data is not NULL and PCRE2_SUBSTITUTE_MATCHED is not set, the
3316       provided block is used for all calls to pcre2_match(), and its contents
3317       afterwards  are  the result of the final call. For global changes, this
3318       will always be a no-match error. The contents of the ovector within the
3319       match data block may or may not have been changed.
3320
3321       As  well as the usual options for pcre2_match(), a number of additional
3322       options can be set in the options argument of pcre2_substitute().   One
3323       such  option is PCRE2_SUBSTITUTE_MATCHED. When this is set, an external
3324       match_data block must be provided, and it must have already  been  used
3325       for an external call to pcre2_match() with the same pattern and subject
3326       arguments. The data in the match_data block (return code,  offset  vec-
3327       tor)  is  then  used  for  the  first  substitution  instead of calling
3328       pcre2_match() from within pcre2_substitute(). This allows  an  applica-
3329       tion to check for a match before choosing to substitute, without having
3330       to repeat the match.
3331
3332       The contents of the  externally  supplied  match  data  block  are  not
3333       changed   when   PCRE2_SUBSTITUTE_MATCHED   is  set.  If  PCRE2_SUBSTI-
3334       TUTE_GLOBAL is also set, pcre2_match() is called after the  first  sub-
3335       stitution  to  check for further matches, but this is done using an in-
3336       ternally obtained match data block, thus always  leaving  the  external
3337       block unchanged.
3338
3339       The  code  argument is not used for matching before the first substitu-
3340       tion when PCRE2_SUBSTITUTE_MATCHED is set, but  it  must  be  provided,
3341       even  when  PCRE2_SUBSTITUTE_GLOBAL is not set, because it contains in-
3342       formation such as the UTF setting and the number of capturing parenthe-
3343       ses in the pattern.
3344
3345       The  default  action  of  pcre2_substitute() is to return a copy of the
3346       subject string with matched substrings replaced. However, if PCRE2_SUB-
3347       STITUTE_REPLACEMENT_ONLY  is  set,  only the replacement substrings are
3348       returned. In the global case, multiple replacements are concatenated in
3349       the  output  buffer.  Substitution  callouts (see below) can be used to
3350       separate them if necessary.
3351
3352       The outlengthptr argument of pcre2_substitute() must point to  a  vari-
3353       able  that contains the length, in code units, of the output buffer. If
3354       the function is successful, the value is updated to contain the  length
3355       in  code  units  of the new string, excluding the trailing zero that is
3356       automatically added.
3357
3358       If the function is not successful, the value set via  outlengthptr  de-
3359       pends  on  the  type  of  error.  For  syntax errors in the replacement
3360       string, the value is the offset in the replacement string where the er-
3361       ror  was  detected.  For  other errors, the value is PCRE2_UNSET by de-
3362       fault. This includes the case of the output buffer being too small, un-
3363       less PCRE2_SUBSTITUTE_OVERFLOW_LENGTH is set.
3364
3365       PCRE2_SUBSTITUTE_OVERFLOW_LENGTH  changes  what happens when the output
3366       buffer is too small. The default action is to return PCRE2_ERROR_NOMEM-
3367       ORY  immediately.  If  this  option is set, however, pcre2_substitute()
3368       continues to go through the motions of matching and substituting (with-
3369       out,  of course, writing anything) in order to compute the size of buf-
3370       fer that is needed. This value is  passed  back  via  the  outlengthptr
3371       variable,  with  the  result  of  the  function  still  being PCRE2_ER-
3372       ROR_NOMEMORY.
3373
3374       Passing a buffer size of zero is a permitted way  of  finding  out  how
3375       much  memory  is needed for given substitution. However, this does mean
3376       that the entire operation is carried out twice. Depending on the appli-
3377       cation,  it  may  be more efficient to allocate a large buffer and free
3378       the  excess  afterwards,  instead   of   using   PCRE2_SUBSTITUTE_OVER-
3379       FLOW_LENGTH.
3380
3381       The  replacement  string,  which  is interpreted as a UTF string in UTF
3382       mode, is checked for UTF validity unless PCRE2_NO_UTF_CHECK is set.  An
3383       invalid UTF replacement string causes an immediate return with the rel-
3384       evant UTF error code.
3385
3386       If PCRE2_SUBSTITUTE_LITERAL is set, the replacement string is  not  in-
3387       terpreted in any way. By default, however, a dollar character is an es-
3388       cape character that can specify the insertion of characters  from  cap-
3389       ture  groups  and names from (*MARK) or other control verbs in the pat-
3390       tern. The following forms are always recognized:
3391
3392         $$                  insert a dollar character
3393         $<n> or ${<n>}      insert the contents of group <n>
3394         $*MARK or ${*MARK}  insert a control verb name
3395
3396       Either a group number or a group name  can  be  given  for  <n>.  Curly
3397       brackets  are  required only if the following character would be inter-
3398       preted as part of the number or name. The number may be zero to include
3399       the  entire  matched  string.   For  example,  if  the pattern a(b)c is
3400       matched with "=abc=" and the replacement string "+$1$0$1+", the  result
3401       is "=+babcb+=".
3402
3403       $*MARK  inserts the name from the last encountered backtracking control
3404       verb on the matching path that has a name. (*MARK) must always  include
3405       a  name,  but  the  other  verbs  need not. For example, in the case of
3406       (*MARK:A)(*PRUNE) the name inserted is "A", but for (*MARK:A)(*PRUNE:B)
3407       the  relevant  name is "B". This facility can be used to perform simple
3408       simultaneous substitutions, as this pcre2test example shows:
3409
3410         /(*MARK:pear)apple|(*MARK:orange)lemon/g,replace=${*MARK}
3411             apple lemon
3412          2: pear orange
3413
3414       PCRE2_SUBSTITUTE_GLOBAL causes the function to iterate over the subject
3415       string,  replacing every matching substring. If this option is not set,
3416       only the first matching substring is replaced. The search  for  matches
3417       takes  place in the original subject string (that is, previous replace-
3418       ments do not affect it).  Iteration is  implemented  by  advancing  the
3419       startoffset  value  for  each search, which is always passed the entire
3420       subject string. If an offset limit is set in the match context, search-
3421       ing stops when that limit is reached.
3422
3423       You  can  restrict  the effect of a global substitution to a portion of
3424       the subject string by setting either or both of startoffset and an off-
3425       set limit. Here is a pcre2test example:
3426
3427         /B/g,replace=!,use_offset_limit
3428         ABC ABC ABC ABC\=offset=3,offset_limit=12
3429          2: ABC A!C A!C ABC
3430
3431       When  continuing  with  global substitutions after matching a substring
3432       with zero length, an attempt to find a non-empty match at the same off-
3433       set is performed.  If this is not successful, the offset is advanced by
3434       one character except when CRLF is a valid newline sequence and the next
3435       two  characters are CR, LF. In this case, the offset is advanced by two
3436       characters.
3437
3438       PCRE2_SUBSTITUTE_UNKNOWN_UNSET causes references to capture groups that
3439       do not appear in the pattern to be treated as unset groups. This option
3440       should be used with care, because it means that a typo in a group  name
3441       or number no longer causes the PCRE2_ERROR_NOSUBSTRING error.
3442
3443       PCRE2_SUBSTITUTE_UNSET_EMPTY causes unset capture groups (including un-
3444       known groups when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) to be  treated
3445       as  empty  strings  when inserted as described above. If this option is
3446       not set, an attempt to insert an unset group causes the PCRE2_ERROR_UN-
3447       SET  error.  This  option  does not influence the extended substitution
3448       syntax described below.
3449
3450       PCRE2_SUBSTITUTE_EXTENDED causes extra processing to be applied to  the
3451       replacement  string.  Without this option, only the dollar character is
3452       special, and only the group insertion forms  listed  above  are  valid.
3453       When PCRE2_SUBSTITUTE_EXTENDED is set, two things change:
3454
3455       Firstly,  backslash in a replacement string is interpreted as an escape
3456       character. The usual forms such as \n or \x{ddd} can be used to specify
3457       particular  character codes, and backslash followed by any non-alphanu-
3458       meric character quotes that character. Extended quoting  can  be  coded
3459       using \Q...\E, exactly as in pattern strings.
3460
3461       There  are  also four escape sequences for forcing the case of inserted
3462       letters.  The insertion mechanism has three states:  no  case  forcing,
3463       force upper case, and force lower case. The escape sequences change the
3464       current state: \U and \L change to upper or lower case forcing, respec-
3465       tively,  and  \E (when not terminating a \Q quoted sequence) reverts to
3466       no case forcing. The sequences \u and \l force the next  character  (if
3467       it  is  a  letter)  to  upper or lower case, respectively, and then the
3468       state automatically reverts to no case forcing. Case forcing applies to
3469       all  inserted  characters, including those from capture groups and let-
3470       ters within \Q...\E quoted sequences. If either PCRE2_UTF or  PCRE2_UCP
3471       was  set when the pattern was compiled, Unicode properties are used for
3472       case forcing characters whose code points are greater than 127.
3473
3474       Note that case forcing sequences such as \U...\E do not nest. For exam-
3475       ple,  the  result of processing "\Uaa\LBB\Ecc\E" is "AAbbcc"; the final
3476       \E has no effect. Note  also  that  the  PCRE2_ALT_BSUX  and  PCRE2_EX-
3477       TRA_ALT_BSUX options do not apply to replacement strings.
3478
3479       The  second  effect of setting PCRE2_SUBSTITUTE_EXTENDED is to add more
3480       flexibility to capture group substitution. The  syntax  is  similar  to
3481       that used by Bash:
3482
3483         ${<n>:-<string>}
3484         ${<n>:+<string1>:<string2>}
3485
3486       As  before,  <n> may be a group number or a name. The first form speci-
3487       fies a default value. If group <n> is set, its value  is  inserted;  if
3488       not,  <string>  is  expanded  and  the result inserted. The second form
3489       specifies strings that are expanded and inserted when group <n> is  set
3490       or  unset,  respectively. The first form is just a convenient shorthand
3491       for
3492
3493         ${<n>:+${<n>}:<string>}
3494
3495       Backslash can be used to escape colons and closing  curly  brackets  in
3496       the  replacement  strings.  A change of the case forcing state within a
3497       replacement string remains  in  force  afterwards,  as  shown  in  this
3498       pcre2test example:
3499
3500         /(some)?(body)/substitute_extended,replace=${1:+\U:\L}HeLLo
3501             body
3502          1: hello
3503             somebody
3504          1: HELLO
3505
3506       The  PCRE2_SUBSTITUTE_UNSET_EMPTY option does not affect these extended
3507       substitutions. However, PCRE2_SUBSTITUTE_UNKNOWN_UNSET does  cause  un-
3508       known groups in the extended syntax forms to be treated as unset.
3509
3510       If  PCRE2_SUBSTITUTE_LITERAL  is  set,  PCRE2_SUBSTITUTE_UNKNOWN_UNSET,
3511       PCRE2_SUBSTITUTE_UNSET_EMPTY, and PCRE2_SUBSTITUTE_EXTENDED are irrele-
3512       vant and are ignored.
3513
3514   Substitution errors
3515
3516       In  the  event of an error, pcre2_substitute() returns a negative error
3517       code. Except for PCRE2_ERROR_NOMATCH (which is never returned),  errors
3518       from pcre2_match() are passed straight back.
3519
3520       PCRE2_ERROR_NOSUBSTRING is returned for a non-existent substring inser-
3521       tion, unless PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set.
3522
3523       PCRE2_ERROR_UNSET is returned for an unset substring insertion (includ-
3524       ing  an  unknown  substring when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set)
3525       when the simple (non-extended) syntax is used and  PCRE2_SUBSTITUTE_UN-
3526       SET_EMPTY is not set.
3527
3528       PCRE2_ERROR_NOMEMORY  is  returned  if  the  output  buffer  is not big
3529       enough. If the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set, the size
3530       of  buffer  that is needed is returned via outlengthptr. Note that this
3531       does not happen by default.
3532
3533       PCRE2_ERROR_NULL is returned if PCRE2_SUBSTITUTE_MATCHED is set but the
3534       match_data  argument is NULL or if the subject or replacement arguments
3535       are NULL. For backward compatibility reasons an exception is  made  for
3536       the replacement argument if the rlength argument is also 0.
3537
3538       PCRE2_ERROR_BADREPLACEMENT  is  used for miscellaneous syntax errors in
3539       the replacement string, with more  particular  errors  being  PCRE2_ER-
3540       ROR_BADREPESCAPE (invalid escape sequence), PCRE2_ERROR_REPMISSINGBRACE
3541       (closing curly bracket not found), PCRE2_ERROR_BADSUBSTITUTION  (syntax
3542       error  in  extended group substitution), and PCRE2_ERROR_BADSUBSPATTERN
3543       (the pattern match ended before it started or the match started earlier
3544       than  the  current  position  in the subject, which can happen if \K is
3545       used in an assertion).
3546
3547       As for all PCRE2 errors, a text message that describes the error can be
3548       obtained  by  calling  the pcre2_get_error_message() function (see "Ob-
3549       taining a textual error message" above).
3550
3551   Substitution callouts
3552
3553       int pcre2_set_substitute_callout(pcre2_match_context *mcontext,
3554         int (*callout_function)(pcre2_substitute_callout_block *, void *),
3555         void *callout_data);
3556
3557       The pcre2_set_substitution_callout() function can be used to specify  a
3558       callout  function for pcre2_substitute(). This information is passed in
3559       a match context. The callout function is called after each substitution
3560       has been processed, but it can cause the replacement not to happen. The
3561       callout function is not called for simulated substitutions that  happen
3562       as a result of the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option.
3563
3564       The first argument of the callout function is a pointer to a substitute
3565       callout block structure, which contains the following fields, not  nec-
3566       essarily in this order:
3567
3568         uint32_t    version;
3569         uint32_t    subscount;
3570         PCRE2_SPTR  input;
3571         PCRE2_SPTR  output;
3572         PCRE2_SIZE *ovector;
3573         uint32_t    oveccount;
3574         PCRE2_SIZE  output_offsets[2];
3575
3576       The  version field contains the version number of the block format. The
3577       current version is 0. The version number will  increase  in  future  if
3578       more  fields are added, but the intention is never to remove any of the
3579       existing fields.
3580
3581       The subscount field is the number of the current match. It is 1 for the
3582       first callout, 2 for the second, and so on. The input and output point-
3583       ers are copies of the values passed to pcre2_substitute().
3584
3585       The ovector field points to the ovector, which contains the  result  of
3586       the most recent match. The oveccount field contains the number of pairs
3587       that are set in the ovector, and is always greater than zero.
3588
3589       The output_offsets vector contains the offsets of  the  replacement  in
3590       the  output  string. This has already been processed for dollar and (if
3591       requested) backslash substitutions as described above.
3592
3593       The second argument of the callout function  is  the  value  passed  as
3594       callout_data  when  the  function was registered. The value returned by
3595       the callout function is interpreted as follows:
3596
3597       If the value is zero, the replacement is accepted, and,  if  PCRE2_SUB-
3598       STITUTE_GLOBAL  is set, processing continues with a search for the next
3599       match. If the value is not zero, the current  replacement  is  not  ac-
3600       cepted.  If  the  value is greater than zero, processing continues when
3601       PCRE2_SUBSTITUTE_GLOBAL is set. Otherwise (the value is less than  zero
3602       or  PCRE2_SUBSTITUTE_GLOBAL  is  not set), the the rest of the input is
3603       copied to the output and the call to pcre2_substitute() exits,  return-
3604       ing the number of matches so far.
3605
3606
3607DUPLICATE CAPTURE GROUP NAMES
3608
3609       int pcre2_substring_nametable_scan(const pcre2_code *code,
3610         PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last);
3611
3612       When  a  pattern  is compiled with the PCRE2_DUPNAMES option, names for
3613       capture groups are not required to be unique. Duplicate names  are  al-
3614       ways  allowed for groups with the same number, created by using the (?|
3615       feature. Indeed, if such groups are named, they are required to use the
3616       same names.
3617
3618       Normally,  patterns  that  use duplicate names are such that in any one
3619       match, only one of each set of identically-named  groups  participates.
3620       An example is shown in the pcre2pattern documentation.
3621
3622       When   duplicates   are   present,   pcre2_substring_copy_byname()  and
3623       pcre2_substring_get_byname() return the first  substring  corresponding
3624       to  the given name that is set. Only if none are set is PCRE2_ERROR_UN-
3625       SET is returned. The  pcre2_substring_number_from_name()  function  re-
3626       turns  the error PCRE2_ERROR_NOUNIQUESUBSTRING when there are duplicate
3627       names.
3628
3629       If you want to get full details of all captured substrings for a  given
3630       name,  you  must use the pcre2_substring_nametable_scan() function. The
3631       first argument is the compiled pattern, and the second is the name.  If
3632       the  third  and fourth arguments are NULL, the function returns a group
3633       number for a unique name, or PCRE2_ERROR_NOUNIQUESUBSTRING otherwise.
3634
3635       When the third and fourth arguments are not NULL, they must be pointers
3636       to  variables  that are updated by the function. After it has run, they
3637       point to the first and last entries in the name-to-number table for the
3638       given  name,  and the function returns the length of each entry in code
3639       units. In both cases, PCRE2_ERROR_NOSUBSTRING is returned if there  are
3640       no entries for the given name.
3641
3642       The format of the name table is described above in the section entitled
3643       Information about a pattern. Given all the  relevant  entries  for  the
3644       name,  you  can  extract  each of their numbers, and hence the captured
3645       data.
3646
3647
3648FINDING ALL POSSIBLE MATCHES AT ONE POSITION
3649
3650       The traditional matching function uses a  similar  algorithm  to  Perl,
3651       which  stops when it finds the first match at a given point in the sub-
3652       ject. If you want to find all possible matches, or the longest possible
3653       match  at  a  given  position,  consider using the alternative matching
3654       function (see below) instead. If you cannot use the  alternative  func-
3655       tion, you can kludge it up by making use of the callout facility, which
3656       is described in the pcre2callout documentation.
3657
3658       What you have to do is to insert a callout right at the end of the pat-
3659       tern.   When your callout function is called, extract and save the cur-
3660       rent matched substring. Then return 1, which  forces  pcre2_match()  to
3661       backtrack  and  try other alternatives. Ultimately, when it runs out of
3662       matches, pcre2_match() will yield PCRE2_ERROR_NOMATCH.
3663
3664
3665MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
3666
3667       int pcre2_dfa_match(const pcre2_code *code, PCRE2_SPTR subject,
3668         PCRE2_SIZE length, PCRE2_SIZE startoffset,
3669         uint32_t options, pcre2_match_data *match_data,
3670         pcre2_match_context *mcontext,
3671         int *workspace, PCRE2_SIZE wscount);
3672
3673       The function pcre2_dfa_match() is called  to  match  a  subject  string
3674       against  a  compiled pattern, using a matching algorithm that scans the
3675       subject string just once (not counting lookaround assertions), and does
3676       not  backtrack (except when processing lookaround assertions). This has
3677       different characteristics to the normal algorithm, and is not  compati-
3678       ble  with  Perl.  Some  of  the features of PCRE2 patterns are not sup-
3679       ported. Nevertheless, there are times when this kind of matching can be
3680       useful.  For a discussion of the two matching algorithms, and a list of
3681       features that pcre2_dfa_match() does not support, see the pcre2matching
3682       documentation.
3683
3684       The  arguments  for  the pcre2_dfa_match() function are the same as for
3685       pcre2_match(), plus two extras. The ovector within the match data block
3686       is used in a different way, and this is described below. The other com-
3687       mon arguments are used in the same way as for pcre2_match(),  so  their
3688       description is not repeated here.
3689
3690       The  two  additional  arguments provide workspace for the function. The
3691       workspace vector should contain at least 20 elements. It  is  used  for
3692       keeping  track  of  multiple  paths  through  the  pattern  tree.  More
3693       workspace is needed for patterns and subjects where there are a lot  of
3694       potential matches.
3695
3696       Here is an example of a simple call to pcre2_dfa_match():
3697
3698         int wspace[20];
3699         pcre2_match_data *md = pcre2_match_data_create(4, NULL);
3700         int rc = pcre2_dfa_match(
3701           re,             /* result of pcre2_compile() */
3702           "some string",  /* the subject string */
3703           11,             /* the length of the subject string */
3704           0,              /* start at offset 0 in the subject */
3705           0,              /* default options */
3706           md,             /* the match data block */
3707           NULL,           /* a match context; NULL means use defaults */
3708           wspace,         /* working space vector */
3709           20);            /* number of elements (NOT size in bytes) */
3710
3711   Option bits for pcre2_dfa_match()
3712
3713       The  unused  bits of the options argument for pcre2_dfa_match() must be
3714       zero.  The  only   bits   that   may   be   set   are   PCRE2_ANCHORED,
3715       PCRE2_COPY_MATCHED_SUBJECT,  PCRE2_ENDANCHORED, PCRE2_NOTBOL, PCRE2_NO-
3716       TEOL,   PCRE2_NOTEMPTY,   PCRE2_NOTEMPTY_ATSTART,   PCRE2_NO_UTF_CHECK,
3717       PCRE2_PARTIAL_HARD,    PCRE2_PARTIAL_SOFT,    PCRE2_DFA_SHORTEST,   and
3718       PCRE2_DFA_RESTART. All but the last four of these are exactly the  same
3719       as for pcre2_match(), so their description is not repeated here.
3720
3721         PCRE2_PARTIAL_HARD
3722         PCRE2_PARTIAL_SOFT
3723
3724       These  have  the  same general effect as they do for pcre2_match(), but
3725       the details are slightly different. When PCRE2_PARTIAL_HARD is set  for
3726       pcre2_dfa_match(),  it  returns  PCRE2_ERROR_PARTIAL  if the end of the
3727       subject is reached and there is still at least one matching possibility
3728       that requires additional characters. This happens even if some complete
3729       matches have already been found. When PCRE2_PARTIAL_SOFT  is  set,  the
3730       return  code  PCRE2_ERROR_NOMATCH is converted into PCRE2_ERROR_PARTIAL
3731       if the end of the subject is  reached,  there  have  been  no  complete
3732       matches, but there is still at least one matching possibility. The por-
3733       tion of the string that was inspected when the  longest  partial  match
3734       was found is set as the first matching string in both cases. There is a
3735       more detailed discussion of partial and  multi-segment  matching,  with
3736       examples, in the pcre2partial documentation.
3737
3738         PCRE2_DFA_SHORTEST
3739
3740       Setting  the PCRE2_DFA_SHORTEST option causes the matching algorithm to
3741       stop as soon as it has found one match. Because of the way the alterna-
3742       tive  algorithm  works, this is necessarily the shortest possible match
3743       at the first possible matching point in the subject string.
3744
3745         PCRE2_DFA_RESTART
3746
3747       When pcre2_dfa_match() returns a partial match, it is possible to  call
3748       it again, with additional subject characters, and have it continue with
3749       the same match. The PCRE2_DFA_RESTART option requests this action; when
3750       it  is  set,  the workspace and wscount options must reference the same
3751       vector as before because data about the match so far is  left  in  them
3752       after a partial match. There is more discussion of this facility in the
3753       pcre2partial documentation.
3754
3755   Successful returns from pcre2_dfa_match()
3756
3757       When pcre2_dfa_match() succeeds, it may have matched more than one sub-
3758       string in the subject. Note, however, that all the matches from one run
3759       of the function start at the same point in  the  subject.  The  shorter
3760       matches  are all initial substrings of the longer matches. For example,
3761       if the pattern
3762
3763         <.*>
3764
3765       is matched against the string
3766
3767         This is <something> <something else> <something further> no more
3768
3769       the three matched strings are
3770
3771         <something> <something else> <something further>
3772         <something> <something else>
3773         <something>
3774
3775       On success, the yield of the function is a number  greater  than  zero,
3776       which  is  the  number  of  matched substrings. The offsets of the sub-
3777       strings are returned in the ovector, and can be extracted by number  in
3778       the  same way as for pcre2_match(), but the numbers bear no relation to
3779       any capture groups that may exist in the pattern, because DFA  matching
3780       does not support capturing.
3781
3782       Calls  to the convenience functions that extract substrings by name re-
3783       turn the error PCRE2_ERROR_DFA_UFUNC (unsupported function) if used af-
3784       ter  a  DFA match. The convenience functions that extract substrings by
3785       number never return PCRE2_ERROR_NOSUBSTRING.
3786
3787       The matched strings are stored in  the  ovector  in  reverse  order  of
3788       length;  that  is,  the longest matching string is first. If there were
3789       too many matches to fit into the ovector, the yield of the function  is
3790       zero, and the vector is filled with the longest matches.
3791
3792       NOTE:  PCRE2's  "auto-possessification" optimization usually applies to
3793       character repeats at the end of a pattern (as well as internally).  For
3794       example,  the pattern "a\d+" is compiled as if it were "a\d++". For DFA
3795       matching, this means that only one possible match is found. If you  re-
3796       ally do want multiple matches in such cases, either use an ungreedy re-
3797       peat such as "a\d+?" or set the PCRE2_NO_AUTO_POSSESS option when  com-
3798       piling.
3799
3800   Error returns from pcre2_dfa_match()
3801
3802       The pcre2_dfa_match() function returns a negative number when it fails.
3803       Many of the errors are the same  as  for  pcre2_match(),  as  described
3804       above.  There are in addition the following errors that are specific to
3805       pcre2_dfa_match():
3806
3807         PCRE2_ERROR_DFA_UITEM
3808
3809       This return is given if pcre2_dfa_match() encounters  an  item  in  the
3810       pattern  that it does not support, for instance, the use of \C in a UTF
3811       mode or a backreference.
3812
3813         PCRE2_ERROR_DFA_UCOND
3814
3815       This return is given if pcre2_dfa_match() encounters a  condition  item
3816       that uses a backreference for the condition, or a test for recursion in
3817       a specific capture group. These are not supported.
3818
3819         PCRE2_ERROR_DFA_UINVALID_UTF
3820
3821       This return is given if pcre2_dfa_match() is called for a pattern  that
3822       was  compiled  with  PCRE2_MATCH_INVALID_UTF. This is not supported for
3823       DFA matching.
3824
3825         PCRE2_ERROR_DFA_WSSIZE
3826
3827       This return is given if pcre2_dfa_match() runs  out  of  space  in  the
3828       workspace vector.
3829
3830         PCRE2_ERROR_DFA_RECURSE
3831
3832       When a recursion or subroutine call is processed, the matching function
3833       calls itself recursively, using private  memory  for  the  ovector  and
3834       workspace.   This  error  is given if the internal ovector is not large
3835       enough. This should be extremely rare, as a  vector  of  size  1000  is
3836       used.
3837
3838         PCRE2_ERROR_DFA_BADRESTART
3839
3840       When  pcre2_dfa_match()  is  called  with the PCRE2_DFA_RESTART option,
3841       some plausibility checks are made on the  contents  of  the  workspace,
3842       which  should  contain data about the previous partial match. If any of
3843       these checks fail, this error is given.
3844
3845
3846SEE ALSO
3847
3848       pcre2build(3),   pcre2callout(3),    pcre2demo(3),    pcre2matching(3),
3849       pcre2partial(3), pcre2posix(3), pcre2sample(3), pcre2unicode(3).
3850
3851
3852AUTHOR
3853
3854       Philip Hazel
3855       Retired from University Computing Service
3856       Cambridge, England.
3857
3858
3859REVISION
3860
3861       Last updated: 14 December 2021
3862       Copyright (c) 1997-2021 University of Cambridge.
3863------------------------------------------------------------------------------
3864
3865
3866PCRE2BUILD(3)              Library Functions Manual              PCRE2BUILD(3)
3867
3868
3869
3870NAME
3871       PCRE2 - Perl-compatible regular expressions (revised API)
3872
3873BUILDING PCRE2
3874
3875       PCRE2  is distributed with a configure script that can be used to build
3876       the library in Unix-like environments using the applications  known  as
3877       Autotools. Also in the distribution are files to support building using
3878       CMake instead of configure. The text file README contains  general  in-
3879       formation  about building with Autotools (some of which is repeated be-
3880       low), and also has some comments about building  on  various  operating
3881       systems.  There  is a lot more information about building PCRE2 without
3882       using Autotools (including information about using CMake  and  building
3883       "by  hand")  in  the  text file called NON-AUTOTOOLS-BUILD.  You should
3884       consult this file as well as the README file if you are building  in  a
3885       non-Unix-like environment.
3886
3887
3888PCRE2 BUILD-TIME OPTIONS
3889
3890       The rest of this document describes the optional features of PCRE2 that
3891       can be selected when the library is compiled. It  assumes  use  of  the
3892       configure  script,  where  the  optional features are selected or dese-
3893       lected by providing options to configure before running the  make  com-
3894       mand.  However,  the same options can be selected in both Unix-like and
3895       non-Unix-like environments if you are using CMake instead of  configure
3896       to build PCRE2.
3897
3898       If  you  are not using Autotools or CMake, option selection can be done
3899       by editing the config.h file, or by passing parameter settings  to  the
3900       compiler, as described in NON-AUTOTOOLS-BUILD.
3901
3902       The complete list of options for configure (which includes the standard
3903       ones such as the selection of the installation directory)  can  be  ob-
3904       tained by running
3905
3906         ./configure --help
3907
3908       The  following  sections include descriptions of "on/off" options whose
3909       names begin with --enable or --disable. Because of the way that config-
3910       ure  works, --enable and --disable always come in pairs, so the comple-
3911       mentary option always exists as well, but as it specifies the  default,
3912       it is not described.  Options that specify values have names that start
3913       with --with. At the end of a configure run, a summary of the configura-
3914       tion is output.
3915
3916
3917BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES
3918
3919       By  default, a library called libpcre2-8 is built, containing functions
3920       that take string arguments contained in arrays  of  bytes,  interpreted
3921       either  as single-byte characters, or UTF-8 strings. You can also build
3922       two other libraries, called libpcre2-16 and libpcre2-32, which  process
3923       strings  that  are contained in arrays of 16-bit and 32-bit code units,
3924       respectively. These can be interpreted either as single-unit characters
3925       or  UTF-16/UTF-32 strings. To build these additional libraries, add one
3926       or both of the following to the configure command:
3927
3928         --enable-pcre2-16
3929         --enable-pcre2-32
3930
3931       If you do not want the 8-bit library, add
3932
3933         --disable-pcre2-8
3934
3935       as well. At least one of the three libraries must be built.  Note  that
3936       the  POSIX wrapper is for the 8-bit library only, and that pcre2grep is
3937       an 8-bit program. Neither of these are built if  you  select  only  the
3938       16-bit or 32-bit libraries.
3939
3940
3941BUILDING SHARED AND STATIC LIBRARIES
3942
3943       The  Autotools PCRE2 building process uses libtool to build both shared
3944       and static libraries by default. You can suppress an  unwanted  library
3945       by adding one of
3946
3947         --disable-shared
3948         --disable-static
3949
3950       to the configure command.
3951
3952
3953UNICODE AND UTF SUPPORT
3954
3955       By  default,  PCRE2 is built with support for Unicode and UTF character
3956       strings.  To build it without Unicode support, add
3957
3958         --disable-unicode
3959
3960       to the configure command. This setting applies to all three  libraries.
3961       It  is  not  possible to build one library with Unicode support and an-
3962       other without in the same configuration.
3963
3964       Of itself, Unicode support does not make PCRE2 treat strings as  UTF-8,
3965       UTF-16 or UTF-32. To do that, applications that use the library can set
3966       the PCRE2_UTF option when they call pcre2_compile() to compile  a  pat-
3967       tern.   Alternatively,  patterns  may be started with (*UTF) unless the
3968       application has locked this out by setting PCRE2_NEVER_UTF.
3969
3970       UTF support allows the libraries to process character code points up to
3971       0x10ffff  in  the  strings that they handle. Unicode support also gives
3972       access to the Unicode properties of characters, using  pattern  escapes
3973       such as \P, \p, and \X. Only the general category properties such as Lu
3974       and Nd, script names, and some bi-directional properties are supported.
3975       Details are given in the pcre2pattern documentation.
3976
3977       Pattern escapes such as \d and \w do not by default make use of Unicode
3978       properties. The application can request that they  do  by  setting  the
3979       PCRE2_UCP  option.  Unless  the  application has set PCRE2_NEVER_UCP, a
3980       pattern may also request this by starting with (*UCP).
3981
3982
3983DISABLING THE USE OF \C
3984
3985       The \C escape sequence, which matches a single code unit, even in a UTF
3986       mode,  can  cause unpredictable behaviour because it may leave the cur-
3987       rent matching point in the middle of a multi-code-unit  character.  The
3988       application  can lock it out by setting the PCRE2_NEVER_BACKSLASH_C op-
3989       tion when calling pcre2_compile(). There is also a build-time option
3990
3991         --enable-never-backslash-C
3992
3993       (note the upper case C) which locks out the use of \C entirely.
3994
3995
3996JUST-IN-TIME COMPILER SUPPORT
3997
3998       Just-in-time (JIT) compiler support is included in the build by  speci-
3999       fying
4000
4001         --enable-jit
4002
4003       This  support  is available only for certain hardware architectures. If
4004       this option is set for an unsupported architecture,  a  building  error
4005       occurs.  If in doubt, use
4006
4007         --enable-jit=auto
4008
4009       which  enables  JIT  only if the current hardware is supported. You can
4010       check if JIT is enabled in the configuration summary that is output  at
4011       the  end  of a configure run. If you are enabling JIT under SELinux you
4012       may also want to add
4013
4014         --enable-jit-sealloc
4015
4016       which enables the use of an execmem allocator in JIT that is compatible
4017       with  SELinux.  This  has  no  effect  if  JIT  is not enabled. See the
4018       pcre2jit documentation for a discussion of JIT usage. When JIT  support
4019       is enabled, pcre2grep automatically makes use of it, unless you add
4020
4021         --disable-pcre2grep-jit
4022
4023       to the configure command.
4024
4025
4026NEWLINE RECOGNITION
4027
4028       By  default, PCRE2 interprets the linefeed (LF) character as indicating
4029       the end of a line. This is the normal newline  character  on  Unix-like
4030       systems.  You can compile PCRE2 to use carriage return (CR) instead, by
4031       adding
4032
4033         --enable-newline-is-cr
4034
4035       to the configure command. There is also an  --enable-newline-is-lf  op-
4036       tion, which explicitly specifies linefeed as the newline character.
4037
4038       Alternatively, you can specify that line endings are to be indicated by
4039       the two-character sequence CRLF (CR immediately followed by LF). If you
4040       want this, add
4041
4042         --enable-newline-is-crlf
4043
4044       to the configure command. There is a fourth option, specified by
4045
4046         --enable-newline-is-anycrlf
4047
4048       which  causes  PCRE2 to recognize any of the three sequences CR, LF, or
4049       CRLF as indicating a line ending. A fifth option, specified by
4050
4051         --enable-newline-is-any
4052
4053       causes PCRE2 to recognize any Unicode  newline  sequence.  The  Unicode
4054       newline sequences are the three just mentioned, plus the single charac-
4055       ters VT (vertical tab, U+000B), FF (form feed, U+000C), NEL (next line,
4056       U+0085),  LS  (line  separator,  U+2028),  and PS (paragraph separator,
4057       U+2029). The final option is
4058
4059         --enable-newline-is-nul
4060
4061       which causes NUL (binary zero) to be set  as  the  default  line-ending
4062       character.
4063
4064       Whatever default line ending convention is selected when PCRE2 is built
4065       can be overridden by applications that use the library. At  build  time
4066       it is recommended to use the standard for your operating system.
4067
4068
4069WHAT \R MATCHES
4070
4071       By  default,  the  sequence \R in a pattern matches any Unicode newline
4072       sequence, independently of what has been selected as  the  line  ending
4073       sequence. If you specify
4074
4075         --enable-bsr-anycrlf
4076
4077       the  default  is changed so that \R matches only CR, LF, or CRLF. What-
4078       ever is selected when PCRE2 is built can be overridden by  applications
4079       that use the library.
4080
4081
4082HANDLING VERY LARGE PATTERNS
4083
4084       Within  a  compiled  pattern,  offset values are used to point from one
4085       part to another (for example, from an opening parenthesis to an  alter-
4086       nation  metacharacter).  By default, in the 8-bit and 16-bit libraries,
4087       two-byte values are used for these offsets, leading to a  maximum  size
4088       for a compiled pattern of around 64 thousand code units. This is suffi-
4089       cient to handle all but the most gigantic patterns. Nevertheless,  some
4090       people do want to process truly enormous patterns, so it is possible to
4091       compile PCRE2 to use three-byte or four-byte offsets by adding  a  set-
4092       ting such as
4093
4094         --with-link-size=3
4095
4096       to  the  configure command. The value given must be 2, 3, or 4. For the
4097       16-bit library, a value of 3 is rounded up to 4.  In  these  libraries,
4098       using  longer  offsets slows down the operation of PCRE2 because it has
4099       to load additional data when handling them. For the 32-bit library  the
4100       value  is  always 4 and cannot be overridden; the value of --with-link-
4101       size is ignored.
4102
4103
4104LIMITING PCRE2 RESOURCE USAGE
4105
4106       The pcre2_match() function increments a counter each time it goes round
4107       its  main  loop. Putting a limit on this counter controls the amount of
4108       computing resource used by a single call to  pcre2_match().  The  limit
4109       can be changed at run time, as described in the pcre2api documentation.
4110       The default is 10 million, but this can be changed by adding a  setting
4111       such as
4112
4113         --with-match-limit=500000
4114
4115       to   the   configure   command.   This  setting  also  applies  to  the
4116       pcre2_dfa_match() matching function, and to JIT  matching  (though  the
4117       counting is done differently).
4118
4119       The  pcre2_match() function starts out using a 20KiB vector on the sys-
4120       tem stack to record backtracking points. The more  nested  backtracking
4121       points there are (that is, the deeper the search tree), the more memory
4122       is needed. If the initial vector is not large enough,  heap  memory  is
4123       used,  up to a certain limit, which is specified in kibibytes (units of
4124       1024 bytes). The limit can be changed at run time, as described in  the
4125       pcre2api  documentation.  The default limit (in effect unlimited) is 20
4126       million. You can change this by a setting such as
4127
4128         --with-heap-limit=500
4129
4130       which limits the amount of heap to 500 KiB. This limit applies only  to
4131       interpretive matching in pcre2_match() and pcre2_dfa_match(), which may
4132       also use the heap for internal workspace  when  processing  complicated
4133       patterns.  This limit does not apply when JIT (which has its own memory
4134       arrangements) is used.
4135
4136       You can also explicitly limit the depth of nested backtracking  in  the
4137       pcre2_match() interpreter. This limit defaults to the value that is set
4138       for --with-match-limit. You can set a lower default  limit  by  adding,
4139       for example,
4140
4141         --with-match-limit-depth=10000
4142
4143       to  the  configure  command.  This value can be overridden at run time.
4144       This depth limit indirectly limits the amount of heap  memory  that  is
4145       used,  but because the size of each backtracking "frame" depends on the
4146       number of capturing parentheses in a pattern, the amount of  heap  that
4147       is  used  before  the  limit is reached varies from pattern to pattern.
4148       This limit was more useful in versions before 10.30, where function re-
4149       cursion was used for backtracking.
4150
4151       As well as applying to pcre2_match(), the depth limit also controls the
4152       depth of recursive function calls in pcre2_dfa_match(). These are  used
4153       for  lookaround  assertions,  atomic  groups, and recursion within pat-
4154       terns.  The limit does not apply to JIT matching.
4155
4156
4157CREATING CHARACTER TABLES AT BUILD TIME
4158
4159       PCRE2 uses fixed tables for processing characters whose code points are
4160       less than 256. By default, PCRE2 is built with a set of tables that are
4161       distributed in the file src/pcre2_chartables.c.dist. These  tables  are
4162       for ASCII codes only. If you add
4163
4164         --enable-rebuild-chartables
4165
4166       to  the  configure  command, the distributed tables are no longer used.
4167       Instead, a program called pcre2_dftables is compiled and run. This out-
4168       puts the source for new set of tables, created in the default locale of
4169       your C run-time system. This method of replacing the  tables  does  not
4170       work if you are cross compiling, because pcre2_dftables needs to be run
4171       on the local host and therefore not compiled with the cross compiler.
4172
4173       If you need to create alternative tables when cross compiling, you will
4174       have  to  do so "by hand". There may also be other reasons for creating
4175       tables manually.  To cause pcre2_dftables to  be  built  on  the  local
4176       host, run a normal compiling command, and then run the program with the
4177       output file as its argument, for example:
4178
4179         cc src/pcre2_dftables.c -o pcre2_dftables
4180         ./pcre2_dftables src/pcre2_chartables.c
4181
4182       This builds the tables in the default locale of the local host. If  you
4183       want to specify a locale, you must use the -L option:
4184
4185         LC_ALL=fr_FR ./pcre2_dftables -L src/pcre2_chartables.c
4186
4187       You can also specify -b (with or without -L). This causes the tables to
4188       be written in binary instead of as source code. A set of binary  tables
4189       can  be  loaded  into memory by an application and passed to pcre2_com-
4190       pile() in the same way as tables created by calling pcre2_maketables().
4191       The  tables are just a string of bytes, independent of hardware charac-
4192       teristics such as endianness. This means they can be  bundled  with  an
4193       application  that  runs in different environments, to ensure consistent
4194       behaviour.
4195
4196
4197USING EBCDIC CODE
4198
4199       PCRE2 assumes by default that it will run in an environment  where  the
4200       character  code is ASCII or Unicode, which is a superset of ASCII. This
4201       is the case for most computer operating systems. PCRE2 can, however, be
4202       compiled to run in an 8-bit EBCDIC environment by adding
4203
4204         --enable-ebcdic --disable-unicode
4205
4206       to the configure command. This setting implies --enable-rebuild-charta-
4207       bles. You should only use it if you know that you are in an EBCDIC  en-
4208       vironment (for example, an IBM mainframe operating system).
4209
4210       It  is  not possible to support both EBCDIC and UTF-8 codes in the same
4211       version of the library. Consequently,  --enable-unicode  and  --enable-
4212       ebcdic are mutually exclusive.
4213
4214       The EBCDIC character that corresponds to an ASCII LF is assumed to have
4215       the value 0x15 by default. However, in some EBCDIC  environments,  0x25
4216       is used. In such an environment you should use
4217
4218         --enable-ebcdic-nl25
4219
4220       as well as, or instead of, --enable-ebcdic. The EBCDIC character for CR
4221       has the same value as in ASCII, namely, 0x0d.  Whichever  of  0x15  and
4222       0x25 is not chosen as LF is made to correspond to the Unicode NEL char-
4223       acter (which, in Unicode, is 0x85).
4224
4225       The options that select newline behaviour, such as --enable-newline-is-
4226       cr, and equivalent run-time options, refer to these character values in
4227       an EBCDIC environment.
4228
4229
4230PCRE2GREP SUPPORT FOR EXTERNAL SCRIPTS
4231
4232       By default pcre2grep supports the use of callouts with string arguments
4233       within  the patterns it is matching. There are two kinds: one that gen-
4234       erates output using local code, and another that calls an external pro-
4235       gram  or  script.   If --disable-pcre2grep-callout-fork is added to the
4236       configure command, only the first kind  of  callout  is  supported;  if
4237       --disable-pcre2grep-callout  is  used,  all callouts are completely ig-
4238       nored. For more details of pcre2grep callouts, see the pcre2grep  docu-
4239       mentation.
4240
4241
4242PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT
4243
4244       By  default,  pcre2grep reads all files as plain text. You can build it
4245       so that it recognizes files whose names end in .gz or .bz2,  and  reads
4246       them with libz or libbz2, respectively, by adding one or both of
4247
4248         --enable-pcre2grep-libz
4249         --enable-pcre2grep-libbz2
4250
4251       to the configure command. These options naturally require that the rel-
4252       evant libraries are installed on your system. Configuration  will  fail
4253       if they are not.
4254
4255
4256PCRE2GREP BUFFER SIZE
4257
4258       pcre2grep  uses an internal buffer to hold a "window" on the file it is
4259       scanning, in order to be able to output "before" and "after" lines when
4260       it finds a match. The default starting size of the buffer is 20KiB. The
4261       buffer itself is three times this size, but because of the  way  it  is
4262       used for holding "before" lines, the longest line that is guaranteed to
4263       be processable is the notional buffer size. If a longer line is encoun-
4264       tered,  pcre2grep  automatically  expands the buffer, up to a specified
4265       maximum size, whose default is 1MiB or the starting size, whichever  is
4266       the  larger. You can change the default parameter values by adding, for
4267       example,
4268
4269         --with-pcre2grep-bufsize=51200
4270         --with-pcre2grep-max-bufsize=2097152
4271
4272       to the configure command. The caller of pcre2grep  can  override  these
4273       values  by  using  --buffer-size  and  --max-buffer-size on the command
4274       line.
4275
4276
4277PCRE2TEST OPTION FOR LIBREADLINE SUPPORT
4278
4279       If you add one of
4280
4281         --enable-pcre2test-libreadline
4282         --enable-pcre2test-libedit
4283
4284       to the configure command, pcre2test is linked with the libreadline  or-
4285       libedit  library,  respectively, and when its input is from a terminal,
4286       it reads it using the readline() function. This  provides  line-editing
4287       and  history  facilities.  Note that libreadline is GPL-licensed, so if
4288       you distribute a binary of pcre2test linked in this way, there  may  be
4289       licensing issues. These can be avoided by linking instead with libedit,
4290       which has a BSD licence.
4291
4292       Setting --enable-pcre2test-libreadline causes the -lreadline option  to
4293       be  added to the pcre2test build. In many operating environments with a
4294       sytem-installed readline library this is sufficient. However,  in  some
4295       environments (e.g. if an unmodified distribution version of readline is
4296       in use), some extra configuration may be necessary.  The  INSTALL  file
4297       for libreadline says this:
4298
4299         "Readline uses the termcap functions, but does not link with
4300         the termcap or curses library itself, allowing applications
4301         which link with readline the to choose an appropriate library."
4302
4303       If  your environment has not been set up so that an appropriate library
4304       is automatically included, you may need to add something like
4305
4306         LIBS="-ncurses"
4307
4308       immediately before the configure command.
4309
4310
4311INCLUDING DEBUGGING CODE
4312
4313       If you add
4314
4315         --enable-debug
4316
4317       to the configure command, additional debugging code is included in  the
4318       build. This feature is intended for use by the PCRE2 maintainers.
4319
4320
4321DEBUGGING WITH VALGRIND SUPPORT
4322
4323       If you add
4324
4325         --enable-valgrind
4326
4327       to  the  configure command, PCRE2 will use valgrind annotations to mark
4328       certain memory regions as unaddressable. This allows it to  detect  in-
4329       valid memory accesses, and is mostly useful for debugging PCRE2 itself.
4330
4331
4332CODE COVERAGE REPORTING
4333
4334       If  your  C  compiler is gcc, you can build a version of PCRE2 that can
4335       generate a code coverage report for its test suite. To enable this, you
4336       must install lcov version 1.6 or above. Then specify
4337
4338         --enable-coverage
4339
4340       to the configure command and build PCRE2 in the usual way.
4341
4342       Note that using ccache (a caching C compiler) is incompatible with code
4343       coverage reporting. If you have configured ccache to run  automatically
4344       on your system, you must set the environment variable
4345
4346         CCACHE_DISABLE=1
4347
4348       before running make to build PCRE2, so that ccache is not used.
4349
4350       When  --enable-coverage  is  used,  the  following addition targets are
4351       added to the Makefile:
4352
4353         make coverage
4354
4355       This creates a fresh coverage report for the PCRE2 test  suite.  It  is
4356       equivalent  to running "make coverage-reset", "make coverage-baseline",
4357       "make check", and then "make coverage-report".
4358
4359         make coverage-reset
4360
4361       This zeroes the coverage counters, but does nothing else.
4362
4363         make coverage-baseline
4364
4365       This captures baseline coverage information.
4366
4367         make coverage-report
4368
4369       This creates the coverage report.
4370
4371         make coverage-clean-report
4372
4373       This removes the generated coverage report without cleaning the  cover-
4374       age data itself.
4375
4376         make coverage-clean-data
4377
4378       This  removes  the captured coverage data without removing the coverage
4379       files created at compile time (*.gcno).
4380
4381         make coverage-clean
4382
4383       This cleans all coverage data including the generated coverage  report.
4384       For  more  information about code coverage, see the gcov and lcov docu-
4385       mentation.
4386
4387
4388DISABLING THE Z AND T FORMATTING MODIFIERS
4389
4390       The C99 standard defines formatting modifiers z and t  for  size_t  and
4391       ptrdiff_t  values, respectively. By default, PCRE2 uses these modifiers
4392       in environments other than old versions of Microsoft Visual Studio when
4393       __STDC_VERSION__  is  defined  and has a value greater than or equal to
4394       199901L (indicating support for C99).  However, there is at  least  one
4395       environment that claims to be C99 but does not support these modifiers.
4396       If
4397
4398         --disable-percent-zt
4399
4400       is specified, no use is made of the z or t modifiers. Instead of %td or
4401       %zu,  a  suitable  format is used depending in the size of long for the
4402       platform.
4403
4404
4405SUPPORT FOR FUZZERS
4406
4407       There is a special option for use by people who  want  to  run  fuzzing
4408       tests on PCRE2:
4409
4410         --enable-fuzz-support
4411
4412       At present this applies only to the 8-bit library. If set, it causes an
4413       extra library called libpcre2-fuzzsupport.a to be built,  but  not  in-
4414       stalled.  This  contains  a single function called LLVMFuzzerTestOneIn-
4415       put() whose arguments are a pointer to a string and the length  of  the
4416       string.  When  called,  this  function tries to compile the string as a
4417       pattern, and if that succeeds, to match it.  This is done both with  no
4418       options  and  with some random options bits that are generated from the
4419       string.
4420
4421       Setting --enable-fuzz-support also causes  a  binary  called  pcre2fuz-
4422       zcheck  to be created. This is normally run under valgrind or used when
4423       PCRE2 is compiled with address sanitizing enabled. It calls the fuzzing
4424       function  and  outputs  information  about  what it is doing. The input
4425       strings are specified by arguments: if an argument starts with "="  the
4426       rest  of it is a literal input string. Otherwise, it is assumed to be a
4427       file name, and the contents of the file are the test string.
4428
4429
4430OBSOLETE OPTION
4431
4432       In versions of PCRE2 prior to 10.30, there were two  ways  of  handling
4433       backtracking  in the pcre2_match() function. The default was to use the
4434       system stack, but if
4435
4436         --disable-stack-for-recursion
4437
4438       was set, memory on the heap was used. From release 10.30  onwards  this
4439       has  changed  (the  stack  is  no longer used) and this option now does
4440       nothing except give a warning.
4441
4442
4443SEE ALSO
4444
4445       pcre2api(3), pcre2-config(3).
4446
4447
4448AUTHOR
4449
4450       Philip Hazel
4451       University Computing Service
4452       Cambridge, England.
4453
4454
4455REVISION
4456
4457       Last updated: 08 December 2021
4458       Copyright (c) 1997-2021 University of Cambridge.
4459------------------------------------------------------------------------------
4460
4461
4462PCRE2CALLOUT(3)            Library Functions Manual            PCRE2CALLOUT(3)
4463
4464
4465
4466NAME
4467       PCRE2 - Perl-compatible regular expressions (revised API)
4468
4469SYNOPSIS
4470
4471       #include <pcre2.h>
4472
4473       int (*pcre2_callout)(pcre2_callout_block *, void *);
4474
4475       int pcre2_callout_enumerate(const pcre2_code *code,
4476         int (*callback)(pcre2_callout_enumerate_block *, void *),
4477         void *user_data);
4478
4479
4480DESCRIPTION
4481
4482       PCRE2  provides  a feature called "callout", which is a means of tempo-
4483       rarily passing control to the caller of PCRE2 in the middle of  pattern
4484       matching.  The caller of PCRE2 provides an external function by putting
4485       its entry point in a match  context  (see  pcre2_set_callout()  in  the
4486       pcre2api documentation).
4487
4488       When  using the pcre2_substitute() function, an additional callout fea-
4489       ture is available. This does a callout after each change to the subject
4490       string and is described in the pcre2api documentation; the rest of this
4491       document is concerned with callouts during pattern matching.
4492
4493       Within a regular expression, (?C<arg>) indicates a point at  which  the
4494       external  function  is  to  be  called. Different callout points can be
4495       identified by putting a number less than 256 after the  letter  C.  The
4496       default  value is zero.  Alternatively, the argument may be a delimited
4497       string. The starting delimiter must be one of ` ' " ^ % # $ {  and  the
4498       ending delimiter is the same as the start, except for {, where the end-
4499       ing delimiter is }. If  the  ending  delimiter  is  needed  within  the
4500       string,  it  must be doubled. For example, this pattern has two callout
4501       points:
4502
4503         (?C1)abc(?C"some ""arbitrary"" text")def
4504
4505       If the PCRE2_AUTO_CALLOUT option bit is set when a pattern is compiled,
4506       PCRE2  automatically inserts callouts, all with number 255, before each
4507       item in the pattern except for immediately before or after an  explicit
4508       callout. For example, if PCRE2_AUTO_CALLOUT is used with the pattern
4509
4510         A(?C3)B
4511
4512       it is processed as if it were
4513
4514         (?C255)A(?C3)B(?C255)
4515
4516       Here is a more complicated example:
4517
4518         A(\d{2}|--)
4519
4520       With PCRE2_AUTO_CALLOUT, this pattern is processed as if it were
4521
4522         (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
4523
4524       Notice  that  there  is a callout before and after each parenthesis and
4525       alternation bar. If the pattern contains a conditional group whose con-
4526       dition  is  an  assertion, an automatic callout is inserted immediately
4527       before the condition. Such a callout may also be  inserted  explicitly,
4528       for example:
4529
4530         (?(?C9)(?=a)ab|de)  (?(?C%text%)(?!=d)ab|de)
4531
4532       This  applies only to assertion conditions (because they are themselves
4533       independent groups).
4534
4535       Callouts can be useful for tracking the progress of  pattern  matching.
4536       The pcre2test program has a pattern qualifier (/auto_callout) that sets
4537       automatic callouts.  When any callouts are  present,  the  output  from
4538       pcre2test  indicates  how  the pattern is being matched. This is useful
4539       information when you are trying to optimize the performance of  a  par-
4540       ticular pattern.
4541
4542
4543MISSING CALLOUTS
4544
4545       You  should  be  aware  that, because of optimizations in the way PCRE2
4546       compiles and matches patterns, callouts sometimes do not happen exactly
4547       as you might expect.
4548
4549   Auto-possessification
4550
4551       At compile time, PCRE2 "auto-possessifies" repeated items when it knows
4552       that what follows cannot be part of the repeat. For example, a+[bc]  is
4553       compiled  as if it were a++[bc]. The pcre2test output when this pattern
4554       is compiled with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied
4555       to the string "aaaa" is:
4556
4557         --->aaaa
4558          +0 ^        a+
4559          +2 ^   ^    [bc]
4560         No match
4561
4562       This  indicates that when matching [bc] fails, there is no backtracking
4563       into a+ (because it is being treated as a++) and therefore the callouts
4564       that  would  be  taken for the backtracks do not occur. You can disable
4565       the  auto-possessify  feature  by  passing   PCRE2_NO_AUTO_POSSESS   to
4566       pcre2_compile(),  or  starting  the pattern with (*NO_AUTO_POSSESS). In
4567       this case, the output changes to this:
4568
4569         --->aaaa
4570          +0 ^        a+
4571          +2 ^   ^    [bc]
4572          +2 ^  ^     [bc]
4573          +2 ^ ^      [bc]
4574          +2 ^^       [bc]
4575         No match
4576
4577       This time, when matching [bc] fails, the matcher backtracks into a+ and
4578       tries again, repeatedly, until a+ itself fails.
4579
4580   Automatic .* anchoring
4581
4582       By default, an optimization is applied when .* is the first significant
4583       item in a pattern. If PCRE2_DOTALL is set, so that the  dot  can  match
4584       any  character,  the pattern is automatically anchored. If PCRE2_DOTALL
4585       is not set, a match can start only after an internal newline or at  the
4586       beginning of the subject, and pcre2_compile() remembers this. If a pat-
4587       tern has more than one top-level branch, automatic anchoring occurs  if
4588       all branches are anchorable.
4589
4590       This  optimization is disabled, however, if .* is in an atomic group or
4591       if there is a backreference to the capture group in which  it  appears.
4592       It  is  also disabled if the pattern contains (*PRUNE) or (*SKIP). How-
4593       ever, the presence of callouts does not affect it.
4594
4595       For example, if the pattern .*\d is  compiled  with  PCRE2_AUTO_CALLOUT
4596       and applied to the string "aa", the pcre2test output is:
4597
4598         --->aa
4599          +0 ^      .*
4600          +2 ^ ^    \d
4601          +2 ^^     \d
4602          +2 ^      \d
4603         No match
4604
4605       This  shows  that all match attempts start at the beginning of the sub-
4606       ject. In other words, the pattern is anchored. You can disable this op-
4607       timization  by  passing  PCRE2_NO_DOTSTAR_ANCHOR to pcre2_compile(), or
4608       starting the pattern with (*NO_DOTSTAR_ANCHOR). In this case, the  out-
4609       put changes to:
4610
4611         --->aa
4612          +0 ^      .*
4613          +2 ^ ^    \d
4614          +2 ^^     \d
4615          +2 ^      \d
4616          +0  ^     .*
4617          +2  ^^    \d
4618          +2  ^     \d
4619         No match
4620
4621       This  shows more match attempts, starting at the second subject charac-
4622       ter.  Another optimization, described in the next section,  means  that
4623       there is no subsequent attempt to match with an empty subject.
4624
4625   Other optimizations
4626
4627       Other  optimizations  that  provide fast "no match" results also affect
4628       callouts.  For example, if the pattern is
4629
4630         ab(?C4)cd
4631
4632       PCRE2 knows that any matching string must contain the  letter  "d".  If
4633       the  subject  string  is  "abyz",  the  lack of "d" means that matching
4634       doesn't ever start, and the callout is  never  reached.  However,  with
4635       "abyd", though the result is still no match, the callout is obeyed.
4636
4637       For  most  patterns  PCRE2  also knows the minimum length of a matching
4638       string, and will immediately give a "no match" return without  actually
4639       running  a  match if the subject is not long enough, or, for unanchored
4640       patterns, if it has been scanned far enough.
4641
4642       You can disable these optimizations by passing the PCRE2_NO_START_OPTI-
4643       MIZE  option  to  pcre2_compile(),  or  by  starting  the  pattern with
4644       (*NO_START_OPT). This slows down the matching process, but does  ensure
4645       that callouts such as the example above are obeyed.
4646
4647
4648THE CALLOUT INTERFACE
4649
4650       During  matching,  when  PCRE2  reaches a callout point, if an external
4651       function is provided in the match context, it is called.  This  applies
4652       to  both normal, DFA, and JIT matching. The first argument to the call-
4653       out function is a pointer to a pcre2_callout block. The second argument
4654       is  the  void * callout data that was supplied when the callout was set
4655       up by calling pcre2_set_callout() (see the pcre2api documentation). The
4656       callout  block structure contains the following fields, not necessarily
4657       in this order:
4658
4659         uint32_t      version;
4660         uint32_t      callout_number;
4661         uint32_t      capture_top;
4662         uint32_t      capture_last;
4663         uint32_t      callout_flags;
4664         PCRE2_SIZE   *offset_vector;
4665         PCRE2_SPTR    mark;
4666         PCRE2_SPTR    subject;
4667         PCRE2_SIZE    subject_length;
4668         PCRE2_SIZE    start_match;
4669         PCRE2_SIZE    current_position;
4670         PCRE2_SIZE    pattern_position;
4671         PCRE2_SIZE    next_item_length;
4672         PCRE2_SIZE    callout_string_offset;
4673         PCRE2_SIZE    callout_string_length;
4674         PCRE2_SPTR    callout_string;
4675
4676       The version field contains the version number of the block format.  The
4677       current  version  is  2; the three callout string fields were added for
4678       version 1, and the callout_flags field for version 2. If you are  writ-
4679       ing  an  application  that  might  use an earlier release of PCRE2, you
4680       should check the version number before accessing any of  these  fields.
4681       The  version  number  will increase in future if more fields are added,
4682       but the intention is never to remove any of the existing fields.
4683
4684   Fields for numerical callouts
4685
4686       For a numerical callout, callout_string  is  NULL,  and  callout_number
4687       contains  the  number  of  the callout, in the range 0-255. This is the
4688       number that follows (?C for callouts that part of the  pattern;  it  is
4689       255 for automatically generated callouts.
4690
4691   Fields for string callouts
4692
4693       For  callouts with string arguments, callout_number is always zero, and
4694       callout_string points to the string that is contained within  the  com-
4695       piled pattern. Its length is given by callout_string_length. Duplicated
4696       ending delimiters that were present in the original pattern string have
4697       been turned into single characters, but there is no other processing of
4698       the callout string argument. An additional code unit containing  binary
4699       zero  is  present  after the string, but is not included in the length.
4700       The delimiter that was used to start the string is also  stored  within
4701       the  pattern, immediately before the string itself. You can access this
4702       delimiter as callout_string[-1] if you need it.
4703
4704       The callout_string_offset field is the code unit offset to the start of
4705       the callout argument string within the original pattern string. This is
4706       provided for the benefit of applications such as script languages  that
4707       might need to report errors in the callout string within the pattern.
4708
4709   Fields for all callouts
4710
4711       The  remaining  fields in the callout block are the same for both kinds
4712       of callout.
4713
4714       The offset_vector field is a pointer to a vector of  capturing  offsets
4715       (the "ovector"). You may read the elements in this vector, but you must
4716       not change any of them.
4717
4718       For calls to pcre2_match(), the offset_vector field is not  (since  re-
4719       lease  10.30)  a  pointer  to the actual ovector that was passed to the
4720       matching function in the match data block. Instead it points to an  in-
4721       ternal  ovector  of  a  size large enough to hold all possible captured
4722       substrings in the pattern. Note that whenever a recursion or subroutine
4723       call  within  a pattern completes, the capturing state is reset to what
4724       it was before.
4725
4726       The capture_last field contains the number of the  most  recently  cap-
4727       tured  substring,  and the capture_top field contains one more than the
4728       number of the highest numbered captured substring so far.  If  no  sub-
4729       strings  have yet been captured, the value of capture_last is 0 and the
4730       value of capture_top is 1. The values of these  fields  do  not  always
4731       differ   by   one;  for  example,  when  the  callout  in  the  pattern
4732       ((a)(b))(?C2) is taken, capture_last is 1 but capture_top is 4.
4733
4734       The contents of ovector[2] to  ovector[<capture_top>*2-1]  can  be  in-
4735       spected  in  order to extract substrings that have been matched so far,
4736       in the same way as extracting substrings after a match  has  completed.
4737       The  values in ovector[0] and ovector[1] are always PCRE2_UNSET because
4738       the match is by definition not complete. Substrings that have not  been
4739       captured  but whose numbers are less than capture_top also have both of
4740       their ovector slots set to PCRE2_UNSET.
4741
4742       For DFA matching, the offset_vector field points to  the  ovector  that
4743       was  passed  to the matching function in the match data block for call-
4744       outs at the top level, but to an internal ovector during the processing
4745       of  pattern  recursions, lookarounds, and atomic groups. However, these
4746       ovectors hold no useful information because pcre2_dfa_match() does  not
4747       support  substring  capturing. The value of capture_top is always 1 and
4748       the value of capture_last is always 0 for DFA matching.
4749
4750       The subject and subject_length fields contain copies of the values that
4751       were passed to the matching function.
4752
4753       The  start_match  field normally contains the offset within the subject
4754       at which the current match attempt started. However, if the escape  se-
4755       quence  \K  has  been encountered, this value is changed to reflect the
4756       modified starting point. If the pattern is not  anchored,  the  callout
4757       function may be called several times from the same point in the pattern
4758       for different starting points in the subject.
4759
4760       The current_position field contains the offset within  the  subject  of
4761       the current match pointer.
4762
4763       The pattern_position field contains the offset in the pattern string to
4764       the next item to be matched.
4765
4766       The next_item_length field contains the length of the next item  to  be
4767       processed  in the pattern string. When the callout is at the end of the
4768       pattern, the length is zero.  When  the  callout  precedes  an  opening
4769       parenthesis, the length includes meta characters that follow the paren-
4770       thesis. For example, in a callout before an assertion  such  as  (?=ab)
4771       the  length  is  3. For an an alternation bar or a closing parenthesis,
4772       the length is one, unless a closing parenthesis is followed by a  quan-
4773       tifier, in which case its length is included.  (This changed in release
4774       10.23. In earlier releases, before an opening  parenthesis  the  length
4775       was  that of the entire group, and before an alternation bar or a clos-
4776       ing parenthesis the length was zero.)
4777
4778       The pattern_position and next_item_length fields are intended  to  help
4779       in  distinguishing between different automatic callouts, which all have
4780       the same callout number. However, they are set for  all  callouts,  and
4781       are used by pcre2test to show the next item to be matched when display-
4782       ing callout information.
4783
4784       In callouts from pcre2_match() the mark field contains a pointer to the
4785       zero-terminated  name of the most recently passed (*MARK), (*PRUNE), or
4786       (*THEN) item in the match, or NULL if no such items have  been  passed.
4787       Instances  of  (*PRUNE)  or  (*THEN) without a name do not obliterate a
4788       previous (*MARK). In callouts from the DFA matching function this field
4789       always contains NULL.
4790
4791       The   callout_flags   field   is   always   zero   in   callouts   from
4792       pcre2_dfa_match() or when JIT is being used. When pcre2_match() without
4793       JIT is used, the following bits may be set:
4794
4795         PCRE2_CALLOUT_STARTMATCH
4796
4797       This  is set for the first callout after the start of matching for each
4798       new starting position in the subject.
4799
4800         PCRE2_CALLOUT_BACKTRACK
4801
4802       This is set if there has been a matching backtrack since  the  previous
4803       callout,  or  since  the start of matching if this is the first callout
4804       from a pcre2_match() run.
4805
4806       Both bits are set when a backtrack has caused a "bumpalong"  to  a  new
4807       starting  position in the subject. Output from pcre2test does not indi-
4808       cate the presence of these bits unless the  callout_extra  modifier  is
4809       set.
4810
4811       The information in the callout_flags field is provided so that applica-
4812       tions can track and tell their users how matching with backtracking  is
4813       done.  This  can be useful when trying to optimize patterns, or just to
4814       understand how PCRE2 works. There is no  support  in  pcre2_dfa_match()
4815       because  there is no backtracking in DFA matching, and there is no sup-
4816       port in JIT because JIT is all about maximimizing matching performance.
4817       In both these cases the callout_flags field is always zero.
4818
4819
4820RETURN VALUES FROM CALLOUTS
4821
4822       The external callout function returns an integer to PCRE2. If the value
4823       is zero, matching proceeds as normal. If  the  value  is  greater  than
4824       zero,  matching  fails  at  the current point, but the testing of other
4825       matching possibilities goes ahead, just as if a lookahead assertion had
4826       failed. If the value is less than zero, the match is abandoned, and the
4827       matching function returns the negative value.
4828
4829       Negative values should normally be chosen from  the  set  of  PCRE2_ER-
4830       ROR_xxx  values.  In  particular, PCRE2_ERROR_NOMATCH forces a standard
4831       "no match" failure. The error number  PCRE2_ERROR_CALLOUT  is  reserved
4832       for use by callout functions; it will never be used by PCRE2 itself.
4833
4834
4835CALLOUT ENUMERATION
4836
4837       int pcre2_callout_enumerate(const pcre2_code *code,
4838         int (*callback)(pcre2_callout_enumerate_block *, void *),
4839         void *user_data);
4840
4841       A script language that supports the use of string arguments in callouts
4842       might like to scan all the callouts in a  pattern  before  running  the
4843       match. This can be done by calling pcre2_callout_enumerate(). The first
4844       argument is a pointer to a compiled pattern, the  second  points  to  a
4845       callback  function,  and the third is arbitrary user data. The callback
4846       function is called for every callout in the pattern  in  the  order  in
4847       which they appear. Its first argument is a pointer to a callout enumer-
4848       ation block, and its second argument is the user_data  value  that  was
4849       passed  to  pcre2_callout_enumerate(). The data block contains the fol-
4850       lowing fields:
4851
4852         version                Block version number
4853         pattern_position       Offset to next item in pattern
4854         next_item_length       Length of next item in pattern
4855         callout_number         Number for numbered callouts
4856         callout_string_offset  Offset to string within pattern
4857         callout_string_length  Length of callout string
4858         callout_string         Points to callout string or is NULL
4859
4860       The version number is currently 0. It will increase if new  fields  are
4861       ever  added  to  the  block. The remaining fields are the same as their
4862       namesakes in the pcre2_callout block that is used for  callouts  during
4863       matching, as described above.
4864
4865       Note  that  the  value  of pattern_position is unique for each callout.
4866       However, if a callout occurs inside a group that is quantified  with  a
4867       non-zero minimum or a fixed maximum, the group is replicated inside the
4868       compiled pattern. For example, a pattern such as /(a){2}/  is  compiled
4869       as  if it were /(a)(a)/. This means that the callout will be enumerated
4870       more than once, but with the same value for  pattern_position  in  each
4871       case.
4872
4873       The callback function should normally return zero. If it returns a non-
4874       zero value, scanning the pattern stops, and that value is returned from
4875       pcre2_callout_enumerate().
4876
4877
4878AUTHOR
4879
4880       Philip Hazel
4881       University Computing Service
4882       Cambridge, England.
4883
4884
4885REVISION
4886
4887       Last updated: 03 February 2019
4888       Copyright (c) 1997-2019 University of Cambridge.
4889------------------------------------------------------------------------------
4890
4891
4892PCRE2COMPAT(3)             Library Functions Manual             PCRE2COMPAT(3)
4893
4894
4895
4896NAME
4897       PCRE2 - Perl-compatible regular expressions (revised API)
4898
4899DIFFERENCES BETWEEN PCRE2 AND PERL
4900
4901       This  document describes some of the differences in the ways that PCRE2
4902       and Perl handle regular expressions. The differences described here are
4903       with  respect  to  Perl  version 5.34.0, but as both Perl and PCRE2 are
4904       continually changing, the information may at times be out of date.
4905
4906       1. When PCRE2_DOTALL (equivalent to Perl's /s qualifier)  is  not  set,
4907       the behaviour of the '.' metacharacter differs from Perl. In PCRE2, '.'
4908       matches the next character unless it is the  start  of  a  newline  se-
4909       quence.  This  means  that, if the newline setting is CR, CRLF, or NUL,
4910       '.' will match the code point LF (0x0A) in ASCII/Unicode  environments,
4911       and  NL  (either  0x15 or 0x25) when using EBCDIC. In Perl, '.' appears
4912       never to match LF, even when 0x0A is not a newline indicator.
4913
4914       2. PCRE2 has only a subset of Perl's Unicode support. Details  of  what
4915       it does have are given in the pcre2unicode page.
4916
4917       3.  Like  Perl, PCRE2 allows repeat quantifiers on parenthesized asser-
4918       tions, but they do not mean what you might think. For example, (?!a){3}
4919       does not assert that the next three characters are not "a". It just as-
4920       serts that the next character is not "a"  three  times  (in  principle;
4921       PCRE2  optimizes this to run the assertion just once). Perl allows some
4922       repeat quantifiers on other assertions, for example, \b* , but these do
4923       not  seem  to have any use. PCRE2 does not allow any kind of quantifier
4924       on non-lookaround assertions.
4925
4926       4. Capture groups that occur inside negative lookaround assertions  are
4927       counted,  but  their  entries in the offsets vector are set only when a
4928       negative assertion is a condition that has a matching branch (that  is,
4929       the  condition  is  false).   Perl may set such capture groups in other
4930       circumstances.
4931
4932       5. The following Perl escape sequences are not supported: \F,  \l,  \L,
4933       \u, \U, and \N when followed by a character name. \N on its own, match-
4934       ing a non-newline character, and \N{U+dd..}, matching  a  Unicode  code
4935       point,  are  supported.  The  escapes that modify the case of following
4936       letters are implemented by Perl's general string-handling and  are  not
4937       part of its pattern matching engine. If any of these are encountered by
4938       PCRE2, an error is generated by default.  However,  if  either  of  the
4939       PCRE2_ALT_BSUX  or  PCRE2_EXTRA_ALT_BSUX  options is set, \U and \u are
4940       interpreted as ECMAScript interprets them.
4941
4942       6. The Perl escape sequences \p, \P, and \X are supported only if PCRE2
4943       is built with Unicode support (the default). The properties that can be
4944       tested with \p and \P are limited to the  general  category  properties
4945       such  as  Lu  and  Nd,  script  names such as Greek or Han, Bidi_Class,
4946       Bidi_Control, and the derived properties Any and LC (synonym L&).  Both
4947       PCRE2  and  Perl  support the Cs (surrogate) property, but in PCRE2 its
4948       use is limited. See the pcre2pattern  documentation  for  details.  The
4949       long  synonyms  for  property names that Perl supports (such as \p{Let-
4950       ter}) are not supported by PCRE2, nor is it permitted to prefix any  of
4951       these properties with "Is".
4952
4953       7. PCRE2 supports the \Q...\E escape for quoting substrings. Characters
4954       in between are treated as literals. However, this is slightly different
4955       from  Perl  in  that  $  and  @ are also handled as literals inside the
4956       quotes. In Perl, they cause variable interpolation (PCRE2 does not have
4957       variables). Also, Perl does "double-quotish backslash interpolation" on
4958       any backslashes between \Q and \E which, its documentation  says,  "may
4959       lead  to confusing results". PCRE2 treats a backslash between \Q and \E
4960       just like any other character. Note the following examples:
4961
4962           Pattern            PCRE2 matches     Perl matches
4963
4964           \Qabc$xyz\E        abc$xyz           abc followed by the
4965                                                  contents of $xyz
4966           \Qabc\$xyz\E       abc\$xyz          abc\$xyz
4967           \Qabc\E\$\Qxyz\E   abc$xyz           abc$xyz
4968           \QA\B\E            A\B               A\B
4969           \Q\\E              \                 \\E
4970
4971       The \Q...\E sequence is recognized both inside  and  outside  character
4972       classes by both PCRE2 and Perl.
4973
4974       8.   Fairly  obviously,  PCRE2  does  not  support  the  (?{code})  and
4975       (??{code}) constructions. However, PCRE2 does have a "callout" feature,
4976       which allows an external function to be called during pattern matching.
4977       See the pcre2callout documentation for details.
4978
4979       9. Subroutine calls (whether recursive or not) were treated  as  atomic
4980       groups  up to PCRE2 release 10.23, but from release 10.30 this changed,
4981       and backtracking into subroutine calls is now supported, as in Perl.
4982
4983       10. In PCRE2, if any of the backtracking control verbs are  used  in  a
4984       group  that  is  called  as  a subroutine (whether or not recursively),
4985       their effect is confined to that group; it does not extend to the  sur-
4986       rounding  pattern.  This is not always the case in Perl. In particular,
4987       if (*THEN) is present in a group that is called as  a  subroutine,  its
4988       action is limited to that group, even if the group does not contain any
4989       | characters. Note that such groups are processed as  anchored  at  the
4990       point where they are tested.
4991
4992       11.  If a pattern contains more than one backtracking control verb, the
4993       first one that is backtracked onto acts. For example,  in  the  pattern
4994       A(*COMMIT)B(*PRUNE)C  a  failure in B triggers (*COMMIT), but a failure
4995       in C triggers (*PRUNE). Perl's behaviour is more complex; in many cases
4996       it is the same as PCRE2, but there are cases where it differs.
4997
4998       12.  There are some differences that are concerned with the settings of
4999       captured strings when part of  a  pattern  is  repeated.  For  example,
5000       matching  "aba"  against the pattern /^(a(b)?)+$/ in Perl leaves $2 un-
5001       set, but in PCRE2 it is set to "b".
5002
5003       13. PCRE2's handling of duplicate capture group numbers  and  names  is
5004       not  as  general as Perl's. This is a consequence of the fact the PCRE2
5005       works internally just with numbers, using an external table  to  trans-
5006       late  between  numbers  and  names.  In  particular,  a pattern such as
5007       (?|(?<a>A)|(?<b>B)), where the two capture groups have the same  number
5008       but  different  names, is not supported, and causes an error at compile
5009       time. If it were allowed, it would not be possible to distinguish which
5010       group  matched,  because  both  names map to capture group number 1. To
5011       avoid this confusing situation, an error is given at compile time.
5012
5013       14. Perl used to recognize comments in some places that PCRE2 does not,
5014       for  example,  between  the  ( and ? at the start of a group. If the /x
5015       modifier is set, Perl allowed white space between ( and  ?  though  the
5016       latest  Perls give an error (for a while it was just deprecated). There
5017       may still be some cases where Perl behaves differently.
5018
5019       15. Perl, when in warning mode, gives warnings  for  character  classes
5020       such  as  [A-\d] or [a-[:digit:]]. It then treats the hyphens as liter-
5021       als. PCRE2 has no warning features, so it gives an error in these cases
5022       because they are almost certainly user mistakes.
5023
5024       16.  In  PCRE2, the upper/lower case character properties Lu and Ll are
5025       not affected when case-independent matching is specified. For  example,
5026       \p{Lu} always matches an upper case letter. I think Perl has changed in
5027       this respect; in the release at the time of writing (5.34), \p{Lu}  and
5028       \p{Ll} match all letters, regardless of case, when case independence is
5029       specified.
5030
5031       17. From release 5.32.0, Perl locks out the use of \K in lookaround as-
5032       sertions.  From  release 10.38 PCRE2 does the same by default. However,
5033       there is an option for re-enabling the previous  behaviour.  When  this
5034       option  is  set,  \K is acted on when it occurs in positive assertions,
5035       but is ignored in negative assertions.
5036
5037       18. PCRE2 provides some extensions to the Perl regular  expression  fa-
5038       cilities.   Perl  5.10  included  new features that were not in earlier
5039       versions of Perl, some of which (such as  named  parentheses)  were  in
5040       PCRE2 for some time before. This list is with respect to Perl 5.34:
5041
5042       (a)  Although  lookbehind  assertions  in PCRE2 must match fixed length
5043       strings, each alternative toplevel branch of a lookbehind assertion can
5044       match  a  different  length of string. Perl used to require them all to
5045       have the same length, but the latest version has some  variable  length
5046       support.
5047
5048       (b) From PCRE2 10.23, backreferences to groups of fixed length are sup-
5049       ported in lookbehinds, provided that there is no possibility of  refer-
5050       encing  a  non-unique  number or name. Perl does not support backrefer-
5051       ences in lookbehinds.
5052
5053       (c) If PCRE2_DOLLAR_ENDONLY is set and PCRE2_MULTILINE is not set,  the
5054       $ meta-character matches only at the very end of the string.
5055
5056       (d)  A  backslash  followed  by  a  letter  with  no special meaning is
5057       faulted. (Perl can be made to issue a warning.)
5058
5059       (e) If PCRE2_UNGREEDY is set, the greediness of the repetition  quanti-
5060       fiers is inverted, that is, by default they are not greedy, but if fol-
5061       lowed by a question mark they are.
5062
5063       (f) PCRE2_ANCHORED can be used at matching time to force a  pattern  to
5064       be tried only at the first matching position in the subject string.
5065
5066       (g)     The     PCRE2_NOTBOL,    PCRE2_NOTEOL,    PCRE2_NOTEMPTY    and
5067       PCRE2_NOTEMPTY_ATSTART options have no Perl equivalents.
5068
5069       (h) The \R escape sequence can be restricted to match only CR,  LF,  or
5070       CRLF by the PCRE2_BSR_ANYCRLF option.
5071
5072       (i)  The  callout  facility is PCRE2-specific. Perl supports codeblocks
5073       and variable interpolation, but not general hooks on every match.
5074
5075       (j) The partial matching facility is PCRE2-specific.
5076
5077       (k) The alternative matching function (pcre2_dfa_match() matches  in  a
5078       different way and is not Perl-compatible.
5079
5080       (l)  PCRE2 recognizes some special sequences such as (*CR) or (*NO_JIT)
5081       at the start of a pattern. These set overall  options  that  cannot  be
5082       changed within the pattern.
5083
5084       (m)  PCRE2  supports non-atomic positive lookaround assertions. This is
5085       an extension to the lookaround facilities. The default, Perl-compatible
5086       lookarounds are atomic.
5087
5088       19.  The  Perl  /a modifier restricts /d numbers to pure ascii, and the
5089       /aa modifier restricts /i case-insensitive matching to pure ascii,  ig-
5090       noring  Unicode  rules.  This  separation  cannot  be  represented with
5091       PCRE2_UCP.
5092
5093       20. Perl has different limits than PCRE2. See the pcre2limit documenta-
5094       tion for details. Perl went with 5.10 from recursion to iteration keep-
5095       ing the intermediate matches on the heap, which is ~10% slower but does
5096       not  fall into any stack-overflow limit. PCRE2 made a similar change at
5097       release 10.30, and also has many build-time and  run-time  customizable
5098       limits.
5099
5100
5101AUTHOR
5102
5103       Philip Hazel
5104       Retired from University Computing Service
5105       Cambridge, England.
5106
5107
5108REVISION
5109
5110       Last updated: 08 December 2021
5111       Copyright (c) 1997-2021 University of Cambridge.
5112------------------------------------------------------------------------------
5113
5114
5115PCRE2JIT(3)                Library Functions Manual                PCRE2JIT(3)
5116
5117
5118
5119NAME
5120       PCRE2 - Perl-compatible regular expressions (revised API)
5121
5122PCRE2 JUST-IN-TIME COMPILER SUPPORT
5123
5124       Just-in-time  compiling  is a heavyweight optimization that can greatly
5125       speed up pattern matching. However, it comes at the cost of extra  pro-
5126       cessing  before  the  match is performed, so it is of most benefit when
5127       the same pattern is going to be matched many times. This does not  nec-
5128       essarily  mean many calls of a matching function; if the pattern is not
5129       anchored, matching attempts may take place many times at various  posi-
5130       tions in the subject, even for a single call. Therefore, if the subject
5131       string is very long, it may still pay  to  use  JIT  even  for  one-off
5132       matches.  JIT  support  is  available  for all of the 8-bit, 16-bit and
5133       32-bit PCRE2 libraries.
5134
5135       JIT support applies only to the  traditional  Perl-compatible  matching
5136       function.   It  does  not apply when the DFA matching function is being
5137       used. The code for this support was written by Zoltan Herczeg.
5138
5139
5140AVAILABILITY OF JIT SUPPORT
5141
5142       JIT support is an optional feature of  PCRE2.  The  "configure"  option
5143       --enable-jit  (or  equivalent  CMake  option) must be set when PCRE2 is
5144       built if you want to use JIT. The support is limited to  the  following
5145       hardware platforms:
5146
5147         ARM 32-bit (v5, v7, and Thumb2)
5148         ARM 64-bit
5149         IBM s390x 64 bit
5150         Intel x86 32-bit and 64-bit
5151         MIPS 32-bit and 64-bit
5152         Power PC 32-bit and 64-bit
5153         SPARC 32-bit
5154
5155       If --enable-jit is set on an unsupported platform, compilation fails.
5156
5157       A  program  can  tell if JIT support is available by calling pcre2_con-
5158       fig() with the PCRE2_CONFIG_JIT option. The result is  1  when  JIT  is
5159       available,  and 0 otherwise. However, a simple program does not need to
5160       check this in order to use JIT. The API is implemented in  a  way  that
5161       falls  back  to the interpretive code if JIT is not available. For pro-
5162       grams that need the best possible performance, there is  also  a  "fast
5163       path" API that is JIT-specific.
5164
5165
5166SIMPLE USE OF JIT
5167
5168       To  make use of the JIT support in the simplest way, all you have to do
5169       is to call pcre2_jit_compile() after successfully compiling  a  pattern
5170       with pcre2_compile(). This function has two arguments: the first is the
5171       compiled pattern pointer that was returned by pcre2_compile(), and  the
5172       second  is  zero  or  more of the following option bits: PCRE2_JIT_COM-
5173       PLETE, PCRE2_JIT_PARTIAL_HARD, or PCRE2_JIT_PARTIAL_SOFT.
5174
5175       If JIT support is not available, a  call  to  pcre2_jit_compile()  does
5176       nothing  and returns PCRE2_ERROR_JIT_BADOPTION. Otherwise, the compiled
5177       pattern is passed to the JIT compiler, which turns it into machine code
5178       that executes much faster than the normal interpretive code, but yields
5179       exactly the same results. The returned value  from  pcre2_jit_compile()
5180       is zero on success, or a negative error code.
5181
5182       There  is  a limit to the size of pattern that JIT supports, imposed by
5183       the size of machine stack that it uses. The exact rules are  not  docu-
5184       mented because they may change at any time, in particular, when new op-
5185       timizations are introduced.  If  a  pattern  is  too  big,  a  call  to
5186       pcre2_jit_compile() returns PCRE2_ERROR_NOMEMORY.
5187
5188       PCRE2_JIT_COMPLETE  requests the JIT compiler to generate code for com-
5189       plete matches. If you want to run partial matches using the  PCRE2_PAR-
5190       TIAL_HARD  or  PCRE2_PARTIAL_SOFT  options of pcre2_match(), you should
5191       set one or both of  the  other  options  as  well  as,  or  instead  of
5192       PCRE2_JIT_COMPLETE. The JIT compiler generates different optimized code
5193       for each of the three modes (normal, soft partial, hard partial).  When
5194       pcre2_match()  is  called,  the appropriate code is run if it is avail-
5195       able. Otherwise, the pattern is matched using interpretive code.
5196
5197       You can call pcre2_jit_compile() multiple times for the  same  compiled
5198       pattern.  It does nothing if it has previously compiled code for any of
5199       the option bits. For example, you can call it once with  PCRE2_JIT_COM-
5200       PLETE  and  (perhaps  later,  when  you find you need partial matching)
5201       again with PCRE2_JIT_COMPLETE and PCRE2_JIT_PARTIAL_HARD. This time  it
5202       will ignore PCRE2_JIT_COMPLETE and just compile code for partial match-
5203       ing. If pcre2_jit_compile() is called with no option bits set, it imme-
5204       diately returns zero. This is an alternative way of testing whether JIT
5205       is available.
5206
5207       At present, it is not possible to free JIT compiled  code  except  when
5208       the entire compiled pattern is freed by calling pcre2_code_free().
5209
5210       In  some circumstances you may need to call additional functions. These
5211       are described in the section entitled "Controlling the JIT  stack"  be-
5212       low.
5213
5214       There are some pcre2_match() options that are not supported by JIT, and
5215       there are also some pattern items that JIT cannot handle.  Details  are
5216       given  below.  In  both cases, matching automatically falls back to the
5217       interpretive code. If you want to know whether JIT  was  actually  used
5218       for  a particular match, you should arrange for a JIT callback function
5219       to be set up as described in the section entitled "Controlling the  JIT
5220       stack"  below,  even  if  you  do  not need to supply a non-default JIT
5221       stack. Such a callback function is called whenever JIT code is about to
5222       be  obeyed.  If the match-time options are not right for JIT execution,
5223       the callback function is not obeyed.
5224
5225       If the JIT compiler finds an unsupported item, no JIT  data  is  gener-
5226       ated.  You  can find out if JIT matching is available after compiling a
5227       pattern by calling pcre2_pattern_info() with the PCRE2_INFO_JITSIZE op-
5228       tion.  A  non-zero  result means that JIT compilation was successful. A
5229       result of 0 means that JIT support is not available, or the pattern was
5230       not  processed by pcre2_jit_compile(), or the JIT compiler was not able
5231       to handle the pattern.
5232
5233
5234MATCHING SUBJECTS CONTAINING INVALID UTF
5235
5236       When a pattern is compiled with the PCRE2_UTF option,  subject  strings
5237       are  normally expected to be a valid sequence of UTF code units. By de-
5238       fault, this is checked at the start of matching and an error is  gener-
5239       ated  if  invalid UTF is detected. The PCRE2_NO_UTF_CHECK option can be
5240       passed to pcre2_match() to skip the check (for improved performance) if
5241       you  are  sure  that  a subject string is valid. If this option is used
5242       with an invalid string, the result is undefined.
5243
5244       However, a way of running matches on strings that may  contain  invalid
5245       UTF   sequences   is   available.   Calling  pcre2_compile()  with  the
5246       PCRE2_MATCH_INVALID_UTF option has two effects:  it  tells  the  inter-
5247       preter  in pcre2_match() to support invalid UTF, and, if pcre2_jit_com-
5248       pile() is called, the compiled JIT code also supports invalid UTF.  De-
5249       tails  of  how this support works, in both the JIT and the interpretive
5250       cases, is given in the pcre2unicode documentation.
5251
5252       There  is  also  an  obsolete  option  for  pcre2_jit_compile()  called
5253       PCRE2_JIT_INVALID_UTF, which currently exists only for backward compat-
5254       ibility.    It   is   superseded   by   the   pcre2_compile()    option
5255       PCRE2_MATCH_INVALID_UTF and should no longer be used. It may be removed
5256       in future.
5257
5258
5259UNSUPPORTED OPTIONS AND PATTERN ITEMS
5260
5261       The pcre2_match() options that  are  supported  for  JIT  matching  are
5262       PCRE2_COPY_MATCHED_SUBJECT, PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY,
5263       PCRE2_NOTEMPTY_ATSTART,  PCRE2_NO_UTF_CHECK,  PCRE2_PARTIAL_HARD,   and
5264       PCRE2_PARTIAL_SOFT.  The  PCRE2_ANCHORED  and PCRE2_ENDANCHORED options
5265       are not supported at match time.
5266
5267       If the PCRE2_NO_JIT option is passed to pcre2_match() it  disables  the
5268       use of JIT, forcing matching by the interpreter code.
5269
5270       The  only  unsupported  pattern items are \C (match a single data unit)
5271       when running in a UTF mode, and a callout immediately before an  asser-
5272       tion condition in a conditional group.
5273
5274
5275RETURN VALUES FROM JIT MATCHING
5276
5277       When a pattern is matched using JIT matching, the return values are the
5278       same as those given by the interpretive pcre2_match()  code,  with  the
5279       addition  of one new error code: PCRE2_ERROR_JIT_STACKLIMIT. This means
5280       that the memory used for the JIT stack was insufficient. See  "Control-
5281       ling the JIT stack" below for a discussion of JIT stack usage.
5282
5283       The  error  code  PCRE2_ERROR_MATCHLIMIT is returned by the JIT code if
5284       searching a very large pattern tree goes on for too long, as it  is  in
5285       the  same circumstance when JIT is not used, but the details of exactly
5286       what is counted are not the same. The PCRE2_ERROR_DEPTHLIMIT error code
5287       is never returned when JIT matching is used.
5288
5289
5290CONTROLLING THE JIT STACK
5291
5292       When the compiled JIT code runs, it needs a block of memory to use as a
5293       stack.  By default, it uses 32KiB on the machine stack.  However,  some
5294       large  or complicated patterns need more than this. The error PCRE2_ER-
5295       ROR_JIT_STACKLIMIT is given when there is not enough stack. Three func-
5296       tions are provided for managing blocks of memory for use as JIT stacks.
5297       There is further discussion about the use of JIT stacks in the  section
5298       entitled "JIT stack FAQ" below.
5299
5300       The  pcre2_jit_stack_create()  function  creates a JIT stack. Its argu-
5301       ments are a starting size, a maximum size, and a general  context  (for
5302       memory  allocation  functions, or NULL for standard memory allocation).
5303       It returns a pointer to an opaque structure of type pcre2_jit_stack, or
5304       NULL  if there is an error. The pcre2_jit_stack_free() function is used
5305       to free a stack that is no longer needed. If its argument is NULL, this
5306       function  returns immediately, without doing anything. (For the techni-
5307       cally minded: the address space is allocated by mmap or  VirtualAlloc.)
5308       A  maximum  stack size of 512KiB to 1MiB should be more than enough for
5309       any pattern.
5310
5311       The pcre2_jit_stack_assign() function specifies which  stack  JIT  code
5312       should use. Its arguments are as follows:
5313
5314         pcre2_match_context  *mcontext
5315         pcre2_jit_callback    callback
5316         void                 *data
5317
5318       The first argument is a pointer to a match context. When this is subse-
5319       quently passed to a matching function, its information determines which
5320       JIT stack is used. If this argument is NULL, the function returns imme-
5321       diately, without doing anything. There are three cases for  the  values
5322       of the other two options:
5323
5324         (1) If callback is NULL and data is NULL, an internal 32KiB block
5325             on the machine stack is used. This is the default when a match
5326             context is created.
5327
5328         (2) If callback is NULL and data is not NULL, data must be
5329             a pointer to a valid JIT stack, the result of calling
5330             pcre2_jit_stack_create().
5331
5332         (3) If callback is not NULL, it must point to a function that is
5333             called with data as an argument at the start of matching, in
5334             order to set up a JIT stack. If the return from the callback
5335             function is NULL, the internal 32KiB stack is used; otherwise the
5336             return value must be a valid JIT stack, the result of calling
5337             pcre2_jit_stack_create().
5338
5339       A  callback function is obeyed whenever JIT code is about to be run; it
5340       is not obeyed when pcre2_match() is called with options that are incom-
5341       patible  for JIT matching. A callback function can therefore be used to
5342       determine whether a match operation was executed by JIT or by  the  in-
5343       terpreter.
5344
5345       You may safely use the same JIT stack for more than one pattern (either
5346       by assigning directly or by callback), as  long  as  the  patterns  are
5347       matched sequentially in the same thread. Currently, the only way to set
5348       up non-sequential matches in one thread is to use callouts: if a  call-
5349       out  function starts another match, that match must use a different JIT
5350       stack to the one used for currently suspended match(es).
5351
5352       In a multithread application, if you do not specify a JIT stack, or  if
5353       you  assign or pass back NULL from a callback, that is thread-safe, be-
5354       cause each thread has its own machine stack. However, if you assign  or
5355       pass back a non-NULL JIT stack, this must be a different stack for each
5356       thread so that the application is thread-safe.
5357
5358       Strictly speaking, even more is allowed. You can assign the  same  non-
5359       NULL  stack  to a match context that is used by any number of patterns,
5360       as long as they are not used for matching by multiple  threads  at  the
5361       same  time.  For  example, you could use the same stack in all compiled
5362       patterns, with a global mutex in the callback to wait until  the  stack
5363       is available for use. However, this is an inefficient solution, and not
5364       recommended.
5365
5366       This is a suggestion for how a multithreaded program that needs to  set
5367       up non-default JIT stacks might operate:
5368
5369         During thread initialization
5370           thread_local_var = pcre2_jit_stack_create(...)
5371
5372         During thread exit
5373           pcre2_jit_stack_free(thread_local_var)
5374
5375         Use a one-line callback function
5376           return thread_local_var
5377
5378       All  the  functions  described in this section do nothing if JIT is not
5379       available.
5380
5381
5382JIT STACK FAQ
5383
5384       (1) Why do we need JIT stacks?
5385
5386       PCRE2 (and JIT) is a recursive, depth-first engine, so it needs a stack
5387       where  the local data of the current node is pushed before checking its
5388       child nodes.  Allocating real machine stack on some platforms is diffi-
5389       cult. For example, the stack chain needs to be updated every time if we
5390       extend the stack on PowerPC.  Although it  is  possible,  its  updating
5391       time overhead decreases performance. So we do the recursion in memory.
5392
5393       (2) Why don't we simply allocate blocks of memory with malloc()?
5394
5395       Modern  operating  systems have a nice feature: they can reserve an ad-
5396       dress space instead of allocating memory. We can safely allocate memory
5397       pages inside this address space, so the stack could grow without moving
5398       memory data (this is important because of pointers). Thus we can  allo-
5399       cate  1MiB  address  space,  and use only a single memory page (usually
5400       4KiB) if that is enough. However, we can still grow up to 1MiB  anytime
5401       if needed.
5402
5403       (3) Who "owns" a JIT stack?
5404
5405       The owner of the stack is the user program, not the JIT studied pattern
5406       or anything else. The user program must ensure that if a stack is being
5407       used by pcre2_match(), (that is, it is assigned to a match context that
5408       is passed to the pattern currently running), that  stack  must  not  be
5409       used  by any other threads (to avoid overwriting the same memory area).
5410       The best practice for multithreaded programs is to allocate a stack for
5411       each thread, and return this stack through the JIT callback function.
5412
5413       (4) When should a JIT stack be freed?
5414
5415       You can free a JIT stack at any time, as long as it will not be used by
5416       pcre2_match() again. When you assign the stack to a match context, only
5417       a  pointer  is  set. There is no reference counting or any other magic.
5418       You can free compiled patterns, contexts, and stacks in any order, any-
5419       time.   Just do not call pcre2_match() with a match context pointing to
5420       an already freed stack, as that will cause SEGFAULT. (Also, do not free
5421       a  stack  currently  used  by pcre2_match() in another thread). You can
5422       also replace the stack in a context at any time when it is not in  use.
5423       You should free the previous stack before assigning a replacement.
5424
5425       (5)  Should  I  allocate/free  a  stack every time before/after calling
5426       pcre2_match()?
5427
5428       No, because this is too costly in  terms  of  resources.  However,  you
5429       could  implement  some clever idea which release the stack if it is not
5430       used in let's say two minutes. The JIT callback  can  help  to  achieve
5431       this without keeping a list of patterns.
5432
5433       (6)  OK, the stack is for long term memory allocation. But what happens
5434       if a pattern causes stack overflow with a stack of 1MiB? Is  that  1MiB
5435       kept until the stack is freed?
5436
5437       Especially  on embedded sytems, it might be a good idea to release mem-
5438       ory sometimes without freeing the stack. There is no API  for  this  at
5439       the  moment.  Probably a function call which returns with the currently
5440       allocated memory for any stack and another which allows releasing  mem-
5441       ory (shrinking the stack) would be a good idea if someone needs this.
5442
5443       (7) This is too much of a headache. Isn't there any better solution for
5444       JIT stack handling?
5445
5446       No, thanks to Windows. If POSIX threads were used everywhere, we  could
5447       throw out this complicated API.
5448
5449
5450FREEING JIT SPECULATIVE MEMORY
5451
5452       void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext);
5453
5454       The JIT executable allocator does not free all memory when it is possi-
5455       ble. It expects new allocations, and keeps some free memory  around  to
5456       improve  allocation  speed. However, in low memory conditions, it might
5457       be better to free all possible memory. You can cause this to happen  by
5458       calling  pcre2_jit_free_unused_memory(). Its argument is a general con-
5459       text, for custom memory management, or NULL for standard memory manage-
5460       ment.
5461
5462
5463EXAMPLE CODE
5464
5465       This  is  a  single-threaded example that specifies a JIT stack without
5466       using a callback. A real program should include  error  checking  after
5467       all the function calls.
5468
5469         int rc;
5470         pcre2_code *re;
5471         pcre2_match_data *match_data;
5472         pcre2_match_context *mcontext;
5473         pcre2_jit_stack *jit_stack;
5474
5475         re = pcre2_compile(pattern, PCRE2_ZERO_TERMINATED, 0,
5476           &errornumber, &erroffset, NULL);
5477         rc = pcre2_jit_compile(re, PCRE2_JIT_COMPLETE);
5478         mcontext = pcre2_match_context_create(NULL);
5479         jit_stack = pcre2_jit_stack_create(32*1024, 512*1024, NULL);
5480         pcre2_jit_stack_assign(mcontext, NULL, jit_stack);
5481         match_data = pcre2_match_data_create(re, 10);
5482         rc = pcre2_match(re, subject, length, 0, 0, match_data, mcontext);
5483         /* Process result */
5484
5485         pcre2_code_free(re);
5486         pcre2_match_data_free(match_data);
5487         pcre2_match_context_free(mcontext);
5488         pcre2_jit_stack_free(jit_stack);
5489
5490
5491JIT FAST PATH API
5492
5493       Because the API described above falls back to interpreted matching when
5494       JIT is not available, it is convenient for programs  that  are  written
5495       for  general  use  in  many  environments.  However,  calling  JIT  via
5496       pcre2_match() does have a performance impact. Programs that are written
5497       for  use  where  JIT  is known to be available, and which need the best
5498       possible performance, can instead use a "fast path"  API  to  call  JIT
5499       matching  directly instead of calling pcre2_match() (obviously only for
5500       patterns that have been successfully processed by pcre2_jit_compile()).
5501
5502       The fast path function is called pcre2_jit_match(), and  it  takes  ex-
5503       actly  the same arguments as pcre2_match(). However, the subject string
5504       must be specified with a  length;  PCRE2_ZERO_TERMINATED  is  not  sup-
5505       ported. Unsupported option bits (for example, PCRE2_ANCHORED, PCRE2_EN-
5506       DANCHORED  and  PCRE2_COPY_MATCHED_SUBJECT)  are  ignored,  as  is  the
5507       PCRE2_NO_JIT  option.  The  return  values  are  also  the  same as for
5508       pcre2_match(), plus PCRE2_ERROR_JIT_BADOPTION if a matching mode  (par-
5509       tial or complete) is requested that was not compiled.
5510
5511       When  you call pcre2_match(), as well as testing for invalid options, a
5512       number of other sanity checks are performed on the arguments. For exam-
5513       ple,  if the subject pointer is NULL but the length is non-zero, an im-
5514       mediate error is given. Also, unless PCRE2_NO_UTF_CHECK is set,  a  UTF
5515       subject string is tested for validity. In the interests of speed, these
5516       checks do not happen on the JIT fast  path,  and  if  invalid  data  is
5517       passed, the result is undefined.
5518
5519       Bypassing  the  sanity  checks  and the pcre2_match() wrapping can give
5520       speedups of more than 10%.
5521
5522
5523SEE ALSO
5524
5525       pcre2api(3)
5526
5527
5528AUTHOR
5529
5530       Philip Hazel (FAQ by Zoltan Herczeg)
5531       University Computing Service
5532       Cambridge, England.
5533
5534
5535REVISION
5536
5537       Last updated: 30 November 2021
5538       Copyright (c) 1997-2021 University of Cambridge.
5539------------------------------------------------------------------------------
5540
5541
5542PCRE2LIMITS(3)             Library Functions Manual             PCRE2LIMITS(3)
5543
5544
5545
5546NAME
5547       PCRE2 - Perl-compatible regular expressions (revised API)
5548
5549SIZE AND OTHER LIMITATIONS
5550
5551       There are some size limitations in PCRE2 but it is hoped that they will
5552       never in practice be relevant.
5553
5554       The maximum size of a compiled pattern  is  approximately  64  thousand
5555       code units for the 8-bit and 16-bit libraries if PCRE2 is compiled with
5556       the default internal linkage size, which  is  2  bytes  for  these  li-
5557       braries.  If  you  want  to  process regular expressions that are truly
5558       enormous, you can compile PCRE2 with an internal linkage size of 3 or 4
5559       (when  building  the  16-bit  library,  3  is rounded up to 4). See the
5560       README file in the source distribution and the pcre2build documentation
5561       for  details.  In  these cases the limit is substantially larger.  How-
5562       ever, the speed of execution is slower. In the 32-bit library, the  in-
5563       ternal linkage size is always 4.
5564
5565       The maximum length of a source pattern string is essentially unlimited;
5566       it is the largest number a PCRE2_SIZE variable can hold.  However,  the
5567       program that calls pcre2_compile() can specify a smaller limit.
5568
5569       The maximum length (in code units) of a subject string is one less than
5570       the largest number a PCRE2_SIZE variable can hold. PCRE2_SIZE is an un-
5571       signed integer type, usually defined as size_t. Its maximum value (that
5572       is ~(PCRE2_SIZE)0) is reserved as a special indicator  for  zero-termi-
5573       nated strings and unset offsets.
5574
5575       All values in repeating quantifiers must be less than 65536.
5576
5577       The maximum length of a lookbehind assertion is 65535 characters.
5578
5579       There  is no limit to the number of parenthesized groups, but there can
5580       be no more than 65535 capture groups, and there is a limit to the depth
5581       of  nesting  of parenthesized subpatterns of all kinds. This is imposed
5582       in order to limit the amount of system stack used at compile time.  The
5583       default limit can be specified when PCRE2 is built; if not, the default
5584       is set to  250.  An  application  can  change  this  limit  by  calling
5585       pcre2_set_parens_nest_limit() to set the limit in a compile context.
5586
5587       The  maximum length of name for a named capture group is 32 code units,
5588       and the maximum number of such groups is 10000.
5589
5590       The maximum length of a  name  in  a  (*MARK),  (*PRUNE),  (*SKIP),  or
5591       (*THEN)  verb  is  255  code units for the 8-bit library and 65535 code
5592       units for the 16-bit and 32-bit libraries.
5593
5594       The maximum length of a string argument to a  callout  is  the  largest
5595       number a 32-bit unsigned integer can hold.
5596
5597
5598AUTHOR
5599
5600       Philip Hazel
5601       University Computing Service
5602       Cambridge, England.
5603
5604
5605REVISION
5606
5607       Last updated: 02 February 2019
5608       Copyright (c) 1997-2019 University of Cambridge.
5609------------------------------------------------------------------------------
5610
5611
5612PCRE2MATCHING(3)           Library Functions Manual           PCRE2MATCHING(3)
5613
5614
5615
5616NAME
5617       PCRE2 - Perl-compatible regular expressions (revised API)
5618
5619PCRE2 MATCHING ALGORITHMS
5620
5621       This document describes the two different algorithms that are available
5622       in PCRE2 for matching a compiled regular  expression  against  a  given
5623       subject  string.  The  "standard"  algorithm is the one provided by the
5624       pcre2_match() function. This works in the same as  as  Perl's  matching
5625       function,  and  provide a Perl-compatible matching operation. The just-
5626       in-time (JIT) optimization that is described in the pcre2jit documenta-
5627       tion is compatible with this function.
5628
5629       An alternative algorithm is provided by the pcre2_dfa_match() function;
5630       it operates in a different way, and is not Perl-compatible. This alter-
5631       native  has advantages and disadvantages compared with the standard al-
5632       gorithm, and these are described below.
5633
5634       When there is only one possible way in which a given subject string can
5635       match  a pattern, the two algorithms give the same answer. A difference
5636       arises, however, when there are multiple possibilities. For example, if
5637       the pattern
5638
5639         ^<.*>
5640
5641       is matched against the string
5642
5643         <something> <something else> <something further>
5644
5645       there are three possible answers. The standard algorithm finds only one
5646       of them, whereas the alternative algorithm finds all three.
5647
5648
5649REGULAR EXPRESSIONS AS TREES
5650
5651       The set of strings that are matched by a regular expression can be rep-
5652       resented  as  a  tree structure. An unlimited repetition in the pattern
5653       makes the tree of infinite size, but it is still a tree.  Matching  the
5654       pattern  to a given subject string (from a given starting point) can be
5655       thought of as a search of the tree.  There are two  ways  to  search  a
5656       tree:  depth-first  and  breadth-first, and these correspond to the two
5657       matching algorithms provided by PCRE2.
5658
5659
5660THE STANDARD MATCHING ALGORITHM
5661
5662       In the terminology of Jeffrey Friedl's book "Mastering Regular  Expres-
5663       sions",  the  standard  algorithm  is an "NFA algorithm". It conducts a
5664       depth-first search of the pattern tree. That is, it  proceeds  along  a
5665       single path through the tree, checking that the subject matches what is
5666       required. When there is a mismatch, the algorithm  tries  any  alterna-
5667       tives  at  the  current point, and if they all fail, it backs up to the
5668       previous branch point in the  tree,  and  tries  the  next  alternative
5669       branch  at  that  level.  This often involves backing up (moving to the
5670       left) in the subject string as well.  The  order  in  which  repetition
5671       branches  are  tried  is controlled by the greedy or ungreedy nature of
5672       the quantifier.
5673
5674       If a leaf node is reached, a matching string has  been  found,  and  at
5675       that  point the algorithm stops. Thus, if there is more than one possi-
5676       ble match, this algorithm returns the first one that it finds.  Whether
5677       this  is the shortest, the longest, or some intermediate length depends
5678       on the way the alternations and the greedy or ungreedy repetition quan-
5679       tifiers are specified in the pattern.
5680
5681       Because  it  ends  up  with a single path through the tree, it is rela-
5682       tively straightforward for this algorithm to keep  track  of  the  sub-
5683       strings  that  are  matched  by portions of the pattern in parentheses.
5684       This provides support for capturing parentheses and backreferences.
5685
5686
5687THE ALTERNATIVE MATCHING ALGORITHM
5688
5689       This algorithm conducts a breadth-first search of  the  tree.  Starting
5690       from  the  first  matching  point  in the subject, it scans the subject
5691       string from left to right, once, character by character, and as it does
5692       this,  it remembers all the paths through the tree that represent valid
5693       matches. In Friedl's terminology, this is a kind  of  "DFA  algorithm",
5694       though  it is not implemented as a traditional finite state machine (it
5695       keeps multiple states active simultaneously).
5696
5697       Although the general principle of this matching algorithm  is  that  it
5698       scans  the subject string only once, without backtracking, there is one
5699       exception: when a lookaround assertion is encountered,  the  characters
5700       following  or  preceding the current point have to be independently in-
5701       spected.
5702
5703       The scan continues until either the end of the subject is  reached,  or
5704       there  are  no more unterminated paths. At this point, terminated paths
5705       represent the different matching possibilities (if there are none,  the
5706       match  has  failed).   Thus,  if there is more than one possible match,
5707       this algorithm finds all of them, and in particular, it finds the long-
5708       est.  The matches are returned in the output vector in decreasing order
5709       of length. There is an option to stop the  algorithm  after  the  first
5710       match (which is necessarily the shortest) is found.
5711
5712       Note  that the size of vector needed to contain all the results depends
5713       on the number of simultaneous matches, not on the number of parentheses
5714       in  the pattern. Using pcre2_match_data_create_from_pattern() to create
5715       the match data block is therefore not advisable when doing  DFA  match-
5716       ing.
5717
5718       Note  also  that all the matches that are found start at the same point
5719       in the subject. If the pattern
5720
5721         cat(er(pillar)?)?
5722
5723       is matched against the string "the caterpillar catchment",  the  result
5724       is  the  three  strings "caterpillar", "cater", and "cat" that start at
5725       the fifth character of the subject. The algorithm  does  not  automati-
5726       cally move on to find matches that start at later positions.
5727
5728       PCRE2's "auto-possessification" optimization usually applies to charac-
5729       ter repeats at the end of a pattern (as well as internally). For  exam-
5730       ple, the pattern "a\d+" is compiled as if it were "a\d++" because there
5731       is no point even considering the possibility of backtracking  into  the
5732       repeated  digits.  For  DFA matching, this means that only one possible
5733       match is found. If you really do want multiple matches in  such  cases,
5734       either  use  an ungreedy repeat ("a\d+?") or set the PCRE2_NO_AUTO_POS-
5735       SESS option when compiling.
5736
5737       There are a number of features of PCRE2 regular  expressions  that  are
5738       not  supported  or behave differently in the alternative matching func-
5739       tion. Those that are not supported cause an error if encountered.
5740
5741       1. Because the algorithm finds all possible matches, the greedy or  un-
5742       greedy  nature of repetition quantifiers is not relevant (though it may
5743       affect auto-possessification,  as  just  described).  During  matching,
5744       greedy  and  ungreedy  quantifiers are treated in exactly the same way.
5745       However, possessive quantifiers can make a difference when what follows
5746       could  also  match  what  is  quantified, for example in a pattern like
5747       this:
5748
5749         ^a++\w!
5750
5751       This pattern matches "aaab!" but not "aaa!", which would be matched  by
5752       a  non-possessive quantifier. Similarly, if an atomic group is present,
5753       it is matched as if it were a standalone pattern at the current  point,
5754       and  the  longest match is then "locked in" for the rest of the overall
5755       pattern.
5756
5757       2. When dealing with multiple paths through the tree simultaneously, it
5758       is  not  straightforward  to  keep track of captured substrings for the
5759       different matching possibilities, and PCRE2's  implementation  of  this
5760       algorithm does not attempt to do this. This means that no captured sub-
5761       strings are available.
5762
5763       3. Because no substrings are captured, backreferences within  the  pat-
5764       tern are not supported.
5765
5766       4.  For  the same reason, conditional expressions that use a backrefer-
5767       ence as the condition or test for a specific group  recursion  are  not
5768       supported.
5769
5770       5. Again for the same reason, script runs are not supported.
5771
5772       6. Because many paths through the tree may be active, the \K escape se-
5773       quence, which resets the start of the match when encountered  (but  may
5774       be on some paths and not on others), is not supported.
5775
5776       7.  Callouts  are  supported, but the value of the capture_top field is
5777       always 1, and the value of the capture_last field is always 0.
5778
5779       8. The \C escape sequence, which (in  the  standard  algorithm)  always
5780       matches  a  single  code  unit, even in a UTF mode, is not supported in
5781       these modes, because the alternative algorithm moves through  the  sub-
5782       ject  string  one  character  (not code unit) at a time, for all active
5783       paths through the tree.
5784
5785       9. Except for (*FAIL), the backtracking control verbs such as  (*PRUNE)
5786       are  not  supported.  (*FAIL)  is supported, and behaves like a failing
5787       negative assertion.
5788
5789       10. The PCRE2_MATCH_INVALID_UTF option for pcre2_compile() is not  sup-
5790       ported by pcre2_dfa_match().
5791
5792
5793ADVANTAGES OF THE ALTERNATIVE ALGORITHM
5794
5795       The  main  advantage  of the alternative algorithm is that all possible
5796       matches (at a single point in the subject) are automatically found, and
5797       in  particular, the longest match is found. To find more than one match
5798       at the same point using the standard algorithm, you have to  do  kludgy
5799       things with callouts.
5800
5801       Partial  matching  is  possible with this algorithm, though it has some
5802       limitations. The pcre2partial documentation gives  details  of  partial
5803       matching and discusses multi-segment matching.
5804
5805
5806DISADVANTAGES OF THE ALTERNATIVE ALGORITHM
5807
5808       The alternative algorithm suffers from a number of disadvantages:
5809
5810       1.  It  is  substantially  slower  than the standard algorithm. This is
5811       partly because it has to search for all possible matches, but  is  also
5812       because it is less susceptible to optimization.
5813
5814       2.  Capturing  parentheses,  backreferences,  script runs, and matching
5815       within invalid UTF string are not supported.
5816
5817       3. Although atomic groups are supported, their use does not provide the
5818       performance advantage that it does for the standard algorithm.
5819
5820       4. JIT optimization is not supported.
5821
5822
5823AUTHOR
5824
5825       Philip Hazel
5826       Retired from University Computing Service
5827       Cambridge, England.
5828
5829
5830REVISION
5831
5832       Last updated: 28 August 2021
5833       Copyright (c) 1997-2021 University of Cambridge.
5834------------------------------------------------------------------------------
5835
5836
5837PCRE2PARTIAL(3)            Library Functions Manual            PCRE2PARTIAL(3)
5838
5839
5840
5841NAME
5842       PCRE2 - Perl-compatible regular expressions
5843
5844PARTIAL MATCHING IN PCRE2
5845
5846       In  normal use of PCRE2, if there is a match up to the end of a subject
5847       string, but more characters are needed to  match  the  entire  pattern,
5848       PCRE2_ERROR_NOMATCH  is  returned,  just  like any other failing match.
5849       There are circumstances where it might be helpful to  distinguish  this
5850       "partial match" case.
5851
5852       One  example  is  an application where the subject string is very long,
5853       and not all available at once. The requirement here is to be able to do
5854       the  matching  segment  by segment, but special action is needed when a
5855       matched substring spans the boundary between two segments.
5856
5857       Another example is checking a user input string as it is typed, to  en-
5858       sure  that  it conforms to a required format. Invalid characters can be
5859       immediately diagnosed and rejected, giving instant feedback.
5860
5861       Partial matching is a PCRE2-specific feature; it is  not  Perl-compati-
5862       ble.  It  is  requested  by  setting  one  of the PCRE2_PARTIAL_HARD or
5863       PCRE2_PARTIAL_SOFT options when calling a matching function.  The  dif-
5864       ference  between  the  two options is whether or not a partial match is
5865       preferred to an alternative complete match, though the  details  differ
5866       between  the  two  types of matching function. If both options are set,
5867       PCRE2_PARTIAL_HARD takes precedence.
5868
5869       If you want to use partial matching with just-in-time  optimized  code,
5870       as  well  as  setting a partial match option for the matching function,
5871       you must also call pcre2_jit_compile() with one or both  of  these  op-
5872       tions:
5873
5874         PCRE2_JIT_PARTIAL_HARD
5875         PCRE2_JIT_PARTIAL_SOFT
5876
5877       PCRE2_JIT_COMPLETE  should also be set if you are going to run non-par-
5878       tial matches on the same pattern. Separate code is  compiled  for  each
5879       mode.  If  the appropriate JIT mode has not been compiled, interpretive
5880       matching code is used.
5881
5882       Setting a partial matching option disables two of PCRE2's standard  op-
5883       timization  hints. PCRE2 remembers the last literal code unit in a pat-
5884       tern, and abandons matching immediately if it is  not  present  in  the
5885       subject  string.  This optimization cannot be used for a subject string
5886       that might match only partially. PCRE2 also remembers a minimum  length
5887       of  a matching string, and does not bother to run the matching function
5888       on shorter strings. This optimization  is  also  disabled  for  partial
5889       matching.
5890
5891
5892REQUIREMENTS FOR A PARTIAL MATCH
5893
5894       A  possible  partial  match  occurs during matching when the end of the
5895       subject string is reached successfully, but either more characters  are
5896       needed  to complete the match, or the addition of more characters might
5897       change what is matched.
5898
5899       Example 1: if the pattern is /abc/ and the subject is "ab", more  char-
5900       acters  are  definitely  needed  to complete a match. In this case both
5901       hard and soft matching options yield a partial match.
5902
5903       Example 2: if the pattern is /ab+/ and the subject is "ab", a  complete
5904       match  can  be  found, but the addition of more characters might change
5905       what is matched. In this case, only PCRE2_PARTIAL_HARD returns  a  par-
5906       tial match; PCRE2_PARTIAL_SOFT returns the complete match.
5907
5908       On  reaching the end of the subject, when PCRE2_PARTIAL_HARD is set, if
5909       the next pattern item is \z, \Z, \b, \B, or $ there is always a partial
5910       match.   Otherwise, for both options, the next pattern item must be one
5911       that inspects a character, and at least one of the  following  must  be
5912       true:
5913
5914       (1)  At  least  one  character has already been inspected. An inspected
5915       character need not form part of the final  matched  string;  lookbehind
5916       assertions  and the \K escape sequence provide ways of inspecting char-
5917       acters before the start of a matched string.
5918
5919       (2) The pattern contains one or more lookbehind assertions. This condi-
5920       tion  exists in case there is a lookbehind that inspects characters be-
5921       fore the start of the match.
5922
5923       (3) There is a special case when the whole pattern can match  an  empty
5924       string.   When  the  starting  point  is at the end of the subject, the
5925       empty string match is a possibility, and if PCRE2_PARTIAL_SOFT  is  set
5926       and  neither  of the above conditions is true, it is returned. However,
5927       because adding more characters  might  result  in  a  non-empty  match,
5928       PCRE2_PARTIAL_HARD  returns  a  partial match, which in this case means
5929       "there is going to be a match at this point, but until some more  char-
5930       acters are added, we do not know if it will be an empty string or some-
5931       thing longer".
5932
5933
5934PARTIAL MATCHING USING pcre2_match()
5935
5936       When  a  partial  matching  option  is  set,  the  result  of   calling
5937       pcre2_match() can be one of the following:
5938
5939       A successful match
5940         A complete match has been found, starting and ending within this sub-
5941         ject.
5942
5943       PCRE2_ERROR_NOMATCH
5944         No match can start anywhere in this subject.
5945
5946       PCRE2_ERROR_PARTIAL
5947         Adding more characters may result in a complete match that  uses  one
5948         or more characters from the end of this subject.
5949
5950       When a partial match is returned, the first two elements in the ovector
5951       point to the portion of the subject that was matched, but the values in
5952       the rest of the ovector are undefined. The appearance of \K in the pat-
5953       tern has no effect for a partial match. Consider this pattern:
5954
5955         /abc\K123/
5956
5957       If it is matched against "456abc123xyz" the result is a complete match,
5958       and  the ovector defines the matched string as "123", because \K resets
5959       the "start of match" point. However, if a partial  match  is  requested
5960       and  the subject string is "456abc12", a partial match is found for the
5961       string "abc12", because all these characters are needed  for  a  subse-
5962       quent re-match with additional characters.
5963
5964       If  there  is more than one partial match, the first one that was found
5965       provides the data that is returned. Consider this pattern:
5966
5967         /123\w+X|dogY/
5968
5969       If this is matched against the subject string "abc123dog", both  alter-
5970       natives  fail  to  match,  but the end of the subject is reached during
5971       matching, so PCRE2_ERROR_PARTIAL is returned. The offsets are set to  3
5972       and  9, identifying "123dog" as the first partial match. (In this exam-
5973       ple, there are two partial matches, because "dog" on its own  partially
5974       matches the second alternative.)
5975
5976   How a partial match is processed by pcre2_match()
5977
5978       What happens when a partial match is identified depends on which of the
5979       two partial matching options is set.
5980
5981       If PCRE2_PARTIAL_HARD is set, PCRE2_ERROR_PARTIAL is returned  as  soon
5982       as  a partial match is found, without continuing to search for possible
5983       complete matches. This option is "hard" because it prefers  an  earlier
5984       partial match over a later complete match. For this reason, the assump-
5985       tion is made that the end of the supplied subject  string  is  not  the
5986       true  end of the available data, which is why \z, \Z, \b, \B, and $ al-
5987       ways give a partial match.
5988
5989       If PCRE2_PARTIAL_SOFT is set, the  partial  match  is  remembered,  but
5990       matching continues as normal, and other alternatives in the pattern are
5991       tried. If no complete match can be found,  PCRE2_ERROR_PARTIAL  is  re-
5992       turned instead of PCRE2_ERROR_NOMATCH. This option is "soft" because it
5993       prefers a complete match over a partial match. All the various matching
5994       items  in a pattern behave as if the subject string is potentially com-
5995       plete; \z, \Z, and $ match at the end of the subject,  as  normal,  and
5996       for \b and \B the end of the subject is treated as a non-alphanumeric.
5997
5998       The  difference  between the two partial matching options can be illus-
5999       trated by a pattern such as:
6000
6001         /dog(sbody)?/
6002
6003       This matches either "dog" or "dogsbody", greedily (that is, it  prefers
6004       the  longer  string  if  possible). If it is matched against the string
6005       "dog" with PCRE2_PARTIAL_SOFT, it yields a complete  match  for  "dog".
6006       However,  if  PCRE2_PARTIAL_HARD is set, the result is PCRE2_ERROR_PAR-
6007       TIAL. On the other hand, if the pattern is made ungreedy the result  is
6008       different:
6009
6010         /dog(sbody)??/
6011
6012       In  this  case  the  result  is always a complete match because that is
6013       found first, and matching never  continues  after  finding  a  complete
6014       match. It might be easier to follow this explanation by thinking of the
6015       two patterns like this:
6016
6017         /dog(sbody)?/    is the same as  /dogsbody|dog/
6018         /dog(sbody)??/   is the same as  /dog|dogsbody/
6019
6020       The second pattern will never match "dogsbody", because it will  always
6021       find the shorter match first.
6022
6023   Example of partial matching using pcre2test
6024
6025       The  pcre2test data modifiers partial_hard (or ph) and partial_soft (or
6026       ps) set PCRE2_PARTIAL_HARD and PCRE2_PARTIAL_SOFT,  respectively,  when
6027       calling  pcre2_match(). Here is a run of pcre2test using a pattern that
6028       matches the whole subject in the form of a date:
6029
6030           re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
6031         data> 25dec3\=ph
6032         Partial match: 23dec3
6033         data> 3ju\=ph
6034         Partial match: 3ju
6035         data> 3juj\=ph
6036         No match
6037
6038       This example gives the same results for  both  hard  and  soft  partial
6039       matching options. Here is an example where there is a difference:
6040
6041           re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
6042         data> 25jun04\=ps
6043          0: 25jun04
6044          1: jun
6045         data> 25jun04\=ph
6046         Partial match: 25jun04
6047
6048       With   PCRE2_PARTIAL_SOFT,  the  subject  is  matched  completely.  For
6049       PCRE2_PARTIAL_HARD, however, the subject is assumed not to be complete,
6050       so there is only a partial match.
6051
6052
6053MULTI-SEGMENT MATCHING WITH pcre2_match()
6054
6055       PCRE  was  not originally designed with multi-segment matching in mind.
6056       However, over time, features (including  partial  matching)  that  make
6057       multi-segment matching possible have been added. A very long string can
6058       be searched segment by segment  by  calling  pcre2_match()  repeatedly,
6059       with the aim of achieving the same results that would happen if the en-
6060       tire string was available for searching all  the  time.  Normally,  the
6061       strings  that  are  being  sought are much shorter than each individual
6062       segment, and are in the middle of very long strings, so the pattern  is
6063       normally not anchored.
6064
6065       Special  logic  must  be implemented to handle a matched substring that
6066       spans a segment boundary. PCRE2_PARTIAL_HARD should be used, because it
6067       returns  a  partial match at the end of a segment whenever there is the
6068       possibility of changing  the  match  by  adding  more  characters.  The
6069       PCRE2_NOTBOL option should also be set for all but the first segment.
6070
6071       When a partial match occurs, the next segment must be added to the cur-
6072       rent subject and the match re-run, using the  startoffset  argument  of
6073       pcre2_match()  to  begin  at the point where the partial match started.
6074       For example:
6075
6076           re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
6077         data> ...the date is 23ja\=ph
6078         Partial match: 23ja
6079         data> ...the date is 23jan19 and on that day...\=offset=15
6080          0: 23jan19
6081          1: jan
6082
6083       Note the use of the offset modifier to start the new  match  where  the
6084       partial match was found. In this example, the next segment was added to
6085       the one in which  the  partial  match  was  found.  This  is  the  most
6086       straightforward approach, typically using a memory buffer that is twice
6087       the size of each segment. After a partial match, the first half of  the
6088       buffer  is discarded, the second half is moved to the start of the buf-
6089       fer, and a new segment is added before repeating the match  as  in  the
6090       example above. After a no match, the entire buffer can be discarded.
6091
6092       If there are memory constraints, you may want to discard text that pre-
6093       cedes a partial match before adding the  next  segment.  Unfortunately,
6094       this  is  not  at  present straightforward. In cases such as the above,
6095       where the pattern does not contain any lookbehinds, it is sufficient to
6096       retain  only  the  partially matched substring. However, if the pattern
6097       contains a lookbehind assertion, characters that precede the  start  of
6098       the  partial match may have been inspected during the matching process.
6099       When pcre2test displays a partial match, it indicates these  characters
6100       with '<' if the allusedtext modifier is set:
6101
6102           re> "(?<=123)abc"
6103         data> xx123ab\=ph,allusedtext
6104         Partial match: 123ab
6105                        <<<
6106
6107       However,  the  allusedtext  modifier is not available for JIT matching,
6108       because JIT matching does not record  the  first  (or  last)  consulted
6109       characters.  For this reason, this information is not available via the
6110       API. It is therefore not possible in general to obtain the exact number
6111       of characters that must be retained in order to get the right match re-
6112       sult. If you cannot retain the  entire  segment,  you  must  find  some
6113       heuristic way of choosing.
6114
6115       If  you know the approximate length of the matching substrings, you can
6116       use that to decide how much text to retain. The only lookbehind  infor-
6117       mation  that  is  currently  available via the API is the length of the
6118       longest individual lookbehind in a pattern, but this can be  misleading
6119       if  there  are  nested  lookbehinds.  The  value  returned  by  calling
6120       pcre2_pattern_info() with the PCRE2_INFO_MAXLOOKBEHIND  option  is  the
6121       maximum number of characters (not code units) that any individual look-
6122       behind  moves  back  when  it  is  processed.   A   pattern   such   as
6123       "(?<=(?<!b)a)"  has a maximum lookbehind value of one, but inspects two
6124       characters before its starting point.
6125
6126       In a non-UTF or a 32-bit case, moving back is just a  subtraction,  but
6127       in  UTF-8  or  UTF-16  you  have  to count characters while moving back
6128       through the code units.
6129
6130
6131PARTIAL MATCHING USING pcre2_dfa_match()
6132
6133       The DFA function moves along the subject string character by character,
6134       without  backtracking,  searching  for  all possible matches simultane-
6135       ously. If the end of the subject is reached before the end of the  pat-
6136       tern, there is the possibility of a partial match.
6137
6138       When PCRE2_PARTIAL_SOFT is set, PCRE2_ERROR_PARTIAL is returned only if
6139       there have been no complete matches. Otherwise,  the  complete  matches
6140       are  returned.   If  PCRE2_PARTIAL_HARD  is  set, a partial match takes
6141       precedence over any complete matches. The portion of  the  string  that
6142       was  matched  when  the  longest  partial match was found is set as the
6143       first matching string.
6144
6145       Because the DFA function always searches for all possible matches,  and
6146       there  is no difference between greedy and ungreedy repetition, its be-
6147       haviour is different from the pcre2_match(). Consider the string  "dog"
6148       matched against this ungreedy pattern:
6149
6150         /dog(sbody)??/
6151
6152       Whereas  the  standard  function stops as soon as it finds the complete
6153       match for "dog", the DFA function also  finds  the  partial  match  for
6154       "dogsbody", and so returns that when PCRE2_PARTIAL_HARD is set.
6155
6156
6157MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()
6158
6159       When a partial match has been found using the DFA matching function, it
6160       is possible to continue the match by providing additional subject  data
6161       and  calling  the function again with the same compiled regular expres-
6162       sion, this time setting the PCRE2_DFA_RESTART option. You must pass the
6163       same working space as before, because this is where details of the pre-
6164       vious partial match are stored. You can set the  PCRE2_PARTIAL_SOFT  or
6165       PCRE2_PARTIAL_HARD  options  with PCRE2_DFA_RESTART to continue partial
6166       matching over multiple segments. Here is an example using pcre2test:
6167
6168           re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
6169         data> 23ja\=dfa,ps
6170         Partial match: 23ja
6171         data> n05\=dfa,dfa_restart
6172          0: n05
6173
6174       The first call has "23ja" as the subject, and requests  partial  match-
6175       ing;  the  second  call  has  "n05"  as  the  subject for the continued
6176       (restarted) match.  Notice that when the match is  complete,  only  the
6177       last  part  is  shown;  PCRE2 does not retain the previously partially-
6178       matched string. It is up to the calling program to do that if it  needs
6179       to.  This  means  that, for an unanchored pattern, if a continued match
6180       fails, it is not possible to try again at a  new  starting  point.  All
6181       this facility is capable of doing is continuing with the previous match
6182       attempt. For example, consider this pattern:
6183
6184         1234|3789
6185
6186       If the first part of the subject is "ABC123", a partial  match  of  the
6187       first  alternative  is found at offset 3. There is no partial match for
6188       the second alternative, because such a match does not start at the same
6189       point  in  the  subject  string. Attempting to continue with the string
6190       "7890" does not yield a match  because  only  those  alternatives  that
6191       match  at one point in the subject are remembered. Depending on the ap-
6192       plication, this may or may not be what you want.
6193
6194       If you do want to allow for starting again at the next  character,  one
6195       way  of  doing it is to retain some or all of the segment and try a new
6196       complete match, as described for pcre2_match() above. Another possibil-
6197       ity  is to work with two buffers. If a partial match at offset n in the
6198       first buffer is followed by "no match" when PCRE2_DFA_RESTART  is  used
6199       on  the  second buffer, you can then try a new match starting at offset
6200       n+1 in the first buffer.
6201
6202
6203AUTHOR
6204
6205       Philip Hazel
6206       University Computing Service
6207       Cambridge, England.
6208
6209
6210REVISION
6211
6212       Last updated: 04 September 2019
6213       Copyright (c) 1997-2019 University of Cambridge.
6214------------------------------------------------------------------------------
6215
6216
6217PCRE2PATTERN(3)            Library Functions Manual            PCRE2PATTERN(3)
6218
6219
6220
6221NAME
6222       PCRE2 - Perl-compatible regular expressions (revised API)
6223
6224PCRE2 REGULAR EXPRESSION DETAILS
6225
6226       The  syntax and semantics of the regular expressions that are supported
6227       by PCRE2 are described in detail below. There is a quick-reference syn-
6228       tax  summary  in the pcre2syntax page. PCRE2 tries to match Perl syntax
6229       and semantics as closely as it can.  PCRE2 also supports some  alterna-
6230       tive  regular  expression syntax (which does not conflict with the Perl
6231       syntax) in order to provide some compatibility with regular expressions
6232       in Python, .NET, and Oniguruma.
6233
6234       Perl's  regular expressions are described in its own documentation, and
6235       regular expressions in general are covered in a number of  books,  some
6236       of which have copious examples. Jeffrey Friedl's "Mastering Regular Ex-
6237       pressions", published by O'Reilly, covers regular expressions in  great
6238       detail.  This description of PCRE2's regular expressions is intended as
6239       reference material.
6240
6241       This document discusses the regular expression patterns that  are  sup-
6242       ported  by  PCRE2  when  its  main matching function, pcre2_match(), is
6243       used.   PCRE2   also   has   an    alternative    matching    function,
6244       pcre2_dfa_match(),  which  matches  using a different algorithm that is
6245       not Perl-compatible. Some of  the  features  discussed  below  are  not
6246       available  when  DFA matching is used. The advantages and disadvantages
6247       of the alternative function, and how it differs from the  normal  func-
6248       tion, are discussed in the pcre2matching page.
6249
6250
6251SPECIAL START-OF-PATTERN ITEMS
6252
6253       A  number  of options that can be passed to pcre2_compile() can also be
6254       set by special items at the start of a pattern. These are not Perl-com-
6255       patible,  but  are provided to make these options accessible to pattern
6256       writers who are not able to change the program that processes the  pat-
6257       tern.  Any  number  of these items may appear, but they must all be to-
6258       gether right at the start of the pattern string, and the  letters  must
6259       be in upper case.
6260
6261   UTF support
6262
6263       In the 8-bit and 16-bit PCRE2 libraries, characters may be coded either
6264       as single code units, or as multiple UTF-8 or UTF-16 code units. UTF-32
6265       can  be  specified  for the 32-bit library, in which case it constrains
6266       the character values to valid  Unicode  code  points.  To  process  UTF
6267       strings,  PCRE2  must be built to include Unicode support (which is the
6268       default). When using UTF strings you must  either  call  the  compiling
6269       function  with  one or both of the PCRE2_UTF or PCRE2_MATCH_INVALID_UTF
6270       options, or the pattern must start with the  special  sequence  (*UTF),
6271       which  is  equivalent  to setting the relevant PCRE2_UTF. How setting a
6272       UTF mode affects pattern matching is mentioned in several places below.
6273       There is also a summary of features in the pcre2unicode page.
6274
6275       Some applications that allow their users to supply patterns may wish to
6276       restrict  them  to  non-UTF  data  for   security   reasons.   If   the
6277       PCRE2_NEVER_UTF  option is passed to pcre2_compile(), (*UTF) is not al-
6278       lowed, and its appearance in a pattern causes an error.
6279
6280   Unicode property support
6281
6282       Another special sequence that may appear at the start of a  pattern  is
6283       (*UCP).   This  has the same effect as setting the PCRE2_UCP option: it
6284       causes sequences such as \d and \w to use Unicode properties to  deter-
6285       mine character types, instead of recognizing only characters with codes
6286       less than 256 via a lookup table. If also causes upper/lower casing op-
6287       erations  to  use  Unicode  properties  for characters with code points
6288       greater than 127, even when UTF is not set.
6289
6290       Some applications that allow their users to supply patterns may wish to
6291       restrict  them  for  security reasons. If the PCRE2_NEVER_UCP option is
6292       passed to pcre2_compile(), (*UCP) is not allowed, and its appearance in
6293       a pattern causes an error.
6294
6295   Locking out empty string matching
6296
6297       Starting a pattern with (*NOTEMPTY) or (*NOTEMPTY_ATSTART) has the same
6298       effect as passing the PCRE2_NOTEMPTY or  PCRE2_NOTEMPTY_ATSTART  option
6299       to whichever matching function is subsequently called to match the pat-
6300       tern. These options lock out the matching of empty strings, either  en-
6301       tirely, or only at the start of the subject.
6302
6303   Disabling auto-possessification
6304
6305       If  a pattern starts with (*NO_AUTO_POSSESS), it has the same effect as
6306       setting the PCRE2_NO_AUTO_POSSESS option. This stops PCRE2 from  making
6307       quantifiers  possessive  when  what  follows  cannot match the repeated
6308       item. For example, by default a+b is treated as a++b. For more details,
6309       see the pcre2api documentation.
6310
6311   Disabling start-up optimizations
6312
6313       If  a  pattern  starts  with (*NO_START_OPT), it has the same effect as
6314       setting the PCRE2_NO_START_OPTIMIZE option. This disables several opti-
6315       mizations  for  quickly  reaching "no match" results. For more details,
6316       see the pcre2api documentation.
6317
6318   Disabling automatic anchoring
6319
6320       If a pattern starts with (*NO_DOTSTAR_ANCHOR), it has the  same  effect
6321       as  setting the PCRE2_NO_DOTSTAR_ANCHOR option. This disables optimiza-
6322       tions that apply to patterns whose top-level branches all start with .*
6323       (match  any  number of arbitrary characters). For more details, see the
6324       pcre2api documentation.
6325
6326   Disabling JIT compilation
6327
6328       If a pattern that starts with (*NO_JIT) is  successfully  compiled,  an
6329       attempt  by  the  application  to apply the JIT optimization by calling
6330       pcre2_jit_compile() is ignored.
6331
6332   Setting match resource limits
6333
6334       The pcre2_match() function contains a counter that is incremented every
6335       time it goes round its main loop. The caller of pcre2_match() can set a
6336       limit on this counter, which therefore limits the amount  of  computing
6337       resource used for a match. The maximum depth of nested backtracking can
6338       also be limited; this indirectly restricts the amount  of  heap  memory
6339       that  is  used,  but there is also an explicit memory limit that can be
6340       set.
6341
6342       These facilities are provided to catch runaway matches  that  are  pro-
6343       voked  by patterns with huge matching trees. A common example is a pat-
6344       tern with nested unlimited repeats applied to a long string  that  does
6345       not  match. When one of these limits is reached, pcre2_match() gives an
6346       error return. The limits can also be set by items at the start  of  the
6347       pattern of the form
6348
6349         (*LIMIT_HEAP=d)
6350         (*LIMIT_MATCH=d)
6351         (*LIMIT_DEPTH=d)
6352
6353       where d is any number of decimal digits. However, the value of the set-
6354       ting must be less than the value set (or defaulted) by  the  caller  of
6355       pcre2_match()  for  it  to have any effect. In other words, the pattern
6356       writer can lower the limits set by the programmer, but not raise  them.
6357       If  there  is  more  than one setting of one of these limits, the lower
6358       value is used. The heap limit is specified in kibibytes (units of  1024
6359       bytes).
6360
6361       Prior  to  release  10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This
6362       name is still recognized for backwards compatibility.
6363
6364       The heap limit applies only when the pcre2_match() or pcre2_dfa_match()
6365       interpreters are used for matching. It does not apply to JIT. The match
6366       limit is used (but in a different way) when JIT is being used, or  when
6367       pcre2_dfa_match() is called, to limit computing resource usage by those
6368       matching functions. The depth limit is ignored by JIT but  is  relevant
6369       for  DFA  matching, which uses function recursion for recursions within
6370       the pattern and for lookaround assertions and atomic  groups.  In  this
6371       case, the depth limit controls the depth of such recursion.
6372
6373   Newline conventions
6374
6375       PCRE2  supports six different conventions for indicating line breaks in
6376       strings: a single CR (carriage return) character, a  single  LF  (line-
6377       feed) character, the two-character sequence CRLF, any of the three pre-
6378       ceding, any Unicode newline sequence,  or  the  NUL  character  (binary
6379       zero).  The  pcre2api  page  has further discussion about newlines, and
6380       shows how to set the newline convention when calling pcre2_compile().
6381
6382       It is also possible to specify a newline convention by starting a  pat-
6383       tern string with one of the following sequences:
6384
6385         (*CR)        carriage return
6386         (*LF)        linefeed
6387         (*CRLF)      carriage return, followed by linefeed
6388         (*ANYCRLF)   any of the three above
6389         (*ANY)       all Unicode newline sequences
6390         (*NUL)       the NUL character (binary zero)
6391
6392       These override the default and the options given to the compiling func-
6393       tion. For example, on a Unix system where LF is the default newline se-
6394       quence, the pattern
6395
6396         (*CR)a.b
6397
6398       changes the convention to CR. That pattern matches "a\nb" because LF is
6399       no longer a newline. If more than one of these settings is present, the
6400       last one is used.
6401
6402       The  newline  convention affects where the circumflex and dollar asser-
6403       tions are true. It also affects the interpretation of the dot metachar-
6404       acter  when  PCRE2_DOTALL  is not set, and the behaviour of \N when not
6405       followed by an opening brace. However, it does not affect what  the  \R
6406       escape  sequence  matches.  By default, this is any Unicode newline se-
6407       quence, for Perl compatibility. However, this can be changed;  see  the
6408       next section and the description of \R in the section entitled "Newline
6409       sequences" below. A change of \R setting can be combined with a  change
6410       of newline convention.
6411
6412   Specifying what \R matches
6413
6414       It is possible to restrict \R to match only CR, LF, or CRLF (instead of
6415       the complete set  of  Unicode  line  endings)  by  setting  the  option
6416       PCRE2_BSR_ANYCRLF  at compile time. This effect can also be achieved by
6417       starting a pattern with (*BSR_ANYCRLF).  For  completeness,  (*BSR_UNI-
6418       CODE) is also recognized, corresponding to PCRE2_BSR_UNICODE.
6419
6420
6421EBCDIC CHARACTER CODES
6422
6423       PCRE2  can be compiled to run in an environment that uses EBCDIC as its
6424       character code instead of ASCII or Unicode (typically a mainframe  sys-
6425       tem).  In  the  sections below, character code values are ASCII or Uni-
6426       code; in an EBCDIC environment these characters may have different code
6427       values, and there are no code points greater than 255.
6428
6429
6430CHARACTERS AND METACHARACTERS
6431
6432       A  regular  expression  is  a pattern that is matched against a subject
6433       string from left to right. Most characters stand for  themselves  in  a
6434       pattern,  and  match  the corresponding characters in the subject. As a
6435       trivial example, the pattern
6436
6437         The quick brown fox
6438
6439       matches a portion of a subject string that is identical to itself. When
6440       caseless  matching  is  specified  (the  PCRE2_CASELESS  option or (?i)
6441       within the pattern), letters are matched independently  of  case.  Note
6442       that  there  are  two  ASCII  characters, K and S, that, in addition to
6443       their lower case ASCII equivalents, are  case-equivalent  with  Unicode
6444       U+212A  (Kelvin  sign)  and  U+017F  (long  S) respectively when either
6445       PCRE2_UTF or PCRE2_UCP is set.
6446
6447       The power of regular expressions comes from the ability to include wild
6448       cards, character classes, alternatives, and repetitions in the pattern.
6449       These are encoded in the pattern by the use of metacharacters, which do
6450       not  stand  for  themselves but instead are interpreted in some special
6451       way.
6452
6453       There are two different sets of metacharacters: those that  are  recog-
6454       nized  anywhere in the pattern except within square brackets, and those
6455       that are recognized within square brackets.  Outside  square  brackets,
6456       the metacharacters are as follows:
6457
6458         \      general escape character with several uses
6459         ^      assert start of string (or line, in multiline mode)
6460         $      assert end of string (or line, in multiline mode)
6461         .      match any character except newline (by default)
6462         [      start character class definition
6463         |      start of alternative branch
6464         (      start group or control verb
6465         )      end group or control verb
6466         *      0 or more quantifier
6467         +      1 or more quantifier; also "possessive quantifier"
6468         ?      0 or 1 quantifier; also quantifier minimizer
6469         {      start min/max quantifier
6470
6471       Part  of  a  pattern  that is in square brackets is called a "character
6472       class". In a character class the only metacharacters are:
6473
6474         \      general escape character
6475         ^      negate the class, but only if the first character
6476         -      indicates character range
6477         [      POSIX character class (if followed by POSIX syntax)
6478         ]      terminates the character class
6479
6480       If a pattern is compiled with the  PCRE2_EXTENDED  option,  most  white
6481       space  in  the pattern, other than in a character class, and characters
6482       between a # outside a character class and the next newline,  inclusive,
6483       are ignored. An escaping backslash can be used to include a white space
6484       or a # character as part of the pattern. If the PCRE2_EXTENDED_MORE op-
6485       tion is set, the same applies, but in addition unescaped space and hor-
6486       izontal tab characters are ignored inside a character class. Note: only
6487       these  two  characters  are  ignored, not the full set of pattern white
6488       space characters that are ignored outside  a  character  class.  Option
6489       settings can be changed within a pattern; see the section entitled "In-
6490       ternal Option Setting" below.
6491
6492       The following sections describe the use of each of the metacharacters.
6493
6494
6495BACKSLASH
6496
6497       The backslash character has several uses. Firstly, if it is followed by
6498       a  character that is not a digit or a letter, it takes away any special
6499       meaning that character may have. This use of  backslash  as  an  escape
6500       character applies both inside and outside character classes.
6501
6502       For  example,  if you want to match a * character, you must write \* in
6503       the pattern. This escaping action applies whether or not the  following
6504       character  would  otherwise be interpreted as a metacharacter, so it is
6505       always safe to precede a non-alphanumeric  with  backslash  to  specify
6506       that it stands for itself.  In particular, if you want to match a back-
6507       slash, you write \\.
6508
6509       Only ASCII digits and letters have any special meaning  after  a  back-
6510       slash. All other characters (in particular, those whose code points are
6511       greater than 127) are treated as literals.
6512
6513       If you want to treat all characters in a sequence as literals, you  can
6514       do so by putting them between \Q and \E. This is different from Perl in
6515       that $ and @ are handled as literals in  \Q...\E  sequences  in  PCRE2,
6516       whereas  in Perl, $ and @ cause variable interpolation. Also, Perl does
6517       "double-quotish backslash interpolation" on any backslashes between  \Q
6518       and  \E which, its documentation says, "may lead to confusing results".
6519       PCRE2 treats a backslash between \Q and \E just like any other  charac-
6520       ter. Note the following examples:
6521
6522         Pattern            PCRE2 matches   Perl matches
6523
6524         \Qabc$xyz\E        abc$xyz        abc followed by the
6525                                             contents of $xyz
6526         \Qabc\$xyz\E       abc\$xyz       abc\$xyz
6527         \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz
6528         \QA\B\E            A\B            A\B
6529         \Q\\E              \              \\E
6530
6531       The  \Q...\E  sequence  is recognized both inside and outside character
6532       classes.  An isolated \E that is not preceded by \Q is ignored.  If  \Q
6533       is  not followed by \E later in the pattern, the literal interpretation
6534       continues to the end of the pattern (that is,  \E  is  assumed  at  the
6535       end).  If  the  isolated \Q is inside a character class, this causes an
6536       error, because the character class  is  not  terminated  by  a  closing
6537       square bracket.
6538
6539   Non-printing characters
6540
6541       A second use of backslash provides a way of encoding non-printing char-
6542       acters in patterns in a visible manner. There is no restriction on  the
6543       appearance  of non-printing characters in a pattern, but when a pattern
6544       is being prepared by text editing, it is often easier to use one of the
6545       following  escape  sequences  instead of the binary character it repre-
6546       sents. In an ASCII or Unicode environment, these escapes  are  as  fol-
6547       lows:
6548
6549         \a          alarm, that is, the BEL character (hex 07)
6550         \cx         "control-x", where x is any printable ASCII character
6551         \e          escape (hex 1B)
6552         \f          form feed (hex 0C)
6553         \n          linefeed (hex 0A)
6554         \r          carriage return (hex 0D) (but see below)
6555         \t          tab (hex 09)
6556         \0dd        character with octal code 0dd
6557         \ddd        character with octal code ddd, or backreference
6558         \o{ddd..}   character with octal code ddd..
6559         \xhh        character with hex code hh
6560         \x{hhh..}   character with hex code hhh..
6561         \N{U+hhh..} character with Unicode hex code point hhh..
6562
6563       By  default, after \x that is not followed by {, from zero to two hexa-
6564       decimal digits are read (letters can be in upper or  lower  case).  Any
6565       number of hexadecimal digits may appear between \x{ and }. If a charac-
6566       ter other than a hexadecimal digit appears between \x{  and  },  or  if
6567       there is no terminating }, an error occurs.
6568
6569       Characters whose code points are less than 256 can be defined by either
6570       of the two syntaxes for \x or by an octal sequence. There is no differ-
6571       ence in the way they are handled. For example, \xdc is exactly the same
6572       as \x{dc} or \334.  However, using the braced versions does  make  such
6573       sequences easier to read.
6574
6575       Support  is  available  for some ECMAScript (aka JavaScript) escape se-
6576       quences via two compile-time options. If PCRE2_ALT_BSUX is set, the se-
6577       quence  \x  followed  by { is not recognized. Only if \x is followed by
6578       two hexadecimal digits is it recognized as a character  escape.  Other-
6579       wise  it  is interpreted as a literal "x" character. In this mode, sup-
6580       port for code points greater than 256 is provided by \u, which must  be
6581       followed  by  four hexadecimal digits; otherwise it is interpreted as a
6582       literal "u" character.
6583
6584       PCRE2_EXTRA_ALT_BSUX has the same effect as PCRE2_ALT_BSUX and, in  ad-
6585       dition, \u{hhh..} is recognized as the character specified by hexadeci-
6586       mal code point.  There may be any number of  hexadecimal  digits.  This
6587       syntax is from ECMAScript 6.
6588
6589       The  \N{U+hhh..} escape sequence is recognized only when PCRE2 is oper-
6590       ating in UTF mode. Perl also uses \N{name}  to  specify  characters  by
6591       Unicode  name;  PCRE2  does  not support this. Note that when \N is not
6592       followed by an opening brace (curly bracket) it has an entirely differ-
6593       ent meaning, matching any character that is not a newline.
6594
6595       There  are some legacy applications where the escape sequence \r is ex-
6596       pected to match a newline. If the  PCRE2_EXTRA_ESCAPED_CR_IS_LF  option
6597       is  set,  \r  in  a  pattern is converted to \n so that it matches a LF
6598       (linefeed) instead of a CR (carriage return) character.
6599
6600       The precise effect of \cx on ASCII characters is as follows: if x is  a
6601       lower  case  letter,  it  is converted to upper case. Then bit 6 of the
6602       character (hex 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A
6603       (A  is  41, Z is 5A), but \c{ becomes hex 3B ({ is 7B), and \c; becomes
6604       hex 7B (; is 3B). If the code unit following \c has a value  less  than
6605       32 or greater than 126, a compile-time error occurs.
6606
6607       When  PCRE2  is  compiled in EBCDIC mode, \N{U+hhh..} is not supported.
6608       \a, \e, \f, \n, \r, and \t generate the appropriate EBCDIC code values.
6609       The \c escape is processed as specified for Perl in the perlebcdic doc-
6610       ument. The only characters that are allowed after \c are A-Z,  a-z,  or
6611       one  of @, [, \, ], ^, _, or ?. Any other character provokes a compile-
6612       time error. The sequence \c@ encodes character code  0;  after  \c  the
6613       letters  (in either case) encode characters 1-26 (hex 01 to hex 1A); [,
6614       \, ], ^, and _ encode characters 27-31 (hex 1B to hex 1F), and \c?  be-
6615       comes either 255 (hex FF) or 95 (hex 5F).
6616
6617       Thus,  apart  from  \c?, these escapes generate the same character code
6618       values as they do in an ASCII environment, though the meanings  of  the
6619       values  mostly  differ. For example, \cG always generates code value 7,
6620       which is BEL in ASCII but DEL in EBCDIC.
6621
6622       The sequence \c? generates DEL (127, hex 7F) in an  ASCII  environment,
6623       but  because  127  is  not a control character in EBCDIC, Perl makes it
6624       generate the APC character. Unfortunately, there are  several  variants
6625       of  EBCDIC.  In  most  of them the APC character has the value 255 (hex
6626       FF), but in the one Perl calls POSIX-BC its value is 95  (hex  5F).  If
6627       certain other characters have POSIX-BC values, PCRE2 makes \c? generate
6628       95; otherwise it generates 255.
6629
6630       After \0 up to two further octal digits are read. If  there  are  fewer
6631       than  two  digits,  just  those that are present are used. Thus the se-
6632       quence \0\x\015 specifies two binary zeros followed by a  CR  character
6633       (code value 13). Make sure you supply two digits after the initial zero
6634       if the pattern character that follows is itself an octal digit.
6635
6636       The escape \o must be followed by a sequence of octal digits,  enclosed
6637       in  braces.  An  error occurs if this is not the case. This escape is a
6638       recent addition to Perl; it provides way of specifying  character  code
6639       points  as  octal  numbers  greater than 0777, and it also allows octal
6640       numbers and backreferences to be unambiguously specified.
6641
6642       For greater clarity and unambiguity, it is best to avoid following \ by
6643       a digit greater than zero. Instead, use \o{} or \x{} to specify numeri-
6644       cal character code points, and \g{} to specify backreferences. The fol-
6645       lowing paragraphs describe the old, ambiguous syntax.
6646
6647       The handling of a backslash followed by a digit other than 0 is compli-
6648       cated, and Perl has changed over time, causing PCRE2 also to change.
6649
6650       Outside a character class, PCRE2 reads the digit and any following dig-
6651       its as a decimal number. If the number is less than 10, begins with the
6652       digit 8 or 9, or if there are  at  least  that  many  previous  capture
6653       groups  in the expression, the entire sequence is taken as a backrefer-
6654       ence. A description of how this works is  given  later,  following  the
6655       discussion  of parenthesized groups.  Otherwise, up to three octal dig-
6656       its are read to form a character code.
6657
6658       Inside a character class, PCRE2 handles \8 and \9 as the literal  char-
6659       acters  "8"  and "9", and otherwise reads up to three octal digits fol-
6660       lowing the backslash, using them to generate a data character. Any sub-
6661       sequent  digits  stand for themselves. For example, outside a character
6662       class:
6663
6664         \040   is another way of writing an ASCII space
6665         \40    is the same, provided there are fewer than 40
6666                   previous capture groups
6667         \7     is always a backreference
6668         \11    might be a backreference, or another way of
6669                   writing a tab
6670         \011   is always a tab
6671         \0113  is a tab followed by the character "3"
6672         \113   might be a backreference, otherwise the
6673                   character with octal code 113
6674         \377   might be a backreference, otherwise
6675                   the value 255 (decimal)
6676         \81    is always a backreference
6677
6678       Note that octal values of 100 or greater that are specified using  this
6679       syntax  must  not be introduced by a leading zero, because no more than
6680       three octal digits are ever read.
6681
6682   Constraints on character values
6683
6684       Characters that are specified using octal or  hexadecimal  numbers  are
6685       limited to certain values, as follows:
6686
6687         8-bit non-UTF mode    no greater than 0xff
6688         16-bit non-UTF mode   no greater than 0xffff
6689         32-bit non-UTF mode   no greater than 0xffffffff
6690         All UTF modes         no greater than 0x10ffff and a valid code point
6691
6692       Invalid Unicode code points are all those in the range 0xd800 to 0xdfff
6693       (the so-called "surrogate" code points). The check  for  these  can  be
6694       disabled  by  the  caller  of  pcre2_compile()  by  setting  the option
6695       PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES. However, this is possible only  in
6696       UTF-8  and  UTF-32 modes, because these values are not representable in
6697       UTF-16.
6698
6699   Escape sequences in character classes
6700
6701       All the sequences that define a single character value can be used both
6702       inside  and  outside character classes. In addition, inside a character
6703       class, \b is interpreted as the backspace character (hex 08).
6704
6705       When not followed by an opening brace, \N is not allowed in a character
6706       class.   \B,  \R, and \X are not special inside a character class. Like
6707       other unrecognized alphabetic escape sequences, they  cause  an  error.
6708       Outside a character class, these sequences have different meanings.
6709
6710   Unsupported escape sequences
6711
6712       In  Perl,  the  sequences  \F, \l, \L, \u, and \U are recognized by its
6713       string handler and used to modify the case of following characters.  By
6714       default,  PCRE2  does  not  support these escape sequences in patterns.
6715       However, if either of the PCRE2_ALT_BSUX  or  PCRE2_EXTRA_ALT_BSUX  op-
6716       tions  is set, \U matches a "U" character, and \u can be used to define
6717       a character by code point, as described above.
6718
6719   Absolute and relative backreferences
6720
6721       The sequence \g followed by a signed or unsigned number, optionally en-
6722       closed  in  braces,  is  an absolute or relative backreference. A named
6723       backreference can be coded as \g{name}.  Backreferences  are  discussed
6724       later, following the discussion of parenthesized groups.
6725
6726   Absolute and relative subroutine calls
6727
6728       For  compatibility with Oniguruma, the non-Perl syntax \g followed by a
6729       name or a number enclosed either in angle brackets or single quotes, is
6730       an  alternative syntax for referencing a capture group as a subroutine.
6731       Details are discussed later.   Note  that  \g{...}  (Perl  syntax)  and
6732       \g<...> (Oniguruma syntax) are not synonymous. The former is a backref-
6733       erence; the latter is a subroutine call.
6734
6735   Generic character types
6736
6737       Another use of backslash is for specifying generic character types:
6738
6739         \d     any decimal digit
6740         \D     any character that is not a decimal digit
6741         \h     any horizontal white space character
6742         \H     any character that is not a horizontal white space character
6743         \N     any character that is not a newline
6744         \s     any white space character
6745         \S     any character that is not a white space character
6746         \v     any vertical white space character
6747         \V     any character that is not a vertical white space character
6748         \w     any "word" character
6749         \W     any "non-word" character
6750
6751       The \N escape sequence has the same meaning as  the  "."  metacharacter
6752       when  PCRE2_DOTALL is not set, but setting PCRE2_DOTALL does not change
6753       the meaning of \N. Note that when \N is followed by an opening brace it
6754       has a different meaning. See the section entitled "Non-printing charac-
6755       ters" above for details. Perl also uses \N{name} to specify  characters
6756       by Unicode name; PCRE2 does not support this.
6757
6758       Each  pair of lower and upper case escape sequences partitions the com-
6759       plete set of characters into two disjoint  sets.  Any  given  character
6760       matches  one, and only one, of each pair. The sequences can appear both
6761       inside and outside character classes. They each match one character  of
6762       the  appropriate  type.  If the current matching point is at the end of
6763       the subject string, all of them fail, because there is no character  to
6764       match.
6765
6766       The  default  \s  characters  are HT (9), LF (10), VT (11), FF (12), CR
6767       (13), and space (32), which are defined as white space in the  "C"  lo-
6768       cale.  This  list may vary if locale-specific matching is taking place.
6769       For example, in some locales the "non-breaking space" character  (\xA0)
6770       is recognized as white space, and in others the VT character is not.
6771
6772       A  "word"  character is an underscore or any character that is a letter
6773       or digit.  By default, the definition of letters  and  digits  is  con-
6774       trolled by PCRE2's low-valued character tables, and may vary if locale-
6775       specific matching is taking place (see "Locale support" in the pcre2api
6776       page).  For  example,  in  a French locale such as "fr_FR" in Unix-like
6777       systems, or "french" in Windows, some character codes greater than  127
6778       are  used  for  accented letters, and these are then matched by \w. The
6779       use of locales with Unicode is discouraged.
6780
6781       By default, characters whose code points are  greater  than  127  never
6782       match \d, \s, or \w, and always match \D, \S, and \W, although this may
6783       be different for characters in the range 128-255  when  locale-specific
6784       matching  is  happening.   These escape sequences retain their original
6785       meanings from before Unicode support was available,  mainly  for  effi-
6786       ciency  reasons.  If  the  PCRE2_UCP  option  is  set, the behaviour is
6787       changed so that Unicode properties  are  used  to  determine  character
6788       types, as follows:
6789
6790         \d  any character that matches \p{Nd} (decimal digit)
6791         \s  any character that matches \p{Z} or \h or \v
6792         \w  any character that matches \p{L} or \p{N}, plus underscore
6793
6794       The  upper case escapes match the inverse sets of characters. Note that
6795       \d matches only decimal digits, whereas \w matches any  Unicode  digit,
6796       as well as any Unicode letter, and underscore. Note also that PCRE2_UCP
6797       affects \b, and \B because they are defined in  terms  of  \w  and  \W.
6798       Matching these sequences is noticeably slower when PCRE2_UCP is set.
6799
6800       The  sequences  \h, \H, \v, and \V, in contrast to the other sequences,
6801       which match only ASCII characters by default, always match  a  specific
6802       list  of  code  points, whether or not PCRE2_UCP is set. The horizontal
6803       space characters are:
6804
6805         U+0009     Horizontal tab (HT)
6806         U+0020     Space
6807         U+00A0     Non-break space
6808         U+1680     Ogham space mark
6809         U+180E     Mongolian vowel separator
6810         U+2000     En quad
6811         U+2001     Em quad
6812         U+2002     En space
6813         U+2003     Em space
6814         U+2004     Three-per-em space
6815         U+2005     Four-per-em space
6816         U+2006     Six-per-em space
6817         U+2007     Figure space
6818         U+2008     Punctuation space
6819         U+2009     Thin space
6820         U+200A     Hair space
6821         U+202F     Narrow no-break space
6822         U+205F     Medium mathematical space
6823         U+3000     Ideographic space
6824
6825       The vertical space characters are:
6826
6827         U+000A     Linefeed (LF)
6828         U+000B     Vertical tab (VT)
6829         U+000C     Form feed (FF)
6830         U+000D     Carriage return (CR)
6831         U+0085     Next line (NEL)
6832         U+2028     Line separator
6833         U+2029     Paragraph separator
6834
6835       In 8-bit, non-UTF-8 mode, only the characters  with  code  points  less
6836       than 256 are relevant.
6837
6838   Newline sequences
6839
6840       Outside  a  character class, by default, the escape sequence \R matches
6841       any Unicode newline sequence. In 8-bit non-UTF-8 mode \R is  equivalent
6842       to the following:
6843
6844         (?>\r\n|\n|\x0b|\f|\r|\x85)
6845
6846       This is an example of an "atomic group", details of which are given be-
6847       low.  This particular group matches either the  two-character  sequence
6848       CR  followed  by  LF,  or  one  of  the single characters LF (linefeed,
6849       U+000A), VT (vertical tab, U+000B), FF (form feed,  U+000C),  CR  (car-
6850       riage  return,  U+000D), or NEL (next line, U+0085). Because this is an
6851       atomic group, the two-character sequence is treated as  a  single  unit
6852       that cannot be split.
6853
6854       In other modes, two additional characters whose code points are greater
6855       than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-
6856       rator,  U+2029).  Unicode support is not needed for these characters to
6857       be recognized.
6858
6859       It is possible to restrict \R to match only CR, LF, or CRLF (instead of
6860       the  complete  set  of  Unicode  line  endings)  by  setting the option
6861       PCRE2_BSR_ANYCRLF at compile time. (BSR is an abbreviation  for  "back-
6862       slash R".) This can be made the default when PCRE2 is built; if this is
6863       the case, the other behaviour can be requested via  the  PCRE2_BSR_UNI-
6864       CODE  option. It is also possible to specify these settings by starting
6865       a pattern string with one of the following sequences:
6866
6867         (*BSR_ANYCRLF)   CR, LF, or CRLF only
6868         (*BSR_UNICODE)   any Unicode newline sequence
6869
6870       These override the default and the options given to the compiling func-
6871       tion.  Note that these special settings, which are not Perl-compatible,
6872       are recognized only at the very start of a pattern, and that they  must
6873       be  in upper case. If more than one of them is present, the last one is
6874       used. They can be combined with a change of newline convention; for ex-
6875       ample, a pattern can start with:
6876
6877         (*ANY)(*BSR_ANYCRLF)
6878
6879       They  can also be combined with the (*UTF) or (*UCP) special sequences.
6880       Inside a character class, \R is treated as an unrecognized  escape  se-
6881       quence, and causes an error.
6882
6883   Unicode character properties
6884
6885       When  PCRE2  is  built  with Unicode support (the default), three addi-
6886       tional escape sequences that match characters with specific  properties
6887       are available. They can be used in any mode, though in 8-bit and 16-bit
6888       non-UTF modes these sequences are of course limited to testing  charac-
6889       ters  whose code points are less than U+0100 and U+10000, respectively.
6890       In 32-bit non-UTF mode, code points greater than 0x10ffff (the  Unicode
6891       limit)  may  be  encountered. These are all treated as being in the Un-
6892       known script and with an unassigned type.
6893
6894       Matching characters by Unicode property is not fast, because PCRE2  has
6895       to  do  a  multistage table lookup in order to find a character's prop-
6896       erty. That is why the traditional escape sequences such as \d and \w do
6897       not  use  Unicode  properties  in PCRE2 by default, though you can make
6898       them do so by setting the PCRE2_UCP option or by starting  the  pattern
6899       with (*UCP).
6900
6901       The extra escape sequences that provide property support are:
6902
6903         \p{xx}   a character with the xx property
6904         \P{xx}   a character without the xx property
6905         \X       a Unicode extended grapheme cluster
6906
6907       The  property names represented by xx above are not case-sensitive, and
6908       in accordance with Unicode's "loose matching" rules,  spaces,  hyphens,
6909       and underscores are ignored. There is support for Unicode script names,
6910       Unicode general category properties, "Any", which matches any character
6911       (including  newline),  Bidi_Class,  a number of binary (yes/no) proper-
6912       ties, and some special PCRE2  properties  (described  below).   Certain
6913       other  Perl  properties such as "InMusicalSymbols" are not supported by
6914       PCRE2. Note that \P{Any} does  not  match  any  characters,  so  always
6915       causes a match failure.
6916
6917   Script properties for \p and \P
6918
6919       There are three different syntax forms for matching a script. Each Uni-
6920       code character has a basic script and,  optionally,  a  list  of  other
6921       scripts ("Script Extensions") with which it is commonly used. Using the
6922       Adlam script as an example, \p{sc:Adlam} matches characters whose basic
6923       script is Adlam, whereas \p{scx:Adlam} matches, in addition, characters
6924       that have Adlam in their extensions list. The full names  "script"  and
6925       "script extensions" for the property types are recognized, and a equals
6926       sign is an alternative to the colon. If a script name is given  without
6927       a  property  type,  for example, \p{Adlam}, it is treated as \p{scx:Ad-
6928       lam}. Perl changed to this interpretation at  release  5.26  and  PCRE2
6929       changed at release 10.40.
6930
6931       Unassigned characters (and in non-UTF 32-bit mode, characters with code
6932       points greater than 0x10FFFF) are assigned the "Unknown" script. Others
6933       that  are not part of an identified script are lumped together as "Com-
6934       mon". The current list of recognized script names and their 4-character
6935       abbreviations can be obtained by running this command:
6936
6937         pcre2test -LS
6938
6939
6940   The general category property for \p and \P
6941
6942       Each character has exactly one Unicode general category property, spec-
6943       ified by a two-letter abbreviation. For compatibility with Perl,  nega-
6944       tion  can  be  specified  by including a circumflex between the opening
6945       brace and the property name.  For  example,  \p{^Lu}  is  the  same  as
6946       \P{Lu}.
6947
6948       If only one letter is specified with \p or \P, it includes all the gen-
6949       eral category properties that start with that letter. In this case,  in
6950       the  absence of negation, the curly brackets in the escape sequence are
6951       optional; these two examples have the same effect:
6952
6953         \p{L}
6954         \pL
6955
6956       The following general category property codes are supported:
6957
6958         C     Other
6959         Cc    Control
6960         Cf    Format
6961         Cn    Unassigned
6962         Co    Private use
6963         Cs    Surrogate
6964
6965         L     Letter
6966         Ll    Lower case letter
6967         Lm    Modifier letter
6968         Lo    Other letter
6969         Lt    Title case letter
6970         Lu    Upper case letter
6971
6972         M     Mark
6973         Mc    Spacing mark
6974         Me    Enclosing mark
6975         Mn    Non-spacing mark
6976
6977         N     Number
6978         Nd    Decimal number
6979         Nl    Letter number
6980         No    Other number
6981
6982         P     Punctuation
6983         Pc    Connector punctuation
6984         Pd    Dash punctuation
6985         Pe    Close punctuation
6986         Pf    Final punctuation
6987         Pi    Initial punctuation
6988         Po    Other punctuation
6989         Ps    Open punctuation
6990
6991         S     Symbol
6992         Sc    Currency symbol
6993         Sk    Modifier symbol
6994         Sm    Mathematical symbol
6995         So    Other symbol
6996
6997         Z     Separator
6998         Zl    Line separator
6999         Zp    Paragraph separator
7000         Zs    Space separator
7001
7002       The special property LC, which has the synonym L&, is  also  supported:
7003       it  matches  a  character that has the Lu, Ll, or Lt property, in other
7004       words, a letter that is not classified as a modifier or "other".
7005
7006       The Cs (Surrogate) property  applies  only  to  characters  whose  code
7007       points  are in the range U+D800 to U+DFFF. These characters are no dif-
7008       ferent to any other character when PCRE2 is not in UTF mode (using  the
7009       16-bit  or  32-bit  library).   However,  they are not valid in Unicode
7010       strings and so cannot be tested by PCRE2 in UTF mode, unless UTF valid-
7011       ity   checking   has   been   turned   off   (see   the  discussion  of
7012       PCRE2_NO_UTF_CHECK in the pcre2api page).
7013
7014       The long synonyms for  property  names  that  Perl  supports  (such  as
7015       \p{Letter})  are  not supported by PCRE2, nor is it permitted to prefix
7016       any of these properties with "Is".
7017
7018       No character that is in the Unicode table has the Cn (unassigned) prop-
7019       erty.  Instead, this property is assumed for any code point that is not
7020       in the Unicode table.
7021
7022       Specifying caseless matching does not affect  these  escape  sequences.
7023       For  example,  \p{Lu}  always  matches only upper case letters. This is
7024       different from the behaviour of current versions of Perl.
7025
7026   Binary (yes/no) properties for \p and \P
7027
7028       Unicode defines a number of  binary  properties,  that  is,  properties
7029       whose  only  values  are  true or false. You can obtain a list of those
7030       that are recognized by \p and \P, along with  their  abbreviations,  by
7031       running this command:
7032
7033         pcre2test -LP
7034
7035
7036   The Bidi_Class property for \p and \P
7037
7038         \p{Bidi_Class:<class>}   matches a character with the given class
7039         \p{BC:<class>}           matches a character with the given class
7040
7041       The recognized classes are:
7042
7043         AL          Arabic letter
7044         AN          Arabic number
7045         B           paragraph separator
7046         BN          boundary neutral
7047         CS          common separator
7048         EN          European number
7049         ES          European separator
7050         ET          European terminator
7051         FSI         first strong isolate
7052         L           left-to-right
7053         LRE         left-to-right embedding
7054         LRI         left-to-right isolate
7055         LRO         left-to-right override
7056         NSM         non-spacing mark
7057         ON          other neutral
7058         PDF         pop directional format
7059         PDI         pop directional isolate
7060         R           right-to-left
7061         RLE         right-to-left embedding
7062         RLI         right-to-left isolate
7063         RLO         right-to-left override
7064         S           segment separator
7065         WS          which space
7066
7067       An  equals  sign  may  be  used instead of a colon. The class names are
7068       case-insensitive; only the short names listed above are recognized.
7069
7070   Extended grapheme clusters
7071
7072       The \X escape matches any number of Unicode  characters  that  form  an
7073       "extended grapheme cluster", and treats the sequence as an atomic group
7074       (see below).  Unicode supports various kinds of composite character  by
7075       giving  each  character  a grapheme breaking property, and having rules
7076       that use these properties to define the boundaries of extended grapheme
7077       clusters.  The rules are defined in Unicode Standard Annex 29, "Unicode
7078       Text Segmentation". Unicode 11.0.0 abandoned the use of  some  previous
7079       properties  that had been used for emojis.  Instead it introduced vari-
7080       ous emoji-specific properties. PCRE2  uses  only  the  Extended  Picto-
7081       graphic property.
7082
7083       \X  always  matches  at least one character. Then it decides whether to
7084       add additional characters according to the following rules for ending a
7085       cluster:
7086
7087       1. End at the end of the subject string.
7088
7089       2.  Do not end between CR and LF; otherwise end after any control char-
7090       acter.
7091
7092       3. Do not break Hangul (a Korean  script)  syllable  sequences.  Hangul
7093       characters  are of five types: L, V, T, LV, and LVT. An L character may
7094       be followed by an L, V, LV, or LVT character; an LV or V character  may
7095       be  followed  by  a V or T character; an LVT or T character may be fol-
7096       lowed only by a T character.
7097
7098       4. Do not end before extending  characters  or  spacing  marks  or  the
7099       "zero-width  joiner" character. Characters with the "mark" property al-
7100       ways have the "extend" grapheme breaking property.
7101
7102       5. Do not end after prepend characters.
7103
7104       6. Do not break within emoji modifier sequences or emoji zwj sequences.
7105       That is, do not break between characters with the Extended_Pictographic
7106       property.  Extend and ZWJ characters are allowed  between  the  charac-
7107       ters.
7108
7109       7.  Do not break within emoji flag sequences. That is, do not break be-
7110       tween regional indicator (RI) characters if there are an odd number  of
7111       RI characters before the break point.
7112
7113       8. Otherwise, end the cluster.
7114
7115   PCRE2's additional properties
7116
7117       As  well as the standard Unicode properties described above, PCRE2 sup-
7118       ports four more that make it possible to convert traditional escape se-
7119       quences  such  as \w and \s to use Unicode properties. PCRE2 uses these
7120       non-standard, non-Perl properties internally  when  PCRE2_UCP  is  set.
7121       However, they may also be used explicitly. These properties are:
7122
7123         Xan   Any alphanumeric character
7124         Xps   Any POSIX space character
7125         Xsp   Any Perl space character
7126         Xwd   Any Perl "word" character
7127
7128       Xan  matches  characters that have either the L (letter) or the N (num-
7129       ber) property. Xps matches the characters tab, linefeed, vertical  tab,
7130       form  feed,  or carriage return, and any other character that has the Z
7131       (separator) property.  Xsp is the same as Xps; in PCRE1 it used to  ex-
7132       clude  vertical  tab,  for  Perl  compatibility,  but Perl changed. Xwd
7133       matches the same characters as Xan, plus underscore.
7134
7135       There is another non-standard property, Xuc, which matches any  charac-
7136       ter  that  can  be represented by a Universal Character Name in C++ and
7137       other programming languages. These are the characters $,  @,  `  (grave
7138       accent),  and  all  characters with Unicode code points greater than or
7139       equal to U+00A0, except for the surrogates U+D800 to U+DFFF. Note  that
7140       most  base  (ASCII) characters are excluded. (Universal Character Names
7141       are of the form \uHHHH or \UHHHHHHHH where H is  a  hexadecimal  digit.
7142       Note that the Xuc property does not match these sequences but the char-
7143       acters that they represent.)
7144
7145   Resetting the match start
7146
7147       In normal use, the escape sequence \K  causes  any  previously  matched
7148       characters not to be included in the final matched sequence that is re-
7149       turned. For example, the pattern:
7150
7151         foo\Kbar
7152
7153       matches "foobar", but reports that it has matched "bar".  \K  does  not
7154       interact with anchoring in any way. The pattern:
7155
7156         ^foo\Kbar
7157
7158       matches  only  when  the  subject  begins with "foobar" (in single line
7159       mode), though it again reports the matched string as "bar".  This  fea-
7160       ture  is similar to a lookbehind assertion (described below).  However,
7161       in this case, the part of the subject before the real  match  does  not
7162       have  to be of fixed length, as lookbehind assertions do. The use of \K
7163       does not interfere with the setting of captured substrings.  For  exam-
7164       ple, when the pattern
7165
7166         (foo)\Kbar
7167
7168       matches "foobar", the first substring is still set to "foo".
7169
7170       From  version  5.32.0  Perl  forbids the use of \K in lookaround asser-
7171       tions. From release 10.38 PCRE2 also forbids this by default.  However,
7172       the  PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK  option  can be used when calling
7173       pcre2_compile() to re-enable the previous behaviour. When  this  option
7174       is set, \K is acted upon when it occurs inside positive assertions, but
7175       is ignored in negative assertions. Note that when  a  pattern  such  as
7176       (?=ab\K)  matches,  the reported start of the match can be greater than
7177       the end of the match. Using \K in a lookbehind assertion at  the  start
7178       of  a  pattern can also lead to odd effects. For example, consider this
7179       pattern:
7180
7181         (?<=\Kfoo)bar
7182
7183       If the subject is "foobar", a call to  pcre2_match()  with  a  starting
7184       offset  of 3 succeeds and reports the matching string as "foobar", that
7185       is, the start of the reported match is earlier  than  where  the  match
7186       started.
7187
7188   Simple assertions
7189
7190       The  final use of backslash is for certain simple assertions. An asser-
7191       tion specifies a condition that has to be met at a particular point  in
7192       a  match, without consuming any characters from the subject string. The
7193       use of groups for more complicated assertions is described below.   The
7194       backslashed assertions are:
7195
7196         \b     matches at a word boundary
7197         \B     matches when not at a word boundary
7198         \A     matches at the start of the subject
7199         \Z     matches at the end of the subject
7200                 also matches before a newline at the end of the subject
7201         \z     matches only at the end of the subject
7202         \G     matches at the first matching position in the subject
7203
7204       Inside  a  character  class, \b has a different meaning; it matches the
7205       backspace character. If any other of  these  assertions  appears  in  a
7206       character class, an "invalid escape sequence" error is generated.
7207
7208       A  word  boundary is a position in the subject string where the current
7209       character and the previous character do not both match \w or  \W  (i.e.
7210       one  matches  \w  and the other matches \W), or the start or end of the
7211       string if the first or last character matches  \w,  respectively.  When
7212       PCRE2  is  built with Unicode support, the meanings of \w and \W can be
7213       changed by setting the PCRE2_UCP option. When this is done, it also af-
7214       fects  \b and \B. Neither PCRE2 nor Perl has a separate "start of word"
7215       or "end of word" metasequence. However, whatever  follows  \b  normally
7216       determines  which  it  is. For example, the fragment \ba matches "a" at
7217       the start of a word.
7218
7219       The \A, \Z, and \z assertions differ from  the  traditional  circumflex
7220       and dollar (described in the next section) in that they only ever match
7221       at the very start and end of the subject string, whatever  options  are
7222       set.  Thus,  they are independent of multiline mode. These three asser-
7223       tions are not affected by the  PCRE2_NOTBOL  or  PCRE2_NOTEOL  options,
7224       which  affect only the behaviour of the circumflex and dollar metachar-
7225       acters. However, if the startoffset argument of pcre2_match()  is  non-
7226       zero,  indicating  that  matching is to start at a point other than the
7227       beginning of the subject, \A can never match.  The  difference  between
7228       \Z  and \z is that \Z matches before a newline at the end of the string
7229       as well as at the very end, whereas \z matches only at the end.
7230
7231       The \G assertion is true only when the current matching position is  at
7232       the  start point of the matching process, as specified by the startoff-
7233       set argument of pcre2_match(). It differs from \A  when  the  value  of
7234       startoffset  is  non-zero. By calling pcre2_match() multiple times with
7235       appropriate arguments, you can mimic Perl's /g option,  and  it  is  in
7236       this kind of implementation where \G can be useful.
7237
7238       Note,  however,  that  PCRE2's  implementation of \G, being true at the
7239       starting character of the matching process, is  subtly  different  from
7240       Perl's,  which  defines it as true at the end of the previous match. In
7241       Perl, these can be different when the  previously  matched  string  was
7242       empty. Because PCRE2 does just one match at a time, it cannot reproduce
7243       this behaviour.
7244
7245       If all the alternatives of a pattern begin with \G, the  expression  is
7246       anchored to the starting match position, and the "anchored" flag is set
7247       in the compiled regular expression.
7248
7249
7250CIRCUMFLEX AND DOLLAR
7251
7252       The circumflex and dollar  metacharacters  are  zero-width  assertions.
7253       That  is,  they test for a particular condition being true without con-
7254       suming any characters from the subject string. These two metacharacters
7255       are  concerned  with matching the starts and ends of lines. If the new-
7256       line convention is set so that only the two-character sequence CRLF  is
7257       recognized  as  a newline, isolated CR and LF characters are treated as
7258       ordinary data characters, and are not recognized as newlines.
7259
7260       Outside a character class, in the default matching mode, the circumflex
7261       character  is  an  assertion  that is true only if the current matching
7262       point is at the start of the subject string. If the  startoffset  argu-
7263       ment  of  pcre2_match() is non-zero, or if PCRE2_NOTBOL is set, circum-
7264       flex can never match if the PCRE2_MULTILINE option is unset.  Inside  a
7265       character  class, circumflex has an entirely different meaning (see be-
7266       low).
7267
7268       Circumflex need not be the first character of the pattern if  a  number
7269       of  alternatives are involved, but it should be the first thing in each
7270       alternative in which it appears if the pattern is ever  to  match  that
7271       branch.  If all possible alternatives start with a circumflex, that is,
7272       if the pattern is constrained to match only at the start  of  the  sub-
7273       ject,  it  is  said  to be an "anchored" pattern. (There are also other
7274       constructs that can cause a pattern to be anchored.)
7275
7276       The dollar character is an assertion that is true only if  the  current
7277       matching  point is at the end of the subject string, or immediately be-
7278       fore a newline at the end of the string (by default), unless  PCRE2_NO-
7279       TEOL  is  set.  Note, however, that it does not actually match the new-
7280       line. Dollar need not be the last character of the pattern if a  number
7281       of  alternatives  are  involved,  but it should be the last item in any
7282       branch in which it appears. Dollar has no special meaning in a  charac-
7283       ter class.
7284
7285       The  meaning  of  dollar  can be changed so that it matches only at the
7286       very end of the string, by setting the PCRE2_DOLLAR_ENDONLY  option  at
7287       compile time. This does not affect the \Z assertion.
7288
7289       The meanings of the circumflex and dollar metacharacters are changed if
7290       the PCRE2_MULTILINE option is set. When this  is  the  case,  a  dollar
7291       character  matches before any newlines in the string, as well as at the
7292       very end, and a circumflex matches immediately after internal  newlines
7293       as  well as at the start of the subject string. It does not match after
7294       a newline that ends the string, for compatibility with  Perl.  However,
7295       this can be changed by setting the PCRE2_ALT_CIRCUMFLEX option.
7296
7297       For  example, the pattern /^abc$/ matches the subject string "def\nabc"
7298       (where \n represents a newline) in multiline mode, but  not  otherwise.
7299       Consequently,  patterns  that  are anchored in single line mode because
7300       all branches start with ^ are not anchored in  multiline  mode,  and  a
7301       match  for  circumflex  is  possible  when  the startoffset argument of
7302       pcre2_match() is non-zero. The PCRE2_DOLLAR_ENDONLY option  is  ignored
7303       if PCRE2_MULTILINE is set.
7304
7305       When  the  newline  convention (see "Newline conventions" below) recog-
7306       nizes the two-character sequence CRLF as a newline, this is  preferred,
7307       even  if  the  single  characters CR and LF are also recognized as new-
7308       lines. For example, if the newline convention  is  "any",  a  multiline
7309       mode  circumflex matches before "xyz" in the string "abc\r\nxyz" rather
7310       than after CR, even though CR on its own is a valid newline.  (It  also
7311       matches at the very start of the string, of course.)
7312
7313       Note  that  the sequences \A, \Z, and \z can be used to match the start
7314       and end of the subject in both modes, and if all branches of a  pattern
7315       start  with \A it is always anchored, whether or not PCRE2_MULTILINE is
7316       set.
7317
7318
7319FULL STOP (PERIOD, DOT) AND \N
7320
7321       Outside a character class, a dot in the pattern matches any one charac-
7322       ter  in  the subject string except (by default) a character that signi-
7323       fies the end of a line. One or more characters may be specified as line
7324       terminators (see "Newline conventions" above).
7325
7326       Dot  never matches a single line-ending character. When the two-charac-
7327       ter sequence CRLF is the only line ending, dot does not match CR if  it
7328       is  immediately followed by LF, but otherwise it matches all characters
7329       (including isolated CRs and LFs). When ANYCRLF  is  selected  for  line
7330       endings,  no  occurences  of  CR of LF match dot. When all Unicode line
7331       endings are being recognized, dot does not match CR or LF or any of the
7332       other line ending characters.
7333
7334       The  behaviour  of  dot  with regard to newlines can be changed. If the
7335       PCRE2_DOTALL option is set, a dot matches any  one  character,  without
7336       exception.   If  the two-character sequence CRLF is present in the sub-
7337       ject string, it takes two dots to match it.
7338
7339       The handling of dot is entirely independent of the handling of  circum-
7340       flex  and  dollar,  the  only relationship being that they both involve
7341       newlines. Dot has no special meaning in a character class.
7342
7343       The escape sequence \N when not followed by an  opening  brace  behaves
7344       like  a dot, except that it is not affected by the PCRE2_DOTALL option.
7345       In other words, it matches any character except one that signifies  the
7346       end of a line.
7347
7348       When \N is followed by an opening brace it has a different meaning. See
7349       the section entitled "Non-printing characters" above for details.  Perl
7350       also  uses  \N{name}  to specify characters by Unicode name; PCRE2 does
7351       not support this.
7352
7353
7354MATCHING A SINGLE CODE UNIT
7355
7356       Outside a character class, the escape sequence \C matches any one  code
7357       unit,  whether or not a UTF mode is set. In the 8-bit library, one code
7358       unit is one byte; in the 16-bit library it is a  16-bit  unit;  in  the
7359       32-bit  library  it  is  a 32-bit unit. Unlike a dot, \C always matches
7360       line-ending characters. The feature is provided in  Perl  in  order  to
7361       match individual bytes in UTF-8 mode, but it is unclear how it can use-
7362       fully be used.
7363
7364       Because \C breaks up characters into individual  code  units,  matching
7365       one  unit  with  \C  in UTF-8 or UTF-16 mode means that the rest of the
7366       string may start with a malformed UTF character. This has undefined re-
7367       sults, because PCRE2 assumes that it is matching character by character
7368       in a valid UTF string (by default it checks the subject string's valid-
7369       ity  at  the  start  of  processing  unless  the  PCRE2_NO_UTF_CHECK or
7370       PCRE2_MATCH_INVALID_UTF option is used).
7371
7372       An  application  can  lock  out  the  use  of   \C   by   setting   the
7373       PCRE2_NEVER_BACKSLASH_C  option  when  compiling  a pattern. It is also
7374       possible to build PCRE2 with the use of \C permanently disabled.
7375
7376       PCRE2 does not allow \C to appear in lookbehind  assertions  (described
7377       below)  in UTF-8 or UTF-16 modes, because this would make it impossible
7378       to calculate the length of  the  lookbehind.  Neither  the  alternative
7379       matching function pcre2_dfa_match() nor the JIT optimizer support \C in
7380       these UTF modes.  The former gives a match-time error; the latter fails
7381       to optimize and so the match is always run using the interpreter.
7382
7383       In  the  32-bit  library, however, \C is always supported (when not ex-
7384       plicitly locked out) because it always  matches  a  single  code  unit,
7385       whether or not UTF-32 is specified.
7386
7387       In general, the \C escape sequence is best avoided. However, one way of
7388       using it that avoids the problem of malformed UTF-8 or  UTF-16  charac-
7389       ters  is  to use a lookahead to check the length of the next character,
7390       as in this pattern, which could be used with  a  UTF-8  string  (ignore
7391       white space and line breaks):
7392
7393         (?| (?=[\x00-\x7f])(\C) |
7394             (?=[\x80-\x{7ff}])(\C)(\C) |
7395             (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
7396             (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
7397
7398       In  this  example,  a  group  that starts with (?| resets the capturing
7399       parentheses numbers in each alternative (see "Duplicate Group  Numbers"
7400       below). The assertions at the start of each branch check the next UTF-8
7401       character for values whose encoding uses 1, 2, 3, or 4  bytes,  respec-
7402       tively.  The  character's individual bytes are then captured by the ap-
7403       propriate number of \C groups.
7404
7405
7406SQUARE BRACKETS AND CHARACTER CLASSES
7407
7408       An opening square bracket introduces a character class, terminated by a
7409       closing square bracket. A closing square bracket on its own is not spe-
7410       cial by default.  If a closing square bracket is required as  a  member
7411       of the class, it should be the first data character in the class (after
7412       an initial circumflex, if present) or escaped with  a  backslash.  This
7413       means  that,  by default, an empty class cannot be defined. However, if
7414       the PCRE2_ALLOW_EMPTY_CLASS option is set, a closing square bracket  at
7415       the start does end the (empty) class.
7416
7417       A  character class matches a single character in the subject. A matched
7418       character must be in the set of characters defined by the class, unless
7419       the  first  character in the class definition is a circumflex, in which
7420       case the subject character must not be in the set defined by the class.
7421       If  a  circumflex is actually required as a member of the class, ensure
7422       it is not the first character, or escape it with a backslash.
7423
7424       For example, the character class [aeiou] matches any lower case  vowel,
7425       while  [^aeiou]  matches  any character that is not a lower case vowel.
7426       Note that a circumflex is just a convenient notation for specifying the
7427       characters  that  are in the class by enumerating those that are not. A
7428       class that starts with a circumflex is not an assertion; it still  con-
7429       sumes  a  character  from the subject string, and therefore it fails if
7430       the current pointer is at the end of the string.
7431
7432       Characters in a class may be specified by their code points  using  \o,
7433       \x,  or \N{U+hh..} in the usual way. When caseless matching is set, any
7434       letters in a class represent both their upper case and lower case  ver-
7435       sions,  so  for example, a caseless [aeiou] matches "A" as well as "a",
7436       and a caseless [^aeiou] does not match "A", whereas a  caseful  version
7437       would.  Note that there are two ASCII characters, K and S, that, in ad-
7438       dition to their lower case ASCII equivalents, are case-equivalent  with
7439       Unicode  U+212A (Kelvin sign) and U+017F (long S) respectively when ei-
7440       ther PCRE2_UTF or PCRE2_UCP is set.
7441
7442       Characters that might indicate line breaks are  never  treated  in  any
7443       special  way  when matching character classes, whatever line-ending se-
7444       quence is  in  use,  and  whatever  setting  of  the  PCRE2_DOTALL  and
7445       PCRE2_MULTILINE  options  is  used. A class such as [^a] always matches
7446       one of these characters.
7447
7448       The generic character type escape sequences \d, \D, \h, \H, \p, \P, \s,
7449       \S,  \v,  \V,  \w,  and \W may appear in a character class, and add the
7450       characters that they  match  to  the  class.  For  example,  [\dABCDEF]
7451       matches  any  hexadecimal digit. In UTF modes, the PCRE2_UCP option af-
7452       fects the meanings of \d, \s, \w and their upper case partners, just as
7453       it does when they appear outside a character class, as described in the
7454       section entitled "Generic character types" above. The  escape  sequence
7455       \b  has  a  different  meaning inside a character class; it matches the
7456       backspace character. The sequences \B, \R, and \X are not  special  in-
7457       side  a  character class. Like any other unrecognized escape sequences,
7458       they cause an error. The same is true for \N when not  followed  by  an
7459       opening brace.
7460
7461       The  minus (hyphen) character can be used to specify a range of charac-
7462       ters in a character class. For example, [d-m] matches  any  letter  be-
7463       tween  d and m, inclusive. If a minus character is required in a class,
7464       it must be escaped with a backslash or appear in a  position  where  it
7465       cannot  be interpreted as indicating a range, typically as the first or
7466       last character in the class, or immediately after a range. For example,
7467       [b-d-z] matches letters in the range b to d, a hyphen character, or z.
7468
7469       Perl treats a hyphen as a literal if it appears before or after a POSIX
7470       class (see below) or before or after a character type escape such as as
7471       \d  or  \H.   However,  unless  the hyphen is the last character in the
7472       class, Perl outputs a warning in its warning  mode,  as  this  is  most
7473       likely  a user error. As PCRE2 has no facility for warning, an error is
7474       given in these cases.
7475
7476       It is not possible to have the literal character "]" as the end charac-
7477       ter  of a range. A pattern such as [W-]46] is interpreted as a class of
7478       two characters ("W" and "-") followed by a literal string "46]", so  it
7479       would  match  "W46]"  or  "-46]". However, if the "]" is escaped with a
7480       backslash it is interpreted as the end of range, so [W-\]46] is  inter-
7481       preted  as a class containing a range followed by two other characters.
7482       The octal or hexadecimal representation of "]" can also be used to  end
7483       a range.
7484
7485       Ranges normally include all code points between the start and end char-
7486       acters, inclusive. They can also be used for code points specified  nu-
7487       merically,  for  example [\000-\037]. Ranges can include any characters
7488       that are valid for the current mode. In any  UTF  mode,  the  so-called
7489       "surrogate"  characters (those whose code points lie between 0xd800 and
7490       0xdfff inclusive) may not  be  specified  explicitly  by  default  (the
7491       PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES  option  disables this check). How-
7492       ever, ranges such as [\x{d7ff}-\x{e000}], which include the surrogates,
7493       are always permitted.
7494
7495       There  is  a  special  case in EBCDIC environments for ranges whose end
7496       points are both specified as literal letters in the same case. For com-
7497       patibility  with Perl, EBCDIC code points within the range that are not
7498       letters are omitted. For example, [h-k] matches only  four  characters,
7499       even though the codes for h and k are 0x88 and 0x92, a range of 11 code
7500       points. However, if the range is specified  numerically,  for  example,
7501       [\x88-\x92] or [h-\x92], all code points are included.
7502
7503       If a range that includes letters is used when caseless matching is set,
7504       it matches the letters in either case. For example, [W-c] is equivalent
7505       to  [][\\^_`wxyzabc],  matched  caselessly,  and  in a non-UTF mode, if
7506       character tables for a French locale are in  use,  [\xc8-\xcb]  matches
7507       accented E characters in both cases.
7508
7509       A  circumflex  can  conveniently  be used with the upper case character
7510       types to specify a more restricted set of characters than the  matching
7511       lower  case  type.  For example, the class [^\W_] matches any letter or
7512       digit, but not underscore, whereas [\w] includes underscore. A positive
7513       character class should be read as "something OR something OR ..." and a
7514       negative class as "NOT something AND NOT something AND NOT ...".
7515
7516       The only metacharacters that are recognized in  character  classes  are
7517       backslash,  hyphen  (only  where  it can be interpreted as specifying a
7518       range), circumflex (only at the start), opening  square  bracket  (only
7519       when  it can be interpreted as introducing a POSIX class name, or for a
7520       special compatibility feature - see the next  two  sections),  and  the
7521       terminating  closing  square  bracket.  However, escaping other non-al-
7522       phanumeric characters does no harm.
7523
7524
7525POSIX CHARACTER CLASSES
7526
7527       Perl supports the POSIX notation for character classes. This uses names
7528       enclosed  by [: and :] within the enclosing square brackets. PCRE2 also
7529       supports this notation. For example,
7530
7531         [01[:alpha:]%]
7532
7533       matches "0", "1", any alphabetic character, or "%". The supported class
7534       names are:
7535
7536         alnum    letters and digits
7537         alpha    letters
7538         ascii    character codes 0 - 127
7539         blank    space or tab only
7540         cntrl    control characters
7541         digit    decimal digits (same as \d)
7542         graph    printing characters, excluding space
7543         lower    lower case letters
7544         print    printing characters, including space
7545         punct    printing characters, excluding letters and digits and space
7546         space    white space (the same as \s from PCRE2 8.34)
7547         upper    upper case letters
7548         word     "word" characters (same as \w)
7549         xdigit   hexadecimal digits
7550
7551       The  default  "space" characters are HT (9), LF (10), VT (11), FF (12),
7552       CR (13), and space (32). If locale-specific matching is  taking  place,
7553       the  list  of  space characters may be different; there may be fewer or
7554       more of them. "Space" and \s match the same set of characters.
7555
7556       The name "word" is a Perl extension, and "blank"  is  a  GNU  extension
7557       from  Perl  5.8. Another Perl extension is negation, which is indicated
7558       by a ^ character after the colon. For example,
7559
7560         [12[:^digit:]]
7561
7562       matches "1", "2", or any non-digit. PCRE2 (and Perl) also recognize the
7563       POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
7564       these are not supported, and an error is given if they are encountered.
7565
7566       By default, characters with values greater than 127 do not match any of
7567       the POSIX character classes, although this may be different for charac-
7568       ters in the range 128-255 when locale-specific matching  is  happening.
7569       However,  if the PCRE2_UCP option is passed to pcre2_compile(), some of
7570       the classes are changed so that Unicode character properties are  used.
7571       This  is  achieved  by  replacing  certain POSIX classes with other se-
7572       quences, as follows:
7573
7574         [:alnum:]  becomes  \p{Xan}
7575         [:alpha:]  becomes  \p{L}
7576         [:blank:]  becomes  \h
7577         [:cntrl:]  becomes  \p{Cc}
7578         [:digit:]  becomes  \p{Nd}
7579         [:lower:]  becomes  \p{Ll}
7580         [:space:]  becomes  \p{Xps}
7581         [:upper:]  becomes  \p{Lu}
7582         [:word:]   becomes  \p{Xwd}
7583
7584       Negated versions, such as [:^alpha:] use \P instead of \p. Three  other
7585       POSIX classes are handled specially in UCP mode:
7586
7587       [:graph:] This  matches  characters that have glyphs that mark the page
7588                 when printed. In Unicode property terms, it matches all char-
7589                 acters with the L, M, N, P, S, or Cf properties, except for:
7590
7591                   U+061C           Arabic Letter Mark
7592                   U+180E           Mongolian Vowel Separator
7593                   U+2066 - U+2069  Various "isolate"s
7594
7595
7596       [:print:] This  matches  the  same  characters  as [:graph:] plus space
7597                 characters that are not controls, that  is,  characters  with
7598                 the Zs property.
7599
7600       [:punct:] This matches all characters that have the Unicode P (punctua-
7601                 tion) property, plus those characters with code  points  less
7602                 than 256 that have the S (Symbol) property.
7603
7604       The  other  POSIX classes are unchanged, and match only characters with
7605       code points less than 256.
7606
7607
7608COMPATIBILITY FEATURE FOR WORD BOUNDARIES
7609
7610       In the POSIX.2 compliant library that was included in 4.4BSD Unix,  the
7611       ugly  syntax  [[:<:]]  and [[:>:]] is used for matching "start of word"
7612       and "end of word". PCRE2 treats these items as follows:
7613
7614         [[:<:]]  is converted to  \b(?=\w)
7615         [[:>:]]  is converted to  \b(?<=\w)
7616
7617       Only these exact character sequences are recognized. A sequence such as
7618       [a[:<:]b]  provokes  error  for  an unrecognized POSIX class name. This
7619       support is not compatible with Perl. It is provided to help  migrations
7620       from other environments, and is best not used in any new patterns. Note
7621       that \b matches at the start and the end of a word (see "Simple  asser-
7622       tions"  above),  and in a Perl-style pattern the preceding or following
7623       character normally shows which is wanted, without the need for the  as-
7624       sertions  that are used above in order to give exactly the POSIX behav-
7625       iour.
7626
7627
7628VERTICAL BAR
7629
7630       Vertical bar characters are used to separate alternative patterns.  For
7631       example, the pattern
7632
7633         gilbert|sullivan
7634
7635       matches  either "gilbert" or "sullivan". Any number of alternatives may
7636       appear, and an empty  alternative  is  permitted  (matching  the  empty
7637       string). The matching process tries each alternative in turn, from left
7638       to right, and the first one that succeeds is used. If the  alternatives
7639       are  within a group (defined below), "succeeds" means matching the rest
7640       of the main pattern as well as the alternative in the group.
7641
7642
7643INTERNAL OPTION SETTING
7644
7645       The settings  of  the  PCRE2_CASELESS,  PCRE2_MULTILINE,  PCRE2_DOTALL,
7646       PCRE2_EXTENDED,  PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE options
7647       can be changed from within the pattern by a  sequence  of  letters  en-
7648       closed  between  "(?"   and ")". These options are Perl-compatible, and
7649       are described in detail in the pcre2api documentation. The option  let-
7650       ters are:
7651
7652         i  for PCRE2_CASELESS
7653         m  for PCRE2_MULTILINE
7654         n  for PCRE2_NO_AUTO_CAPTURE
7655         s  for PCRE2_DOTALL
7656         x  for PCRE2_EXTENDED
7657         xx for PCRE2_EXTENDED_MORE
7658
7659       For example, (?im) sets caseless, multiline matching. It is also possi-
7660       ble to unset these options by preceding the relevant letters with a hy-
7661       phen,  for  example (?-im). The two "extended" options are not indepen-
7662       dent; unsetting either one cancels the effects of both of them.
7663
7664       A  combined  setting  and  unsetting  such  as  (?im-sx),  which   sets
7665       PCRE2_CASELESS  and  PCRE2_MULTILINE  while  unsetting PCRE2_DOTALL and
7666       PCRE2_EXTENDED, is also permitted. Only one hyphen may  appear  in  the
7667       options  string.  If a letter appears both before and after the hyphen,
7668       the option is unset. An empty options setting "(?)" is  allowed.  Need-
7669       less to say, it has no effect.
7670
7671       If  the  first character following (? is a circumflex, it causes all of
7672       the above options to be unset. Thus, (?^) is equivalent  to  (?-imnsx).
7673       Letters  may  follow  the circumflex to cause some options to be re-in-
7674       stated, but a hyphen may not appear.
7675
7676       The PCRE2-specific options PCRE2_DUPNAMES  and  PCRE2_UNGREEDY  can  be
7677       changed  in  the  same  way as the Perl-compatible options by using the
7678       characters J and U respectively. However, these are not unset by (?^).
7679
7680       When one of these option changes occurs at top level (that is, not  in-
7681       side  group  parentheses),  the  change applies to the remainder of the
7682       pattern that follows. An option change within a group (see below for  a
7683       description of groups) affects only that part of the group that follows
7684       it, so
7685
7686         (a(?i)b)c
7687
7688       matches abc and aBc and no other strings  (assuming  PCRE2_CASELESS  is
7689       not  used).   By this means, options can be made to have different set-
7690       tings in different parts of the pattern. Any changes made in one alter-
7691       native  do carry on into subsequent branches within the same group. For
7692       example,
7693
7694         (a(?i)b|c)
7695
7696       matches "ab", "aB", "c", and "C", even though  when  matching  "C"  the
7697       first  branch  is  abandoned before the option setting. This is because
7698       the effects of option settings happen at compile time. There  would  be
7699       some very weird behaviour otherwise.
7700
7701       As  a  convenient shorthand, if any option settings are required at the
7702       start of a non-capturing group (see the next section), the option  let-
7703       ters may appear between the "?" and the ":". Thus the two patterns
7704
7705         (?i:saturday|sunday)
7706         (?:(?i)saturday|sunday)
7707
7708       match exactly the same set of strings.
7709
7710       Note:  There  are  other  PCRE2-specific options, applying to the whole
7711       pattern, which can be set by the application when the  compiling  func-
7712       tion  is  called.  In addition, the pattern can contain special leading
7713       sequences such as (*CRLF) to override what the application has  set  or
7714       what  has  been  defaulted.   Details are given in the section entitled
7715       "Newline sequences" above. There are also the (*UTF) and (*UCP) leading
7716       sequences  that can be used to set UTF and Unicode property modes; they
7717       are equivalent to setting the PCRE2_UTF and PCRE2_UCP options,  respec-
7718       tively.  However,  the  application  can  set  the  PCRE2_NEVER_UTF and
7719       PCRE2_NEVER_UCP options, which lock out  the  use  of  the  (*UTF)  and
7720       (*UCP) sequences.
7721
7722
7723GROUPS
7724
7725       Groups  are  delimited  by  parentheses  (round brackets), which can be
7726       nested.  Turning part of a pattern into a group does two things:
7727
7728       1. It localizes a set of alternatives. For example, the pattern
7729
7730         cat(aract|erpillar|)
7731
7732       matches "cataract", "caterpillar", or "cat". Without  the  parentheses,
7733       it would match "cataract", "erpillar" or an empty string.
7734
7735       2.  It  creates a "capture group". This means that, when the whole pat-
7736       tern matches, the portion of the subject string that matched the  group
7737       is  passed back to the caller, separately from the portion that matched
7738       the whole pattern.  (This applies  only  to  the  traditional  matching
7739       function; the DFA matching function does not support capturing.)
7740
7741       Opening parentheses are counted from left to right (starting from 1) to
7742       obtain numbers for capture groups. For example, if the string "the  red
7743       king" is matched against the pattern
7744
7745         the ((red|white) (king|queen))
7746
7747       the captured substrings are "red king", "red", and "king", and are num-
7748       bered 1, 2, and 3, respectively.
7749
7750       The fact that plain parentheses fulfil  two  functions  is  not  always
7751       helpful.   There are often times when grouping is required without cap-
7752       turing. If an opening parenthesis is followed by a question mark and  a
7753       colon,  the  group  does  not do any capturing, and is not counted when
7754       computing the number of any subsequent capture groups. For example,  if
7755       the string "the white queen" is matched against the pattern
7756
7757         the ((?:red|white) (king|queen))
7758
7759       the captured substrings are "white queen" and "queen", and are numbered
7760       1 and 2. The maximum number of capture groups is 65535.
7761
7762       As a convenient shorthand, if any option settings are required  at  the
7763       start  of  a non-capturing group, the option letters may appear between
7764       the "?" and the ":". Thus the two patterns
7765
7766         (?i:saturday|sunday)
7767         (?:(?i)saturday|sunday)
7768
7769       match exactly the same set of strings. Because alternative branches are
7770       tried  from  left  to right, and options are not reset until the end of
7771       the group is reached, an option setting in one branch does affect  sub-
7772       sequent branches, so the above patterns match "SUNDAY" as well as "Sat-
7773       urday".
7774
7775
7776DUPLICATE GROUP NUMBERS
7777
7778       Perl 5.10 introduced a feature whereby each alternative in a group uses
7779       the  same  numbers  for  its capturing parentheses. Such a group starts
7780       with (?| and is itself a non-capturing  group.  For  example,  consider
7781       this pattern:
7782
7783         (?|(Sat)ur|(Sun))day
7784
7785       Because  the two alternatives are inside a (?| group, both sets of cap-
7786       turing parentheses are numbered one. Thus, when  the  pattern  matches,
7787       you  can  look  at captured substring number one, whichever alternative
7788       matched. This construct is useful when you want to  capture  part,  but
7789       not all, of one of a number of alternatives. Inside a (?| group, paren-
7790       theses are numbered as usual, but the number is reset at the  start  of
7791       each  branch.  The numbers of any capturing parentheses that follow the
7792       whole group start after the highest number used in any branch. The fol-
7793       lowing example is taken from the Perl documentation. The numbers under-
7794       neath show in which buffer the captured content will be stored.
7795
7796         # before  ---------------branch-reset----------- after
7797         / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
7798         # 1            2         2  3        2     3     4
7799
7800       A backreference to a capture group uses the most recent value  that  is
7801       set for the group. The following pattern matches "abcabc" or "defdef":
7802
7803         /(?|(abc)|(def))\1/
7804
7805       In  contrast, a subroutine call to a capture group always refers to the
7806       first one in the pattern with the given number. The  following  pattern
7807       matches "abcabc" or "defabc":
7808
7809         /(?|(abc)|(def))(?1)/
7810
7811       A relative reference such as (?-1) is no different: it is just a conve-
7812       nient way of computing an absolute group number.
7813
7814       If a condition test for a group's having matched refers to a non-unique
7815       number, the test is true if any group with that number has matched.
7816
7817       An  alternative approach to using this "branch reset" feature is to use
7818       duplicate named groups, as described in the next section.
7819
7820
7821NAMED CAPTURE GROUPS
7822
7823       Identifying capture groups by number is simple, but it can be very hard
7824       to  keep  track of the numbers in complicated patterns. Furthermore, if
7825       an expression is modified, the numbers may change. To  help  with  this
7826       difficulty,  PCRE2  supports the naming of capture groups. This feature
7827       was not added to Perl until release 5.10. Python had the  feature  ear-
7828       lier,  and PCRE1 introduced it at release 4.0, using the Python syntax.
7829       PCRE2 supports both the Perl and the Python syntax.
7830
7831       In PCRE2,  a  capture  group  can  be  named  in  one  of  three  ways:
7832       (?<name>...) or (?'name'...) as in Perl, or (?P<name>...) as in Python.
7833       Names may be up to 32 code units long. When PCRE2_UTF is not set,  they
7834       may  contain  only  ASCII  alphanumeric characters and underscores, but
7835       must start with a non-digit. When PCRE2_UTF is set, the syntax of group
7836       names is extended to allow any Unicode letter or Unicode decimal digit.
7837       In other words, group names must match one of these patterns:
7838
7839         ^[_A-Za-z][_A-Za-z0-9]*\z   when PCRE2_UTF is not set
7840         ^[_\p{L}][_\p{L}\p{Nd}]*\z  when PCRE2_UTF is set
7841
7842       References to capture groups from other parts of the pattern,  such  as
7843       backreferences,  recursion,  and conditions, can all be made by name as
7844       well as by number.
7845
7846       Named capture groups are allocated numbers as well as names, exactly as
7847       if  the  names were not present. In both PCRE2 and Perl, capture groups
7848       are primarily identified by numbers; any names  are  just  aliases  for
7849       these numbers. The PCRE2 API provides function calls for extracting the
7850       complete name-to-number translation table from a compiled  pattern,  as
7851       well  as  convenience  functions  for extracting captured substrings by
7852       name.
7853
7854       Warning: When more than one capture group has the same number,  as  de-
7855       scribed in the previous section, a name given to one of them applies to
7856       all of them. Perl allows identically numbered groups to have  different
7857       names.  Consider this pattern, where there are two capture groups, both
7858       numbered 1:
7859
7860         (?|(?<AA>aa)|(?<BB>bb))
7861
7862       Perl allows this, with both names AA and BB  as  aliases  of  group  1.
7863       Thus, after a successful match, both names yield the same value (either
7864       "aa" or "bb").
7865
7866       In an attempt to reduce confusion, PCRE2 does not allow the same  group
7867       number to be associated with more than one name. The example above pro-
7868       vokes a compile-time error. However, there is still  scope  for  confu-
7869       sion. Consider this pattern:
7870
7871         (?|(?<AA>aa)|(bb))
7872
7873       Although the second group number 1 is not explicitly named, the name AA
7874       is still an alias for any group 1. Whether the pattern matches "aa"  or
7875       "bb", a reference by name to group AA yields the matched string.
7876
7877       By  default, a name must be unique within a pattern, except that dupli-
7878       cate names are permitted for groups with the same number, for example:
7879
7880         (?|(?<AA>aa)|(?<AA>bb))
7881
7882       The duplicate name constraint can be disabled by setting the PCRE2_DUP-
7883       NAMES option at compile time, or by the use of (?J) within the pattern,
7884       as described in the section entitled "Internal Option Setting" above.
7885
7886       Duplicate names can be useful for patterns where only one  instance  of
7887       the  named  capture group can match. Suppose you want to match the name
7888       of a weekday, either as a 3-letter abbreviation or as  the  full  name,
7889       and  in  both  cases you want to extract the abbreviation. This pattern
7890       (ignoring the line breaks) does the job:
7891
7892         (?J)
7893         (?<DN>Mon|Fri|Sun)(?:day)?|
7894         (?<DN>Tue)(?:sday)?|
7895         (?<DN>Wed)(?:nesday)?|
7896         (?<DN>Thu)(?:rsday)?|
7897         (?<DN>Sat)(?:urday)?
7898
7899       There are five capture groups, but only one is ever set after a  match.
7900       The  convenience  functions for extracting the data by name returns the
7901       substring for the first (and in this example, the only) group  of  that
7902       name that matched. This saves searching to find which numbered group it
7903       was. (An alternative way of solving this problem is to  use  a  "branch
7904       reset" group, as described in the previous section.)
7905
7906       If  you make a backreference to a non-unique named group from elsewhere
7907       in the pattern, the groups to which the name refers are checked in  the
7908       order  in  which they appear in the overall pattern. The first one that
7909       is set is used for the reference. For  example,  this  pattern  matches
7910       both "foofoo" and "barbar" but not "foobar" or "barfoo":
7911
7912         (?J)(?:(?<n>foo)|(?<n>bar))\k<n>
7913
7914
7915       If you make a subroutine call to a non-unique named group, the one that
7916       corresponds to the first occurrence of the name is used. In the absence
7917       of duplicate numbers this is the one with the lowest number.
7918
7919       If you use a named reference in a condition test (see the section about
7920       conditions below), either to check whether a capture group has matched,
7921       or to check for recursion, all groups with the same name are tested. If
7922       the condition is true for any one of them,  the  overall  condition  is
7923       true.  This is the same behaviour as testing by number. For further de-
7924       tails of the interfaces for handling  named  capture  groups,  see  the
7925       pcre2api documentation.
7926
7927
7928REPETITION
7929
7930       Repetition  is  specified  by  quantifiers, which can follow any of the
7931       following items:
7932
7933         a literal data character
7934         the dot metacharacter
7935         the \C escape sequence
7936         the \R escape sequence
7937         the \X escape sequence
7938         an escape such as \d or \pL that matches a single character
7939         a character class
7940         a backreference
7941         a parenthesized group (including lookaround assertions)
7942         a subroutine call (recursive or otherwise)
7943
7944       The general repetition quantifier specifies a minimum and maximum  num-
7945       ber  of  permitted matches, by giving the two numbers in curly brackets
7946       (braces), separated by a comma. The numbers must be  less  than  65536,
7947       and the first must be less than or equal to the second. For example,
7948
7949         z{2,4}
7950
7951       matches  "zz",  "zzz",  or  "zzzz". A closing brace on its own is not a
7952       special character. If the second number is omitted, but  the  comma  is
7953       present,  there  is  no upper limit; if the second number and the comma
7954       are both omitted, the quantifier specifies an exact number of  required
7955       matches. Thus
7956
7957         [aeiou]{3,}
7958
7959       matches at least 3 successive vowels, but may match many more, whereas
7960
7961         \d{8}
7962
7963       matches  exactly  8  digits. An opening curly bracket that appears in a
7964       position where a quantifier is not allowed, or one that does not  match
7965       the  syntax of a quantifier, is taken as a literal character. For exam-
7966       ple, {,6} is not a quantifier, but a literal string of four characters.
7967
7968       In UTF modes, quantifiers apply to characters rather than to individual
7969       code  units. Thus, for example, \x{100}{2} matches two characters, each
7970       of which is represented by a two-byte sequence in a UTF-8 string. Simi-
7971       larly,  \X{3} matches three Unicode extended grapheme clusters, each of
7972       which may be several code units long (and  they  may  be  of  different
7973       lengths).
7974
7975       The quantifier {0} is permitted, causing the expression to behave as if
7976       the previous item and the quantifier were not present. This may be use-
7977       ful  for  capture  groups that are referenced as subroutines from else-
7978       where in the pattern (but see also the section entitled "Defining  cap-
7979       ture groups for use by reference only" below). Except for parenthesized
7980       groups, items that have a {0} quantifier are omitted from the  compiled
7981       pattern.
7982
7983       For  convenience, the three most common quantifiers have single-charac-
7984       ter abbreviations:
7985
7986         *    is equivalent to {0,}
7987         +    is equivalent to {1,}
7988         ?    is equivalent to {0,1}
7989
7990       It is possible to construct infinite loops by following  a  group  that
7991       can  match no characters with a quantifier that has no upper limit, for
7992       example:
7993
7994         (a?)*
7995
7996       Earlier versions of Perl and PCRE1 used to give  an  error  at  compile
7997       time for such patterns. However, because there are cases where this can
7998       be useful, such patterns are now accepted, but whenever an iteration of
7999       such  a group matches no characters, matching moves on to the next item
8000       in the pattern instead of repeatedly matching  an  empty  string.  This
8001       does  not  prevent  backtracking into any of the iterations if a subse-
8002       quent item fails to match.
8003
8004       By default, quantifiers are "greedy", that is, they match  as  much  as
8005       possible (up to the maximum number of permitted times), without causing
8006       the rest of the pattern to fail. The  classic  example  of  where  this
8007       gives  problems is in trying to match comments in C programs. These ap-
8008       pear between /* and */ and within the comment, individual * and / char-
8009       acters  may appear. An attempt to match C comments by applying the pat-
8010       tern
8011
8012         /\*.*\*/
8013
8014       to the string
8015
8016         /* first comment */  not comment  /* second comment */
8017
8018       fails, because it matches the entire string owing to the greediness  of
8019       the  .*  item. However, if a quantifier is followed by a question mark,
8020       it ceases to be greedy, and instead matches the minimum number of times
8021       possible, so the pattern
8022
8023         /\*.*?\*/
8024
8025       does  the  right  thing with the C comments. The meaning of the various
8026       quantifiers is not otherwise changed,  just  the  preferred  number  of
8027       matches.   Do  not  confuse this use of question mark with its use as a
8028       quantifier in its own right. Because it has two uses, it can  sometimes
8029       appear doubled, as in
8030
8031         \d??\d
8032
8033       which matches one digit by preference, but can match two if that is the
8034       only way the rest of the pattern matches.
8035
8036       If the PCRE2_UNGREEDY option is set (an option that is not available in
8037       Perl),  the  quantifiers are not greedy by default, but individual ones
8038       can be made greedy by following them with a  question  mark.  In  other
8039       words, it inverts the default behaviour.
8040
8041       When  a  parenthesized  group is quantified with a minimum repeat count
8042       that is greater than 1 or with a limited maximum, more  memory  is  re-
8043       quired for the compiled pattern, in proportion to the size of the mini-
8044       mum or maximum.
8045
8046       If a pattern starts with  .*  or  .{0,}  and  the  PCRE2_DOTALL  option
8047       (equivalent  to  Perl's /s) is set, thus allowing the dot to match new-
8048       lines, the pattern is implicitly  anchored,  because  whatever  follows
8049       will  be  tried against every character position in the subject string,
8050       so there is no point in retrying the overall match at any position  af-
8051       ter  the  first. PCRE2 normally treats such a pattern as though it were
8052       preceded by \A.
8053
8054       In cases where it is known that the subject  string  contains  no  new-
8055       lines,  it  is worth setting PCRE2_DOTALL in order to obtain this opti-
8056       mization, or alternatively, using ^ to indicate anchoring explicitly.
8057
8058       However, there are some cases where the optimization  cannot  be  used.
8059       When  .*   is  inside  capturing  parentheses that are the subject of a
8060       backreference elsewhere in the pattern, a match at the start  may  fail
8061       where a later one succeeds. Consider, for example:
8062
8063         (.*)abc\1
8064
8065       If  the subject is "xyz123abc123" the match point is the fourth charac-
8066       ter. For this reason, such a pattern is not implicitly anchored.
8067
8068       Another case where implicit anchoring is not applied is when the  lead-
8069       ing  .* is inside an atomic group. Once again, a match at the start may
8070       fail where a later one succeeds. Consider this pattern:
8071
8072         (?>.*?a)b
8073
8074       It matches "ab" in the subject "aab". The use of the backtracking  con-
8075       trol  verbs  (*PRUNE)  and  (*SKIP) also disable this optimization, and
8076       there is an option, PCRE2_NO_DOTSTAR_ANCHOR, to do so explicitly.
8077
8078       When a capture group is repeated, the value captured is  the  substring
8079       that matched the final iteration. For example, after
8080
8081         (tweedle[dume]{3}\s*)+
8082
8083       has matched "tweedledum tweedledee" the value of the captured substring
8084       is "tweedledee". However, if there are nested capture groups, the  cor-
8085       responding  captured  values  may have been set in previous iterations.
8086       For example, after
8087
8088         (a|(b))+
8089
8090       matches "aba" the value of the second captured substring is "b".
8091
8092
8093ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
8094
8095       With both maximizing ("greedy") and minimizing ("ungreedy"  or  "lazy")
8096       repetition,  failure  of what follows normally causes the repeated item
8097       to be re-evaluated to see if a different number of repeats  allows  the
8098       rest  of  the pattern to match. Sometimes it is useful to prevent this,
8099       either to change the nature of the match, or to cause it  fail  earlier
8100       than  it otherwise might, when the author of the pattern knows there is
8101       no point in carrying on.
8102
8103       Consider, for example, the pattern \d+foo when applied to  the  subject
8104       line
8105
8106         123456bar
8107
8108       After matching all 6 digits and then failing to match "foo", the normal
8109       action of the matcher is to try again with only 5 digits  matching  the
8110       \d+  item,  and  then  with  4,  and  so on, before ultimately failing.
8111       "Atomic grouping" (a term taken from Jeffrey  Friedl's  book)  provides
8112       the means for specifying that once a group has matched, it is not to be
8113       re-evaluated in this way.
8114
8115       If we use atomic grouping for the previous example, the  matcher  gives
8116       up  immediately  on failing to match "foo" the first time. The notation
8117       is a kind of special parenthesis, starting with (?> as in this example:
8118
8119         (?>\d+)foo
8120
8121       Perl 5.28 introduced an experimental alphabetic form starting  with  (*
8122       which may be easier to remember:
8123
8124         (*atomic:\d+)foo
8125
8126       This  kind of parenthesized group "locks up" the part of the pattern it
8127       contains once it has matched, and a failure further into the pattern is
8128       prevented  from  backtracking into it. Backtracking past it to previous
8129       items, however, works as normal.
8130
8131       An alternative description is that a group of this type matches exactly
8132       the  string  of  characters  that an identical standalone pattern would
8133       match, if anchored at the current point in the subject string.
8134
8135       Atomic groups are not capture groups. Simple cases such  as  the  above
8136       example  can be thought of as a maximizing repeat that must swallow ev-
8137       erything it can.  So, while both \d+ and \d+? are  prepared  to  adjust
8138       the  number  of digits they match in order to make the rest of the pat-
8139       tern match, (?>\d+) can only match an entire sequence of digits.
8140
8141       Atomic groups in general can of course contain arbitrarily  complicated
8142       expressions, and can be nested. However, when the contents of an atomic
8143       group is just a single repeated item, as in the example above,  a  sim-
8144       pler  notation, called a "possessive quantifier" can be used. This con-
8145       sists of an additional + character following a quantifier.  Using  this
8146       notation, the previous example can be rewritten as
8147
8148         \d++foo
8149
8150       Note that a possessive quantifier can be used with an entire group, for
8151       example:
8152
8153         (abc|xyz){2,3}+
8154
8155       Possessive quantifiers are always greedy; the setting of the  PCRE2_UN-
8156       GREEDY  option  is ignored. They are a convenient notation for the sim-
8157       pler forms of atomic group. However, there  is  no  difference  in  the
8158       meaning  of  a  possessive  quantifier and the equivalent atomic group,
8159       though there may be a performance  difference;  possessive  quantifiers
8160       should be slightly faster.
8161
8162       The  possessive  quantifier syntax is an extension to the Perl 5.8 syn-
8163       tax.  Jeffrey Friedl originated the idea (and the name)  in  the  first
8164       edition of his book. Mike McCloskey liked it, so implemented it when he
8165       built Sun's Java package, and PCRE1 copied it from there. It found  its
8166       way into Perl at release 5.10.
8167
8168       PCRE2  has  an  optimization  that automatically "possessifies" certain
8169       simple pattern constructs. For example, the sequence A+B is treated  as
8170       A++B  because  there is no point in backtracking into a sequence of A's
8171       when B must follow.  This feature can be disabled by the PCRE2_NO_AUTO-
8172       POSSESS option, or starting the pattern with (*NO_AUTO_POSSESS).
8173
8174       When a pattern contains an unlimited repeat inside a group that can it-
8175       self be repeated an unlimited number of times, the  use  of  an  atomic
8176       group  is the only way to avoid some failing matches taking a very long
8177       time indeed. The pattern
8178
8179         (\D+|<\d+>)*[!?]
8180
8181       matches an unlimited number of substrings that either consist  of  non-
8182       digits,  or  digits  enclosed in <>, followed by either ! or ?. When it
8183       matches, it runs quickly. However, if it is applied to
8184
8185         aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
8186
8187       it takes a long time before reporting  failure.  This  is  because  the
8188       string  can be divided between the internal \D+ repeat and the external
8189       * repeat in a large number of ways, and all have to be tried. (The  ex-
8190       ample uses [!?] rather than a single character at the end, because both
8191       PCRE2 and Perl have an optimization that allows for fast failure when a
8192       single  character is used. They remember the last single character that
8193       is required for a match, and fail early if it is  not  present  in  the
8194       string.)  If  the  pattern  is changed so that it uses an atomic group,
8195       like this:
8196
8197         ((?>\D+)|<\d+>)*[!?]
8198
8199       sequences of non-digits cannot be broken, and failure happens quickly.
8200
8201
8202BACKREFERENCES
8203
8204       Outside a character class, a backslash followed by a digit greater than
8205       0  (and  possibly further digits) is a backreference to a capture group
8206       earlier (that is, to its left) in the pattern, provided there have been
8207       that many previous capture groups.
8208
8209       However,  if the decimal number following the backslash is less than 8,
8210       it is always taken as a backreference, and  causes  an  error  only  if
8211       there  are not that many capture groups in the entire pattern. In other
8212       words, the group that is referenced need not be to the left of the ref-
8213       erence  for numbers less than 8. A "forward backreference" of this type
8214       can make sense when a repetition is involved and the group to the right
8215       has participated in an earlier iteration.
8216
8217       It  is  not  possible  to have a numerical "forward backreference" to a
8218       group whose number is 8 or more using this syntax  because  a  sequence
8219       such  as  \50  is  interpreted as a character defined in octal. See the
8220       subsection entitled "Non-printing characters" above for further details
8221       of  the  handling of digits following a backslash. Other forms of back-
8222       referencing do not suffer from this restriction. In  particular,  there
8223       is no problem when named capture groups are used (see below).
8224
8225       Another  way  of  avoiding  the ambiguity inherent in the use of digits
8226       following a backslash is to use the \g  escape  sequence.  This  escape
8227       must be followed by a signed or unsigned number, optionally enclosed in
8228       braces. These examples are all identical:
8229
8230         (ring), \1
8231         (ring), \g1
8232         (ring), \g{1}
8233
8234       An unsigned number specifies an absolute reference without the  ambigu-
8235       ity that is present in the older syntax. It is also useful when literal
8236       digits follow the reference. A signed number is a  relative  reference.
8237       Consider this example:
8238
8239         (abc(def)ghi)\g{-1}
8240
8241       The sequence \g{-1} is a reference to the most recently started capture
8242       group before \g, that is, is it equivalent to \2 in this example. Simi-
8243       larly, \g{-2} would be equivalent to \1. The use of relative references
8244       can be helpful in long patterns, and also in patterns that are  created
8245       by  joining  together  fragments  that  contain references within them-
8246       selves.
8247
8248       The sequence \g{+1} is a reference to the next capture group. This kind
8249       of  forward  reference can be useful in patterns that repeat. Perl does
8250       not support the use of + in this way.
8251
8252       A backreference matches whatever actually  most  recently  matched  the
8253       capture  group  in  the current subject string, rather than anything at
8254       all that matches the group (see "Groups as subroutines" below for a way
8255       of doing that). So the pattern
8256
8257         (sens|respons)e and \1ibility
8258
8259       matches  "sense and sensibility" and "response and responsibility", but
8260       not "sense and responsibility". If caseful matching is in force at  the
8261       time  of  the backreference, the case of letters is relevant. For exam-
8262       ple,
8263
8264         ((?i)rah)\s+\1
8265
8266       matches "rah rah" and "RAH RAH", but not "RAH  rah",  even  though  the
8267       original capture group is matched caselessly.
8268
8269       There  are  several  different  ways of writing backreferences to named
8270       capture groups. The .NET syntax \k{name} and the Perl  syntax  \k<name>
8271       or  \k'name'  are  supported,  as  is the Python syntax (?P=name). Perl
8272       5.10's unified backreference syntax, in which \g can be used  for  both
8273       numeric  and  named references, is also supported. We could rewrite the
8274       above example in any of the following ways:
8275
8276         (?<p1>(?i)rah)\s+\k<p1>
8277         (?'p1'(?i)rah)\s+\k{p1}
8278         (?P<p1>(?i)rah)\s+(?P=p1)
8279         (?<p1>(?i)rah)\s+\g{p1}
8280
8281       A capture group that is referenced by name may appear  in  the  pattern
8282       before or after the reference.
8283
8284       There  may be more than one backreference to the same group. If a group
8285       has not actually been used in a particular match, backreferences to  it
8286       always fail by default. For example, the pattern
8287
8288         (a|(bc))\2
8289
8290       always  fails  if  it starts to match "a" rather than "bc". However, if
8291       the PCRE2_MATCH_UNSET_BACKREF option is set at compile time, a backref-
8292       erence to an unset value matches an empty string.
8293
8294       Because  there may be many capture groups in a pattern, all digits fol-
8295       lowing a backslash are taken as part of a potential backreference  num-
8296       ber.  If  the  pattern continues with a digit character, some delimiter
8297       must be used to terminate the backreference. If the  PCRE2_EXTENDED  or
8298       PCRE2_EXTENDED_MORE  option is set, this can be white space. Otherwise,
8299       the \g{} syntax or an empty comment (see "Comments" below) can be used.
8300
8301   Recursive backreferences
8302
8303       A backreference that occurs inside the group to which it  refers  fails
8304       when  the  group  is  first used, so, for example, (a\1) never matches.
8305       However, such references can be useful inside repeated groups. For  ex-
8306       ample, the pattern
8307
8308         (a|b\1)+
8309
8310       matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
8311       ation of the group, the backreference matches the character string cor-
8312       responding  to  the  previous iteration. In order for this to work, the
8313       pattern must be such that the first iteration does not  need  to  match
8314       the  backreference. This can be done using alternation, as in the exam-
8315       ple above, or by a quantifier with a minimum of zero.
8316
8317       For versions of PCRE2 less than 10.25, backreferences of this type used
8318       to  cause  the  group  that  they  reference to be treated as an atomic
8319       group.  This restriction no longer applies, and backtracking into  such
8320       groups can occur as normal.
8321
8322
8323ASSERTIONS
8324
8325       An  assertion  is  a  test on the characters following or preceding the
8326       current matching point that does not consume any characters. The simple
8327       assertions  coded  as  \b,  \B,  \A,  \G, \Z, \z, ^ and $ are described
8328       above.
8329
8330       More complicated assertions are coded as  parenthesized  groups.  There
8331       are  two  kinds:  those  that look ahead of the current position in the
8332       subject string, and those that look behind it, and in each case an  as-
8333       sertion  may  be  positive (must match for the assertion to be true) or
8334       negative (must not match for the assertion to be  true).  An  assertion
8335       group is matched in the normal way, and if it is true, matching contin-
8336       ues after it, but with the matching position in the subject string  re-
8337       set to what it was before the assertion was processed.
8338
8339       The  Perl-compatible  lookaround assertions are atomic. If an assertion
8340       is true, but there is a subsequent matching failure, there is no  back-
8341       tracking  into  the assertion. However, there are some cases where non-
8342       atomic assertions can be useful. PCRE2 has some support for these,  de-
8343       scribed in the section entitled "Non-atomic assertions" below, but they
8344       are not Perl-compatible.
8345
8346       A lookaround assertion may appear as the  condition  in  a  conditional
8347       group  (see  below). In this case, the result of matching the assertion
8348       determines which branch of the condition is followed.
8349
8350       Assertion groups are not capture groups. If an assertion contains  cap-
8351       ture  groups within it, these are counted for the purposes of numbering
8352       the capture groups in the whole pattern. Within each branch of  an  as-
8353       sertion,  locally  captured  substrings  may be referenced in the usual
8354       way. For example, a sequence such as (.)\g{-1} can  be  used  to  check
8355       that two adjacent characters are the same.
8356
8357       When  a  branch within an assertion fails to match, any substrings that
8358       were captured are discarded (as happens with any  pattern  branch  that
8359       fails  to  match).  A  negative  assertion  is  true  only when all its
8360       branches fail to match; this means that no captured substrings are ever
8361       retained  after a successful negative assertion. When an assertion con-
8362       tains a matching branch, what happens depends on the type of assertion.
8363
8364       For a positive assertion, internally captured substrings  in  the  suc-
8365       cessful  branch are retained, and matching continues with the next pat-
8366       tern item after the assertion. For a  negative  assertion,  a  matching
8367       branch  means  that  the assertion is not true. If such an assertion is
8368       being used as a condition in a conditional group (see below),  captured
8369       substrings  are  retained,  because  matching  continues  with the "no"
8370       branch of the condition. For other failing negative assertions, control
8371       passes to the previous backtracking point, thus discarding any captured
8372       strings within the assertion.
8373
8374       Most assertion groups may be repeated; though it makes no sense to  as-
8375       sert the same thing several times, the side effect of capturing in pos-
8376       itive assertions may occasionally be useful. However, an assertion that
8377       forms  the  condition  for  a  conditional group may not be quantified.
8378       PCRE2 used to restrict the repetition of assertions, but  from  release
8379       10.35  the  only restriction is that an unlimited maximum repetition is
8380       changed to be one more than the minimum. For example, {3,}  is  treated
8381       as {3,4}.
8382
8383   Alphabetic assertion names
8384
8385       Traditionally,  symbolic  sequences such as (?= and (?<= have been used
8386       to specify lookaround assertions. Perl 5.28 introduced some  experimen-
8387       tal alphabetic alternatives which might be easier to remember. They all
8388       start with (* instead of (? and must be written using lower  case  let-
8389       ters. PCRE2 supports the following synonyms:
8390
8391         (*positive_lookahead:  or (*pla: is the same as (?=
8392         (*negative_lookahead:  or (*nla: is the same as (?!
8393         (*positive_lookbehind: or (*plb: is the same as (?<=
8394         (*negative_lookbehind: or (*nlb: is the same as (?<!
8395
8396       For  example,  (*pla:foo) is the same assertion as (?=foo). In the fol-
8397       lowing sections, the various assertions are described using the  origi-
8398       nal symbolic forms.
8399
8400   Lookahead assertions
8401
8402       Lookahead assertions start with (?= for positive assertions and (?! for
8403       negative assertions. For example,
8404
8405         \w+(?=;)
8406
8407       matches a word followed by a semicolon, but does not include the  semi-
8408       colon in the match, and
8409
8410         foo(?!bar)
8411
8412       matches  any  occurrence  of  "foo" that is not followed by "bar". Note
8413       that the apparently similar pattern
8414
8415         (?!foo)bar
8416
8417       does not find an occurrence of "bar"  that  is  preceded  by  something
8418       other  than "foo"; it finds any occurrence of "bar" whatsoever, because
8419       the assertion (?!foo) is always true when the next three characters are
8420       "bar". A lookbehind assertion is needed to achieve the other effect.
8421
8422       If you want to force a matching failure at some point in a pattern, the
8423       most convenient way to do it is with (?!) because an empty  string  al-
8424       ways  matches,  so  an assertion that requires there not to be an empty
8425       string must always fail.  The backtracking control verb (*FAIL) or (*F)
8426       is a synonym for (?!).
8427
8428   Lookbehind assertions
8429
8430       Lookbehind  assertions start with (?<= for positive assertions and (?<!
8431       for negative assertions. For example,
8432
8433         (?<!foo)bar
8434
8435       does find an occurrence of "bar" that is not  preceded  by  "foo".  The
8436       contents  of  a  lookbehind  assertion are restricted such that all the
8437       strings it matches must have a fixed length. However, if there are sev-
8438       eral  top-level  alternatives,  they  do  not all have to have the same
8439       fixed length. Thus
8440
8441         (?<=bullock|donkey)
8442
8443       is permitted, but
8444
8445         (?<!dogs?|cats?)
8446
8447       causes an error at compile time. Branches that match  different  length
8448       strings  are permitted only at the top level of a lookbehind assertion.
8449       This is an extension compared with Perl, which requires all branches to
8450       match the same length of string. An assertion such as
8451
8452         (?<=ab(c|de))
8453
8454       is  not  permitted,  because  its single top-level branch can match two
8455       different lengths, but it is acceptable to PCRE2 if  rewritten  to  use
8456       two top-level branches:
8457
8458         (?<=abc|abde)
8459
8460       In  some  cases, the escape sequence \K (see above) can be used instead
8461       of a lookbehind assertion to get round the fixed-length restriction.
8462
8463       The implementation of lookbehind assertions is, for  each  alternative,
8464       to  temporarily  move the current position back by the fixed length and
8465       then try to match. If there are insufficient characters before the cur-
8466       rent position, the assertion fails.
8467
8468       In  UTF-8  and  UTF-16 modes, PCRE2 does not allow the \C escape (which
8469       matches a single code unit even in a UTF mode) to appear in  lookbehind
8470       assertions,  because  it makes it impossible to calculate the length of
8471       the lookbehind. The \X and \R escapes, which can match  different  num-
8472       bers of code units, are never permitted in lookbehinds.
8473
8474       "Subroutine"  calls  (see below) such as (?2) or (?&X) are permitted in
8475       lookbehinds, as long as the called capture group matches a fixed-length
8476       string.  However,  recursion, that is, a "subroutine" call into a group
8477       that is already active, is not supported.
8478
8479       Perl does not support backreferences in lookbehinds. PCRE2 does support
8480       them,  but  only  if  certain  conditions  are met. The PCRE2_MATCH_UN-
8481       SET_BACKREF option must not be set, there must be no use of (?| in  the
8482       pattern  (it creates duplicate group numbers), and if the backreference
8483       is by name, the name must be unique. Of course,  the  referenced  group
8484       must  itself  match  a  fixed  length  substring. The following pattern
8485       matches words containing at least two characters  that  begin  and  end
8486       with the same character:
8487
8488          \b(\w)\w++(?<=\1)
8489
8490       Possessive  quantifiers  can be used in conjunction with lookbehind as-
8491       sertions to specify efficient matching of fixed-length strings  at  the
8492       end of subject strings. Consider a simple pattern such as
8493
8494         abcd$
8495
8496       when  applied  to  a  long string that does not match. Because matching
8497       proceeds from left to right, PCRE2 will look for each "a" in  the  sub-
8498       ject  and  then see if what follows matches the rest of the pattern. If
8499       the pattern is specified as
8500
8501         ^.*abcd$
8502
8503       the initial .* matches the entire string at first, but when this  fails
8504       (because there is no following "a"), it backtracks to match all but the
8505       last character, then all but the last two characters, and so  on.  Once
8506       again  the search for "a" covers the entire string, from right to left,
8507       so we are no better off. However, if the pattern is written as
8508
8509         ^.*+(?<=abcd)
8510
8511       there can be no backtracking for the .*+ item because of the possessive
8512       quantifier; it can match only the entire string. The subsequent lookbe-
8513       hind assertion does a single test on the last four  characters.  If  it
8514       fails,  the  match  fails  immediately. For long strings, this approach
8515       makes a significant difference to the processing time.
8516
8517   Using multiple assertions
8518
8519       Several assertions (of any sort) may occur in succession. For example,
8520
8521         (?<=\d{3})(?<!999)foo
8522
8523       matches "foo" preceded by three digits that are not "999". Notice  that
8524       each  of  the  assertions is applied independently at the same point in
8525       the subject string. First there is a  check  that  the  previous  three
8526       characters  are  all  digits,  and  then there is a check that the same
8527       three characters are not "999".  This pattern does not match "foo" pre-
8528       ceded  by  six  characters,  the first of which are digits and the last
8529       three of which are not "999". For example, it  doesn't  match  "123abc-
8530       foo". A pattern to do that is
8531
8532         (?<=\d{3}...)(?<!999)foo
8533
8534       This  time  the  first assertion looks at the preceding six characters,
8535       checking that the first three are digits, and then the second assertion
8536       checks that the preceding three characters are not "999".
8537
8538       Assertions can be nested in any combination. For example,
8539
8540         (?<=(?<!foo)bar)baz
8541
8542       matches  an occurrence of "baz" that is preceded by "bar" which in turn
8543       is not preceded by "foo", while
8544
8545         (?<=\d{3}(?!999)...)foo
8546
8547       is another pattern that matches "foo" preceded by three digits and  any
8548       three characters that are not "999".
8549
8550
8551NON-ATOMIC ASSERTIONS
8552
8553       The  traditional Perl-compatible lookaround assertions are atomic. That
8554       is, if an assertion is true, but there is a subsequent  matching  fail-
8555       ure,  there  is  no backtracking into the assertion. However, there are
8556       some cases where non-atomic positive assertions can  be  useful.  PCRE2
8557       provides these using the following syntax:
8558
8559         (*non_atomic_positive_lookahead:  or (*napla: or (?*
8560         (*non_atomic_positive_lookbehind: or (*naplb: or (?<*
8561
8562       Consider  the  problem  of finding the right-most word in a string that
8563       also appears earlier in the string, that is, it must  appear  at  least
8564       twice  in  total.  This pattern returns the required result as captured
8565       substring 1:
8566
8567         ^(?x)(*napla: .* \b(\w++)) (?> .*? \b\1\b ){2}
8568
8569       For a subject such as "word1 word2 word3 word2 word3 word4" the  result
8570       is  "word3".  How does it work? At the start, ^(?x) anchors the pattern
8571       and sets the "x" option, which causes white space (introduced for read-
8572       ability)  to  be  ignored. Inside the assertion, the greedy .* at first
8573       consumes the entire string, but then has to backtrack until the rest of
8574       the  assertion can match a word, which is captured by group 1. In other
8575       words, when the assertion first succeeds, it  captures  the  right-most
8576       word in the string.
8577
8578       The  current  matching point is then reset to the start of the subject,
8579       and the rest of the pattern match checks for  two  occurrences  of  the
8580       captured  word,  using  an  ungreedy .*? to scan from the left. If this
8581       succeeds, we are done, but if the last word in the string does not  oc-
8582       cur  twice,  this  part  of  the pattern fails. If a traditional atomic
8583       lookhead (?= or (*pla: had been used, the assertion could not be re-en-
8584       tered,  and  the whole match would fail. The pattern would succeed only
8585       if the very last word in the subject was found twice.
8586
8587       Using a non-atomic lookahead, however, means that when  the  last  word
8588       does  not  occur  twice  in the string, the lookahead can backtrack and
8589       find the second-last word, and so on, until either the match  succeeds,
8590       or all words have been tested.
8591
8592       Two conditions must be met for a non-atomic assertion to be useful: the
8593       contents of one or more capturing groups must change after a  backtrack
8594       into  the  assertion,  and  there  must be a backreference to a changed
8595       group later in the pattern. If this is not the case, the  rest  of  the
8596       pattern  match  fails exactly as before because nothing has changed, so
8597       using a non-atomic assertion just wastes resources.
8598
8599       There is one exception to backtracking into a non-atomic assertion.  If
8600       an  (*ACCEPT)  control verb is triggered, the assertion succeeds atomi-
8601       cally. That is, a subsequent match failure cannot  backtrack  into  the
8602       assertion.
8603
8604       Non-atomic  assertions  are  not  supported by the alternative matching
8605       function pcre2_dfa_match(). They are supported by JIT, but only if they
8606       do not contain any control verbs such as (*ACCEPT). (This may change in
8607       future). Note that assertions that appear as conditions for conditional
8608       groups (see below) must be atomic.
8609
8610
8611SCRIPT RUNS
8612
8613       In  concept, a script run is a sequence of characters that are all from
8614       the same Unicode script such as Latin or Greek. However,  because  some
8615       scripts  are  commonly  used together, and because some diacritical and
8616       other marks are used with multiple scripts,  it  is  not  that  simple.
8617       There is a full description of the rules that PCRE2 uses in the section
8618       entitled "Script Runs" in the pcre2unicode documentation.
8619
8620       If part of a pattern is enclosed between (*script_run: or (*sr:  and  a
8621       closing  parenthesis,  it  fails  if the sequence of characters that it
8622       matches are not a script run. After a failure, normal backtracking  oc-
8623       curs.  Script runs can be used to detect spoofing attacks using charac-
8624       ters that look the same, but are from  different  scripts.  The  string
8625       "paypal.com"  is an infamous example, where the letters could be a mix-
8626       ture of Latin and Cyrillic. This pattern ensures that the matched char-
8627       acters in a sequence of non-spaces that follow white space are a script
8628       run:
8629
8630         \s+(*sr:\S+)
8631
8632       To be sure that they are all from the Latin  script  (for  example),  a
8633       lookahead can be used:
8634
8635         \s+(?=\p{Latin})(*sr:\S+)
8636
8637       This works as long as the first character is expected to be a character
8638       in that script, and not (for example)  punctuation,  which  is  allowed
8639       with  any script. If this is not the case, a more creative lookahead is
8640       needed. For example, if digits, underscore, and dots are  permitted  at
8641       the start:
8642
8643         \s+(?=[0-9_.]*\p{Latin})(*sr:\S+)
8644
8645
8646       In  many  cases, backtracking into a script run pattern fragment is not
8647       desirable. The script run can employ an atomic group to  prevent  this.
8648       Because  this is a common requirement, a shorthand notation is provided
8649       by (*atomic_script_run: or (*asr:
8650
8651         (*asr:...) is the same as (*sr:(?>...))
8652
8653       Note that the atomic group is inside the script run. Putting it outside
8654       would not prevent backtracking into the script run pattern.
8655
8656       Support  for  script runs is not available if PCRE2 is compiled without
8657       Unicode support. A compile-time error is given if any of the above con-
8658       structs  is encountered. Script runs are not supported by the alternate
8659       matching function, pcre2_dfa_match() because they use the  same  mecha-
8660       nism as capturing parentheses.
8661
8662       Warning:  The  (*ACCEPT)  control  verb  (see below) should not be used
8663       within a script run group, because it causes an immediate exit from the
8664       group, bypassing the script run checking.
8665
8666
8667CONDITIONAL GROUPS
8668
8669       It is possible to cause the matching process to obey a pattern fragment
8670       conditionally or to choose between two alternative fragments, depending
8671       on  the result of an assertion, or whether a specific capture group has
8672       already been matched. The two possible forms of conditional group are:
8673
8674         (?(condition)yes-pattern)
8675         (?(condition)yes-pattern|no-pattern)
8676
8677       If the condition is satisfied, the yes-pattern is used;  otherwise  the
8678       no-pattern  (if present) is used. An absent no-pattern is equivalent to
8679       an empty string (it always matches). If there are more than two  alter-
8680       natives  in the group, a compile-time error occurs. Each of the two al-
8681       ternatives may itself contain nested groups of any form, including con-
8682       ditional  groups;  the  restriction to two alternatives applies only at
8683       the level of the condition itself. This pattern fragment is an  example
8684       where the alternatives are complex:
8685
8686         (?(1) (A|B|C) | (D | (?(2)E|F) | E) )
8687
8688
8689       There are five kinds of condition: references to capture groups, refer-
8690       ences to recursion, two pseudo-conditions called  DEFINE  and  VERSION,
8691       and assertions.
8692
8693   Checking for a used capture group by number
8694
8695       If  the  text between the parentheses consists of a sequence of digits,
8696       the condition is true if a capture group of that number has  previously
8697       matched.  If  there is more than one capture group with the same number
8698       (see the earlier section about duplicate group numbers), the  condition
8699       is true if any of them have matched. An alternative notation is to pre-
8700       cede the digits with a plus or minus sign. In this case, the group num-
8701       ber  is relative rather than absolute. The most recently opened capture
8702       group can be referenced by (?(-1), the next most recent by (?(-2),  and
8703       so  on.  Inside  loops  it  can  also make sense to refer to subsequent
8704       groups. The next capture group can be referenced as (?(+1), and so  on.
8705       (The  value  zero in any of these forms is not used; it provokes a com-
8706       pile-time error.)
8707
8708       Consider the following pattern, which  contains  non-significant  white
8709       space  to  make it more readable (assume the PCRE2_EXTENDED option) and
8710       to divide it into three parts for ease of discussion:
8711
8712         ( \( )?    [^()]+    (?(1) \) )
8713
8714       The first part matches an optional opening  parenthesis,  and  if  that
8715       character is present, sets it as the first captured substring. The sec-
8716       ond part matches one or more characters that are not  parentheses.  The
8717       third  part  is a conditional group that tests whether or not the first
8718       capture group matched. If it did, that is, if subject started  with  an
8719       opening  parenthesis,  the condition is true, and so the yes-pattern is
8720       executed and a closing parenthesis is required.  Otherwise,  since  no-
8721       pattern is not present, the conditional group matches nothing. In other
8722       words, this pattern matches a sequence of  non-parentheses,  optionally
8723       enclosed in parentheses.
8724
8725       If  you  were  embedding  this pattern in a larger one, you could use a
8726       relative reference:
8727
8728         ...other stuff... ( \( )?    [^()]+    (?(-1) \) ) ...
8729
8730       This makes the fragment independent of the parentheses  in  the  larger
8731       pattern.
8732
8733   Checking for a used capture group by name
8734
8735       Perl  uses  the  syntax  (?(<name>)...) or (?('name')...) to test for a
8736       used capture group by name. For compatibility with earlier versions  of
8737       PCRE1,  which had this facility before Perl, the syntax (?(name)...) is
8738       also recognized.  Note, however, that undelimited names  consisting  of
8739       the  letter  R followed by digits are ambiguous (see the following sec-
8740       tion). Rewriting the above example to use a named group gives this:
8741
8742         (?<OPEN> \( )?    [^()]+    (?(<OPEN>) \) )
8743
8744       If the name used in a condition of this kind is a duplicate,  the  test
8745       is  applied  to  all groups of the same name, and is true if any one of
8746       them has matched.
8747
8748   Checking for pattern recursion
8749
8750       "Recursion" in this sense refers to any subroutine-like call  from  one
8751       part  of  the  pattern to another, whether or not it is actually recur-
8752       sive. See the sections entitled "Recursive  patterns"  and  "Groups  as
8753       subroutines" below for details of recursion and subroutine calls.
8754
8755       If  a  condition  is the string (R), and there is no capture group with
8756       the name R, the condition is true if matching is currently in a  recur-
8757       sion  or  subroutine call to the whole pattern or any capture group. If
8758       digits follow the letter R, and there is no group with that  name,  the
8759       condition  is  true  if  the  most recent call is into a group with the
8760       given number, which must exist somewhere in the overall  pattern.  This
8761       is a contrived example that is equivalent to a+b:
8762
8763         ((?(R1)a+|(?1)b))
8764
8765       However,  in  both  cases,  if there is a capture group with a matching
8766       name, the condition tests for its being set, as described in  the  sec-
8767       tion  above,  instead of testing for recursion. For example, creating a
8768       group with the name R1 by adding (?<R1>)  to  the  above  pattern  com-
8769       pletely changes its meaning.
8770
8771       If a name preceded by ampersand follows the letter R, for example:
8772
8773         (?(R&name)...)
8774
8775       the  condition  is true if the most recent recursion is into a group of
8776       that name (which must exist within the pattern).
8777
8778       This condition does not check the entire recursion stack. It tests only
8779       the  current  level.  If the name used in a condition of this kind is a
8780       duplicate, the test is applied to all groups of the same name,  and  is
8781       true if any one of them is the most recent recursion.
8782
8783       At "top level", all these recursion test conditions are false.
8784
8785   Defining capture groups for use by reference only
8786
8787       If the condition is the string (DEFINE), the condition is always false,
8788       even if there is a group with the name DEFINE. In this case, there  may
8789       be only one alternative in the rest of the conditional group. It is al-
8790       ways skipped if control reaches this point in the pattern; the idea  of
8791       DEFINE  is that it can be used to define subroutines that can be refer-
8792       enced from elsewhere. (The use of subroutines is described below.)  For
8793       example,  a  pattern  to match an IPv4 address such as "192.168.23.245"
8794       could be written like this (ignore white space and line breaks):
8795
8796         (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
8797         \b (?&byte) (\.(?&byte)){3} \b
8798
8799       The first part of the pattern is a DEFINE group  inside  which  another
8800       group  named "byte" is defined. This matches an individual component of
8801       an IPv4 address (a number less than 256). When  matching  takes  place,
8802       this  part  of  the pattern is skipped because DEFINE acts like a false
8803       condition. The rest of the pattern uses references to the  named  group
8804       to  match the four dot-separated components of an IPv4 address, insist-
8805       ing on a word boundary at each end.
8806
8807   Checking the PCRE2 version
8808
8809       Programs that link with a PCRE2 library can check the version by  call-
8810       ing  pcre2_config()  with  appropriate arguments. Users of applications
8811       that do not have access to the underlying code cannot do this.  A  spe-
8812       cial  "condition" called VERSION exists to allow such users to discover
8813       which version of PCRE2 they are dealing with by using this condition to
8814       match  a string such as "yesno". VERSION must be followed either by "="
8815       or ">=" and a version number.  For example:
8816
8817         (?(VERSION>=10.4)yes|no)
8818
8819       This pattern matches "yes" if the PCRE2 version is greater or equal  to
8820       10.4,  or "no" otherwise. The fractional part of the version number may
8821       not contain more than two digits.
8822
8823   Assertion conditions
8824
8825       If the condition is not in any of the  above  formats,  it  must  be  a
8826       parenthesized  assertion.  This may be a positive or negative lookahead
8827       or lookbehind assertion. However, it must be a traditional  atomic  as-
8828       sertion, not one of the PCRE2-specific non-atomic assertions.
8829
8830       Consider  this  pattern,  again containing non-significant white space,
8831       and with the two alternatives on the second line:
8832
8833         (?(?=[^a-z]*[a-z])
8834         \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )
8835
8836       The condition is a positive lookahead assertion  that  matches  an  op-
8837       tional sequence of non-letters followed by a letter. In other words, it
8838       tests for the presence of at least one letter in the subject. If a let-
8839       ter  is  found,  the  subject is matched against the first alternative;
8840       otherwise it is  matched  against  the  second.  This  pattern  matches
8841       strings  in  one  of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
8842       letters and dd are digits.
8843
8844       When an assertion that is a condition contains capture groups, any cap-
8845       turing  that  occurs  in  a matching branch is retained afterwards, for
8846       both positive and negative assertions, because matching always  contin-
8847       ues  after  the  assertion, whether it succeeds or fails. (Compare non-
8848       conditional assertions, for which captures are retained only for  posi-
8849       tive assertions that succeed.)
8850
8851
8852COMMENTS
8853
8854       There are two ways of including comments in patterns that are processed
8855       by PCRE2. In both cases, the start of the comment  must  not  be  in  a
8856       character  class,  nor  in  the middle of any other sequence of related
8857       characters such as (?: or a group name or number. The  characters  that
8858       make up a comment play no part in the pattern matching.
8859
8860       The  sequence (?# marks the start of a comment that continues up to the
8861       next closing parenthesis. Nested parentheses are not permitted. If  the
8862       PCRE2_EXTENDED  or  PCRE2_EXTENDED_MORE  option  is set, an unescaped #
8863       character also introduces a comment, which in this  case  continues  to
8864       immediately  after  the next newline character or character sequence in
8865       the pattern. Which characters are interpreted as newlines is controlled
8866       by  an option passed to the compiling function or by a special sequence
8867       at the start of the pattern, as described in the section entitled "New-
8868       line conventions" above. Note that the end of this type of comment is a
8869       literal newline sequence in the pattern; escape sequences  that  happen
8870       to represent a newline do not count. For example, consider this pattern
8871       when PCRE2_EXTENDED is set, and the default newline convention (a  sin-
8872       gle linefeed character) is in force:
8873
8874         abc #comment \n still comment
8875
8876       On  encountering  the # character, pcre2_compile() skips along, looking
8877       for a newline in the pattern. The sequence \n is still literal at  this
8878       stage,  so  it does not terminate the comment. Only an actual character
8879       with the code value 0x0a (the default newline) does so.
8880
8881
8882RECURSIVE PATTERNS
8883
8884       Consider the problem of matching a string in parentheses, allowing  for
8885       unlimited  nested  parentheses.  Without the use of recursion, the best
8886       that can be done is to use a pattern that  matches  up  to  some  fixed
8887       depth  of  nesting.  It  is not possible to handle an arbitrary nesting
8888       depth.
8889
8890       For some time, Perl has provided a facility that allows regular expres-
8891       sions  to recurse (amongst other things). It does this by interpolating
8892       Perl code in the expression at run time, and the code can refer to  the
8893       expression itself. A Perl pattern using code interpolation to solve the
8894       parentheses problem can be created like this:
8895
8896         $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
8897
8898       The (?p{...}) item interpolates Perl code at run time, and in this case
8899       refers recursively to the pattern in which it appears.
8900
8901       Obviously,  PCRE2  cannot  support  the interpolation of Perl code. In-
8902       stead, it supports special syntax for recursion of the entire  pattern,
8903       and also for individual capture group recursion. After its introduction
8904       in PCRE1 and Python, this kind of recursion was subsequently introduced
8905       into Perl at release 5.10.
8906
8907       A  special  item  that consists of (? followed by a number greater than
8908       zero and a closing parenthesis is a recursive subroutine  call  of  the
8909       capture  group of the given number, provided that it occurs inside that
8910       group. (If not, it is a non-recursive subroutine  call,  which  is  de-
8911       scribed in the next section.) The special item (?R) or (?0) is a recur-
8912       sive call of the entire regular expression.
8913
8914       This PCRE2 pattern solves the nested parentheses  problem  (assume  the
8915       PCRE2_EXTENDED option is set so that white space is ignored):
8916
8917         \( ( [^()]++ | (?R) )* \)
8918
8919       First  it matches an opening parenthesis. Then it matches any number of
8920       substrings which can either be a sequence of non-parentheses, or a  re-
8921       cursive match of the pattern itself (that is, a correctly parenthesized
8922       substring).  Finally there is a closing parenthesis. Note the use of  a
8923       possessive  quantifier  to  avoid  backtracking  into sequences of non-
8924       parentheses.
8925
8926       If this were part of a larger pattern, you would not  want  to  recurse
8927       the entire pattern, so instead you could use this:
8928
8929         ( \( ( [^()]++ | (?1) )* \) )
8930
8931       We  have  put the pattern into parentheses, and caused the recursion to
8932       refer to them instead of the whole pattern.
8933
8934       In a larger pattern,  keeping  track  of  parenthesis  numbers  can  be
8935       tricky.  This is made easier by the use of relative references. Instead
8936       of (?1) in the pattern above you can write (?-2) to refer to the second
8937       most  recently  opened  parentheses  preceding  the recursion. In other
8938       words, a negative number counts capturing  parentheses  leftwards  from
8939       the point at which it is encountered.
8940
8941       Be  aware  however, that if duplicate capture group numbers are in use,
8942       relative references refer to the earliest group  with  the  appropriate
8943       number. Consider, for example:
8944
8945         (?|(a)|(b)) (c) (?-2)
8946
8947       The first two capture groups (a) and (b) are both numbered 1, and group
8948       (c) is number 2. When the reference (?-2) is  encountered,  the  second
8949       most  recently opened parentheses has the number 1, but it is the first
8950       such group (the (a) group) to which the recursion refers. This would be
8951       the  same if an absolute reference (?1) was used. In other words, rela-
8952       tive references are just a shorthand for computing a group number.
8953
8954       It is also possible to refer to subsequent capture groups,  by  writing
8955       references  such  as  (?+2). However, these cannot be recursive because
8956       the reference is not inside the parentheses that are  referenced.  They
8957       are  always  non-recursive  subroutine  calls, as described in the next
8958       section.
8959
8960       An alternative approach is to use named parentheses.  The  Perl  syntax
8961       for  this  is  (?&name);  PCRE1's earlier syntax (?P>name) is also sup-
8962       ported. We could rewrite the above example as follows:
8963
8964         (?<pn> \( ( [^()]++ | (?&pn) )* \) )
8965
8966       If there is more than one group with the same name, the earliest one is
8967       used.
8968
8969       The example pattern that we have been looking at contains nested unlim-
8970       ited repeats, and so the use of a possessive  quantifier  for  matching
8971       strings  of  non-parentheses  is important when applying the pattern to
8972       strings that do not match. For example, when this pattern is applied to
8973
8974         (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
8975
8976       it yields "no match" quickly. However, if a  possessive  quantifier  is
8977       not  used, the match runs for a very long time indeed because there are
8978       so many different ways the + and * repeats can carve  up  the  subject,
8979       and all have to be tested before failure can be reported.
8980
8981       At  the  end  of a match, the values of capturing parentheses are those
8982       from the outermost level. If you want to obtain intermediate values,  a
8983       callout function can be used (see below and the pcre2callout documenta-
8984       tion). If the pattern above is matched against
8985
8986         (ab(cd)ef)
8987
8988       the value for the inner capturing parentheses  (numbered  2)  is  "ef",
8989       which  is  the last value taken on at the top level. If a capture group
8990       is not matched at the top level, its final  captured  value  is  unset,
8991       even  if it was (temporarily) set at a deeper level during the matching
8992       process.
8993
8994       Do not confuse the (?R) item with the condition (R),  which  tests  for
8995       recursion.   Consider  this pattern, which matches text in angle brack-
8996       ets, allowing for arbitrary nesting. Only digits are allowed in  nested
8997       brackets  (that is, when recursing), whereas any characters are permit-
8998       ted at the outer level.
8999
9000         < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >
9001
9002       In this pattern, (?(R) is the start of a conditional  group,  with  two
9003       different  alternatives  for the recursive and non-recursive cases. The
9004       (?R) item is the actual recursive call.
9005
9006   Differences in recursion processing between PCRE2 and Perl
9007
9008       Some former differences between PCRE2 and Perl no longer exist.
9009
9010       Before release 10.30, recursion processing in PCRE2 differed from  Perl
9011       in  that  a  recursive  subroutine call was always treated as an atomic
9012       group. That is, once it had matched some of the subject string, it  was
9013       never  re-entered,  even if it contained untried alternatives and there
9014       was a subsequent matching failure. (Historical note:  PCRE  implemented
9015       recursion before Perl did.)
9016
9017       Starting  with  release 10.30, recursive subroutine calls are no longer
9018       treated as atomic. That is, they can be re-entered to try unused alter-
9019       natives  if  there  is a matching failure later in the pattern. This is
9020       now compatible with the way Perl works. If you want a  subroutine  call
9021       to be atomic, you must explicitly enclose it in an atomic group.
9022
9023       Supporting backtracking into recursions simplifies certain types of re-
9024       cursive pattern. For example, this pattern matches palindromic strings:
9025
9026         ^((.)(?1)\2|.?)$
9027
9028       The second branch in the group matches a single  central  character  in
9029       the  palindrome  when there are an odd number of characters, or nothing
9030       when there are an even number of characters, but in order  to  work  it
9031       has  to  be  able  to  try the second case when the rest of the pattern
9032       match fails. If you want to match typical palindromic phrases, the pat-
9033       tern  has  to  ignore  all  non-word characters, which can be done like
9034       this:
9035
9036         ^\W*+((.)\W*+(?1)\W*+\2|\W*+.?)\W*+$
9037
9038       If run with the PCRE2_CASELESS option,  this  pattern  matches  phrases
9039       such  as "A man, a plan, a canal: Panama!". Note the use of the posses-
9040       sive quantifier *+ to avoid backtracking  into  sequences  of  non-word
9041       characters. Without this, PCRE2 takes a great deal longer (ten times or
9042       more) to match typical phrases, and Perl takes so long that  you  think
9043       it has gone into a loop.
9044
9045       Another  way  in which PCRE2 and Perl used to differ in their recursion
9046       processing is in the handling of captured  values.  Formerly  in  Perl,
9047       when  a  group  was called recursively or as a subroutine (see the next
9048       section), it had no access to any values that were captured outside the
9049       recursion,  whereas  in  PCRE2 these values can be referenced. Consider
9050       this pattern:
9051
9052         ^(.)(\1|a(?2))
9053
9054       This pattern matches "bab". The first capturing parentheses match  "b",
9055       then in the second group, when the backreference \1 fails to match "b",
9056       the second alternative matches "a" and then recurses. In the recursion,
9057       \1  does now match "b" and so the whole match succeeds. This match used
9058       to fail in Perl, but in later versions (I tried 5.024) it now works.
9059
9060
9061GROUPS AS SUBROUTINES
9062
9063       If the syntax for a recursive group call (either by number or by  name)
9064       is  used  outside the parentheses to which it refers, it operates a bit
9065       like a subroutine in a programming  language.  More  accurately,  PCRE2
9066       treats the referenced group as an independent subpattern which it tries
9067       to match at the current matching position. The called group may be  de-
9068       fined  before or after the reference. A numbered reference can be abso-
9069       lute or relative, as in these examples:
9070
9071         (...(absolute)...)...(?2)...
9072         (...(relative)...)...(?-1)...
9073         (...(?+1)...(relative)...
9074
9075       An earlier example pointed out that the pattern
9076
9077         (sens|respons)e and \1ibility
9078
9079       matches "sense and sensibility" and "response and responsibility",  but
9080       not "sense and responsibility". If instead the pattern
9081
9082         (sens|respons)e and (?1)ibility
9083
9084       is  used, it does match "sense and responsibility" as well as the other
9085       two strings. Another example is  given  in  the  discussion  of  DEFINE
9086       above.
9087
9088       Like  recursions,  subroutine  calls  used to be treated as atomic, but
9089       this changed at PCRE2 release 10.30, so  backtracking  into  subroutine
9090       calls  can  now  occur. However, any capturing parentheses that are set
9091       during the subroutine call revert to their previous values afterwards.
9092
9093       Processing options such as case-independence are fixed when a group  is
9094       defined,  so  if  it  is  used  as a subroutine, such options cannot be
9095       changed for different calls. For example, consider this pattern:
9096
9097         (abc)(?i:(?-1))
9098
9099       It matches "abcabc". It does not match "abcABC" because the  change  of
9100       processing option does not affect the called group.
9101
9102       The  behaviour  of  backtracking control verbs in groups when called as
9103       subroutines is described in the section entitled "Backtracking verbs in
9104       subroutines" below.
9105
9106
9107ONIGURUMA SUBROUTINE SYNTAX
9108
9109       For  compatibility with Oniguruma, the non-Perl syntax \g followed by a
9110       name or a number enclosed either in angle brackets or single quotes, is
9111       an alternative syntax for calling a group as a subroutine, possibly re-
9112       cursively. Here are two of the examples  used  above,  rewritten  using
9113       this syntax:
9114
9115         (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
9116         (sens|respons)e and \g'1'ibility
9117
9118       PCRE2  supports an extension to Oniguruma: if a number is preceded by a
9119       plus or a minus sign it is taken as a relative reference. For example:
9120
9121         (abc)(?i:\g<-1>)
9122
9123       Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are  not
9124       synonymous.  The  former is a backreference; the latter is a subroutine
9125       call.
9126
9127
9128CALLOUTS
9129
9130       Perl has a feature whereby using the sequence (?{...}) causes arbitrary
9131       Perl  code to be obeyed in the middle of matching a regular expression.
9132       This makes it possible, amongst other things, to extract different sub-
9133       strings that match the same pair of parentheses when there is a repeti-
9134       tion.
9135
9136       PCRE2 provides a similar feature, but of course it  cannot  obey  arbi-
9137       trary  Perl  code. The feature is called "callout". The caller of PCRE2
9138       provides an external function by putting its entry  point  in  a  match
9139       context  using  the function pcre2_set_callout(), and then passing that
9140       context to pcre2_match() or pcre2_dfa_match(). If no match  context  is
9141       passed, or if the callout entry point is set to NULL, callouts are dis-
9142       abled.
9143
9144       Within a regular expression, (?C<arg>) indicates a point at  which  the
9145       external  function  is  to  be  called. There are two kinds of callout:
9146       those with a numerical argument and those with a string argument.  (?C)
9147       on  its  own with no argument is treated as (?C0). A numerical argument
9148       allows the  application  to  distinguish  between  different  callouts.
9149       String  arguments  were added for release 10.20 to make it possible for
9150       script languages that use PCRE2 to embed short scripts within  patterns
9151       in a similar way to Perl.
9152
9153       During matching, when PCRE2 reaches a callout point, the external func-
9154       tion is called. It is provided with the number or  string  argument  of
9155       the  callout, the position in the pattern, and one item of data that is
9156       also set in the match block. The callout function may cause matching to
9157       proceed, to backtrack, or to fail.
9158
9159       By  default,  PCRE2  implements  a  number of optimizations at matching
9160       time, and one side-effect is that sometimes callouts  are  skipped.  If
9161       you  need all possible callouts to happen, you need to set options that
9162       disable the relevant optimizations. More details, including a  complete
9163       description  of  the programming interface to the callout function, are
9164       given in the pcre2callout documentation.
9165
9166   Callouts with numerical arguments
9167
9168       If you just want to have  a  means  of  identifying  different  callout
9169       points,  put  a  number  less than 256 after the letter C. For example,
9170       this pattern has two callout points:
9171
9172         (?C1)abc(?C2)def
9173
9174       If the PCRE2_AUTO_CALLOUT flag is passed to pcre2_compile(),  numerical
9175       callouts  are  automatically installed before each item in the pattern.
9176       They are all numbered 255. If there is a conditional group in the  pat-
9177       tern whose condition is an assertion, an additional callout is inserted
9178       just before the condition. An explicit callout may also be set at  this
9179       position, as in this example:
9180
9181         (?(?C9)(?=a)abc|def)
9182
9183       Note that this applies only to assertion conditions, not to other types
9184       of condition.
9185
9186   Callouts with string arguments
9187
9188       A delimited string may be used instead of a number as a  callout  argu-
9189       ment.  The  starting  delimiter  must be one of ` ' " ^ % # $ { and the
9190       ending delimiter is the same as the start, except for {, where the end-
9191       ing  delimiter  is  }.  If  the  ending  delimiter is needed within the
9192       string, it must be doubled. For example:
9193
9194         (?C'ab ''c'' d')xyz(?C{any text})pqr
9195
9196       The doubling is removed before the string  is  passed  to  the  callout
9197       function.
9198
9199
9200BACKTRACKING CONTROL
9201
9202       There  are  a  number  of  special "Backtracking Control Verbs" (to use
9203       Perl's terminology) that modify the behaviour  of  backtracking  during
9204       matching.  They are generally of the form (*VERB) or (*VERB:NAME). Some
9205       verbs take either form, and may behave differently depending on whether
9206       or  not  a  name  argument is present. The names are not required to be
9207       unique within the pattern.
9208
9209       By default, for compatibility with Perl, a  name  is  any  sequence  of
9210       characters that does not include a closing parenthesis. The name is not
9211       processed in any way, and it is  not  possible  to  include  a  closing
9212       parenthesis   in  the  name.   This  can  be  changed  by  setting  the
9213       PCRE2_ALT_VERBNAMES option, but the result is no  longer  Perl-compati-
9214       ble.
9215
9216       When  PCRE2_ALT_VERBNAMES  is  set,  backslash processing is applied to
9217       verb names and only an unescaped  closing  parenthesis  terminates  the
9218       name.  However, the only backslash items that are permitted are \Q, \E,
9219       and sequences such as \x{100} that define character code points.  Char-
9220       acter type escapes such as \d are faulted.
9221
9222       A closing parenthesis can be included in a name either as \) or between
9223       \Q and \E. In addition to backslash processing, if  the  PCRE2_EXTENDED
9224       or PCRE2_EXTENDED_MORE option is also set, unescaped whitespace in verb
9225       names is skipped, and #-comments are recognized, exactly as in the rest
9226       of  the  pattern.  PCRE2_EXTENDED and PCRE2_EXTENDED_MORE do not affect
9227       verb names unless PCRE2_ALT_VERBNAMES is also set.
9228
9229       The maximum length of a name is 255 in the 8-bit library and  65535  in
9230       the  16-bit and 32-bit libraries. If the name is empty, that is, if the
9231       closing parenthesis immediately follows the colon, the effect is as  if
9232       the colon were not there. Any number of these verbs may occur in a pat-
9233       tern. Except for (*ACCEPT), they may not be quantified.
9234
9235       Since these verbs are specifically related  to  backtracking,  most  of
9236       them  can be used only when the pattern is to be matched using the tra-
9237       ditional matching function, because that uses a backtracking algorithm.
9238       With  the  exception  of (*FAIL), which behaves like a failing negative
9239       assertion, the backtracking control verbs cause an error if encountered
9240       by the DFA matching function.
9241
9242       The  behaviour  of  these  verbs in repeated groups, assertions, and in
9243       capture groups called as subroutines (whether or  not  recursively)  is
9244       documented below.
9245
9246   Optimizations that affect backtracking verbs
9247
9248       PCRE2 contains some optimizations that are used to speed up matching by
9249       running some checks at the start of each match attempt. For example, it
9250       may  know  the minimum length of matching subject, or that a particular
9251       character must be present. When one of these optimizations bypasses the
9252       running  of  a  match,  any  included  backtracking  verbs will not, of
9253       course, be processed. You can suppress the start-of-match optimizations
9254       by  setting  the PCRE2_NO_START_OPTIMIZE option when calling pcre2_com-
9255       pile(), or by starting the pattern with (*NO_START_OPT). There is  more
9256       discussion of this option in the section entitled "Compiling a pattern"
9257       in the pcre2api documentation.
9258
9259       Experiments with Perl suggest that it too  has  similar  optimizations,
9260       and like PCRE2, turning them off can change the result of a match.
9261
9262   Verbs that act immediately
9263
9264       The following verbs act as soon as they are encountered.
9265
9266          (*ACCEPT) or (*ACCEPT:NAME)
9267
9268       This  verb causes the match to end successfully, skipping the remainder
9269       of the pattern. However, when it is inside  a  capture  group  that  is
9270       called as a subroutine, only that group is ended successfully. Matching
9271       then continues at the outer level. If (*ACCEPT) in triggered in a posi-
9272       tive  assertion,  the  assertion succeeds; in a negative assertion, the
9273       assertion fails.
9274
9275       If (*ACCEPT) is inside capturing parentheses, the data so far  is  cap-
9276       tured. For example:
9277
9278         A((?:A|B(*ACCEPT)|C)D)
9279
9280       This  matches  "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap-
9281       tured by the outer parentheses.
9282
9283       (*ACCEPT) is the only backtracking verb that is allowed to  be  quanti-
9284       fied  because  an  ungreedy  quantification with a minimum of zero acts
9285       only when a backtrack happens. Consider, for example,
9286
9287         (A(*ACCEPT)??B)C
9288
9289       where A, B, and C may be complex expressions. After matching  "A",  the
9290       matcher  processes  "BC"; if that fails, causing a backtrack, (*ACCEPT)
9291       is triggered and the match succeeds. In both cases, all but C  is  cap-
9292       tured.  Whereas  (*COMMIT) (see below) means "fail on backtrack", a re-
9293       peated (*ACCEPT) of this type means "succeed on backtrack".
9294
9295       Warning: (*ACCEPT) should not be used within a script  run  group,  be-
9296       cause  it causes an immediate exit from the group, bypassing the script
9297       run checking.
9298
9299         (*FAIL) or (*FAIL:NAME)
9300
9301       This verb causes a matching failure, forcing backtracking to occur.  It
9302       may  be  abbreviated  to  (*F).  It is equivalent to (?!) but easier to
9303       read. The Perl documentation notes that it is probably useful only when
9304       combined with (?{}) or (??{}). Those are, of course, Perl features that
9305       are not present in PCRE2. The nearest equivalent is  the  callout  fea-
9306       ture, as for example in this pattern:
9307
9308         a+(?C)(*FAIL)
9309
9310       A  match  with the string "aaaa" always fails, but the callout is taken
9311       before each backtrack happens (in this example, 10 times).
9312
9313       (*ACCEPT:NAME) and (*FAIL:NAME) behave the  same  as  (*MARK:NAME)(*AC-
9314       CEPT)  and  (*MARK:NAME)(*FAIL),  respectively,  that  is, a (*MARK) is
9315       recorded just before the verb acts.
9316
9317   Recording which path was taken
9318
9319       There is one verb whose main purpose is to track how a  match  was  ar-
9320       rived  at,  though  it also has a secondary use in conjunction with ad-
9321       vancing the match starting point (see (*SKIP) below).
9322
9323         (*MARK:NAME) or (*:NAME)
9324
9325       A name is always required with this verb. For all the other  backtrack-
9326       ing control verbs, a NAME argument is optional.
9327
9328       When  a  match  succeeds, the name of the last-encountered mark name on
9329       the matching path is passed back to the caller as described in the sec-
9330       tion entitled "Other information about the match" in the pcre2api docu-
9331       mentation. This applies to all instances of (*MARK)  and  other  verbs,
9332       including those inside assertions and atomic groups. However, there are
9333       differences in those cases when (*MARK) is  used  in  conjunction  with
9334       (*SKIP) as described below.
9335
9336       The  mark name that was last encountered on the matching path is passed
9337       back. A verb without a NAME argument is ignored for this purpose.  Here
9338       is  an  example of pcre2test output, where the "mark" modifier requests
9339       the retrieval and outputting of (*MARK) data:
9340
9341           re> /X(*MARK:A)Y|X(*MARK:B)Z/mark
9342         data> XY
9343          0: XY
9344         MK: A
9345         XZ
9346          0: XZ
9347         MK: B
9348
9349       The (*MARK) name is tagged with "MK:" in this output, and in this exam-
9350       ple  it indicates which of the two alternatives matched. This is a more
9351       efficient way of obtaining this information than putting each  alterna-
9352       tive in its own capturing parentheses.
9353
9354       If  a  verb  with a name is encountered in a positive assertion that is
9355       true, the name is recorded and passed back if it  is  the  last-encoun-
9356       tered. This does not happen for negative assertions or failing positive
9357       assertions.
9358
9359       After a partial match or a failed match, the last encountered  name  in
9360       the entire match process is returned. For example:
9361
9362           re> /X(*MARK:A)Y|X(*MARK:B)Z/mark
9363         data> XP
9364         No match, mark = B
9365
9366       Note  that  in  this  unanchored  example the mark is retained from the
9367       match attempt that started at the letter "X" in the subject. Subsequent
9368       match attempts starting at "P" and then with an empty string do not get
9369       as far as the (*MARK) item, but nevertheless do not reset it.
9370
9371       If you are interested in  (*MARK)  values  after  failed  matches,  you
9372       should  probably  set the PCRE2_NO_START_OPTIMIZE option (see above) to
9373       ensure that the match is always attempted.
9374
9375   Verbs that act after backtracking
9376
9377       The following verbs do nothing when they are encountered. Matching con-
9378       tinues  with  what follows, but if there is a subsequent match failure,
9379       causing a backtrack to the verb, a failure is forced.  That  is,  back-
9380       tracking  cannot  pass  to  the  left of the verb. However, when one of
9381       these verbs appears inside an atomic group or in a lookaround assertion
9382       that  is  true,  its effect is confined to that group, because once the
9383       group has been matched, there is never any backtracking into it.  Back-
9384       tracking from beyond an assertion or an atomic group ignores the entire
9385       group, and seeks a preceding backtracking point.
9386
9387       These verbs differ in exactly what kind of failure  occurs  when  back-
9388       tracking  reaches  them.  The behaviour described below is what happens
9389       when the verb is not in a subroutine or an assertion.  Subsequent  sec-
9390       tions cover these special cases.
9391
9392         (*COMMIT) or (*COMMIT:NAME)
9393
9394       This  verb  causes the whole match to fail outright if there is a later
9395       matching failure that causes backtracking to reach it. Even if the pat-
9396       tern  is  unanchored,  no further attempts to find a match by advancing
9397       the starting point take place. If (*COMMIT) is  the  only  backtracking
9398       verb that is encountered, once it has been passed pcre2_match() is com-
9399       mitted to finding a match at the current starting point, or not at all.
9400       For example:
9401
9402         a+(*COMMIT)b
9403
9404       This  matches  "xxaab" but not "aacaab". It can be thought of as a kind
9405       of dynamic anchor, or "I've started, so I must finish."
9406
9407       The behaviour of (*COMMIT:NAME) is not the same  as  (*MARK:NAME)(*COM-
9408       MIT).  It is like (*MARK:NAME) in that the name is remembered for pass-
9409       ing back to the caller. However, (*SKIP:NAME) searches only  for  names
9410       that are set with (*MARK), ignoring those set by any of the other back-
9411       tracking verbs.
9412
9413       If there is more than one backtracking verb in a pattern,  a  different
9414       one  that  follows  (*COMMIT) may be triggered first, so merely passing
9415       (*COMMIT) during a match does not always guarantee that a match must be
9416       at this starting point.
9417
9418       Note that (*COMMIT) at the start of a pattern is not the same as an an-
9419       chor, unless PCRE2's start-of-match optimizations are  turned  off,  as
9420       shown in this output from pcre2test:
9421
9422           re> /(*COMMIT)abc/
9423         data> xyzabc
9424          0: abc
9425         data>
9426         re> /(*COMMIT)abc/no_start_optimize
9427         data> xyzabc
9428         No match
9429
9430       For  the first pattern, PCRE2 knows that any match must start with "a",
9431       so the optimization skips along the subject to "a" before applying  the
9432       pattern  to the first set of data. The match attempt then succeeds. The
9433       second pattern disables the optimization that skips along to the  first
9434       character.  The  pattern  is  now  applied  starting at "x", and so the
9435       (*COMMIT) causes the match to fail without trying  any  other  starting
9436       points.
9437
9438         (*PRUNE) or (*PRUNE:NAME)
9439
9440       This  verb causes the match to fail at the current starting position in
9441       the subject if there is a later matching failure that causes backtrack-
9442       ing  to  reach it. If the pattern is unanchored, the normal "bumpalong"
9443       advance to the next starting character then happens.  Backtracking  can
9444       occur  as  usual to the left of (*PRUNE), before it is reached, or when
9445       matching to the right of (*PRUNE), but if there  is  no  match  to  the
9446       right,  backtracking cannot cross (*PRUNE). In simple cases, the use of
9447       (*PRUNE) is just an alternative to an atomic group or possessive  quan-
9448       tifier, but there are some uses of (*PRUNE) that cannot be expressed in
9449       any other way. In an anchored pattern (*PRUNE) has the same  effect  as
9450       (*COMMIT).
9451
9452       The behaviour of (*PRUNE:NAME) is not the same as (*MARK:NAME)(*PRUNE).
9453       It is like (*MARK:NAME) in that the name is remembered for passing back
9454       to  the  caller. However, (*SKIP:NAME) searches only for names set with
9455       (*MARK), ignoring those set by other backtracking verbs.
9456
9457         (*SKIP)
9458
9459       This verb, when given without a name, is like (*PRUNE), except that  if
9460       the  pattern  is unanchored, the "bumpalong" advance is not to the next
9461       character, but to the position in the subject where (*SKIP) was encoun-
9462       tered.  (*SKIP)  signifies that whatever text was matched leading up to
9463       it cannot be part of a successful match if there is a  later  mismatch.
9464       Consider:
9465
9466         a+(*SKIP)b
9467
9468       If  the  subject  is  "aaaac...",  after  the first match attempt fails
9469       (starting at the first character in the  string),  the  starting  point
9470       skips on to start the next attempt at "c". Note that a possessive quan-
9471       tifier does not have the same effect as this example; although it would
9472       suppress  backtracking  during  the first match attempt, the second at-
9473       tempt would start at the second character instead  of  skipping  on  to
9474       "c".
9475
9476       If  (*SKIP) is used to specify a new starting position that is the same
9477       as the starting position of the current match, or (by  being  inside  a
9478       lookbehind)  earlier, the position specified by (*SKIP) is ignored, and
9479       instead the normal "bumpalong" occurs.
9480
9481         (*SKIP:NAME)
9482
9483       When (*SKIP) has an associated name, its behaviour  is  modified.  When
9484       such  a  (*SKIP) is triggered, the previous path through the pattern is
9485       searched for the most recent (*MARK) that has the same name. If one  is
9486       found,  the  "bumpalong" advance is to the subject position that corre-
9487       sponds to that (*MARK) instead of to where (*SKIP) was encountered.  If
9488       no (*MARK) with a matching name is found, the (*SKIP) is ignored.
9489
9490       The  search  for a (*MARK) name uses the normal backtracking mechanism,
9491       which means that it does not  see  (*MARK)  settings  that  are  inside
9492       atomic groups or assertions, because they are never re-entered by back-
9493       tracking. Compare the following pcre2test examples:
9494
9495           re> /a(?>(*MARK:X))(*SKIP:X)(*F)|(.)/
9496         data: abc
9497          0: a
9498          1: a
9499         data:
9500           re> /a(?:(*MARK:X))(*SKIP:X)(*F)|(.)/
9501         data: abc
9502          0: b
9503          1: b
9504
9505       In the first example, the (*MARK) setting is in an atomic group, so  it
9506       is not seen when (*SKIP:X) triggers, causing the (*SKIP) to be ignored.
9507       This allows the second branch of the pattern to be tried at  the  first
9508       character  position.  In the second example, the (*MARK) setting is not
9509       in an atomic group. This allows (*SKIP:X) to find the (*MARK)  when  it
9510       backtracks, and this causes a new matching attempt to start at the sec-
9511       ond character. This time, the (*MARK) is never seen  because  "a"  does
9512       not match "b", so the matcher immediately jumps to the second branch of
9513       the pattern.
9514
9515       Note that (*SKIP:NAME) searches only for names set by (*MARK:NAME).  It
9516       ignores names that are set by other backtracking verbs.
9517
9518         (*THEN) or (*THEN:NAME)
9519
9520       This  verb  causes  a skip to the next innermost alternative when back-
9521       tracking reaches it. That  is,  it  cancels  any  further  backtracking
9522       within  the  current  alternative.  Its name comes from the observation
9523       that it can be used for a pattern-based if-then-else block:
9524
9525         ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
9526
9527       If the COND1 pattern matches, FOO is tried (and possibly further  items
9528       after  the  end  of the group if FOO succeeds); on failure, the matcher
9529       skips to the second alternative and tries COND2,  without  backtracking
9530       into  COND1.  If that succeeds and BAR fails, COND3 is tried. If subse-
9531       quently BAZ fails, there are no more alternatives, so there is a  back-
9532       track  to  whatever came before the entire group. If (*THEN) is not in-
9533       side an alternation, it acts like (*PRUNE).
9534
9535       The behaviour of (*THEN:NAME) is not the same  as  (*MARK:NAME)(*THEN).
9536       It is like (*MARK:NAME) in that the name is remembered for passing back
9537       to the caller. However, (*SKIP:NAME) searches only for names  set  with
9538       (*MARK), ignoring those set by other backtracking verbs.
9539
9540       A  group  that does not contain a | character is just a part of the en-
9541       closing alternative; it is not a nested alternation with only  one  al-
9542       ternative. The effect of (*THEN) extends beyond such a group to the en-
9543       closing alternative.  Consider this pattern, where A, B, etc. are  com-
9544       plex  pattern  fragments  that  do not contain any | characters at this
9545       level:
9546
9547         A (B(*THEN)C) | D
9548
9549       If A and B are matched, but there is a failure in C, matching does  not
9550       backtrack into A; instead it moves to the next alternative, that is, D.
9551       However, if the group containing (*THEN) is given  an  alternative,  it
9552       behaves differently:
9553
9554         A (B(*THEN)C | (*FAIL)) | D
9555
9556       The effect of (*THEN) is now confined to the inner group. After a fail-
9557       ure in C, matching moves to (*FAIL), which causes the  whole  group  to
9558       fail  because  there  are  no  more  alternatives to try. In this case,
9559       matching does backtrack into A.
9560
9561       Note that a conditional group is not considered as having two  alterna-
9562       tives,  because  only one is ever used. In other words, the | character
9563       in a conditional group has a different meaning. Ignoring  white  space,
9564       consider:
9565
9566         ^.*? (?(?=a) a | b(*THEN)c )
9567
9568       If the subject is "ba", this pattern does not match. Because .*? is un-
9569       greedy, it initially matches zero characters. The condition (?=a)  then
9570       fails,  the  character  "b"  is matched, but "c" is not. At this point,
9571       matching does not backtrack to .*? as might perhaps  be  expected  from
9572       the  presence  of the | character. The conditional group is part of the
9573       single alternative that comprises the whole pattern, and so  the  match
9574       fails.  (If  there  was a backtrack into .*?, allowing it to match "b",
9575       the match would succeed.)
9576
9577       The verbs just described provide four different "strengths" of  control
9578       when subsequent matching fails. (*THEN) is the weakest, carrying on the
9579       match at the next alternative. (*PRUNE) comes next, failing  the  match
9580       at  the  current starting position, but allowing an advance to the next
9581       character (for an unanchored pattern). (*SKIP) is similar, except  that
9582       the advance may be more than one character. (*COMMIT) is the strongest,
9583       causing the entire match to fail.
9584
9585   More than one backtracking verb
9586
9587       If more than one backtracking verb is present in  a  pattern,  the  one
9588       that  is  backtracked  onto first acts. For example, consider this pat-
9589       tern, where A, B, etc. are complex pattern fragments:
9590
9591         (A(*COMMIT)B(*THEN)C|ABD)
9592
9593       If A matches but B fails, the backtrack to (*COMMIT) causes the  entire
9594       match to fail. However, if A and B match, but C fails, the backtrack to
9595       (*THEN) causes the next alternative (ABD) to be tried.  This  behaviour
9596       is  consistent,  but is not always the same as Perl's. It means that if
9597       two or more backtracking verbs appear in succession, all the  the  last
9598       of them has no effect. Consider this example:
9599
9600         ...(*COMMIT)(*PRUNE)...
9601
9602       If there is a matching failure to the right, backtracking onto (*PRUNE)
9603       causes it to be triggered, and its action is taken. There can never  be
9604       a backtrack onto (*COMMIT).
9605
9606   Backtracking verbs in repeated groups
9607
9608       PCRE2 sometimes differs from Perl in its handling of backtracking verbs
9609       in repeated groups. For example, consider:
9610
9611         /(a(*COMMIT)b)+ac/
9612
9613       If the subject is "abac", Perl matches  unless  its  optimizations  are
9614       disabled,  but  PCRE2  always fails because the (*COMMIT) in the second
9615       repeat of the group acts.
9616
9617   Backtracking verbs in assertions
9618
9619       (*FAIL) in any assertion has its normal effect: it forces an  immediate
9620       backtrack.  The  behaviour  of  the other backtracking verbs depends on
9621       whether or not the assertion is standalone or acting as  the  condition
9622       in a conditional group.
9623
9624       (*ACCEPT)  in  a  standalone positive assertion causes the assertion to
9625       succeed without any further processing; captured  strings  and  a  mark
9626       name  (if  set) are retained. In a standalone negative assertion, (*AC-
9627       CEPT) causes the assertion to fail without any further processing; cap-
9628       tured substrings and any mark name are discarded.
9629
9630       If  the  assertion is a condition, (*ACCEPT) causes the condition to be
9631       true for a positive assertion and false for a  negative  one;  captured
9632       substrings are retained in both cases.
9633
9634       The remaining verbs act only when a later failure causes a backtrack to
9635       reach them. This means that, for the Perl-compatible assertions,  their
9636       effect is confined to the assertion, because Perl lookaround assertions
9637       are atomic. A backtrack that occurs after such an assertion is complete
9638       does  not  jump  back  into  the  assertion.  Note in particular that a
9639       (*MARK) name that is set in an assertion is not "seen" by  an  instance
9640       of (*SKIP:NAME) later in the pattern.
9641
9642       PCRE2  now supports non-atomic positive assertions, as described in the
9643       section entitled "Non-atomic assertions" above. These  assertions  must
9644       be  standalone  (not used as conditions). They are not Perl-compatible.
9645       For these assertions, a later backtrack does jump back into the  asser-
9646       tion,  and  therefore verbs such as (*COMMIT) can be triggered by back-
9647       tracks from later in the pattern.
9648
9649       The effect of (*THEN) is not allowed to escape beyond an assertion.  If
9650       there  are no more branches to try, (*THEN) causes a positive assertion
9651       to be false, and a negative assertion to be true.
9652
9653       The other backtracking verbs are not treated specially if  they  appear
9654       in  a  standalone  positive assertion. In a conditional positive asser-
9655       tion, backtracking (from within the assertion) into (*COMMIT), (*SKIP),
9656       or  (*PRUNE) causes the condition to be false. However, for both stand-
9657       alone and conditional negative assertions, backtracking into (*COMMIT),
9658       (*SKIP), or (*PRUNE) causes the assertion to be true, without consider-
9659       ing any further alternative branches.
9660
9661   Backtracking verbs in subroutines
9662
9663       These behaviours occur whether or not the group is called recursively.
9664
9665       (*ACCEPT) in a group called as a subroutine causes the subroutine match
9666       to  succeed without any further processing. Matching then continues af-
9667       ter the subroutine call. Perl documents this behaviour.  Perl's  treat-
9668       ment of the other verbs in subroutines is different in some cases.
9669
9670       (*FAIL)  in  a  group  called as a subroutine has its normal effect: it
9671       forces an immediate backtrack.
9672
9673       (*COMMIT), (*SKIP), and (*PRUNE) cause the  subroutine  match  to  fail
9674       when  triggered  by being backtracked to in a group called as a subrou-
9675       tine. There is then a backtrack at the outer level.
9676
9677       (*THEN), when triggered, skips to the next alternative in the innermost
9678       enclosing  group that has alternatives (its normal behaviour). However,
9679       if there is no such group within the subroutine's group, the subroutine
9680       match fails and there is a backtrack at the outer level.
9681
9682
9683SEE ALSO
9684
9685       pcre2api(3),    pcre2callout(3),    pcre2matching(3),   pcre2syntax(3),
9686       pcre2(3).
9687
9688
9689AUTHOR
9690
9691       Philip Hazel
9692       Retired from University Computing Service
9693       Cambridge, England.
9694
9695
9696REVISION
9697
9698       Last updated: 12 January 2022
9699       Copyright (c) 1997-2022 University of Cambridge.
9700------------------------------------------------------------------------------
9701
9702
9703PCRE2PERFORM(3)            Library Functions Manual            PCRE2PERFORM(3)
9704
9705
9706
9707NAME
9708       PCRE2 - Perl-compatible regular expressions (revised API)
9709
9710PCRE2 PERFORMANCE
9711
9712       Two  aspects  of performance are discussed below: memory usage and pro-
9713       cessing time. The way you express your pattern as a regular  expression
9714       can affect both of them.
9715
9716
9717COMPILED PATTERN MEMORY USAGE
9718
9719       Patterns are compiled by PCRE2 into a reasonably efficient interpretive
9720       code, so that most simple patterns do not use much memory  for  storing
9721       the compiled version. However, there is one case where the memory usage
9722       of a compiled pattern can be unexpectedly  large.  If  a  parenthesized
9723       group  has  a quantifier with a minimum greater than 1 and/or a limited
9724       maximum, the whole group is repeated in the compiled code. For example,
9725       the pattern
9726
9727         (abc|def){2,4}
9728
9729       is compiled as if it were
9730
9731         (abc|def)(abc|def)((abc|def)(abc|def)?)?
9732
9733       (Technical  aside:  It is done this way so that backtrack points within
9734       each of the repetitions can be independently maintained.)
9735
9736       For regular expressions whose quantifiers use only small numbers,  this
9737       is  not  usually a problem. However, if the numbers are large, and par-
9738       ticularly if such repetitions are nested, the memory usage  can  become
9739       an embarrassment. For example, the very simple pattern
9740
9741         ((ab){1,1000}c){1,3}
9742
9743       uses  over  50KiB  when compiled using the 8-bit library. When PCRE2 is
9744       compiled with its default internal pointer size of two bytes, the  size
9745       limit on a compiled pattern is 65535 code units in the 8-bit and 16-bit
9746       libraries, and this is reached with the above pattern if the outer rep-
9747       etition  is  increased from 3 to 4. PCRE2 can be compiled to use larger
9748       internal pointers and thus handle larger compiled patterns, but  it  is
9749       better to try to rewrite your pattern to use less memory if you can.
9750
9751       One  way  of reducing the memory usage for such patterns is to make use
9752       of PCRE2's "subroutine" facility. Re-writing the above pattern as
9753
9754         ((ab)(?2){0,999}c)(?1){0,2}
9755
9756       reduces the memory requirements to around 16KiB, and indeed it  remains
9757       under  20KiB  even with the outer repetition increased to 100. However,
9758       this kind of pattern is not always exactly equivalent, because any cap-
9759       tures  within  subroutine calls are lost when the subroutine completes.
9760       If this is not a problem, this kind of  rewriting  will  allow  you  to
9761       process  patterns that PCRE2 cannot otherwise handle. The matching per-
9762       formance of the two different versions of the pattern are  roughly  the
9763       same.  (This applies from release 10.30 - things were different in ear-
9764       lier releases.)
9765
9766
9767STACK AND HEAP USAGE AT RUN TIME
9768
9769       From release 10.30, the interpretive (non-JIT) version of pcre2_match()
9770       uses  very  little system stack at run time. In earlier releases recur-
9771       sive function calls could use a great deal of  stack,  and  this  could
9772       cause  problems, but this usage has been eliminated. Backtracking posi-
9773       tions are now explicitly remembered in memory frames controlled by  the
9774       code.  An  initial  20KiB  vector  of frames is allocated on the system
9775       stack (enough for about 100 frames for small patterns), but if this  is
9776       insufficient,  heap  memory  is  used. The amount of heap memory can be
9777       limited; if the limit is set to zero, only the initial stack vector  is
9778       used.  Rewriting patterns to be time-efficient, as described below, may
9779       also reduce the memory requirements.
9780
9781       In contrast to  pcre2_match(),  pcre2_dfa_match()  does  use  recursive
9782       function  calls,  but only for processing atomic groups, lookaround as-
9783       sertions, and recursion within the pattern. The original version of the
9784       code  used  to  allocate  quite large internal workspace vectors on the
9785       stack, which caused some problems for  some  patterns  in  environments
9786       with  small  stacks.  From release 10.32 the code for pcre2_dfa_match()
9787       has been re-factored to use heap memory  when  necessary  for  internal
9788       workspace  when  recursing,  though  recursive function calls are still
9789       used.
9790
9791       The "match depth" parameter can be used to limit the depth of  function
9792       recursion,  and  the  "match  heap"  parameter  to limit heap memory in
9793       pcre2_dfa_match().
9794
9795
9796PROCESSING TIME
9797
9798       Certain items in regular expression patterns are processed  more  effi-
9799       ciently than others. It is more efficient to use a character class like
9800       [aeiou]  than  a  set  of   single-character   alternatives   such   as
9801       (a|e|i|o|u).  In  general,  the simplest construction that provides the
9802       required behaviour is usually the most efficient. Jeffrey Friedl's book
9803       contains  a  lot  of useful general discussion about optimizing regular
9804       expressions for efficient performance. This document contains a few ob-
9805       servations about PCRE2.
9806
9807       Using  Unicode  character  properties  (the  \p, \P, and \X escapes) is
9808       slow, because PCRE2 has to use a multi-stage table lookup  whenever  it
9809       needs  a  character's  property. If you can find an alternative pattern
9810       that does not use character properties, it will probably be faster.
9811
9812       By default, the escape sequences \b, \d, \s,  and  \w,  and  the  POSIX
9813       character  classes  such  as  [:alpha:]  do not use Unicode properties,
9814       partly for backwards compatibility, and partly for performance reasons.
9815       However,  you  can  set  the PCRE2_UCP option or start the pattern with
9816       (*UCP) if you want Unicode character properties to be  used.  This  can
9817       double  the  matching  time  for  items  such  as \d, when matched with
9818       pcre2_match(); the performance loss is less with a DFA  matching  func-
9819       tion, and in both cases there is not much difference for \b.
9820
9821       When  a pattern begins with .* not in atomic parentheses, nor in paren-
9822       theses that are the subject of a backreference,  and  the  PCRE2_DOTALL
9823       option  is  set,  the pattern is implicitly anchored by PCRE2, since it
9824       can match only at the start of a subject string.  If  the  pattern  has
9825       multiple top-level branches, they must all be anchorable. The optimiza-
9826       tion can be disabled by the PCRE2_NO_DOTSTAR_ANCHOR option, and is  au-
9827       tomatically disabled if the pattern contains (*PRUNE) or (*SKIP).
9828
9829       If  PCRE2_DOTALL  is  not set, PCRE2 cannot make this optimization, be-
9830       cause the dot metacharacter does not then match a newline, and  if  the
9831       subject  string contains newlines, the pattern may match from the char-
9832       acter immediately following one of them instead of from the very start.
9833       For example, the pattern
9834
9835         .*second
9836
9837       matches  the subject "first\nand second" (where \n stands for a newline
9838       character), with the match starting at the seventh character. In  order
9839       to  do  this, PCRE2 has to retry the match starting after every newline
9840       in the subject.
9841
9842       If you are using such a pattern with subject strings that do  not  con-
9843       tain   newlines,   the   best   performance   is  obtained  by  setting
9844       PCRE2_DOTALL, or starting the pattern with ^.* or ^.*? to indicate  ex-
9845       plicit  anchoring.  That saves PCRE2 from having to scan along the sub-
9846       ject looking for a newline to restart at.
9847
9848       Beware of patterns that contain nested indefinite  repeats.  These  can
9849       take  a  long time to run when applied to a string that does not match.
9850       Consider the pattern fragment
9851
9852         ^(a+)*
9853
9854       This can match "aaaa" in 16 different ways, and this  number  increases
9855       very  rapidly  as the string gets longer. (The * repeat can match 0, 1,
9856       2, 3, or 4 times, and for each of those cases other than 0 or 4, the  +
9857       repeats  can  match  different numbers of times.) When the remainder of
9858       the pattern is such that the entire match is going to fail,  PCRE2  has
9859       in  principle to try every possible variation, and this can take an ex-
9860       tremely long time, even for relatively short strings.
9861
9862       An optimization catches some of the more simple cases such as
9863
9864         (a+)*b
9865
9866       where a literal character follows. Before  embarking  on  the  standard
9867       matching  procedure, PCRE2 checks that there is a "b" later in the sub-
9868       ject string, and if there is not, it fails the match immediately.  How-
9869       ever,  when  there  is no following literal this optimization cannot be
9870       used. You can see the difference by comparing the behaviour of
9871
9872         (a+)*\d
9873
9874       with the pattern above. The former gives  a  failure  almost  instantly
9875       when  applied  to  a  whole  line of "a" characters, whereas the latter
9876       takes an appreciable time with strings longer than about 20 characters.
9877
9878       In many cases, the solution to this kind of performance issue is to use
9879       an  atomic group or a possessive quantifier. This can often reduce mem-
9880       ory requirements as well. As another example, consider this pattern:
9881
9882         ([^<]|<(?!inet))+
9883
9884       It matches from wherever it starts until it encounters "<inet"  or  the
9885       end  of  the  data,  and is the kind of pattern that might be used when
9886       processing an XML file. Each iteration of the outer parentheses matches
9887       either  one  character that is not "<" or a "<" that is not followed by
9888       "inet". However, each time a parenthesis is processed,  a  backtracking
9889       position  is  passed,  so this formulation uses a memory frame for each
9890       matched character. For a long string, a lot of memory is required. Con-
9891       sider  now  this  rewritten  pattern,  which  matches  exactly the same
9892       strings:
9893
9894         ([^<]++|<(?!inet))+
9895
9896       This runs much faster, because sequences of characters that do not con-
9897       tain "<" are "swallowed" in one item inside the parentheses, and a pos-
9898       sessive quantifier is used to stop any backtracking into  the  runs  of
9899       non-"<"  characters.  This  version also uses a lot less memory because
9900       entry to a new set of parentheses happens only  when  a  "<"  character
9901       that  is  not  followed by "inet" is encountered (and we assume this is
9902       relatively rare).
9903
9904       This example shows that one way of optimizing performance when matching
9905       long  subject strings is to write repeated parenthesized subpatterns to
9906       match more than one character whenever possible.
9907
9908   SETTING RESOURCE LIMITS
9909
9910       You can set limits on the amount of processing that  takes  place  when
9911       matching,  and  on  the amount of heap memory that is used. The default
9912       values of the limits are very large, and unlikely ever to operate. They
9913       can  be  changed  when  PCRE2  is  built, and they can also be set when
9914       pcre2_match() or pcre2_dfa_match() is called. For details of these  in-
9915       terfaces,  see  the  pcre2build  documentation and the section entitled
9916       "The match context" in the pcre2api documentation.
9917
9918       The pcre2test test program has a modifier called  "find_limits"  which,
9919       if  applied  to  a  subject line, causes it to find the smallest limits
9920       that allow a pattern to match. This is done by repeatedly matching with
9921       different limits.
9922
9923
9924AUTHOR
9925
9926       Philip Hazel
9927       University Computing Service
9928       Cambridge, England.
9929
9930
9931REVISION
9932
9933       Last updated: 03 February 2019
9934       Copyright (c) 1997-2019 University of Cambridge.
9935------------------------------------------------------------------------------
9936
9937
9938PCRE2POSIX(3)              Library Functions Manual              PCRE2POSIX(3)
9939
9940
9941
9942NAME
9943       PCRE2 - Perl-compatible regular expressions (revised API)
9944
9945SYNOPSIS
9946
9947       #include <pcre2posix.h>
9948
9949       int pcre2_regcomp(regex_t *preg, const char *pattern,
9950            int cflags);
9951
9952       int pcre2_regexec(const regex_t *preg, const char *string,
9953            size_t nmatch, regmatch_t pmatch[], int eflags);
9954
9955       size_t pcre2_regerror(int errcode, const regex_t *preg,
9956            char *errbuf, size_t errbuf_size);
9957
9958       void pcre2_regfree(regex_t *preg);
9959
9960
9961DESCRIPTION
9962
9963       This  set of functions provides a POSIX-style API for the PCRE2 regular
9964       expression 8-bit library. There are no POSIX-style wrappers for PCRE2's
9965       16-bit  and  32-bit libraries. See the pcre2api documentation for a de-
9966       scription of PCRE2's native API, which contains much  additional  func-
9967       tionality.
9968
9969       The functions described here are wrapper functions that ultimately call
9970       the PCRE2 native API. Their prototypes are defined in the  pcre2posix.h
9971       header  file, and they all have unique names starting with pcre2_. How-
9972       ever, the pcre2posix.h header also contains macro definitions that con-
9973       vert  the standard POSIX names such regcomp() into pcre2_regcomp() etc.
9974       This means that a program can use the usual POSIX names without running
9975       the  risk of accidentally linking with POSIX functions from a different
9976       library.
9977
9978       On Unix-like systems the PCRE2 POSIX library is called  libpcre2-posix,
9979       so  can  be accessed by adding -lpcre2-posix to the command for linking
9980       an application. Because the POSIX functions call the native ones, it is
9981       also necessary to add -lpcre2-8.
9982
9983       Although  they  were  not defined as protypes in pcre2posix.h, releases
9984       10.33 to 10.36 of the library contained functions with the POSIX  names
9985       regcomp()  etc.  These simply passed their arguments to the PCRE2 func-
9986       tions. These functions were provided for backwards  compatibility  with
9987       earlier  versions  of  PCRE2, which had only POSIX names. However, this
9988       has proved troublesome in situations where a program links with several
9989       libraries,  some  of which use PCRE2's POSIX interface while others use
9990       the real POSIX functions.  For this reason, the POSIX names  have  been
9991       removed since release 10.37.
9992
9993       Calling  the  header  file  pcre2posix.h avoids any conflict with other
9994       POSIX libraries. It can, of course, be renamed or aliased  as  regex.h,
9995       which  is  the  "correct"  name,  if there is no clash. It provides two
9996       structure types, regex_t for compiled internal  forms,  and  regmatch_t
9997       for returning captured substrings. It also defines some constants whose
9998       names start with "REG_"; these are used for setting options and identi-
9999       fying error codes.
10000
10001
10002USING THE POSIX FUNCTIONS
10003
10004       Those  POSIX  option bits that can reasonably be mapped to PCRE2 native
10005       options have been implemented. In addition, the option REG_EXTENDED  is
10006       defined  with  the  value  zero. This has no effect, but since programs
10007       that are written to the POSIX interface often use  it,  this  makes  it
10008       easier  to  slot in PCRE2 as a replacement library. Other POSIX options
10009       are not even defined.
10010
10011       There are also some options that are not defined by POSIX.  These  have
10012       been  added  at  the  request  of users who want to make use of certain
10013       PCRE2-specific features via the POSIX calling interface or to  add  BSD
10014       or GNU functionality.
10015
10016       When  PCRE2  is  called via these functions, it is only the API that is
10017       POSIX-like in style. The syntax and semantics of  the  regular  expres-
10018       sions  themselves  are  still  those of Perl, subject to the setting of
10019       various PCRE2 options, as described below. "POSIX-like in style"  means
10020       that  the  API  approximates  to  the POSIX definition; it is not fully
10021       POSIX-compatible, and in multi-unit encoding  domains  it  is  probably
10022       even less compatible.
10023
10024       The  descriptions  below use the actual names of the functions, but, as
10025       described above, the standard POSIX names (without the  pcre2_  prefix)
10026       may also be used.
10027
10028
10029COMPILING A PATTERN
10030
10031       The function pcre2_regcomp() is called to compile a pattern into an in-
10032       ternal form. By default, the pattern is a C string terminated by a  bi-
10033       nary zero (but see REG_PEND below). The preg argument is a pointer to a
10034       regex_t structure that is used as a base for storing information  about
10035       the  compiled  regular  expression.  (It  is  also  used for input when
10036       REG_PEND is set.)
10037
10038       The argument cflags is either zero, or contains one or more of the bits
10039       defined by the following macros:
10040
10041         REG_DOTALL
10042
10043       The  PCRE2_DOTALL  option  is set when the regular expression is passed
10044       for compilation to the native function. Note  that  REG_DOTALL  is  not
10045       part of the POSIX standard.
10046
10047         REG_ICASE
10048
10049       The  PCRE2_CASELESS option is set when the regular expression is passed
10050       for compilation to the native function.
10051
10052         REG_NEWLINE
10053
10054       The PCRE2_MULTILINE option is set when the regular expression is passed
10055       for  compilation  to the native function. Note that this does not mimic
10056       the defined POSIX behaviour for REG_NEWLINE  (see  the  following  sec-
10057       tion).
10058
10059         REG_NOSPEC
10060
10061       The  PCRE2_LITERAL  option is set when the regular expression is passed
10062       for compilation to the native function. This disables all meta  charac-
10063       ters  in the pattern, causing it to be treated as a literal string. The
10064       only other options that are  allowed  with  REG_NOSPEC  are  REG_ICASE,
10065       REG_NOSUB,  REG_PEND,  and REG_UTF. Note that REG_NOSPEC is not part of
10066       the POSIX standard.
10067
10068         REG_NOSUB
10069
10070       When  a  pattern  that  is  compiled  with  this  flag  is  passed   to
10071       pcre2_regexec()  for  matching, the nmatch and pmatch arguments are ig-
10072       nored, and no captured strings are returned. Versions of the  PCRE  li-
10073       brary  prior to 10.22 used to set the PCRE2_NO_AUTO_CAPTURE compile op-
10074       tion, but this no longer happens because it disables the use  of  back-
10075       references.
10076
10077         REG_PEND
10078
10079       If  this option is set, the reg_endp field in the preg structure (which
10080       has the type const char *) must be set to point to the character beyond
10081       the  end of the pattern before calling pcre2_regcomp(). The pattern it-
10082       self may now contain binary zeros, which are treated  as  data  charac-
10083       ters.  Without  REG_PEND,  a binary zero terminates the pattern and the
10084       re_endp field is ignored. This is a GNU extension to the POSIX standard
10085       and  should be used with caution in software intended to be portable to
10086       other systems.
10087
10088         REG_UCP
10089
10090       The PCRE2_UCP option is set when the regular expression is  passed  for
10091       compilation  to  the  native function. This causes PCRE2 to use Unicode
10092       properties when matchine \d, \w,  etc.,  instead  of  just  recognizing
10093       ASCII values. Note that REG_UCP is not part of the POSIX standard.
10094
10095         REG_UNGREEDY
10096
10097       The  PCRE2_UNGREEDY option is set when the regular expression is passed
10098       for compilation to the native function. Note that REG_UNGREEDY  is  not
10099       part of the POSIX standard.
10100
10101         REG_UTF
10102
10103       The  PCRE2_UTF  option is set when the regular expression is passed for
10104       compilation to the native function. This causes the pattern itself  and
10105       all  data  strings used for matching it to be treated as UTF-8 strings.
10106       Note that REG_UTF is not part of the POSIX standard.
10107
10108       In the absence of these flags, no options  are  passed  to  the  native
10109       function.   This means the the regex is compiled with PCRE2 default se-
10110       mantics. In particular, the way it handles newline  characters  in  the
10111       subject  string  is  the Perl way, not the POSIX way. Note that setting
10112       PCRE2_MULTILINE has only some of the effects specified for REG_NEWLINE.
10113       It  does not affect the way newlines are matched by the dot metacharac-
10114       ter (they are not) or by a negative class such as [^a] (they are).
10115
10116       The yield of pcre2_regcomp() is zero on success,  and  non-zero  other-
10117       wise.  The preg structure is filled in on success, and one other member
10118       of the structure (as well as re_endp) is public: re_nsub  contains  the
10119       number  of capturing subpatterns in the regular expression. Various er-
10120       ror codes are defined in the header file.
10121
10122       NOTE: If the yield of pcre2_regcomp() is non-zero, you must not attempt
10123       to use the contents of the preg structure. If, for example, you pass it
10124       to pcre2_regexec(), the result is undefined and your program is  likely
10125       to crash.
10126
10127
10128MATCHING NEWLINE CHARACTERS
10129
10130       This area is not simple, because POSIX and Perl take different views of
10131       things.  It is not possible to get PCRE2 to obey POSIX  semantics,  but
10132       then PCRE2 was never intended to be a POSIX engine. The following table
10133       lists the different possibilities for matching  newline  characters  in
10134       Perl and PCRE2:
10135
10136                                 Default   Change with
10137
10138         . matches newline          no     PCRE2_DOTALL
10139         newline matches [^a]       yes    not changeable
10140         $ matches \n at end        yes    PCRE2_DOLLAR_ENDONLY
10141         $ matches \n in middle     no     PCRE2_MULTILINE
10142         ^ matches \n in middle     no     PCRE2_MULTILINE
10143
10144       This is the equivalent table for a POSIX-compatible pattern matcher:
10145
10146                                 Default   Change with
10147
10148         . matches newline          yes    REG_NEWLINE
10149         newline matches [^a]       yes    REG_NEWLINE
10150         $ matches \n at end        no     REG_NEWLINE
10151         $ matches \n in middle     no     REG_NEWLINE
10152         ^ matches \n in middle     no     REG_NEWLINE
10153
10154       This  behaviour  is not what happens when PCRE2 is called via its POSIX
10155       API. By default, PCRE2's behaviour is the same as Perl's,  except  that
10156       there  is no equivalent for PCRE2_DOLLAR_ENDONLY in Perl. In both PCRE2
10157       and Perl, there is no way to stop newline from matching [^a].
10158
10159       Default POSIX newline handling can be obtained by setting  PCRE2_DOTALL
10160       and  PCRE2_DOLLAR_ENDONLY  when  calling  pcre2_compile() directly, but
10161       there is no way to make PCRE2 behave exactly as for the REG_NEWLINE ac-
10162       tion.  When  using  the  POSIX  API,  passing  REG_NEWLINE  to  PCRE2's
10163       pcre2_regcomp()  function  causes  PCRE2_MULTILINE  to  be  passed   to
10164       pcre2_compile(), and REG_DOTALL passes PCRE2_DOTALL. There is no way to
10165       pass PCRE2_DOLLAR_ENDONLY.
10166
10167
10168MATCHING A PATTERN
10169
10170       The function pcre2_regexec() is called to match a compiled pattern preg
10171       against  a  given string, which is by default terminated by a zero byte
10172       (but see REG_STARTEND below), subject to the options in eflags.   These
10173       can be:
10174
10175         REG_NOTBOL
10176
10177       The PCRE2_NOTBOL option is set when calling the underlying PCRE2 match-
10178       ing function.
10179
10180         REG_NOTEMPTY
10181
10182       The PCRE2_NOTEMPTY option is set  when  calling  the  underlying  PCRE2
10183       matching  function.  Note  that  REG_NOTEMPTY  is not part of the POSIX
10184       standard. However, setting this option can give more POSIX-like  behav-
10185       iour in some situations.
10186
10187         REG_NOTEOL
10188
10189       The PCRE2_NOTEOL option is set when calling the underlying PCRE2 match-
10190       ing function.
10191
10192         REG_STARTEND
10193
10194       When this option  is  set,  the  subject  string  starts  at  string  +
10195       pmatch[0].rm_so  and  ends  at  string  + pmatch[0].rm_eo, which should
10196       point to the first character beyond the string. There may be binary ze-
10197       ros  within  the  subject string, and indeed, using REG_STARTEND is the
10198       only way to pass a subject string that contains a binary zero.
10199
10200       Whatever the value of  pmatch[0].rm_so,  the  offsets  of  the  matched
10201       string  and  any  captured  substrings  are still given relative to the
10202       start of string itself. (Before PCRE2 release 10.30  these  were  given
10203       relative  to  string + pmatch[0].rm_so, but this differs from other im-
10204       plementations.)
10205
10206       This is a BSD extension, compatible with  but  not  specified  by  IEEE
10207       Standard  1003.2 (POSIX.2), and should be used with caution in software
10208       intended to be portable to other systems. Note that  a  non-zero  rm_so
10209       does  not  imply REG_NOTBOL; REG_STARTEND affects only the location and
10210       length of the string, not how it is matched. Setting  REG_STARTEND  and
10211       passing  pmatch as NULL are mutually exclusive; the error REG_INVARG is
10212       returned.
10213
10214       If the pattern was compiled with the REG_NOSUB flag, no data about  any
10215       matched  strings  is  returned.  The  nmatch  and  pmatch  arguments of
10216       pcre2_regexec() are ignored (except possibly  as  input  for  REG_STAR-
10217       TEND).
10218
10219       The  value of nmatch may be zero, and the value pmatch may be NULL (un-
10220       less REG_STARTEND is set); in  both  these  cases  no  data  about  any
10221       matched strings is returned.
10222
10223       Otherwise,  the  portion  of  the string that was matched, and also any
10224       captured substrings, are returned via the pmatch argument, which points
10225       to  an  array  of  nmatch structures of type regmatch_t, containing the
10226       members rm_so and rm_eo. These contain the byte  offset  to  the  first
10227       character of each substring and the offset to the first character after
10228       the end of each substring, respectively. The 0th element of the  vector
10229       relates  to  the  entire portion of string that was matched; subsequent
10230       elements relate to the capturing subpatterns of the regular expression.
10231       Unused entries in the array have both structure members set to -1.
10232
10233       A  successful  match  yields a zero return; various error codes are de-
10234       fined in the header file, of which REG_NOMATCH is the "expected"  fail-
10235       ure code.
10236
10237
10238ERROR MESSAGES
10239
10240       The  pcre2_regerror()  function  maps  a non-zero errorcode from either
10241       pcre2_regcomp() or pcre2_regexec() to a printable message. If  preg  is
10242       not  NULL, the error should have arisen from the use of that structure.
10243       A message terminated by a binary zero is placed in errbuf. If the  buf-
10244       fer  is too short, only the first errbuf_size - 1 characters of the er-
10245       ror message are used. The yield of the function is the size  of  buffer
10246       needed  to hold the whole message, including the terminating zero. This
10247       value is greater than errbuf_size if the message was truncated.
10248
10249
10250MEMORY USAGE
10251
10252       Compiling a regular expression causes memory to be allocated and  asso-
10253       ciated  with the preg structure. The function pcre2_regfree() frees all
10254       such memory, after which preg may no longer be used as a  compiled  ex-
10255       pression.
10256
10257
10258AUTHOR
10259
10260       Philip Hazel
10261       University Computing Service
10262       Cambridge, England.
10263
10264
10265REVISION
10266
10267       Last updated: 26 April 2021
10268       Copyright (c) 1997-2021 University of Cambridge.
10269------------------------------------------------------------------------------
10270
10271
10272PCRE2SAMPLE(3)             Library Functions Manual             PCRE2SAMPLE(3)
10273
10274
10275
10276NAME
10277       PCRE2 - Perl-compatible regular expressions (revised API)
10278
10279PCRE2 SAMPLE PROGRAM
10280
10281       A  simple, complete demonstration program to get you started with using
10282       PCRE2 is supplied in the file pcre2demo.c in the src directory  in  the
10283       PCRE2 distribution. A listing of this program is given in the pcre2demo
10284       documentation. If you do not have a copy of the PCRE2 distribution, you
10285       can save this listing to re-create the contents of pcre2demo.c.
10286
10287       The  demonstration  program compiles the regular expression that is its
10288       first argument, and matches it against the subject string in its second
10289       argument.  No  PCRE2  options are set, and default character tables are
10290       used. If matching succeeds, the program outputs the portion of the sub-
10291       ject  that  matched,  together  with  the contents of any captured sub-
10292       strings.
10293
10294       If the -g option is given on the command line, the program then goes on
10295       to check for further matches of the same regular expression in the same
10296       subject string. The logic is a little bit tricky because of the  possi-
10297       bility  of  matching an empty string. Comments in the code explain what
10298       is going on.
10299
10300       The code in pcre2demo.c is an 8-bit program that uses the  PCRE2  8-bit
10301       library.  It  handles  strings  and characters that are stored in 8-bit
10302       code units.  By default, one character corresponds to  one  code  unit,
10303       but  if  the  pattern starts with "(*UTF)", both it and the subject are
10304       treated as UTF-8 strings, where characters  may  occupy  multiple  code
10305       units.
10306
10307       If  PCRE2  is installed in the standard include and library directories
10308       for your operating system, you should be able to compile the demonstra-
10309       tion program using a command like this:
10310
10311         cc -o pcre2demo pcre2demo.c -lpcre2-8
10312
10313       If PCRE2 is installed elsewhere, you may need to add additional options
10314       to the command line. For example, on a Unix-like system that has  PCRE2
10315       installed  in /usr/local, you can compile the demonstration program us-
10316       ing a command like this:
10317
10318         cc -o pcre2demo -I/usr/local/include pcre2demo.c \
10319            -L/usr/local/lib -lpcre2-8
10320
10321       Once you have built the demonstration program, you can run simple tests
10322       like this:
10323
10324         ./pcre2demo 'cat|dog' 'the cat sat on the mat'
10325         ./pcre2demo -g 'cat|dog' 'the dog sat on the cat'
10326
10327       Note  that  there  is  a  much  more comprehensive test program, called
10328       pcre2test, which supports many more facilities for testing regular  ex-
10329       pressions  using  all three PCRE2 libraries (8-bit, 16-bit, and 32-bit,
10330       though not all three need be installed). The pcre2demo program is  pro-
10331       vided as a relatively simple coding example.
10332
10333       If you try to run pcre2demo when PCRE2 is not installed in the standard
10334       library directory, you may get an error like  this  on  some  operating
10335       systems (e.g. Solaris):
10336
10337         ld.so.1: pcre2demo: fatal: libpcre2-8.so.0: open failed: No such file
10338       or directory
10339
10340       This is caused by the way shared library support works  on  those  sys-
10341       tems. You need to add
10342
10343         -R/usr/local/lib
10344
10345       (for example) to the compile command to get round this problem.
10346
10347
10348AUTHOR
10349
10350       Philip Hazel
10351       University Computing Service
10352       Cambridge, England.
10353
10354
10355REVISION
10356
10357       Last updated: 02 February 2016
10358       Copyright (c) 1997-2016 University of Cambridge.
10359------------------------------------------------------------------------------
10360PCRE2SERIALIZE(3)          Library Functions Manual          PCRE2SERIALIZE(3)
10361
10362
10363
10364NAME
10365       PCRE2 - Perl-compatible regular expressions (revised API)
10366
10367SAVING AND RE-USING PRECOMPILED PCRE2 PATTERNS
10368
10369       int32_t pcre2_serialize_decode(pcre2_code **codes,
10370         int32_t number_of_codes, const uint8_t *bytes,
10371         pcre2_general_context *gcontext);
10372
10373       int32_t pcre2_serialize_encode(const pcre2_code **codes,
10374         int32_t number_of_codes, uint8_t **serialized_bytes,
10375         PCRE2_SIZE *serialized_size, pcre2_general_context *gcontext);
10376
10377       void pcre2_serialize_free(uint8_t *bytes);
10378
10379       int32_t pcre2_serialize_get_number_of_codes(const uint8_t *bytes);
10380
10381       If  you  are running an application that uses a large number of regular
10382       expression patterns, it may be useful to store them  in  a  precompiled
10383       form  instead  of  having to compile them every time the application is
10384       run. However, if you are using the just-in-time  optimization  feature,
10385       it is not possible to save and reload the JIT data, because it is posi-
10386       tion-dependent. The host on which the patterns  are  reloaded  must  be
10387       running  the  same version of PCRE2, with the same code unit width, and
10388       must also have the same endianness, pointer width and PCRE2_SIZE  type.
10389       For  example, patterns compiled on a 32-bit system using PCRE2's 16-bit
10390       library cannot be reloaded on a 64-bit system, nor can they be reloaded
10391       using the 8-bit library.
10392
10393       Note  that  "serialization" in PCRE2 does not convert compiled patterns
10394       to an abstract format like Java or .NET serialization.  The  serialized
10395       output  is  really  just  a  bytecode dump, which is why it can only be
10396       reloaded in the same environment as the one that created it. Hence  the
10397       restrictions  mentioned  above.   Applications  that are not statically
10398       linked with a fixed version of PCRE2 must be prepared to recompile pat-
10399       terns from their sources, in order to be immune to PCRE2 upgrades.
10400
10401
10402SECURITY CONCERNS
10403
10404       The facility for saving and restoring compiled patterns is intended for
10405       use within individual applications.  As  such,  the  data  supplied  to
10406       pcre2_serialize_decode()  is expected to be trusted data, not data from
10407       arbitrary external sources.  There  is  only  some  simple  consistency
10408       checking, not complete validation of what is being re-loaded. Corrupted
10409       data may cause undefined results. For example, if the length field of a
10410       pattern in the serialized data is corrupted, the deserializing code may
10411       read beyond the end of the byte stream that is passed to it.
10412
10413
10414SAVING COMPILED PATTERNS
10415
10416       Before compiled patterns can be saved they must be serialized, which in
10417       PCRE2  means converting the pattern to a stream of bytes. A single byte
10418       stream may contain any number of compiled patterns, but they  must  all
10419       use  the same character tables. A single copy of the tables is included
10420       in the byte stream (its size is 1088 bytes). For more details of  char-
10421       acter  tables,  see the section on locale support in the pcre2api docu-
10422       mentation.
10423
10424       The function pcre2_serialize_encode() creates a serialized byte  stream
10425       from  a  list of compiled patterns. Its first two arguments specify the
10426       list, being a pointer to a vector of pointers to compiled patterns, and
10427       the length of the vector. The third and fourth arguments point to vari-
10428       ables which are set to point to the created byte stream and its length,
10429       respectively.  The  final  argument  is a pointer to a general context,
10430       which can be used to specify custom memory  mangagement  functions.  If
10431       this  argument  is NULL, malloc() is used to obtain memory for the byte
10432       stream. The yield of the function is the number of serialized patterns,
10433       or one of the following negative error codes:
10434
10435         PCRE2_ERROR_BADDATA      the number of patterns is zero or less
10436         PCRE2_ERROR_BADMAGIC     mismatch of id bytes in one of the patterns
10437         PCRE2_ERROR_MEMORY       memory allocation failed
10438         PCRE2_ERROR_MIXEDTABLES  the patterns do not all use the same tables
10439         PCRE2_ERROR_NULL         the 1st, 3rd, or 4th argument is NULL
10440
10441       PCRE2_ERROR_BADMAGIC  means  either that a pattern's code has been cor-
10442       rupted, or that a slot in the vector does not point to a compiled  pat-
10443       tern.
10444
10445       Once a set of patterns has been serialized you can save the data in any
10446       appropriate manner. Here is sample code that compiles two patterns  and
10447       writes them to a file. It assumes that the variable fd refers to a file
10448       that is open for output. The error checking that should be present in a
10449       real application has been omitted for simplicity.
10450
10451         int errorcode;
10452         uint8_t *bytes;
10453         PCRE2_SIZE erroroffset;
10454         PCRE2_SIZE bytescount;
10455         pcre2_code *list_of_codes[2];
10456         list_of_codes[0] = pcre2_compile("first pattern",
10457           PCRE2_ZERO_TERMINATED, 0, &errorcode, &erroroffset, NULL);
10458         list_of_codes[1] = pcre2_compile("second pattern",
10459           PCRE2_ZERO_TERMINATED, 0, &errorcode, &erroroffset, NULL);
10460         errorcode = pcre2_serialize_encode(list_of_codes, 2, &bytes,
10461           &bytescount, NULL);
10462         errorcode = fwrite(bytes, 1, bytescount, fd);
10463
10464       Note  that  the  serialized data is binary data that may contain any of
10465       the 256 possible byte values. On systems that make  a  distinction  be-
10466       tween  binary  and non-binary data, be sure that the file is opened for
10467       binary output.
10468
10469       Serializing a set of patterns leaves the original  data  untouched,  so
10470       they  can  still  be used for matching. Their memory must eventually be
10471       freed in the usual way by calling pcre2_code_free(). When you have fin-
10472       ished with the byte stream, it too must be freed by calling pcre2_seri-
10473       alize_free(). If this function is called with a NULL argument,  it  re-
10474       turns immediately without doing anything.
10475
10476
10477RE-USING PRECOMPILED PATTERNS
10478
10479       In  order to re-use a set of saved patterns you must first make the se-
10480       rialized byte stream available in main memory (for example, by  reading
10481       from a file). The management of this memory block is up to the applica-
10482       tion. You can use the pcre2_serialize_get_number_of_codes() function to
10483       find  out how many compiled patterns are in the serialized data without
10484       actually decoding the patterns:
10485
10486         uint8_t *bytes = <serialized data>;
10487         int32_t number_of_codes = pcre2_serialize_get_number_of_codes(bytes);
10488
10489       The pcre2_serialize_decode() function reads a byte stream and recreates
10490       the compiled patterns in new memory blocks, setting pointers to them in
10491       a vector. The first two arguments are a pointer to  a  suitable  vector
10492       and its length, and the third argument points to a byte stream. The fi-
10493       nal argument is a pointer to a general context, which can  be  used  to
10494       specify  custom  memory mangagement functions for the decoded patterns.
10495       If this argument is NULL, malloc() and free() are used. After deserial-
10496       ization, the byte stream is no longer needed and can be discarded.
10497
10498         pcre2_code *list_of_codes[2];
10499         uint8_t *bytes = <serialized data>;
10500         int32_t number_of_codes =
10501           pcre2_serialize_decode(list_of_codes, 2, bytes, NULL);
10502
10503       If  the  vector  is  not  large enough for all the patterns in the byte
10504       stream, it is filled with those that fit, and  the  remainder  are  ig-
10505       nored.  The yield of the function is the number of decoded patterns, or
10506       one of the following negative error codes:
10507
10508         PCRE2_ERROR_BADDATA    second argument is zero or less
10509         PCRE2_ERROR_BADMAGIC   mismatch of id bytes in the data
10510         PCRE2_ERROR_BADMODE    mismatch of code unit size or PCRE2 version
10511         PCRE2_ERROR_BADSERIALIZEDDATA  other sanity check failure
10512         PCRE2_ERROR_MEMORY     memory allocation failed
10513         PCRE2_ERROR_NULL       first or third argument is NULL
10514
10515       PCRE2_ERROR_BADMAGIC may mean that the data is corrupt, or that it  was
10516       compiled on a system with different endianness.
10517
10518       Decoded patterns can be used for matching in the usual way, and must be
10519       freed by calling pcre2_code_free(). However, be aware that there  is  a
10520       potential  race  issue if you are using multiple patterns that were de-
10521       coded from a single byte stream in a multithreaded application. A  sin-
10522       gle  copy  of  the character tables is used by all the decoded patterns
10523       and a reference count is used to arrange for its memory to be automati-
10524       cally  freed when the last pattern is freed, but there is no locking on
10525       this reference count. Therefore, if you want to call  pcre2_code_free()
10526       for  these  patterns  in  different  threads, you must arrange your own
10527       locking, and ensure that pcre2_code_free()  cannot  be  called  by  two
10528       threads at the same time.
10529
10530       If  a pattern was processed by pcre2_jit_compile() before being serial-
10531       ized, the JIT data is discarded and so is no longer available  after  a
10532       save/restore  cycle.  You can, however, process a restored pattern with
10533       pcre2_jit_compile() if you wish.
10534
10535
10536AUTHOR
10537
10538       Philip Hazel
10539       University Computing Service
10540       Cambridge, England.
10541
10542
10543REVISION
10544
10545       Last updated: 27 June 2018
10546       Copyright (c) 1997-2018 University of Cambridge.
10547------------------------------------------------------------------------------
10548
10549
10550PCRE2SYNTAX(3)             Library Functions Manual             PCRE2SYNTAX(3)
10551
10552
10553
10554NAME
10555       PCRE2 - Perl-compatible regular expressions (revised API)
10556
10557PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY
10558
10559       The  full syntax and semantics of the regular expressions that are sup-
10560       ported by PCRE2 are described in the pcre2pattern  documentation.  This
10561       document contains a quick-reference summary of the syntax.
10562
10563
10564QUOTING
10565
10566         \x         where x is non-alphanumeric is a literal x
10567         \Q...\E    treat enclosed characters as literal
10568
10569
10570ESCAPED CHARACTERS
10571
10572       This  table  applies to ASCII and Unicode environments. An unrecognized
10573       escape sequence causes an error.
10574
10575         \a         alarm, that is, the BEL character (hex 07)
10576         \cx        "control-x", where x is any ASCII printing character
10577         \e         escape (hex 1B)
10578         \f         form feed (hex 0C)
10579         \n         newline (hex 0A)
10580         \r         carriage return (hex 0D)
10581         \t         tab (hex 09)
10582         \0dd       character with octal code 0dd
10583         \ddd       character with octal code ddd, or backreference
10584         \o{ddd..}  character with octal code ddd..
10585         \N{U+hh..} character with Unicode code point hh.. (Unicode mode only)
10586         \xhh       character with hex code hh
10587         \x{hh..}   character with hex code hh..
10588
10589       If PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX is set ("ALT_BSUX mode"), the
10590       following are also recognized:
10591
10592         \U         the character "U"
10593         \uhhhh     character with hex code hhhh
10594         \u{hh..}   character with hex code hh.. but only for EXTRA_ALT_BSUX
10595
10596       When  \x  is not followed by {, from zero to two hexadecimal digits are
10597       read, but in ALT_BSUX mode \x must be followed by two hexadecimal  dig-
10598       its  to  be  recognized as a hexadecimal escape; otherwise it matches a
10599       literal "x".  Likewise, if \u (in ALT_BSUX mode)  is  not  followed  by
10600       four  hexadecimal  digits or (in EXTRA_ALT_BSUX mode) a sequence of hex
10601       digits in curly brackets, it matches a literal "u".
10602
10603       Note that \0dd is always an octal code. The treatment of backslash fol-
10604       lowed  by  a non-zero digit is complicated; for details see the section
10605       "Non-printing characters" in the pcre2pattern documentation, where  de-
10606       tails  of  escape  processing  in  EBCDIC  environments are also given.
10607       \N{U+hh..} is synonymous with \x{hh..} in PCRE2 but is not supported in
10608       EBCDIC  environments.  Note  that  \N  not followed by an opening curly
10609       bracket has a different meaning (see below).
10610
10611
10612CHARACTER TYPES
10613
10614         .          any character except newline;
10615                      in dotall mode, any character whatsoever
10616         \C         one code unit, even in UTF mode (best avoided)
10617         \d         a decimal digit
10618         \D         a character that is not a decimal digit
10619         \h         a horizontal white space character
10620         \H         a character that is not a horizontal white space character
10621         \N         a character that is not a newline
10622         \p{xx}     a character with the xx property
10623         \P{xx}     a character without the xx property
10624         \R         a newline sequence
10625         \s         a white space character
10626         \S         a character that is not a white space character
10627         \v         a vertical white space character
10628         \V         a character that is not a vertical white space character
10629         \w         a "word" character
10630         \W         a "non-word" character
10631         \X         a Unicode extended grapheme cluster
10632
10633       \C is dangerous because it may leave the current matching point in  the
10634       middle of a UTF-8 or UTF-16 character. The application can lock out the
10635       use of \C by setting the PCRE2_NEVER_BACKSLASH_C  option.  It  is  also
10636       possible to build PCRE2 with the use of \C permanently disabled.
10637
10638       By  default,  \d, \s, and \w match only ASCII characters, even in UTF-8
10639       mode or in the 16-bit and 32-bit libraries. However, if locale-specific
10640       matching  is  happening,  \s and \w may also match characters with code
10641       points in the range 128-255. If the PCRE2_UCP option is set, the behav-
10642       iour of these escape sequences is changed to use Unicode properties and
10643       they match many more characters.
10644
10645       Property descriptions in \p and \P are matched caselessly; hyphens, un-
10646       derscores,  and  white  space are ignored, in accordance with Unicode's
10647       "loose matching" rules.
10648
10649
10650GENERAL CATEGORY PROPERTIES FOR \p and \P
10651
10652         C          Other
10653         Cc         Control
10654         Cf         Format
10655         Cn         Unassigned
10656         Co         Private use
10657         Cs         Surrogate
10658
10659         L          Letter
10660         Ll         Lower case letter
10661         Lm         Modifier letter
10662         Lo         Other letter
10663         Lt         Title case letter
10664         Lu         Upper case letter
10665         Lc         Ll, Lu, or Lt
10666         L&         Ll, Lu, or Lt
10667
10668         M          Mark
10669         Mc         Spacing mark
10670         Me         Enclosing mark
10671         Mn         Non-spacing mark
10672
10673         N          Number
10674         Nd         Decimal number
10675         Nl         Letter number
10676         No         Other number
10677
10678         P          Punctuation
10679         Pc         Connector punctuation
10680         Pd         Dash punctuation
10681         Pe         Close punctuation
10682         Pf         Final punctuation
10683         Pi         Initial punctuation
10684         Po         Other punctuation
10685         Ps         Open punctuation
10686
10687         S          Symbol
10688         Sc         Currency symbol
10689         Sk         Modifier symbol
10690         Sm         Mathematical symbol
10691         So         Other symbol
10692
10693         Z          Separator
10694         Zl         Line separator
10695         Zp         Paragraph separator
10696         Zs         Space separator
10697
10698
10699PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P
10700
10701         Xan        Alphanumeric: union of properties L and N
10702         Xps        POSIX space: property Z or tab, NL, VT, FF, CR
10703         Xsp        Perl space: property Z or tab, NL, VT, FF, CR
10704         Xuc        Univerally-named character: one that can be
10705                      represented by a Universal Character Name
10706         Xwd        Perl word: property Xan or underscore
10707
10708       Perl and POSIX space are now the same. Perl added VT to its space char-
10709       acter set at release 5.18.
10710
10711
10712BINARY PROPERTIES FOR \p AND \P
10713
10714       Unicode  defines  a  number  of  binary properties, that is, properties
10715       whose only values are true or false. You can obtain  a  list  of  those
10716       that  are  recognized  by \p and \P, along with their abbreviations, by
10717       running this command:
10718
10719         pcre2test -LP
10720
10721
10722SCRIPT MATCHING WITH \p AND \P
10723
10724       Many script names and their 4-letter abbreviations  are  recognized  in
10725       \p{sc:...}  or  \p{scx:...} items, or on their own with \p (and also \P
10726       of course). You can obtain a list of these scripts by running this com-
10727       mand:
10728
10729         pcre2test -LS
10730
10731
10732THE BIDI_CLASS PROPERTY FOR \p AND \P
10733
10734         \p{Bidi_Class:<class>}   matches a character with the given class
10735         \p{BC:<class>}           matches a character with the given class
10736
10737       The recognized classes are:
10738
10739         AL          Arabic letter
10740         AN          Arabic number
10741         B           paragraph separator
10742         BN          boundary neutral
10743         CS          common separator
10744         EN          European number
10745         ES          European separator
10746         ET          European terminator
10747         FSI         first strong isolate
10748         L           left-to-right
10749         LRE         left-to-right embedding
10750         LRI         left-to-right isolate
10751         LRO         left-to-right override
10752         NSM         non-spacing mark
10753         ON          other neutral
10754         PDF         pop directional format
10755         PDI         pop directional isolate
10756         R           right-to-left
10757         RLE         right-to-left embedding
10758         RLI         right-to-left isolate
10759         RLO         right-to-left override
10760         S           segment separator
10761         WS          which space
10762
10763
10764CHARACTER CLASSES
10765
10766         [...]       positive character class
10767         [^...]      negative character class
10768         [x-y]       range (can be used for hex characters)
10769         [[:xxx:]]   positive POSIX named set
10770         [[:^xxx:]]  negative POSIX named set
10771
10772         alnum       alphanumeric
10773         alpha       alphabetic
10774         ascii       0-127
10775         blank       space or tab
10776         cntrl       control character
10777         digit       decimal digit
10778         graph       printing, excluding space
10779         lower       lower case letter
10780         print       printing, including space
10781         punct       printing, excluding alphanumeric
10782         space       white space
10783         upper       upper case letter
10784         word        same as \w
10785         xdigit      hexadecimal digit
10786
10787       In  PCRE2, POSIX character set names recognize only ASCII characters by
10788       default, but some of them use Unicode properties if PCRE2_UCP  is  set.
10789       You can use \Q...\E inside a character class.
10790
10791
10792QUANTIFIERS
10793
10794         ?           0 or 1, greedy
10795         ?+          0 or 1, possessive
10796         ??          0 or 1, lazy
10797         *           0 or more, greedy
10798         *+          0 or more, possessive
10799         *?          0 or more, lazy
10800         +           1 or more, greedy
10801         ++          1 or more, possessive
10802         +?          1 or more, lazy
10803         {n}         exactly n
10804         {n,m}       at least n, no more than m, greedy
10805         {n,m}+      at least n, no more than m, possessive
10806         {n,m}?      at least n, no more than m, lazy
10807         {n,}        n or more, greedy
10808         {n,}+       n or more, possessive
10809         {n,}?       n or more, lazy
10810
10811
10812ANCHORS AND SIMPLE ASSERTIONS
10813
10814         \b          word boundary
10815         \B          not a word boundary
10816         ^           start of subject
10817                       also after an internal newline in multiline mode
10818                       (after any newline if PCRE2_ALT_CIRCUMFLEX is set)
10819         \A          start of subject
10820         $           end of subject
10821                       also before newline at end of subject
10822                       also before internal newline in multiline mode
10823         \Z          end of subject
10824                       also before newline at end of subject
10825         \z          end of subject
10826         \G          first matching position in subject
10827
10828
10829REPORTED MATCH POINT SETTING
10830
10831         \K          set reported start of match
10832
10833       From  release 10.38 \K is not permitted by default in lookaround asser-
10834       tions, for compatibility with Perl.  However,  if  the  PCRE2_EXTRA_AL-
10835       LOW_LOOKAROUND_BSK option is set, the previous behaviour is re-enabled.
10836       When this option is set, \K is honoured in positive assertions, but ig-
10837       nored in negative ones.
10838
10839
10840ALTERNATION
10841
10842         expr|expr|expr...
10843
10844
10845CAPTURING
10846
10847         (...)           capture group
10848         (?<name>...)    named capture group (Perl)
10849         (?'name'...)    named capture group (Perl)
10850         (?P<name>...)   named capture group (Python)
10851         (?:...)         non-capture group
10852         (?|...)         non-capture group; reset group numbers for
10853                          capture groups in each alternative
10854
10855       In  non-UTF  modes, names may contain underscores and ASCII letters and
10856       digits; in UTF modes, any Unicode letters and  Unicode  decimal  digits
10857       are permitted. In both cases, a name must not start with a digit.
10858
10859
10860ATOMIC GROUPS
10861
10862         (?>...)         atomic non-capture group
10863         (*atomic:...)   atomic non-capture group
10864
10865
10866COMMENT
10867
10868         (?#....)        comment (not nestable)
10869
10870
10871OPTION SETTING
10872       Changes  of these options within a group are automatically cancelled at
10873       the end of the group.
10874
10875         (?i)            caseless
10876         (?J)            allow duplicate named groups
10877         (?m)            multiline
10878         (?n)            no auto capture
10879         (?s)            single line (dotall)
10880         (?U)            default ungreedy (lazy)
10881         (?x)            extended: ignore white space except in classes
10882         (?xx)           as (?x) but also ignore space and tab in classes
10883         (?-...)         unset option(s)
10884         (?^)            unset imnsx options
10885
10886       Unsetting x or xx unsets both. Several options may be set at once,  and
10887       a mixture of setting and unsetting such as (?i-x) is allowed, but there
10888       may be only one hyphen. Setting (but no unsetting) is allowed after (?^
10889       for example (?^in). An option setting may appear at the start of a non-
10890       capture group, for example (?i:...).
10891
10892       The following are recognized only at the very start of a pattern or af-
10893       ter one of the newline or \R options with similar syntax. More than one
10894       of them may appear. For the first three, d is a decimal number.
10895
10896         (*LIMIT_DEPTH=d) set the backtracking limit to d
10897         (*LIMIT_HEAP=d)  set the heap size limit to d * 1024 bytes
10898         (*LIMIT_MATCH=d) set the match limit to d
10899         (*NOTEMPTY)      set PCRE2_NOTEMPTY when matching
10900         (*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
10901         (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
10902         (*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR)
10903         (*NO_JIT)       disable JIT optimization
10904         (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
10905         (*UTF)          set appropriate UTF mode for the library in use
10906         (*UCP)          set PCRE2_UCP (use Unicode properties for \d etc)
10907
10908       Note that LIMIT_DEPTH, LIMIT_HEAP, and LIMIT_MATCH can only reduce  the
10909       value   of   the   limits   set  by  the  caller  of  pcre2_match()  or
10910       pcre2_dfa_match(), not increase them. LIMIT_RECURSION  is  an  obsolete
10911       synonym for LIMIT_DEPTH. The application can lock out the use of (*UTF)
10912       and (*UCP) by setting the PCRE2_NEVER_UTF or  PCRE2_NEVER_UCP  options,
10913       respectively, at compile time.
10914
10915
10916NEWLINE CONVENTION
10917
10918       These are recognized only at the very start of the pattern or after op-
10919       tion settings with a similar syntax.
10920
10921         (*CR)           carriage return only
10922         (*LF)           linefeed only
10923         (*CRLF)         carriage return followed by linefeed
10924         (*ANYCRLF)      all three of the above
10925         (*ANY)          any Unicode newline sequence
10926         (*NUL)          the NUL character (binary zero)
10927
10928
10929WHAT \R MATCHES
10930
10931       These are recognized only at the very start of the pattern or after op-
10932       tion setting with a similar syntax.
10933
10934         (*BSR_ANYCRLF)  CR, LF, or CRLF
10935         (*BSR_UNICODE)  any Unicode newline sequence
10936
10937
10938LOOKAHEAD AND LOOKBEHIND ASSERTIONS
10939
10940         (?=...)                     )
10941         (*pla:...)                  ) positive lookahead
10942         (*positive_lookahead:...)   )
10943
10944         (?!...)                     )
10945         (*nla:...)                  ) negative lookahead
10946         (*negative_lookahead:...)   )
10947
10948         (?<=...)                    )
10949         (*plb:...)                  ) positive lookbehind
10950         (*positive_lookbehind:...)  )
10951
10952         (?<!...)                    )
10953         (*nlb:...)                  ) negative lookbehind
10954         (*negative_lookbehind:...)  )
10955
10956       Each top-level branch of a lookbehind must be of a fixed length.
10957
10958
10959NON-ATOMIC LOOKAROUND ASSERTIONS
10960
10961       These assertions are specific to PCRE2 and are not Perl-compatible.
10962
10963         (?*...)                                )
10964         (*napla:...)                           ) synonyms
10965         (*non_atomic_positive_lookahead:...)   )
10966
10967         (?<*...)                               )
10968         (*naplb:...)                           ) synonyms
10969         (*non_atomic_positive_lookbehind:...)  )
10970
10971
10972SCRIPT RUNS
10973
10974         (*script_run:...)           ) script run, can be backtracked into
10975         (*sr:...)                   )
10976
10977         (*atomic_script_run:...)    ) atomic script run
10978         (*asr:...)                  )
10979
10980
10981BACKREFERENCES
10982
10983         \n              reference by number (can be ambiguous)
10984         \gn             reference by number
10985         \g{n}           reference by number
10986         \g+n            relative reference by number (PCRE2 extension)
10987         \g-n            relative reference by number
10988         \g{+n}          relative reference by number (PCRE2 extension)
10989         \g{-n}          relative reference by number
10990         \k<name>        reference by name (Perl)
10991         \k'name'        reference by name (Perl)
10992         \g{name}        reference by name (Perl)
10993         \k{name}        reference by name (.NET)
10994         (?P=name)       reference by name (Python)
10995
10996
10997SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)
10998
10999         (?R)            recurse whole pattern
11000         (?n)            call subroutine by absolute number
11001         (?+n)           call subroutine by relative number
11002         (?-n)           call subroutine by relative number
11003         (?&name)        call subroutine by name (Perl)
11004         (?P>name)       call subroutine by name (Python)
11005         \g<name>        call subroutine by name (Oniguruma)
11006         \g'name'        call subroutine by name (Oniguruma)
11007         \g<n>           call subroutine by absolute number (Oniguruma)
11008         \g'n'           call subroutine by absolute number (Oniguruma)
11009         \g<+n>          call subroutine by relative number (PCRE2 extension)
11010         \g'+n'          call subroutine by relative number (PCRE2 extension)
11011         \g<-n>          call subroutine by relative number (PCRE2 extension)
11012         \g'-n'          call subroutine by relative number (PCRE2 extension)
11013
11014
11015CONDITIONAL PATTERNS
11016
11017         (?(condition)yes-pattern)
11018         (?(condition)yes-pattern|no-pattern)
11019
11020         (?(n)               absolute reference condition
11021         (?(+n)              relative reference condition
11022         (?(-n)              relative reference condition
11023         (?(<name>)          named reference condition (Perl)
11024         (?('name')          named reference condition (Perl)
11025         (?(name)            named reference condition (PCRE2, deprecated)
11026         (?(R)               overall recursion condition
11027         (?(Rn)              specific numbered group recursion condition
11028         (?(R&name)          specific named group recursion condition
11029         (?(DEFINE)          define groups for reference
11030         (?(VERSION[>]=n.m)  test PCRE2 version
11031         (?(assert)          assertion condition
11032
11033       Note  the  ambiguity of (?(R) and (?(Rn) which might be named reference
11034       conditions or recursion tests. Such a condition  is  interpreted  as  a
11035       reference condition if the relevant named group exists.
11036
11037
11038BACKTRACKING CONTROL
11039
11040       All  backtracking  control  verbs  may be in the form (*VERB:NAME). For
11041       (*MARK) the name is mandatory, for the others it is  optional.  (*SKIP)
11042       changes  its  behaviour if :NAME is present. The others just set a name
11043       for passing back to the caller, but this is not a name that (*SKIP) can
11044       see. The following act immediately they are reached:
11045
11046         (*ACCEPT)       force successful match
11047         (*FAIL)         force backtrack; synonym (*F)
11048         (*MARK:NAME)    set name to be passed back; synonym (*:NAME)
11049
11050       The  following  act only when a subsequent match failure causes a back-
11051       track to reach them. They all force a match failure, but they differ in
11052       what happens afterwards. Those that advance the start-of-match point do
11053       so only if the pattern is not anchored.
11054
11055         (*COMMIT)       overall failure, no advance of starting point
11056         (*PRUNE)        advance to next starting character
11057         (*SKIP)         advance to current matching position
11058         (*SKIP:NAME)    advance to position corresponding to an earlier
11059                         (*MARK:NAME); if not found, the (*SKIP) is ignored
11060         (*THEN)         local failure, backtrack to next alternation
11061
11062       The effect of one of these verbs in a group called as a  subroutine  is
11063       confined to the subroutine call.
11064
11065
11066CALLOUTS
11067
11068         (?C)            callout (assumed number 0)
11069         (?Cn)           callout with numerical data n
11070         (?C"text")      callout with string data
11071
11072       The allowed string delimiters are ` ' " ^ % # $ (which are the same for
11073       the start and the end), and the starting delimiter { matched  with  the
11074       ending  delimiter  }. To encode the ending delimiter within the string,
11075       double it.
11076
11077
11078SEE ALSO
11079
11080       pcre2pattern(3),   pcre2api(3),   pcre2callout(3),    pcre2matching(3),
11081       pcre2(3).
11082
11083
11084AUTHOR
11085
11086       Philip Hazel
11087       Retired from University Computing Service
11088       Cambridge, England.
11089
11090
11091REVISION
11092
11093       Last updated: 12 January 2022
11094       Copyright (c) 1997-2022 University of Cambridge.
11095------------------------------------------------------------------------------
11096
11097
11098PCRE2UNICODE(3)            Library Functions Manual            PCRE2UNICODE(3)
11099
11100
11101
11102NAME
11103       PCRE - Perl-compatible regular expressions (revised API)
11104
11105UNICODE AND UTF SUPPORT
11106
11107       PCRE2 is normally built with Unicode support, though if you do not need
11108       it, you can build it  without,  in  which  case  the  library  will  be
11109       smaller. With Unicode support, PCRE2 has knowledge of Unicode character
11110       properties and can process strings of text in UTF-8, UTF-16, and UTF-32
11111       format (depending on the code unit width), but this is not the default.
11112       Unless specifically requested, PCRE2 treats each code unit in a  string
11113       as one character.
11114
11115       There  are two ways of telling PCRE2 to switch to UTF mode, where char-
11116       acters may consist of more than one code unit and the range  of  values
11117       is constrained. The program can call pcre2_compile() with the PCRE2_UTF
11118       option, or the pattern may start with the  sequence  (*UTF).   However,
11119       the  latter  facility  can be locked out by the PCRE2_NEVER_UTF option.
11120       That is, the programmer can prevent the supplier of  the  pattern  from
11121       switching to UTF mode.
11122
11123       Note   that  the  PCRE2_MATCH_INVALID_UTF  option  (see  below)  forces
11124       PCRE2_UTF to be set.
11125
11126       In UTF mode, both the pattern and any subject strings that are  matched
11127       against  it are treated as UTF strings instead of strings of individual
11128       one-code-unit characters. There are also some other changes to the  way
11129       characters are handled, as documented below.
11130
11131
11132UNICODE PROPERTY SUPPORT
11133
11134       When  PCRE2 is built with Unicode support, the escape sequences \p{..},
11135       \P{..}, and \X can be used. This is not dependent on the PCRE2_UTF set-
11136       ting.   The Unicode properties that can be tested are a subset of those
11137       that Perl supports. Currently they are limited to the general  category
11138       properties such as Lu for an upper case letter or Nd for a decimal num-
11139       ber, the Unicode script  names  such  as  Arabic  or  Han,  Bidi_Class,
11140       Bidi_Control,  and the derived properties Any and LC (synonym L&). Full
11141       lists are given in the pcre2pattern and pcre2syntax  documentation.  In
11142       general,  only the short names for properties are supported.  For exam-
11143       ple, \p{L} matches a letter. Its longer  synonym,  \p{Letter},  is  not
11144       supported. Furthermore, in Perl, many properties may optionally be pre-
11145       fixed by "Is", for compatibility with Perl 5.6. PCRE2 does not  support
11146       this.
11147
11148
11149WIDE CHARACTERS AND UTF MODES
11150
11151       Code points less than 256 can be specified in patterns by either braced
11152       or unbraced hexadecimal escape sequences (for example, \x{b3} or \xb3).
11153       Larger  values have to use braced sequences. Unbraced octal code points
11154       up to \777 are also recognized; larger ones can be coded using \o{...}.
11155
11156       The escape sequence \N{U+<hex digits>} is recognized as another way  of
11157       specifying  a  Unicode character by code point in a UTF mode. It is not
11158       allowed in non-UTF mode.
11159
11160       In UTF mode, repeat quantifiers apply to complete UTF  characters,  not
11161       to individual code units.
11162
11163       In UTF mode, the dot metacharacter matches one UTF character instead of
11164       a single code unit.
11165
11166       In UTF mode, capture group names are not restricted to ASCII,  and  may
11167       contain any Unicode letters and decimal digits, as well as underscore.
11168
11169       The  escape  sequence \C can be used to match a single code unit in UTF
11170       mode, but its use can lead to some strange effects because it breaks up
11171       multi-unit  characters  (see  the description of \C in the pcre2pattern
11172       documentation). For this reason, there is a build-time option that dis-
11173       ables  support  for  \C completely. There is also a less draconian com-
11174       pile-time option for locking out the use of \C when a pattern  is  com-
11175       piled.
11176
11177       The  use  of  \C  is not supported by the alternative matching function
11178       pcre2_dfa_match() when in UTF-8 or UTF-16 mode, that is, when a charac-
11179       ter  may  consist  of  more  than one code unit. The use of \C in these
11180       modes provokes a match-time error. Also, the JIT optimization does  not
11181       support \C in these modes. If JIT optimization is requested for a UTF-8
11182       or UTF-16 pattern that contains \C, it will not succeed,  and  so  when
11183       pcre2_match() is called, the matching will be carried out by the inter-
11184       pretive function.
11185
11186       The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly test
11187       characters  of  any  code  value,  but, by default, the characters that
11188       PCRE2 recognizes as digits, spaces, or word characters remain the  same
11189       set  as  in  non-UTF mode, all with code points less than 256. This re-
11190       mains true even when PCRE2 is built to include Unicode support, because
11191       to  do  otherwise  would  slow down matching in many common cases. Note
11192       that this also applies to \b and \B, because they are defined in  terms
11193       of  \w  and \W. If you want to test for a wider sense of, say, "digit",
11194       you can use explicit Unicode property tests such  as  \p{Nd}.  Alterna-
11195       tively, if you set the PCRE2_UCP option, the way that the character es-
11196       capes work is changed so that Unicode properties are used to  determine
11197       which  characters  match.  There  are  more  details  in the section on
11198       generic character types in the pcre2pattern documentation.
11199
11200       Similarly, characters that match the POSIX named character classes  are
11201       all low-valued characters, unless the PCRE2_UCP option is set.
11202
11203       However,  the  special horizontal and vertical white space matching es-
11204       capes (\h, \H, \v, and \V) do match all the appropriate Unicode charac-
11205       ters, whether or not PCRE2_UCP is set.
11206
11207
11208UNICODE CASE-EQUIVALENCE
11209
11210       If  either  PCRE2_UTF  or PCRE2_UCP is set, upper/lower case processing
11211       makes use of Unicode properties except for characters whose code points
11212       are less than 128 and that have at most two case-equivalent values. For
11213       these, a direct table lookup is used for speed. A few  Unicode  charac-
11214       ters  such as Greek sigma have more than two code points that are case-
11215       equivalent, and these are treated specially. Setting PCRE2_UCP  without
11216       PCRE2_UTF  allows  Unicode-style  case processing for non-UTF character
11217       encodings such as UCS-2.
11218
11219
11220SCRIPT RUNS
11221
11222       The pattern constructs (*script_run:...) and  (*atomic_script_run:...),
11223       with  synonyms (*sr:...) and (*asr:...), verify that the string matched
11224       within the parentheses is a script run. In concept, a script run  is  a
11225       sequence  of characters that are all from the same Unicode script. How-
11226       ever, because some scripts are commonly used together, and because some
11227       diacritical  and  other marks are used with multiple scripts, it is not
11228       that simple.
11229
11230       Every Unicode character has a Script property, mostly with a value cor-
11231       responding  to the name of a script, such as Latin, Greek, or Cyrillic.
11232       There are also three special values:
11233
11234       "Unknown" is used for code points that have not been assigned, and also
11235       for  the surrogate code points. In the PCRE2 32-bit library, characters
11236       whose code points are greater  than  the  Unicode  maximum  (U+10FFFF),
11237       which  are  accessible  only  in non-UTF mode, are assigned the Unknown
11238       script.
11239
11240       "Common" is used for characters that are used with many scripts.  These
11241       include  punctuation,  emoji,  mathematical, musical, and currency sym-
11242       bols, and the ASCII digits 0 to 9.
11243
11244       "Inherited" is used for characters such as diacritical marks that  mod-
11245       ify a previous character. These are considered to take on the script of
11246       the character that they modify.
11247
11248       Some Inherited characters are used with many scripts, but many of  them
11249       are  only  normally  used  with a small number of scripts. For example,
11250       U+102E0 (Coptic Epact thousands mark) is used only with Arabic and Cop-
11251       tic.  In  order  to  make it possible to check this, a Unicode property
11252       called Script Extension exists. Its value is a list of scripts that ap-
11253       ply to the character. For the majority of characters, the list contains
11254       just one script, the same one as  the  Script  property.  However,  for
11255       characters  such  as  U+102E0 more than one Script is listed. There are
11256       also some Common characters that have a single,  non-Common  script  in
11257       their Script Extension list.
11258
11259       The next section describes the basic rules for deciding whether a given
11260       string of characters is a script run. Note,  however,  that  there  are
11261       some  special cases involving the Chinese Han script, and an additional
11262       constraint for decimal digits. These are  covered  in  subsequent  sec-
11263       tions.
11264
11265   Basic script run rules
11266
11267       A string that is less than two characters long is a script run. This is
11268       the only case in which an Unknown character can be  part  of  a  script
11269       run.  Longer strings are checked using only the Script Extensions prop-
11270       erty, not the basic Script property.
11271
11272       If a character's Script Extension property is the single value  "Inher-
11273       ited", it is always accepted as part of a script run. This is also true
11274       for the property "Common", subject to the checking  of  decimal  digits
11275       described below. All the remaining characters in a script run must have
11276       at least one script in common in their Script Extension lists. In  set-
11277       theoretic terminology, the intersection of all the sets of scripts must
11278       not be empty.
11279
11280       A simple example is an Internet name such as "google.com". The  letters
11281       are all in the Latin script, and the dot is Common, so this string is a
11282       script run.  However, the Cyrillic letter "o" looks exactly the same as
11283       the  Latin "o"; a string that looks the same, but with Cyrillic "o"s is
11284       not a script run.
11285
11286       More interesting examples involve characters with more than one  script
11287       in their Script Extension. Consider the following characters:
11288
11289         U+060C  Arabic comma
11290         U+06D4  Arabic full stop
11291
11292       The  first  has the Script Extension list Arabic, Hanifi Rohingya, Syr-
11293       iac, and Thaana; the second has just Arabic and Hanifi  Rohingya.  Both
11294       of  them  could  appear  in  script runs of either Arabic or Hanifi Ro-
11295       hingya. The first could also appear in Syriac or  Thaana  script  runs,
11296       but the second could not.
11297
11298   The Chinese Han script
11299
11300       The  Chinese  Han  script  is  commonly  used in conjunction with other
11301       scripts for writing certain languages. Japanese uses the  Hiragana  and
11302       Katakana  scripts  together  with Han; Korean uses Hangul and Han; Tai-
11303       wanese Mandarin uses Bopomofo and Han.  These  three  combinations  are
11304       treated  as special cases when checking script runs and are, in effect,
11305       "virtual scripts". Thus, a script run may contain a  mixture  of  Hira-
11306       gana,  Katakana,  and Han, or a mixture of Hangul and Han, or a mixture
11307       of Bopomofo and Han, but not, for example,  a  mixture  of  Hangul  and
11308       Bopomofo  and  Han. PCRE2 (like Perl) follows Unicode's Technical Stan-
11309       dard  39   ("Unicode   Security   Mechanisms",   http://unicode.org/re-
11310       ports/tr39/) in allowing such mixtures.
11311
11312   Decimal digits
11313
11314       Unicode  contains  many sets of 10 decimal digits in different scripts,
11315       and some scripts (including the Common script) contain  more  than  one
11316       set.  Some  of these decimal digits them are visually indistinguishable
11317       from the common ASCII digits. In addition to the  script  checking  de-
11318       scribed  above,  if a script run contains any decimal digits, they must
11319       all come from the same set of 10 adjacent characters.
11320
11321
11322VALIDITY OF UTF STRINGS
11323
11324       When the PCRE2_UTF option is set, the strings passed  as  patterns  and
11325       subjects are (by default) checked for validity on entry to the relevant
11326       functions. If an invalid UTF string is passed, a negative error code is
11327       returned.  The  code  unit offset to the offending character can be ex-
11328       tracted from the match data  block  by  calling  pcre2_get_startchar(),
11329       which is used for this purpose after a UTF error.
11330
11331       In  some  situations, you may already know that your strings are valid,
11332       and therefore want to skip these checks in  order  to  improve  perfor-
11333       mance,  for  example in the case of a long subject string that is being
11334       scanned repeatedly.  If you set the PCRE2_NO_UTF_CHECK option  at  com-
11335       pile  time  or at match time, PCRE2 assumes that the pattern or subject
11336       it is given (respectively) contains only valid UTF code unit sequences.
11337
11338       If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is  set,  the
11339       result  is undefined and your program may crash or loop indefinitely or
11340       give incorrect results. There is, however, one mode  of  matching  that
11341       can  handle  invalid  UTF  subject  strings. This is enabled by passing
11342       PCRE2_MATCH_INVALID_UTF to pcre2_compile() and is  discussed  below  in
11343       the  next  section.  The  rest  of  this  section  covers the case when
11344       PCRE2_MATCH_INVALID_UTF is not set.
11345
11346       Passing PCRE2_NO_UTF_CHECK to pcre2_compile()  just  disables  the  UTF
11347       check  for  the  pattern; it does not also apply to subject strings. If
11348       you want to disable the check for a subject string you must  pass  this
11349       same option to pcre2_match() or pcre2_dfa_match().
11350
11351       UTF-16 and UTF-32 strings can indicate their endianness by special code
11352       knows as a byte-order mark (BOM). The PCRE2  functions  do  not  handle
11353       this, expecting strings to be in host byte order.
11354
11355       Unless  PCRE2_NO_UTF_CHECK  is  set, a UTF string is checked before any
11356       other  processing  takes  place.  In  the  case  of  pcre2_match()  and
11357       pcre2_dfa_match()  calls  with a non-zero starting offset, the check is
11358       applied only to that part of the subject that could be inspected during
11359       matching,  and  there is a check that the starting offset points to the
11360       first code unit of a character or to the end of the subject.  If  there
11361       are  no  lookbehind  assertions in the pattern, the check starts at the
11362       starting offset.  Otherwise, it starts at the  length  of  the  longest
11363       lookbehind  before  the starting offset, or at the start of the subject
11364       if there are not that many characters before the starting offset.  Note
11365       that the sequences \b and \B are one-character lookbehinds.
11366
11367       In  addition  to checking the format of the string, there is a check to
11368       ensure that all code points lie in the range U+0 to U+10FFFF, excluding
11369       the  surrogate  area. The so-called "non-character" code points are not
11370       excluded because Unicode corrigendum #9 makes it clear that they should
11371       not be.
11372
11373       Characters  in  the "Surrogate Area" of Unicode are reserved for use by
11374       UTF-16, where they are used in pairs to encode code points with  values
11375       greater  than  0xFFFF. The code points that are encoded by UTF-16 pairs
11376       are available independently in the  UTF-8  and  UTF-32  encodings.  (In
11377       other  words, the whole surrogate thing is a fudge for UTF-16 which un-
11378       fortunately messes up UTF-8 and UTF-32.)
11379
11380       Setting PCRE2_NO_UTF_CHECK at compile time does not disable  the  error
11381       that  is  given if an escape sequence for an invalid Unicode code point
11382       is encountered in the pattern. If you want to  allow  escape  sequences
11383       such  as  \x{d800}  (a  surrogate code point) you can set the PCRE2_EX-
11384       TRA_ALLOW_SURROGATE_ESCAPES extra option.  However,  this  is  possible
11385       only  in  UTF-8  and  UTF-32 modes, because these values are not repre-
11386       sentable in UTF-16.
11387
11388   Errors in UTF-8 strings
11389
11390       The following negative error codes are given for invalid UTF-8 strings:
11391
11392         PCRE2_ERROR_UTF8_ERR1
11393         PCRE2_ERROR_UTF8_ERR2
11394         PCRE2_ERROR_UTF8_ERR3
11395         PCRE2_ERROR_UTF8_ERR4
11396         PCRE2_ERROR_UTF8_ERR5
11397
11398       The string ends with a truncated UTF-8 character;  the  code  specifies
11399       how  many bytes are missing (1 to 5). Although RFC 3629 restricts UTF-8
11400       characters to be no longer than 4 bytes, the  encoding  scheme  (origi-
11401       nally  defined  by  RFC  2279)  allows  for  up to 6 bytes, and this is
11402       checked first; hence the possibility of 4 or 5 missing bytes.
11403
11404         PCRE2_ERROR_UTF8_ERR6
11405         PCRE2_ERROR_UTF8_ERR7
11406         PCRE2_ERROR_UTF8_ERR8
11407         PCRE2_ERROR_UTF8_ERR9
11408         PCRE2_ERROR_UTF8_ERR10
11409
11410       The two most significant bits of the 2nd, 3rd, 4th, 5th, or 6th byte of
11411       the  character  do  not have the binary value 0b10 (that is, either the
11412       most significant bit is 0, or the next bit is 1).
11413
11414         PCRE2_ERROR_UTF8_ERR11
11415         PCRE2_ERROR_UTF8_ERR12
11416
11417       A character that is valid by the RFC 2279 rules is either 5 or 6  bytes
11418       long; these code points are excluded by RFC 3629.
11419
11420         PCRE2_ERROR_UTF8_ERR13
11421
11422       A 4-byte character has a value greater than 0x10ffff; these code points
11423       are excluded by RFC 3629.
11424
11425         PCRE2_ERROR_UTF8_ERR14
11426
11427       A 3-byte character has a value in the  range  0xd800  to  0xdfff;  this
11428       range  of code points are reserved by RFC 3629 for use with UTF-16, and
11429       so are excluded from UTF-8.
11430
11431         PCRE2_ERROR_UTF8_ERR15
11432         PCRE2_ERROR_UTF8_ERR16
11433         PCRE2_ERROR_UTF8_ERR17
11434         PCRE2_ERROR_UTF8_ERR18
11435         PCRE2_ERROR_UTF8_ERR19
11436
11437       A 2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it  codes
11438       for  a  value that can be represented by fewer bytes, which is invalid.
11439       For example, the two bytes 0xc0, 0xae give the value 0x2e,  whose  cor-
11440       rect coding uses just one byte.
11441
11442         PCRE2_ERROR_UTF8_ERR20
11443
11444       The two most significant bits of the first byte of a character have the
11445       binary value 0b10 (that is, the most significant bit is 1 and the  sec-
11446       ond  is  0). Such a byte can only validly occur as the second or subse-
11447       quent byte of a multi-byte character.
11448
11449         PCRE2_ERROR_UTF8_ERR21
11450
11451       The first byte of a character has the value 0xfe or 0xff. These  values
11452       can never occur in a valid UTF-8 string.
11453
11454   Errors in UTF-16 strings
11455
11456       The  following  negative  error  codes  are  given  for  invalid UTF-16
11457       strings:
11458
11459         PCRE2_ERROR_UTF16_ERR1  Missing low surrogate at end of string
11460         PCRE2_ERROR_UTF16_ERR2  Invalid low surrogate follows high surrogate
11461         PCRE2_ERROR_UTF16_ERR3  Isolated low surrogate
11462
11463
11464   Errors in UTF-32 strings
11465
11466       The following  negative  error  codes  are  given  for  invalid  UTF-32
11467       strings:
11468
11469         PCRE2_ERROR_UTF32_ERR1  Surrogate character (0xd800 to 0xdfff)
11470         PCRE2_ERROR_UTF32_ERR2  Code point is greater than 0x10ffff
11471
11472
11473MATCHING IN INVALID UTF STRINGS
11474
11475       You can run pattern matches on subject strings that may contain invalid
11476       UTF sequences if you  call  pcre2_compile()  with  the  PCRE2_MATCH_IN-
11477       VALID_UTF  option.  This  is  supported by pcre2_match(), including JIT
11478       matching, but not by pcre2_dfa_match(). When PCRE2_MATCH_INVALID_UTF is
11479       set,  it  forces  PCRE2_UTF  to be set as well. Note, however, that the
11480       pattern itself must be a valid UTF string.
11481
11482       Setting PCRE2_MATCH_INVALID_UTF does not  affect  what  pcre2_compile()
11483       generates,  but  if pcre2_jit_compile() is subsequently called, it does
11484       generate different code. If JIT is not used, the option affects the be-
11485       haviour of the interpretive code in pcre2_match(). When PCRE2_MATCH_IN-
11486       VALID_UTF is set at compile  time,  PCRE2_NO_UTF_CHECK  is  ignored  at
11487       match time.
11488
11489       In  this  mode,  an  invalid  code  unit  sequence in the subject never
11490       matches any pattern item. It does not match  dot,  it  does  not  match
11491       \p{Any},  it does not even match negative items such as [^X]. A lookbe-
11492       hind assertion fails if it encounters an invalid sequence while  moving
11493       the  current  point backwards. In other words, an invalid UTF code unit
11494       sequence acts as a barrier which no match can cross.
11495
11496       You can also think of this as the subject being split up into fragments
11497       of  valid UTF, delimited internally by invalid code unit sequences. The
11498       pattern is matched fragment by fragment. The  result  of  a  successful
11499       match,  however,  is  given  as code unit offsets in the entire subject
11500       string in the usual way. There are a few points to consider:
11501
11502       The internal boundaries are not interpreted as the beginnings  or  ends
11503       of  lines  and  so  do not match circumflex or dollar characters in the
11504       pattern.
11505
11506       If pcre2_match() is called with an offset that  points  to  an  invalid
11507       UTF-sequence,  that  sequence  is  skipped, and the match starts at the
11508       next valid UTF character, or the end of the subject.
11509
11510       At internal fragment boundaries, \b and \B behave in the same way as at
11511       the  beginning  and end of the subject. For example, a sequence such as
11512       \bWORD\b would match an instance of WORD that is surrounded by  invalid
11513       UTF code units.
11514
11515       Using  PCRE2_MATCH_INVALID_UTF, an application can run matches on arbi-
11516       trary data, knowing that any matched  strings  that  are  returned  are
11517       valid UTF. This can be useful when searching for UTF text in executable
11518       or other binary files.
11519
11520
11521AUTHOR
11522
11523       Philip Hazel
11524       Retired from University Computing Service
11525       Cambridge, England.
11526
11527
11528REVISION
11529
11530       Last updated: 22 December 2021
11531       Copyright (c) 1997-2021 University of Cambridge.
11532------------------------------------------------------------------------------
11533
11534
11535