• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1-----------------------------------------------------------------------------
2This file contains a concatenation of the PCRE2 man pages, converted to plain
3text format for ease of searching with a text editor, or for use on systems
4that do not have a man page processor. The small individual files that give
5synopses of each function in the library have not been included. Neither has
6the pcre2demo program. There are separate text files for the pcre2grep and
7pcre2test commands.
8-----------------------------------------------------------------------------
9
10
11PCRE2(3)                   Library Functions Manual                   PCRE2(3)
12
13
14
15NAME
16       PCRE2 - Perl-compatible regular expressions (revised API)
17
18INTRODUCTION
19
20       PCRE2 is the name used for a revised API for the PCRE library, which is
21       a set of functions, written in C,  that  implement  regular  expression
22       pattern matching using the same syntax and semantics as Perl, with just
23       a few differences. After nearly two decades,  the  limitations  of  the
24       original  API  were  making development increasingly difficult. The new
25       API is more extensible, and it was simplified by abolishing  the  sepa-
26       rate  "study" optimizing function; in PCRE2, patterns are automatically
27       optimized where possible. Since forking from PCRE1, the code  has  been
28       extensively  refactored and new features introduced. The old library is
29       now obsolete and is no longer maintained.
30
31       As well as Perl-style regular expression patterns, some  features  that
32       appeared  in  Python and the original PCRE before they appeared in Perl
33       are available using the Python syntax. There is also some  support  for
34       one  or  two .NET and Oniguruma syntax items, and there are options for
35       requesting some minor changes that give better  ECMAScript  (aka  Java-
36       Script) compatibility.
37
38       The  source code for PCRE2 can be compiled to support strings of 8-bit,
39       16-bit, or 32-bit code units, which means that up to three separate li-
40       braries may be installed, one for each code unit size. The size of code
41       unit is not related to the bit size of the underlying  hardware.  In  a
42       64-bit  environment that also supports 32-bit applications, versions of
43       PCRE2 that are compiled in both 64-bit and 32-bit modes may be needed.
44
45       The original work to extend PCRE to 16-bit and 32-bit  code  units  was
46       done by Zoltan Herczeg and Christian Persch, respectively. In all three
47       cases, strings can be interpreted either  as  one  character  per  code
48       unit, or as UTF-encoded Unicode, with support for Unicode general cate-
49       gory properties. Unicode support is optional at build time (but is  the
50       default). However, processing strings as UTF code units must be enabled
51       explicitly at run time. The version of Unicode in use can be discovered
52       by running
53
54         pcre2test -C
55
56       The  three  libraries  contain  identical sets of functions, with names
57       ending in _8,  _16,  or  _32,  respectively  (for  example,  pcre2_com-
58       pile_8()).  However,  by defining PCRE2_CODE_UNIT_WIDTH to be 8, 16, or
59       32, a program that uses just one code unit width can be  written  using
60       generic names such as pcre2_compile(), and the documentation is written
61       assuming that this is the case.
62
63       In addition to the Perl-compatible matching function, PCRE2 contains an
64       alternative  function that matches the same compiled patterns in a dif-
65       ferent way. In certain circumstances, the alternative function has some
66       advantages.   For  a discussion of the two matching algorithms, see the
67       pcre2matching page.
68
69       Details of exactly which Perl regular expression features are  and  are
70       not  supported  by  PCRE2  are  given  in  separate  documents. See the
71       pcre2pattern and pcre2compat pages. There is a syntax  summary  in  the
72       pcre2syntax page.
73
74       Some  features  of PCRE2 can be included, excluded, or changed when the
75       library is built. The pcre2_config() function makes it possible  for  a
76       client  to  discover  which  features are available. The features them-
77       selves are described in the pcre2build page. Documentation about build-
78       ing  PCRE2 for various operating systems can be found in the README and
79       NON-AUTOTOOLS_BUILD files in the source distribution.
80
81       The libraries contains a number of undocumented internal functions  and
82       data  tables  that  are  used by more than one of the exported external
83       functions, but which are not intended  for  use  by  external  callers.
84       Their  names  all begin with "_pcre2", which hopefully will not provoke
85       any name clashes. In some environments, it is possible to control which
86       external  symbols  are  exported when a shared library is built, and in
87       these cases the undocumented symbols are not exported.
88
89
90SECURITY CONSIDERATIONS
91
92       If you are using PCRE2 in a non-UTF application that permits  users  to
93       supply  arbitrary  patterns  for  compilation, you should be aware of a
94       feature that allows users to turn on UTF support from within a pattern.
95       For  example, an 8-bit pattern that begins with "(*UTF)" turns on UTF-8
96       mode, which interprets patterns and subjects as strings of  UTF-8  code
97       units instead of individual 8-bit characters. This causes both the pat-
98       tern and any data against which it is matched to be checked  for  UTF-8
99       validity.  If the data string is very long, such a check might use suf-
100       ficiently many resources as to cause your application to  lose  perfor-
101       mance.
102
103       One  way  of guarding against this possibility is to use the pcre2_pat-
104       tern_info() function  to  check  the  compiled  pattern's  options  for
105       PCRE2_UTF.  Alternatively,  you can set the PCRE2_NEVER_UTF option when
106       calling pcre2_compile(). This causes a compile time error if  the  pat-
107       tern contains a UTF-setting sequence.
108
109       The  use  of Unicode properties for character types such as \d can also
110       be enabled from within the pattern, by specifying "(*UCP)".  This  fea-
111       ture can be disallowed by setting the PCRE2_NEVER_UCP option.
112
113       If  your  application  is one that supports UTF, be aware that validity
114       checking can take time. If the same data string is to be  matched  many
115       times,  you  can  use  the PCRE2_NO_UTF_CHECK option for the second and
116       subsequent matches to avoid running redundant checks.
117
118       The use of the \C escape sequence in a UTF-8 or UTF-16 pattern can lead
119       to  problems,  because  it  may leave the current matching point in the
120       middle of a multi-code-unit character. The PCRE2_NEVER_BACKSLASH_C  op-
121       tion can be used by an application to lock out the use of \C, causing a
122       compile-time error if it is encountered. It is also possible  to  build
123       PCRE2 with the use of \C permanently disabled.
124
125       Another  way  that  performance can be hit is by running a pattern that
126       has a very large search tree against a string that  will  never  match.
127       Nested  unlimited repeats in a pattern are a common example. PCRE2 pro-
128       vides some protection against  this:  see  the  pcre2_set_match_limit()
129       function  in  the  pcre2api  page.  There  is a similar function called
130       pcre2_set_depth_limit() that can be used to restrict the amount of mem-
131       ory that is used.
132
133
134USER DOCUMENTATION
135
136       The  user  documentation for PCRE2 comprises a number of different sec-
137       tions. In the "man" format, each of these is a separate "man page".  In
138       the  HTML  format, each is a separate page, linked from the index page.
139       In the plain  text  format,  the  descriptions  of  the  pcre2grep  and
140       pcre2test programs are in files called pcre2grep.txt and pcre2test.txt,
141       respectively. The remaining sections, except for the pcre2demo  section
142       (which  is a program listing), and the short pages for individual func-
143       tions, are concatenated in pcre2.txt, for ease of searching.  The  sec-
144       tions are as follows:
145
146         pcre2              this document
147         pcre2-config       show PCRE2 installation configuration information
148         pcre2api           details of PCRE2's native C API
149         pcre2build         building PCRE2
150         pcre2callout       details of the pattern callout feature
151         pcre2compat        discussion of Perl compatibility
152         pcre2convert       details of pattern conversion functions
153         pcre2demo          a demonstration C program that uses PCRE2
154         pcre2grep          description of the pcre2grep command (8-bit only)
155         pcre2jit           discussion of just-in-time optimization support
156         pcre2limits        details of size and other limits
157         pcre2matching      discussion of the two matching algorithms
158         pcre2partial       details of the partial matching facility
159         pcre2pattern       syntax and semantics of supported regular
160                              expression patterns
161         pcre2perform       discussion of performance issues
162         pcre2posix         the POSIX-compatible C API for the 8-bit library
163         pcre2sample        discussion of the pcre2demo program
164         pcre2serialize     details of pattern serialization
165         pcre2syntax        quick syntax reference
166         pcre2test          description of the pcre2test command
167         pcre2unicode       discussion of Unicode and UTF support
168
169       In  the  "man"  and HTML formats, there is also a short page for each C
170       library function, listing its arguments and results.
171
172
173AUTHOR
174
175       Philip Hazel
176       Retired from University Computing Service
177       Cambridge, England.
178
179       Putting an actual email address here is a spam magnet. If you  want  to
180       email me, use my two names separated by a dot at gmail.com.
181
182
183REVISION
184
185       Last updated: 27 August 2021
186       Copyright (c) 1997-2021 University of Cambridge.
187------------------------------------------------------------------------------
188
189
190PCRE2API(3)                Library Functions Manual                PCRE2API(3)
191
192
193
194NAME
195       PCRE2 - Perl-compatible regular expressions (revised API)
196
197       #include <pcre2.h>
198
199       PCRE2  is  a  new API for PCRE, starting at release 10.0. This document
200       contains a description of all its native functions. See the pcre2 docu-
201       ment for an overview of all the PCRE2 documentation.
202
203
204PCRE2 NATIVE API BASIC FUNCTIONS
205
206       pcre2_code *pcre2_compile(PCRE2_SPTR pattern, PCRE2_SIZE length,
207         uint32_t options, int *errorcode, PCRE2_SIZE *erroroffset,
208         pcre2_compile_context *ccontext);
209
210       void pcre2_code_free(pcre2_code *code);
211
212       pcre2_match_data *pcre2_match_data_create(uint32_t ovecsize,
213         pcre2_general_context *gcontext);
214
215       pcre2_match_data *pcre2_match_data_create_from_pattern(
216         const pcre2_code *code, pcre2_general_context *gcontext);
217
218       int pcre2_match(const pcre2_code *code, PCRE2_SPTR subject,
219         PCRE2_SIZE length, PCRE2_SIZE startoffset,
220         uint32_t options, pcre2_match_data *match_data,
221         pcre2_match_context *mcontext);
222
223       int pcre2_dfa_match(const pcre2_code *code, PCRE2_SPTR subject,
224         PCRE2_SIZE length, PCRE2_SIZE startoffset,
225         uint32_t options, pcre2_match_data *match_data,
226         pcre2_match_context *mcontext,
227         int *workspace, PCRE2_SIZE wscount);
228
229       void pcre2_match_data_free(pcre2_match_data *match_data);
230
231
232PCRE2 NATIVE API AUXILIARY MATCH FUNCTIONS
233
234       PCRE2_SPTR pcre2_get_mark(pcre2_match_data *match_data);
235
236       uint32_t pcre2_get_ovector_count(pcre2_match_data *match_data);
237
238       PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *match_data);
239
240       PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *match_data);
241
242
243PCRE2 NATIVE API GENERAL CONTEXT FUNCTIONS
244
245       pcre2_general_context *pcre2_general_context_create(
246         void *(*private_malloc)(PCRE2_SIZE, void *),
247         void (*private_free)(void *, void *), void *memory_data);
248
249       pcre2_general_context *pcre2_general_context_copy(
250         pcre2_general_context *gcontext);
251
252       void pcre2_general_context_free(pcre2_general_context *gcontext);
253
254
255PCRE2 NATIVE API COMPILE CONTEXT FUNCTIONS
256
257       pcre2_compile_context *pcre2_compile_context_create(
258         pcre2_general_context *gcontext);
259
260       pcre2_compile_context *pcre2_compile_context_copy(
261         pcre2_compile_context *ccontext);
262
263       void pcre2_compile_context_free(pcre2_compile_context *ccontext);
264
265       int pcre2_set_bsr(pcre2_compile_context *ccontext,
266         uint32_t value);
267
268       int pcre2_set_character_tables(pcre2_compile_context *ccontext,
269         const uint8_t *tables);
270
271       int pcre2_set_compile_extra_options(pcre2_compile_context *ccontext,
272         uint32_t extra_options);
273
274       int pcre2_set_max_pattern_length(pcre2_compile_context *ccontext,
275         PCRE2_SIZE value);
276
277       int pcre2_set_newline(pcre2_compile_context *ccontext,
278         uint32_t value);
279
280       int pcre2_set_parens_nest_limit(pcre2_compile_context *ccontext,
281         uint32_t value);
282
283       int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext,
284         int (*guard_function)(uint32_t, void *), void *user_data);
285
286
287PCRE2 NATIVE API MATCH CONTEXT FUNCTIONS
288
289       pcre2_match_context *pcre2_match_context_create(
290         pcre2_general_context *gcontext);
291
292       pcre2_match_context *pcre2_match_context_copy(
293         pcre2_match_context *mcontext);
294
295       void pcre2_match_context_free(pcre2_match_context *mcontext);
296
297       int pcre2_set_callout(pcre2_match_context *mcontext,
298         int (*callout_function)(pcre2_callout_block *, void *),
299         void *callout_data);
300
301       int pcre2_set_substitute_callout(pcre2_match_context *mcontext,
302         int (*callout_function)(pcre2_substitute_callout_block *, void *),
303         void *callout_data);
304
305       int pcre2_set_offset_limit(pcre2_match_context *mcontext,
306         PCRE2_SIZE value);
307
308       int pcre2_set_heap_limit(pcre2_match_context *mcontext,
309         uint32_t value);
310
311       int pcre2_set_match_limit(pcre2_match_context *mcontext,
312         uint32_t value);
313
314       int pcre2_set_depth_limit(pcre2_match_context *mcontext,
315         uint32_t value);
316
317
318PCRE2 NATIVE API STRING EXTRACTION FUNCTIONS
319
320       int pcre2_substring_copy_byname(pcre2_match_data *match_data,
321         PCRE2_SPTR name, PCRE2_UCHAR *buffer, PCRE2_SIZE *bufflen);
322
323       int pcre2_substring_copy_bynumber(pcre2_match_data *match_data,
324         uint32_t number, PCRE2_UCHAR *buffer,
325         PCRE2_SIZE *bufflen);
326
327       void pcre2_substring_free(PCRE2_UCHAR *buffer);
328
329       int pcre2_substring_get_byname(pcre2_match_data *match_data,
330         PCRE2_SPTR name, PCRE2_UCHAR **bufferptr, PCRE2_SIZE *bufflen);
331
332       int pcre2_substring_get_bynumber(pcre2_match_data *match_data,
333         uint32_t number, PCRE2_UCHAR **bufferptr,
334         PCRE2_SIZE *bufflen);
335
336       int pcre2_substring_length_byname(pcre2_match_data *match_data,
337         PCRE2_SPTR name, PCRE2_SIZE *length);
338
339       int pcre2_substring_length_bynumber(pcre2_match_data *match_data,
340         uint32_t number, PCRE2_SIZE *length);
341
342       int pcre2_substring_nametable_scan(const pcre2_code *code,
343         PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last);
344
345       int pcre2_substring_number_from_name(const pcre2_code *code,
346         PCRE2_SPTR name);
347
348       void pcre2_substring_list_free(PCRE2_SPTR *list);
349
350       int pcre2_substring_list_get(pcre2_match_data *match_data,
351         PCRE2_UCHAR ***listptr, PCRE2_SIZE **lengthsptr);
352
353
354PCRE2 NATIVE API STRING SUBSTITUTION FUNCTION
355
356       int pcre2_substitute(const pcre2_code *code, PCRE2_SPTR subject,
357         PCRE2_SIZE length, PCRE2_SIZE startoffset,
358         uint32_t options, pcre2_match_data *match_data,
359         pcre2_match_context *mcontext, PCRE2_SPTR replacementz,
360         PCRE2_SIZE rlength, PCRE2_UCHAR *outputbuffer,
361         PCRE2_SIZE *outlengthptr);
362
363
364PCRE2 NATIVE API JIT FUNCTIONS
365
366       int pcre2_jit_compile(pcre2_code *code, uint32_t options);
367
368       int pcre2_jit_match(const pcre2_code *code, PCRE2_SPTR subject,
369         PCRE2_SIZE length, PCRE2_SIZE startoffset,
370         uint32_t options, pcre2_match_data *match_data,
371         pcre2_match_context *mcontext);
372
373       void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext);
374
375       pcre2_jit_stack *pcre2_jit_stack_create(PCRE2_SIZE startsize,
376         PCRE2_SIZE maxsize, pcre2_general_context *gcontext);
377
378       void pcre2_jit_stack_assign(pcre2_match_context *mcontext,
379         pcre2_jit_callback callback_function, void *callback_data);
380
381       void pcre2_jit_stack_free(pcre2_jit_stack *jit_stack);
382
383
384PCRE2 NATIVE API SERIALIZATION FUNCTIONS
385
386       int32_t pcre2_serialize_decode(pcre2_code **codes,
387         int32_t number_of_codes, const uint8_t *bytes,
388         pcre2_general_context *gcontext);
389
390       int32_t pcre2_serialize_encode(const pcre2_code **codes,
391         int32_t number_of_codes, uint8_t **serialized_bytes,
392         PCRE2_SIZE *serialized_size, pcre2_general_context *gcontext);
393
394       void pcre2_serialize_free(uint8_t *bytes);
395
396       int32_t pcre2_serialize_get_number_of_codes(const uint8_t *bytes);
397
398
399PCRE2 NATIVE API AUXILIARY FUNCTIONS
400
401       pcre2_code *pcre2_code_copy(const pcre2_code *code);
402
403       pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *code);
404
405       int pcre2_get_error_message(int errorcode, PCRE2_UCHAR *buffer,
406         PCRE2_SIZE bufflen);
407
408       const uint8_t *pcre2_maketables(pcre2_general_context *gcontext);
409
410       void pcre2_maketables_free(pcre2_general_context *gcontext,
411         const uint8_t *tables);
412
413       int pcre2_pattern_info(const pcre2_code *code, uint32_t what,
414         void *where);
415
416       int pcre2_callout_enumerate(const pcre2_code *code,
417         int (*callback)(pcre2_callout_enumerate_block *, void *),
418         void *user_data);
419
420       int pcre2_config(uint32_t what, void *where);
421
422
423PCRE2 NATIVE API OBSOLETE FUNCTIONS
424
425       int pcre2_set_recursion_limit(pcre2_match_context *mcontext,
426         uint32_t value);
427
428       int pcre2_set_recursion_memory_management(
429         pcre2_match_context *mcontext,
430         void *(*private_malloc)(PCRE2_SIZE, void *),
431         void (*private_free)(void *, void *), void *memory_data);
432
433       These  functions became obsolete at release 10.30 and are retained only
434       for backward compatibility. They should not be used in  new  code.  The
435       first  is  replaced by pcre2_set_depth_limit(); the second is no longer
436       needed and has no effect (it always returns zero).
437
438
439PCRE2 EXPERIMENTAL PATTERN CONVERSION FUNCTIONS
440
441       pcre2_convert_context *pcre2_convert_context_create(
442         pcre2_general_context *gcontext);
443
444       pcre2_convert_context *pcre2_convert_context_copy(
445         pcre2_convert_context *cvcontext);
446
447       void pcre2_convert_context_free(pcre2_convert_context *cvcontext);
448
449       int pcre2_set_glob_escape(pcre2_convert_context *cvcontext,
450         uint32_t escape_char);
451
452       int pcre2_set_glob_separator(pcre2_convert_context *cvcontext,
453         uint32_t separator_char);
454
455       int pcre2_pattern_convert(PCRE2_SPTR pattern, PCRE2_SIZE length,
456         uint32_t options, PCRE2_UCHAR **buffer,
457         PCRE2_SIZE *blength, pcre2_convert_context *cvcontext);
458
459       void pcre2_converted_pattern_free(PCRE2_UCHAR *converted_pattern);
460
461       These functions provide a way of  converting  non-PCRE2  patterns  into
462       patterns that can be processed by pcre2_compile(). This facility is ex-
463       perimental and may be changed in future releases. At  present,  "globs"
464       and  POSIX  basic  and  extended patterns can be converted. Details are
465       given in the pcre2convert documentation.
466
467
468PCRE2 8-BIT, 16-BIT, AND 32-BIT LIBRARIES
469
470       There are three PCRE2 libraries, supporting 8-bit, 16-bit,  and  32-bit
471       code  units,  respectively.  However,  there  is  just one header file,
472       pcre2.h.  This contains the function prototypes and  other  definitions
473       for all three libraries. One, two, or all three can be installed simul-
474       taneously. On Unix-like systems the libraries  are  called  libpcre2-8,
475       libpcre2-16, and libpcre2-32, and they can also co-exist with the orig-
476       inal PCRE libraries.
477
478       Character strings are passed to and from a PCRE2 library as a  sequence
479       of  unsigned  integers  in  code  units of the appropriate width. Every
480       PCRE2 function comes in three different forms, one  for  each  library,
481       for example:
482
483         pcre2_compile_8()
484         pcre2_compile_16()
485         pcre2_compile_32()
486
487       There are also three different sets of data types:
488
489         PCRE2_UCHAR8, PCRE2_UCHAR16, PCRE2_UCHAR32
490         PCRE2_SPTR8,  PCRE2_SPTR16,  PCRE2_SPTR32
491
492       The  UCHAR  types define unsigned code units of the appropriate widths.
493       For example, PCRE2_UCHAR16 is usually defined as `uint16_t'.  The  SPTR
494       types  are  constant  pointers  to the equivalent UCHAR types, that is,
495       they are pointers to vectors of unsigned code units.
496
497       Many applications use only one code unit width. For their  convenience,
498       macros are defined whose names are the generic forms such as pcre2_com-
499       pile() and  PCRE2_SPTR.  These  macros  use  the  value  of  the  macro
500       PCRE2_CODE_UNIT_WIDTH  to generate the appropriate width-specific func-
501       tion and macro names.  PCRE2_CODE_UNIT_WIDTH is not defined by default.
502       An  application  must  define  it  to  be 8, 16, or 32 before including
503       pcre2.h in order to make use of the generic names.
504
505       Applications that use more than one code unit width can be linked  with
506       more  than  one PCRE2 library, but must define PCRE2_CODE_UNIT_WIDTH to
507       be 0 before including pcre2.h, and then use the  real  function  names.
508       Any  code  that  is to be included in an environment where the value of
509       PCRE2_CODE_UNIT_WIDTH is unknown should  also  use  the  real  function
510       names. (Unfortunately, it is not possible in C code to save and restore
511       the value of a macro.)
512
513       If PCRE2_CODE_UNIT_WIDTH is not defined  before  including  pcre2.h,  a
514       compiler error occurs.
515
516       When  using  multiple  libraries  in an application, you must take care
517       when processing any particular pattern to use  only  functions  from  a
518       single  library.   For example, if you want to run a match using a pat-
519       tern that was compiled with pcre2_compile_16(), you  must  do  so  with
520       pcre2_match_16(), not pcre2_match_8() or pcre2_match_32().
521
522       In  the  function summaries above, and in the rest of this document and
523       other PCRE2 documents, functions and data  types  are  described  using
524       their generic names, without the _8, _16, or _32 suffix.
525
526
527PCRE2 API OVERVIEW
528
529       PCRE2  has  its  own  native  API, which is described in this document.
530       There are also some wrapper functions for the 8-bit library that corre-
531       spond  to the POSIX regular expression API, but they do not give access
532       to all the functionality of PCRE2. They are described in the pcre2posix
533       documentation. Both these APIs define a set of C function calls.
534
535       The  native  API  C data types, function prototypes, option values, and
536       error codes are defined in the header file pcre2.h, which also contains
537       definitions of PCRE2_MAJOR and PCRE2_MINOR, the major and minor release
538       numbers for the library. Applications can use these to include  support
539       for different releases of PCRE2.
540
541       In a Windows environment, if you want to statically link an application
542       program against a non-dll PCRE2 library, you must  define  PCRE2_STATIC
543       before including pcre2.h.
544
545       The  functions pcre2_compile() and pcre2_match() are used for compiling
546       and matching regular expressions in a Perl-compatible manner. A  sample
547       program that demonstrates the simplest way of using them is provided in
548       the file called pcre2demo.c in the PCRE2 source distribution. A listing
549       of  this  program  is  given  in  the  pcre2demo documentation, and the
550       pcre2sample documentation describes how to compile and run it.
551
552       The compiling and matching functions recognize various options that are
553       passed as bits in an options argument. There are also some more compli-
554       cated parameters such as custom memory  management  functions  and  re-
555       source  limits  that  are  passed  in "contexts" (which are just memory
556       blocks, described below). Simple applications do not need to  make  use
557       of contexts.
558
559       Just-in-time  (JIT)  compiler  support  is an optional feature of PCRE2
560       that can be built in  appropriate  hardware  environments.  It  greatly
561       speeds  up  the matching performance of many patterns. Programs can re-
562       quest that it be used if available by calling pcre2_jit_compile() after
563       a  pattern has been successfully compiled by pcre2_compile(). This does
564       nothing if JIT support is not available.
565
566       More complicated programs might need to  make  use  of  the  specialist
567       functions    pcre2_jit_stack_create(),    pcre2_jit_stack_free(),   and
568       pcre2_jit_stack_assign() in order to control the JIT code's memory  us-
569       age.
570
571       JIT matching is automatically used by pcre2_match() if it is available,
572       unless the PCRE2_NO_JIT option is set. There is also a direct interface
573       for  JIT  matching,  which gives improved performance at the expense of
574       less sanity checking. The JIT-specific functions are discussed  in  the
575       pcre2jit documentation.
576
577       A  second  matching function, pcre2_dfa_match(), which is not Perl-com-
578       patible, is also provided. This uses  a  different  algorithm  for  the
579       matching.  The  alternative  algorithm finds all possible matches (at a
580       given point in the subject), and scans the subject  just  once  (unless
581       there  are lookaround assertions). However, this algorithm does not re-
582       turn captured substrings. A description of the two matching  algorithms
583       and  their  advantages  and disadvantages is given in the pcre2matching
584       documentation. There is no JIT support for pcre2_dfa_match().
585
586       In addition to the main compiling and  matching  functions,  there  are
587       convenience functions for extracting captured substrings from a subject
588       string that has been matched by pcre2_match(). They are:
589
590         pcre2_substring_copy_byname()
591         pcre2_substring_copy_bynumber()
592         pcre2_substring_get_byname()
593         pcre2_substring_get_bynumber()
594         pcre2_substring_list_get()
595         pcre2_substring_length_byname()
596         pcre2_substring_length_bynumber()
597         pcre2_substring_nametable_scan()
598         pcre2_substring_number_from_name()
599
600       pcre2_substring_free() and pcre2_substring_list_free()  are  also  pro-
601       vided,  to  free  memory used for extracted strings. If either of these
602       functions is called with a NULL argument, the function returns  immedi-
603       ately without doing anything.
604
605       The  function  pcre2_substitute()  can be called to match a pattern and
606       return a copy of the subject string with substitutions for  parts  that
607       were matched.
608
609       Functions  whose  names begin with pcre2_serialize_ are used for saving
610       compiled patterns on disc or elsewhere, and reloading them later.
611
612       Finally, there are functions for finding out information about  a  com-
613       piled  pattern  (pcre2_pattern_info()) and about the configuration with
614       which PCRE2 was built (pcre2_config()).
615
616       Functions with names ending with _free() are used  for  freeing  memory
617       blocks  of  various  sorts.  In all cases, if one of these functions is
618       called with a NULL argument, it does nothing.
619
620
621STRING LENGTHS AND OFFSETS
622
623       The PCRE2 API uses string lengths and  offsets  into  strings  of  code
624       units  in  several  places. These values are always of type PCRE2_SIZE,
625       which is an unsigned integer type, currently always defined as  size_t.
626       The  largest  value  that  can  be  stored  in  such  a  type  (that is
627       ~(PCRE2_SIZE)0) is reserved as a special indicator for  zero-terminated
628       strings  and  unset offsets.  Therefore, the longest string that can be
629       handled is one less than this maximum.
630
631
632NEWLINES
633
634       PCRE2 supports five different conventions for indicating line breaks in
635       strings:  a  single  CR (carriage return) character, a single LF (line-
636       feed) character, the two-character sequence CRLF, any of the three pre-
637       ceding,  or any Unicode newline sequence. The Unicode newline sequences
638       are the three just mentioned, plus the single characters  VT  (vertical
639       tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line
640       separator, U+2028), and PS (paragraph separator, U+2029).
641
642       Each of the first three conventions is used by at least  one  operating
643       system as its standard newline sequence. When PCRE2 is built, a default
644       can be specified.  If it is not, the default is set to LF, which is the
645       Unix standard. However, the newline convention can be changed by an ap-
646       plication when calling pcre2_compile(), or it can be specified by  spe-
647       cial  text at the start of the pattern itself; this overrides any other
648       settings. See the pcre2pattern page for details of the special  charac-
649       ter sequences.
650
651       In  the  PCRE2  documentation  the  word "newline" is used to mean "the
652       character or pair of characters that indicate a line break". The choice
653       of  newline convention affects the handling of the dot, circumflex, and
654       dollar metacharacters, the handling of #-comments in /x mode, and, when
655       CRLF  is a recognized line ending sequence, the match position advance-
656       ment for a non-anchored pattern. There is more detail about this in the
657       section on pcre2_match() options below.
658
659       The  choice of newline convention does not affect the interpretation of
660       the \n or \r escape sequences, nor does it affect what \R matches; this
661       has its own separate convention.
662
663
664MULTITHREADING
665
666       In  a multithreaded application it is important to keep thread-specific
667       data separate from data that can be shared between threads.  The  PCRE2
668       library  code  itself  is  thread-safe: it contains no static or global
669       variables. The API is designed to be fairly simple for non-threaded ap-
670       plications  while at the same time ensuring that multithreaded applica-
671       tions can use it.
672
673       There are several different blocks of data that are used to pass infor-
674       mation between the application and the PCRE2 libraries.
675
676   The compiled pattern
677
678       A  pointer  to  the  compiled form of a pattern is returned to the user
679       when pcre2_compile() is successful. The data in the compiled pattern is
680       fixed,  and  does not change when the pattern is matched. Therefore, it
681       is thread-safe, that is, the same compiled pattern can be used by  more
682       than one thread simultaneously. For example, an application can compile
683       all its patterns at the start, before forking off multiple threads that
684       use  them.  However,  if the just-in-time (JIT) optimization feature is
685       being used, it needs separate memory stack areas for each  thread.  See
686       the pcre2jit documentation for more details.
687
688       In  a more complicated situation, where patterns are compiled only when
689       they are first needed, but are still shared between  threads,  pointers
690       to  compiled  patterns  must  be protected from simultaneous writing by
691       multiple threads. This is somewhat tricky to do correctly. If you  know
692       that  writing  to  a pointer is atomic in your environment, you can use
693       logic like this:
694
695         Get a read-only (shared) lock (mutex) for pointer
696         if (pointer == NULL)
697           {
698           Get a write (unique) lock for pointer
699           if (pointer == NULL) pointer = pcre2_compile(...
700           }
701         Release the lock
702         Use pointer in pcre2_match()
703
704       Of course, testing for compilation errors should also  be  included  in
705       the code.
706
707       The  reason  for checking the pointer a second time is as follows: Sev-
708       eral threads may have acquired the shared lock and tested  the  pointer
709       for being NULL, but only one of them will be given the write lock, with
710       the rest kept waiting. The winning thread will compile the pattern  and
711       store  the  result.  After this thread releases the write lock, another
712       thread will get it, and if it does not retest pointer for  being  NULL,
713       will recompile the pattern and overwrite the pointer, creating a memory
714       leak and possibly causing other issues.
715
716       In an environment where writing to a pointer may  not  be  atomic,  the
717       above  logic  is not sufficient. The thread that is doing the compiling
718       may be descheduled after writing only part of the pointer, which  could
719       cause  other  threads  to use an invalid value. Instead of checking the
720       pointer itself, a separate "pointer is valid" flag (that can be updated
721       atomically) must be used:
722
723         Get a read-only (shared) lock (mutex) for pointer
724         if (!pointer_is_valid)
725           {
726           Get a write (unique) lock for pointer
727           if (!pointer_is_valid)
728             {
729             pointer = pcre2_compile(...
730             pointer_is_valid = TRUE
731             }
732           }
733         Release the lock
734         Use pointer in pcre2_match()
735
736       If JIT is being used, but the JIT compilation is not being done immedi-
737       ately (perhaps waiting to see if the pattern  is  used  often  enough),
738       similar  logic  is required. JIT compilation updates a value within the
739       compiled code block, so a thread must gain unique write access  to  the
740       pointer     before    calling    pcre2_jit_compile().    Alternatively,
741       pcre2_code_copy() or pcre2_code_copy_with_tables() can be used  to  ob-
742       tain  a  private  copy of the compiled code before calling the JIT com-
743       piler.
744
745   Context blocks
746
747       The next main section below introduces the idea of "contexts" in  which
748       PCRE2 functions are called. A context is nothing more than a collection
749       of parameters that control the way PCRE2 operates. Grouping a number of
750       parameters together in a context is a convenient way of passing them to
751       a PCRE2 function without using lots of arguments. The  parameters  that
752       are  stored  in  contexts  are in some sense "advanced features" of the
753       API. Many straightforward applications will not need to use contexts.
754
755       In a multithreaded application, if the parameters in a context are val-
756       ues  that  are  never  changed, the same context can be used by all the
757       threads. However, if any thread needs to change any value in a context,
758       it must make its own thread-specific copy.
759
760   Match blocks
761
762       The  matching  functions need a block of memory for storing the results
763       of a match. This includes details of what was matched, as well as addi-
764       tional  information  such as the name of a (*MARK) setting. Each thread
765       must provide its own copy of this memory.
766
767
768PCRE2 CONTEXTS
769
770       Some PCRE2 functions have a lot of parameters, many of which  are  used
771       only  by  specialist  applications,  for example, those that use custom
772       memory management or non-standard character tables.  To  keep  function
773       argument  lists  at a reasonable size, and at the same time to keep the
774       API extensible, "uncommon" parameters are passed to  certain  functions
775       in  a  context instead of directly. A context is just a block of memory
776       that holds the parameter values.  Applications that do not need to  ad-
777       just any of the context parameters can pass NULL when a context pointer
778       is required.
779
780       There are three different types of context: a general context  that  is
781       relevant  for  several  PCRE2 operations, a compile-time context, and a
782       match-time context.
783
784   The general context
785
786       At present, this context just contains pointers to (and data  for)  ex-
787       ternal  memory management functions that are called from several places
788       in the PCRE2 library.  The  context  is  named  `general'  rather  than
789       specifically  `memory'  because in future other fields may be added. If
790       you do not want to supply your own custom memory management  functions,
791       you  do not need to bother with a general context. A general context is
792       created by:
793
794       pcre2_general_context *pcre2_general_context_create(
795         void *(*private_malloc)(PCRE2_SIZE, void *),
796         void (*private_free)(void *, void *), void *memory_data);
797
798       The two function pointers specify custom memory  management  functions,
799       whose prototypes are:
800
801         void *private_malloc(PCRE2_SIZE, void *);
802         void  private_free(void *, void *);
803
804       Whenever code in PCRE2 calls these functions, the final argument is the
805       value of memory_data. Either of the first two arguments of the creation
806       function  may be NULL, in which case the system memory management func-
807       tions malloc() and free() are used. (This is not currently  useful,  as
808       there  are  no  other  fields in a general context, but in future there
809       might be.)  The private_malloc() function is used (if supplied) to  ob-
810       tain  memory for storing the context, and all three values are saved as
811       part of the context.
812
813       Whenever PCRE2 creates a data block of any kind, the block  contains  a
814       pointer  to the free() function that matches the malloc() function that
815       was used. When the time comes to  free  the  block,  this  function  is
816       called.
817
818       A general context can be copied by calling:
819
820       pcre2_general_context *pcre2_general_context_copy(
821         pcre2_general_context *gcontext);
822
823       The memory used for a general context should be freed by calling:
824
825       void pcre2_general_context_free(pcre2_general_context *gcontext);
826
827       If  this  function  is  passed  a NULL argument, it returns immediately
828       without doing anything.
829
830   The compile context
831
832       A compile context is required if you want to provide an external  func-
833       tion  for  stack  checking  during compilation or to change the default
834       values of any of the following compile-time parameters:
835
836         What \R matches (Unicode newlines or CR, LF, CRLF only)
837         PCRE2's character tables
838         The newline character sequence
839         The compile time nested parentheses limit
840         The maximum length of the pattern string
841         The extra options bits (none set by default)
842
843       A compile context is also required if you are using custom memory  man-
844       agement.   If  none of these apply, just pass NULL as the context argu-
845       ment of pcre2_compile().
846
847       A compile context is created, copied, and freed by the following  func-
848       tions:
849
850       pcre2_compile_context *pcre2_compile_context_create(
851         pcre2_general_context *gcontext);
852
853       pcre2_compile_context *pcre2_compile_context_copy(
854         pcre2_compile_context *ccontext);
855
856       void pcre2_compile_context_free(pcre2_compile_context *ccontext);
857
858       A  compile  context  is created with default values for its parameters.
859       These can be changed by calling the following functions, which return 0
860       on success, or PCRE2_ERROR_BADDATA if invalid data is detected.
861
862       int pcre2_set_bsr(pcre2_compile_context *ccontext,
863         uint32_t value);
864
865       The  value  must  be PCRE2_BSR_ANYCRLF, to specify that \R matches only
866       CR, LF, or CRLF, or PCRE2_BSR_UNICODE, to specify that \R  matches  any
867       Unicode line ending sequence. The value is used by the JIT compiler and
868       by  the  two  interpreted   matching   functions,   pcre2_match()   and
869       pcre2_dfa_match().
870
871       int pcre2_set_character_tables(pcre2_compile_context *ccontext,
872         const uint8_t *tables);
873
874       The  value  must  be  the result of a call to pcre2_maketables(), whose
875       only argument is a general context. This function builds a set of char-
876       acter tables in the current locale.
877
878       int pcre2_set_compile_extra_options(pcre2_compile_context *ccontext,
879         uint32_t extra_options);
880
881       As  PCRE2  has developed, almost all the 32 option bits that are avail-
882       able in the options argument of pcre2_compile() have been used  up.  To
883       avoid  running  out, the compile context contains a set of extra option
884       bits which are used for some newer, assumed rarer, options. This  func-
885       tion  sets  those bits. It always sets all the bits (either on or off).
886       It does not modify any existing setting. The available options are  de-
887       fined in the section entitled "Extra compile options" below.
888
889       int pcre2_set_max_pattern_length(pcre2_compile_context *ccontext,
890         PCRE2_SIZE value);
891
892       This  sets a maximum length, in code units, for any pattern string that
893       is compiled with this context. If the pattern is longer,  an  error  is
894       generated.   This facility is provided so that applications that accept
895       patterns from external sources can limit their size. The default is the
896       largest  number  that  a  PCRE2_SIZE variable can hold, which is effec-
897       tively unlimited.
898
899       int pcre2_set_newline(pcre2_compile_context *ccontext,
900         uint32_t value);
901
902       This specifies which characters or character sequences are to be recog-
903       nized  as newlines. The value must be one of PCRE2_NEWLINE_CR (carriage
904       return only), PCRE2_NEWLINE_LF (linefeed only), PCRE2_NEWLINE_CRLF (the
905       two-character  sequence  CR followed by LF), PCRE2_NEWLINE_ANYCRLF (any
906       of the above), PCRE2_NEWLINE_ANY (any  Unicode  newline  sequence),  or
907       PCRE2_NEWLINE_NUL (the NUL character, that is a binary zero).
908
909       A pattern can override the value set in the compile context by starting
910       with a sequence such as (*CRLF). See the pcre2pattern page for details.
911
912       When a  pattern  is  compiled  with  the  PCRE2_EXTENDED  or  PCRE2_EX-
913       TENDED_MORE  option,  the newline convention affects the recognition of
914       the end of internal comments starting with #. The value is  saved  with
915       the  compiled pattern for subsequent use by the JIT compiler and by the
916       two    interpreted    matching     functions,     pcre2_match()     and
917       pcre2_dfa_match().
918
919       int pcre2_set_parens_nest_limit(pcre2_compile_context *ccontext,
920         uint32_t value);
921
922       This  parameter  adjusts  the  limit,  set when PCRE2 is built (default
923       250), on the depth of parenthesis nesting  in  a  pattern.  This  limit
924       stops  rogue  patterns  using  up too much system stack when being com-
925       piled. The limit applies to parentheses of all kinds, not just  captur-
926       ing parentheses.
927
928       int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext,
929         int (*guard_function)(uint32_t, void *), void *user_data);
930
931       There  is at least one application that runs PCRE2 in threads with very
932       limited system stack, where running out of stack is to  be  avoided  at
933       all  costs. The parenthesis limit above cannot take account of how much
934       stack is actually available during compilation. For  a  finer  control,
935       you  can  supply  a  function  that  is called whenever pcre2_compile()
936       starts to compile a parenthesized part of a pattern. This function  can
937       check  the  actual  stack  size  (or anything else that it wants to, of
938       course).
939
940       The first argument to the callout function gives the current  depth  of
941       nesting,  and  the second is user data that is set up by the last argu-
942       ment  of  pcre2_set_compile_recursion_guard().  The  callout   function
943       should return zero if all is well, or non-zero to force an error.
944
945   The match context
946
947       A match context is required if you want to:
948
949         Set up a callout function
950         Set an offset limit for matching an unanchored pattern
951         Change the limit on the amount of heap used when matching
952         Change the backtracking match limit
953         Change the backtracking depth limit
954         Set custom memory management specifically for the match
955
956       If  none  of  these  apply,  just  pass NULL as the context argument of
957       pcre2_match(), pcre2_dfa_match(), or pcre2_jit_match().
958
959       A match context is created, copied, and freed by  the  following  func-
960       tions:
961
962       pcre2_match_context *pcre2_match_context_create(
963         pcre2_general_context *gcontext);
964
965       pcre2_match_context *pcre2_match_context_copy(
966         pcre2_match_context *mcontext);
967
968       void pcre2_match_context_free(pcre2_match_context *mcontext);
969
970       A  match  context  is  created  with default values for its parameters.
971       These can be changed by calling the following functions, which return 0
972       on success, or PCRE2_ERROR_BADDATA if invalid data is detected.
973
974       int pcre2_set_callout(pcre2_match_context *mcontext,
975         int (*callout_function)(pcre2_callout_block *, void *),
976         void *callout_data);
977
978       This  sets  up a callout function for PCRE2 to call at specified points
979       during a matching operation. Details are given in the pcre2callout doc-
980       umentation.
981
982       int pcre2_set_substitute_callout(pcre2_match_context *mcontext,
983         int (*callout_function)(pcre2_substitute_callout_block *, void *),
984         void *callout_data);
985
986       This  sets up a callout function for PCRE2 to call after each substitu-
987       tion made by pcre2_substitute(). Details are given in the section enti-
988       tled "Creating a new string with substitutions" below.
989
990       int pcre2_set_offset_limit(pcre2_match_context *mcontext,
991         PCRE2_SIZE value);
992
993       The  offset_limit parameter limits how far an unanchored search can ad-
994       vance in the subject string. The  default  value  is  PCRE2_UNSET.  The
995       pcre2_match()  and  pcre2_dfa_match()  functions return PCRE2_ERROR_NO-
996       MATCH if a match with a starting point before or at the given offset is
997       not found. The pcre2_substitute() function makes no more substitutions.
998
999       For  example,  if the pattern /abc/ is matched against "123abc" with an
1000       offset limit less than 3, the result is  PCRE2_ERROR_NOMATCH.  A  match
1001       can  never  be  found  if  the  startoffset  argument of pcre2_match(),
1002       pcre2_dfa_match(), or pcre2_substitute() is  greater  than  the  offset
1003       limit set in the match context.
1004
1005       When  using  this facility, you must set the PCRE2_USE_OFFSET_LIMIT op-
1006       tion when calling pcre2_compile() so that when JIT is in use, different
1007       code  can  be  compiled. If a match is started with a non-default match
1008       limit when PCRE2_USE_OFFSET_LIMIT is not set, an error is generated.
1009
1010       The offset limit facility can be used to track progress when  searching
1011       large  subject  strings or to limit the extent of global substitutions.
1012       See also the PCRE2_FIRSTLINE option, which requires a  match  to  start
1013       before  or  at  the first newline that follows the start of matching in
1014       the subject. If this is set with an offset limit, a match must occur in
1015       the first line and also within the offset limit. In other words, which-
1016       ever limit comes first is used.
1017
1018       int pcre2_set_heap_limit(pcre2_match_context *mcontext,
1019         uint32_t value);
1020
1021       The heap_limit parameter specifies, in units of kibibytes (1024 bytes),
1022       the  maximum  amount  of heap memory that pcre2_match() may use to hold
1023       backtracking information when running an interpretive match. This limit
1024       also applies to pcre2_dfa_match(), which may use the heap when process-
1025       ing patterns with a lot of nested pattern recursion or  lookarounds  or
1026       atomic groups. This limit does not apply to matching with the JIT opti-
1027       mization, which has  its  own  memory  control  arrangements  (see  the
1028       pcre2jit  documentation for more details). If the limit is reached, the
1029       negative error code  PCRE2_ERROR_HEAPLIMIT  is  returned.  The  default
1030       limit  can be set when PCRE2 is built; if it is not, the default is set
1031       very large and is essentially unlimited.
1032
1033       A value for the heap limit may also be supplied by an item at the start
1034       of a pattern of the form
1035
1036         (*LIMIT_HEAP=ddd)
1037
1038       where  ddd  is a decimal number. However, such a setting is ignored un-
1039       less ddd is less than the limit set by the caller of pcre2_match()  or,
1040       if no such limit is set, less than the default.
1041
1042       The  pcre2_match() function always needs some heap memory, so setting a
1043       value of zero guarantees a "heap limit exceeded" error. Details of  how
1044       pcre2_match()  uses  the  heap are given in the pcre2perform documenta-
1045       tion.
1046
1047       For pcre2_dfa_match(), a vector on the system stack is used  when  pro-
1048       cessing  pattern recursions, lookarounds, or atomic groups, and only if
1049       this is not big enough is heap memory used. In  this  case,  setting  a
1050       value of zero disables the use of the heap.
1051
1052       int pcre2_set_match_limit(pcre2_match_context *mcontext,
1053         uint32_t value);
1054
1055       The match_limit parameter provides a means of preventing PCRE2 from us-
1056       ing up too many computing resources when processing patterns  that  are
1057       not going to match, but which have a very large number of possibilities
1058       in their search trees. The classic  example  is  a  pattern  that  uses
1059       nested unlimited repeats.
1060
1061       There  is an internal counter in pcre2_match() that is incremented each
1062       time round its main matching loop. If  this  value  reaches  the  match
1063       limit, pcre2_match() returns the negative value PCRE2_ERROR_MATCHLIMIT.
1064       This has the effect of limiting the amount  of  backtracking  that  can
1065       take place. For patterns that are not anchored, the count restarts from
1066       zero for each position in the subject string. This limit  also  applies
1067       to pcre2_dfa_match(), though the counting is done in a different way.
1068
1069       When  pcre2_match() is called with a pattern that was successfully pro-
1070       cessed by pcre2_jit_compile(), the way in which matching is executed is
1071       entirely  different. However, there is still the possibility of runaway
1072       matching that goes on for a very long  time,  and  so  the  match_limit
1073       value  is  also used in this case (but in a different way) to limit how
1074       long the matching can continue.
1075
1076       The default value for the limit can be set when PCRE2 is built; the de-
1077       fault  default  is  10  million, which handles all but the most extreme
1078       cases. A value for the match limit may also be supplied by an  item  at
1079       the start of a pattern of the form
1080
1081         (*LIMIT_MATCH=ddd)
1082
1083       where  ddd  is a decimal number. However, such a setting is ignored un-
1084       less ddd is less than the limit set by the caller of  pcre2_match()  or
1085       pcre2_dfa_match() or, if no such limit is set, less than the default.
1086
1087       int pcre2_set_depth_limit(pcre2_match_context *mcontext,
1088         uint32_t value);
1089
1090       This   parameter   limits   the   depth   of   nested  backtracking  in
1091       pcre2_match().  Each time a nested backtracking point is passed, a  new
1092       memory  frame  is used to remember the state of matching at that point.
1093       Thus, this parameter indirectly limits the amount  of  memory  that  is
1094       used in a match. However, because the size of each memory frame depends
1095       on the number of capturing parentheses, the actual memory limit  varies
1096       from  pattern to pattern. This limit was more useful in versions before
1097       10.30, where function recursion was used for backtracking.
1098
1099       The depth limit is not relevant, and is ignored, when matching is  done
1100       using JIT compiled code. However, it is supported by pcre2_dfa_match(),
1101       which uses it to limit the depth of nested internal recursive  function
1102       calls  that implement atomic groups, lookaround assertions, and pattern
1103       recursions. This limits, indirectly, the amount of system stack that is
1104       used.  It  was  more useful in versions before 10.32, when stack memory
1105       was used for local workspace vectors for recursive function calls. From
1106       version  10.32,  only local variables are allocated on the stack and as
1107       each call uses only a few hundred bytes, even a small stack can support
1108       quite a lot of recursion.
1109
1110       If  the depth of internal recursive function calls is great enough, lo-
1111       cal workspace vectors are allocated on the heap from version 10.32  on-
1112       wards,  so  the  depth  limit also indirectly limits the amount of heap
1113       memory that is used. A recursive pattern such as /(.(?2))((?1)|)/, when
1114       matched  to a very long string using pcre2_dfa_match(), can use a great
1115       deal of memory. However, it is probably better to limit heap usage  di-
1116       rectly by calling pcre2_set_heap_limit().
1117
1118       The  default  value for the depth limit can be set when PCRE2 is built;
1119       if it is not, the default is set to the same value as the  default  for
1120       the   match   limit.   If  the  limit  is  exceeded,  pcre2_match()  or
1121       pcre2_dfa_match() returns PCRE2_ERROR_DEPTHLIMIT. A value for the depth
1122       limit  may also be supplied by an item at the start of a pattern of the
1123       form
1124
1125         (*LIMIT_DEPTH=ddd)
1126
1127       where ddd is a decimal number. However, such a setting is  ignored  un-
1128       less  ddd  is less than the limit set by the caller of pcre2_match() or
1129       pcre2_dfa_match() or, if no such limit is set, less than the default.
1130
1131
1132CHECKING BUILD-TIME OPTIONS
1133
1134       int pcre2_config(uint32_t what, void *where);
1135
1136       The function pcre2_config() makes it possible for  a  PCRE2  client  to
1137       find  the  value  of  certain  configuration parameters and to discover
1138       which optional features have been compiled into the PCRE2 library.  The
1139       pcre2build documentation has more details about these features.
1140
1141       The  first  argument  for pcre2_config() specifies which information is
1142       required. The second argument is a pointer to memory into which the in-
1143       formation is placed. If NULL is passed, the function returns the amount
1144       of memory that is needed for the requested information. For calls  that
1145       return  numerical  values, the value is in bytes; when requesting these
1146       values, where should point to appropriately aligned memory.  For  calls
1147       that  return  strings,  the required length is given in code units, not
1148       counting the terminating zero.
1149
1150       When requesting information, the returned value from pcre2_config()  is
1151       non-negative  on success, or the negative error code PCRE2_ERROR_BADOP-
1152       TION if the value in the first argument is not recognized. The  follow-
1153       ing information is available:
1154
1155         PCRE2_CONFIG_BSR
1156
1157       The  output  is a uint32_t integer whose value indicates what character
1158       sequences the \R  escape  sequence  matches  by  default.  A  value  of
1159       PCRE2_BSR_UNICODE  means  that  \R  matches any Unicode line ending se-
1160       quence; a value of PCRE2_BSR_ANYCRLF means that \R matches only CR, LF,
1161       or CRLF. The default can be overridden when a pattern is compiled.
1162
1163         PCRE2_CONFIG_COMPILED_WIDTHS
1164
1165       The  output  is a uint32_t integer whose lower bits indicate which code
1166       unit widths were selected when PCRE2 was  built.  The  1-bit  indicates
1167       8-bit  support, and the 2-bit and 4-bit indicate 16-bit and 32-bit sup-
1168       port, respectively.
1169
1170         PCRE2_CONFIG_DEPTHLIMIT
1171
1172       The output is a uint32_t integer that gives the default limit  for  the
1173       depth  of  nested  backtracking in pcre2_match() or the depth of nested
1174       recursions, lookarounds, and atomic groups in  pcre2_dfa_match().  Fur-
1175       ther details are given with pcre2_set_depth_limit() above.
1176
1177         PCRE2_CONFIG_HEAPLIMIT
1178
1179       The  output is a uint32_t integer that gives, in kibibytes, the default
1180       limit  for  the  amount  of  heap  memory  used  by  pcre2_match()   or
1181       pcre2_dfa_match().      Further      details     are     given     with
1182       pcre2_set_heap_limit() above.
1183
1184         PCRE2_CONFIG_JIT
1185
1186       The output is a uint32_t integer that is set  to  one  if  support  for
1187       just-in-time compiling is available; otherwise it is set to zero.
1188
1189         PCRE2_CONFIG_JITTARGET
1190
1191       The  where  argument  should point to a buffer that is at least 48 code
1192       units long.  (The  exact  length  required  can  be  found  by  calling
1193       pcre2_config()  with  where  set  to NULL.) The buffer is filled with a
1194       string that contains the name of the architecture  for  which  the  JIT
1195       compiler  is  configured,  for  example "x86 32bit (little endian + un-
1196       aligned)". If JIT support is not  available,  PCRE2_ERROR_BADOPTION  is
1197       returned,  otherwise the number of code units used is returned. This is
1198       the length of the string, plus one unit for the terminating zero.
1199
1200         PCRE2_CONFIG_LINKSIZE
1201
1202       The output is a uint32_t integer that contains the number of bytes used
1203       for  internal  linkage  in  compiled regular expressions. When PCRE2 is
1204       configured, the value can be set to 2, 3, or 4, with the default  being
1205       2.  This is the value that is returned by pcre2_config(). However, when
1206       the 16-bit library is compiled, a value of 3 is rounded up  to  4,  and
1207       when  the  32-bit  library  is compiled, internal linkages always use 4
1208       bytes, so the configured value is not relevant.
1209
1210       The default value of 2 for the 8-bit and 16-bit libraries is sufficient
1211       for  all but the most massive patterns, since it allows the size of the
1212       compiled pattern to be up to 65535  code  units.  Larger  values  allow
1213       larger  regular  expressions to be compiled by those two libraries, but
1214       at the expense of slower matching.
1215
1216         PCRE2_CONFIG_MATCHLIMIT
1217
1218       The output is a uint32_t integer that gives the default match limit for
1219       pcre2_match().  Further  details are given with pcre2_set_match_limit()
1220       above.
1221
1222         PCRE2_CONFIG_NEWLINE
1223
1224       The output is a uint32_t integer  whose  value  specifies  the  default
1225       character  sequence that is recognized as meaning "newline". The values
1226       are:
1227
1228         PCRE2_NEWLINE_CR       Carriage return (CR)
1229         PCRE2_NEWLINE_LF       Linefeed (LF)
1230         PCRE2_NEWLINE_CRLF     Carriage return, linefeed (CRLF)
1231         PCRE2_NEWLINE_ANY      Any Unicode line ending
1232         PCRE2_NEWLINE_ANYCRLF  Any of CR, LF, or CRLF
1233         PCRE2_NEWLINE_NUL      The NUL character (binary zero)
1234
1235       The default should normally correspond to  the  standard  sequence  for
1236       your operating system.
1237
1238         PCRE2_CONFIG_NEVER_BACKSLASH_C
1239
1240       The  output  is  a uint32_t integer that is set to one if the use of \C
1241       was permanently disabled when PCRE2 was built; otherwise it is  set  to
1242       zero.
1243
1244         PCRE2_CONFIG_PARENSLIMIT
1245
1246       The  output is a uint32_t integer that gives the maximum depth of nest-
1247       ing of parentheses (of any kind) in a pattern. This limit is imposed to
1248       cap  the  amount of system stack used when a pattern is compiled. It is
1249       specified when PCRE2 is built; the default is 250. This limit does  not
1250       take into account the stack that may already be used by the calling ap-
1251       plication.  For  finer  control  over  compilation  stack  usage,   see
1252       pcre2_set_compile_recursion_guard().
1253
1254         PCRE2_CONFIG_STACKRECURSE
1255
1256       This parameter is obsolete and should not be used in new code. The out-
1257       put is a uint32_t integer that is always set to zero.
1258
1259         PCRE2_CONFIG_TABLES_LENGTH
1260
1261       The output is a uint32_t integer that gives the length of PCRE2's char-
1262       acter  processing  tables in bytes. For details of these tables see the
1263       section on locale support below.
1264
1265         PCRE2_CONFIG_UNICODE_VERSION
1266
1267       The where argument should point to a buffer that is at  least  24  code
1268       units  long.  (The  exact  length  required  can  be  found  by calling
1269       pcre2_config() with where set to NULL.)  If  PCRE2  has  been  compiled
1270       without  Unicode  support,  the buffer is filled with the text "Unicode
1271       not supported". Otherwise, the Unicode  version  string  (for  example,
1272       "8.0.0")  is  inserted. The number of code units used is returned. This
1273       is the length of the string plus one unit for the terminating zero.
1274
1275         PCRE2_CONFIG_UNICODE
1276
1277       The output is a uint32_t integer that is set to one if Unicode  support
1278       is  available; otherwise it is set to zero. Unicode support implies UTF
1279       support.
1280
1281         PCRE2_CONFIG_VERSION
1282
1283       The where argument should point to a buffer that is at  least  24  code
1284       units  long.  (The  exact  length  required  can  be  found  by calling
1285       pcre2_config() with where set to NULL.) The buffer is filled  with  the
1286       PCRE2 version string, zero-terminated. The number of code units used is
1287       returned. This is the length of the string plus one unit for the termi-
1288       nating zero.
1289
1290
1291COMPILING A PATTERN
1292
1293       pcre2_code *pcre2_compile(PCRE2_SPTR pattern, PCRE2_SIZE length,
1294         uint32_t options, int *errorcode, PCRE2_SIZE *erroroffset,
1295         pcre2_compile_context *ccontext);
1296
1297       void pcre2_code_free(pcre2_code *code);
1298
1299       pcre2_code *pcre2_code_copy(const pcre2_code *code);
1300
1301       pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *code);
1302
1303       The  pcre2_compile() function compiles a pattern into an internal form.
1304       The pattern is defined by a pointer to a string of  code  units  and  a
1305       length  (in  code units). If the pattern is zero-terminated, the length
1306       can be specified  as  PCRE2_ZERO_TERMINATED.  The  function  returns  a
1307       pointer to a block of memory that contains the compiled pattern and re-
1308       lated data, or NULL if an error occurred.
1309
1310       If the compile context argument ccontext is NULL, memory for  the  com-
1311       piled  pattern  is  obtained  by calling malloc(). Otherwise, it is ob-
1312       tained from the same memory function that was used for the compile con-
1313       text. The caller must free the memory by calling pcre2_code_free() when
1314       it is no longer needed.  If pcre2_code_free() is called with a NULL ar-
1315       gument, it returns immediately, without doing anything.
1316
1317       The function pcre2_code_copy() makes a copy of the compiled code in new
1318       memory, using the same memory allocator as was used for  the  original.
1319       However,  if  the  code has been processed by the JIT compiler (see be-
1320       low), the JIT information cannot be copied (because it is  position-de-
1321       pendent).   The  new copy can initially be used only for non-JIT match-
1322       ing, though it can be passed to  pcre2_jit_compile()  if  required.  If
1323       pcre2_code_copy() is called with a NULL argument, it returns NULL.
1324
1325       The pcre2_code_copy() function provides a way for individual threads in
1326       a multithreaded application to acquire a private copy  of  shared  com-
1327       piled  code.   However, it does not make a copy of the character tables
1328       used by the compiled pattern; the new pattern code points to  the  same
1329       tables  as  the original code.  (See "Locale Support" below for details
1330       of these character tables.) In many applications the  same  tables  are
1331       used  throughout, so this behaviour is appropriate. Nevertheless, there
1332       are occasions when a copy of a compiled pattern and the relevant tables
1333       are  needed.  The pcre2_code_copy_with_tables() provides this facility.
1334       Copies of both the code and the tables are  made,  with  the  new  code
1335       pointing  to the new tables. The memory for the new tables is automati-
1336       cally freed when pcre2_code_free() is called for the new  copy  of  the
1337       compiled  code.  If pcre2_code_copy_with_tables() is called with a NULL
1338       argument, it returns NULL.
1339
1340       NOTE: When one of the matching functions is  called,  pointers  to  the
1341       compiled pattern and the subject string are set in the match data block
1342       so that they can be referenced by the  substring  extraction  functions
1343       after  a  successful match.  After running a match, you must not free a
1344       compiled pattern or a subject string until after all operations on  the
1345       match  data  block have taken place, unless, in the case of the subject
1346       string, you have used the PCRE2_COPY_MATCHED_SUBJECT option,  which  is
1347       described  in  the section entitled "Option bits for pcre2_match()" be-
1348       low.
1349
1350       The options argument for pcre2_compile() contains various bit  settings
1351       that  affect the compilation. It should be zero if none of them are re-
1352       quired. The available options are described below.  Some  of  them  (in
1353       particular,  those  that  are  compatible with Perl, but some others as
1354       well) can also be set and unset from within the pattern  (see  the  de-
1355       tailed description in the pcre2pattern documentation).
1356
1357       For  those options that can be different in different parts of the pat-
1358       tern, the contents of the options argument specifies their settings  at
1359       the  start  of  compilation. The PCRE2_ANCHORED, PCRE2_ENDANCHORED, and
1360       PCRE2_NO_UTF_CHECK options can be set at the time of matching  as  well
1361       as at compile time.
1362
1363       Some  additional  options and less frequently required compile-time pa-
1364       rameters (for example, the newline setting) can be provided in  a  com-
1365       pile context (as described above).
1366
1367       If errorcode or erroroffset is NULL, pcre2_compile() returns NULL imme-
1368       diately. Otherwise, the variables to which these point are  set  to  an
1369       error code and an offset (number of code units) within the pattern, re-
1370       spectively, when pcre2_compile() returns NULL because a compilation er-
1371       ror has occurred.
1372
1373       There  are nearly 100 positive error codes that pcre2_compile() may re-
1374       turn if it finds an error in the pattern. There are also some  negative
1375       error  codes that are used for invalid UTF strings when validity check-
1376       ing is in force. These are the  same  as  given  by  pcre2_match()  and
1377       pcre2_dfa_match(), and are described in the pcre2unicode documentation.
1378       There is no separate documentation for the positive  error  codes,  be-
1379       cause  the  textual  error  messages  that  are obtained by calling the
1380       pcre2_get_error_message() function (see "Obtaining a textual error mes-
1381       sage"  below)  should  be  self-explanatory.  Macro names starting with
1382       PCRE2_ERROR_ are defined for both positive and negative error codes  in
1383       pcre2.h.  When  compilation  is  successful errorcode is set to a value
1384       that returns the message "no error" if passed  to  pcre2_get_error_mes-
1385       sage().
1386
1387       The value returned in erroroffset is an indication of where in the pat-
1388       tern an error occurred. When there is no error,  zero  is  returned.  A
1389       non-zero  value  is  not  necessarily the furthest point in the pattern
1390       that was read. For example, after the error  "lookbehind  assertion  is
1391       not  fixed length", the error offset points to the start of the failing
1392       assertion. For an invalid UTF-8 or UTF-16 string, the offset is that of
1393       the first code unit of the failing character.
1394
1395       Some  errors are not detected until the whole pattern has been scanned;
1396       in these cases, the offset passed back is the length  of  the  pattern.
1397       Note  that  the  offset is in code units, not characters, even in a UTF
1398       mode. It may sometimes point into the middle of a UTF-8 or UTF-16 char-
1399       acter.
1400
1401       This  code  fragment shows a typical straightforward call to pcre2_com-
1402       pile():
1403
1404         pcre2_code *re;
1405         PCRE2_SIZE erroffset;
1406         int errorcode;
1407         re = pcre2_compile(
1408           "^A.*Z",                /* the pattern */
1409           PCRE2_ZERO_TERMINATED,  /* the pattern is zero-terminated */
1410           0,                      /* default options */
1411           &errorcode,             /* for error code */
1412           &erroffset,             /* for error offset */
1413           NULL);                  /* no compile context */
1414
1415
1416   Main compile options
1417
1418       The following names for option bits are defined in the  pcre2.h  header
1419       file:
1420
1421         PCRE2_ANCHORED
1422
1423       If this bit is set, the pattern is forced to be "anchored", that is, it
1424       is constrained to match only at the first matching point in the  string
1425       that  is being searched (the "subject string"). This effect can also be
1426       achieved by appropriate constructs in the pattern itself, which is  the
1427       only way to do it in Perl.
1428
1429         PCRE2_ALLOW_EMPTY_CLASS
1430
1431       By  default, for compatibility with Perl, a closing square bracket that
1432       immediately follows an opening one is treated as a data  character  for
1433       the  class.  When  PCRE2_ALLOW_EMPTY_CLASS  is  set,  it terminates the
1434       class, which therefore contains no characters and so can never match.
1435
1436         PCRE2_ALT_BSUX
1437
1438       This option request alternative handling  of  three  escape  sequences,
1439       which  makes  PCRE2's  behaviour more like ECMAscript (aka JavaScript).
1440       When it is set:
1441
1442       (1) \U matches an upper case "U" character; by default \U causes a com-
1443       pile time error (Perl uses \U to upper case subsequent characters).
1444
1445       (2) \u matches a lower case "u" character unless it is followed by four
1446       hexadecimal digits, in which case the hexadecimal  number  defines  the
1447       code  point  to match. By default, \u causes a compile time error (Perl
1448       uses it to upper case the following character).
1449
1450       (3) \x matches a lower case "x" character unless it is followed by  two
1451       hexadecimal  digits,  in  which case the hexadecimal number defines the
1452       code point to match. By default, as in Perl, a  hexadecimal  number  is
1453       always expected after \x, but it may have zero, one, or two digits (so,
1454       for example, \xz matches a binary zero character followed by z).
1455
1456       ECMAscript 6 added additional functionality to \u. This can be accessed
1457       using  the  PCRE2_EXTRA_ALT_BSUX  extra  option (see "Extra compile op-
1458       tions" below).  Note that this alternative escape handling applies only
1459       to  patterns.  Neither  of  these options affects the processing of re-
1460       placement strings passed to pcre2_substitute().
1461
1462         PCRE2_ALT_CIRCUMFLEX
1463
1464       In  multiline  mode  (when  PCRE2_MULTILINE  is  set),  the  circumflex
1465       metacharacter  matches at the start of the subject (unless PCRE2_NOTBOL
1466       is set), and also after any internal  newline.  However,  it  does  not
1467       match after a newline at the end of the subject, for compatibility with
1468       Perl. If you want a multiline circumflex also to match after  a  termi-
1469       nating newline, you must set PCRE2_ALT_CIRCUMFLEX.
1470
1471         PCRE2_ALT_VERBNAMES
1472
1473       By  default, for compatibility with Perl, the name in any verb sequence
1474       such as (*MARK:NAME) is any sequence of characters that  does  not  in-
1475       clude  a closing parenthesis. The name is not processed in any way, and
1476       it is not possible to include a closing parenthesis in the  name.  How-
1477       ever,  if  the PCRE2_ALT_VERBNAMES option is set, normal backslash pro-
1478       cessing is applied to verb names and only an unescaped  closing  paren-
1479       thesis  terminates the name. A closing parenthesis can be included in a
1480       name either as \) or between  \Q  and  \E.  If  the  PCRE2_EXTENDED  or
1481       PCRE2_EXTENDED_MORE  option  is set with PCRE2_ALT_VERBNAMES, unescaped
1482       whitespace in verb names is skipped and #-comments are recognized,  ex-
1483       actly as in the rest of the pattern.
1484
1485         PCRE2_AUTO_CALLOUT
1486
1487       If  this  bit  is  set,  pcre2_compile()  automatically inserts callout
1488       items, all with number 255, before each pattern  item,  except  immedi-
1489       ately  before  or after an explicit callout in the pattern. For discus-
1490       sion of the callout facility, see the pcre2callout documentation.
1491
1492         PCRE2_CASELESS
1493
1494       If this bit is set, letters in the pattern match both upper  and  lower
1495       case  letters in the subject. It is equivalent to Perl's /i option, and
1496       it can be changed within a pattern by a (?i) option setting. If  either
1497       PCRE2_UTF  or  PCRE2_UCP  is  set,  Unicode properties are used for all
1498       characters with more than one other case, and for all characters  whose
1499       code  points  are  greater  than  U+007F. Note that there are two ASCII
1500       characters, K and S, that, in addition to their lower case ASCII equiv-
1501       alents,  are case-equivalent with U+212A (Kelvin sign) and U+017F (long
1502       S) respectively. For lower valued characters with only one other  case,
1503       a  lookup table is used for speed. When neither PCRE2_UTF nor PCRE2_UCP
1504       is set, a lookup table is used for all code points less than  256,  and
1505       higher  code  points  (available  only  in  16-bit  or 32-bit mode) are
1506       treated as not having another case.
1507
1508         PCRE2_DOLLAR_ENDONLY
1509
1510       If this bit is set, a dollar metacharacter in the pattern matches  only
1511       at  the  end  of the subject string. Without this option, a dollar also
1512       matches immediately before a newline at the end of the string (but  not
1513       before  any other newlines). The PCRE2_DOLLAR_ENDONLY option is ignored
1514       if PCRE2_MULTILINE is set. There is no equivalent  to  this  option  in
1515       Perl, and no way to set it within a pattern.
1516
1517         PCRE2_DOTALL
1518
1519       If  this  bit  is  set,  a dot metacharacter in the pattern matches any
1520       character, including one that indicates a  newline.  However,  it  only
1521       ever matches one character, even if newlines are coded as CRLF. Without
1522       this option, a dot does not match when the current position in the sub-
1523       ject  is  at  a newline. This option is equivalent to Perl's /s option,
1524       and it can be changed within a pattern by a (?s) option setting. A neg-
1525       ative  class such as [^a] always matches newline characters, and the \N
1526       escape sequence always matches a non-newline character, independent  of
1527       the setting of PCRE2_DOTALL.
1528
1529         PCRE2_DUPNAMES
1530
1531       If  this  bit is set, names used to identify capture groups need not be
1532       unique.  This can be helpful for certain types of pattern  when  it  is
1533       known  that  only  one instance of the named group can ever be matched.
1534       There are more details of named capture  groups  below;  see  also  the
1535       pcre2pattern documentation.
1536
1537         PCRE2_ENDANCHORED
1538
1539       If  this  bit is set, the end of any pattern match must be right at the
1540       end of the string being searched (the "subject string"). If the pattern
1541       match succeeds by reaching (*ACCEPT), but does not reach the end of the
1542       subject, the match fails at the current starting point. For  unanchored
1543       patterns,  a  new  match is then tried at the next starting point. How-
1544       ever, if the match succeeds by reaching the end of the pattern, but not
1545       the  end  of  the subject, backtracking occurs and an alternative match
1546       may be found. Consider these two patterns:
1547
1548         .(*ACCEPT)|..
1549         .|..
1550
1551       If matched against "abc" with PCRE2_ENDANCHORED set, the first  matches
1552       "c"  whereas  the  second matches "bc". The effect of PCRE2_ENDANCHORED
1553       can also be achieved by appropriate constructs in the  pattern  itself,
1554       which is the only way to do it in Perl.
1555
1556       For DFA matching with pcre2_dfa_match(), PCRE2_ENDANCHORED applies only
1557       to the first (that is, the  longest)  matched  string.  Other  parallel
1558       matches,  which are necessarily substrings of the first one, must obvi-
1559       ously end before the end of the subject.
1560
1561         PCRE2_EXTENDED
1562
1563       If this bit is set, most white space characters in the pattern are  to-
1564       tally ignored except when escaped or inside a character class. However,
1565       white space is not allowed within sequences such as (?> that  introduce
1566       various  parenthesized groups, nor within numerical quantifiers such as
1567       {1,3}. Ignorable white space is permitted between an item and a follow-
1568       ing  quantifier  and  between a quantifier and a following + that indi-
1569       cates possessiveness. PCRE2_EXTENDED is equivalent to Perl's /x option,
1570       and it can be changed within a pattern by a (?x) option setting.
1571
1572       When  PCRE2  is compiled without Unicode support, PCRE2_EXTENDED recog-
1573       nizes as white space only those characters with code points  less  than
1574       256 that are flagged as white space in its low-character table. The ta-
1575       ble is normally created by pcre2_maketables(), which uses the isspace()
1576       function  to identify space characters. In most ASCII environments, the
1577       relevant characters are those with code  points  0x0009  (tab),  0x000A
1578       (linefeed),  0x000B (vertical tab), 0x000C (formfeed), 0x000D (carriage
1579       return), and 0x0020 (space).
1580
1581       When PCRE2 is compiled with Unicode support, in addition to these char-
1582       acters,  five  more Unicode "Pattern White Space" characters are recog-
1583       nized by PCRE2_EXTENDED. These are U+0085 (next line), U+200E (left-to-
1584       right  mark), U+200F (right-to-left mark), U+2028 (line separator), and
1585       U+2029 (paragraph separator). This set of characters  is  the  same  as
1586       recognized  by  Perl's /x option. Note that the horizontal and vertical
1587       space characters that are matched by the \h and \v escapes in  patterns
1588       are a much bigger set.
1589
1590       As  well as ignoring most white space, PCRE2_EXTENDED also causes char-
1591       acters between an unescaped # outside a character class  and  the  next
1592       newline,  inclusive,  to be ignored, which makes it possible to include
1593       comments inside complicated patterns. Note that the end of this type of
1594       comment  is a literal newline sequence in the pattern; escape sequences
1595       that happen to represent a newline do not count.
1596
1597       Which characters are interpreted as newlines can be specified by a set-
1598       ting  in  the compile context that is passed to pcre2_compile() or by a
1599       special sequence at the start of the pattern, as described in the  sec-
1600       tion  entitled "Newline conventions" in the pcre2pattern documentation.
1601       A default is defined when PCRE2 is built.
1602
1603         PCRE2_EXTENDED_MORE
1604
1605       This option has the effect of PCRE2_EXTENDED,  but,  in  addition,  un-
1606       escaped  space and horizontal tab characters are ignored inside a char-
1607       acter class. Note: only these two characters are ignored, not the  full
1608       set  of pattern white space characters that are ignored outside a char-
1609       acter class. PCRE2_EXTENDED_MORE is equivalent to  Perl's  /xx  option,
1610       and it can be changed within a pattern by a (?xx) option setting.
1611
1612         PCRE2_FIRSTLINE
1613
1614       If this option is set, the start of an unanchored pattern match must be
1615       before or at the first newline in  the  subject  string  following  the
1616       start  of  matching, though the matched text may continue over the new-
1617       line. If startoffset is non-zero, the limiting newline is not necessar-
1618       ily  the  first  newline  in  the  subject. For example, if the subject
1619       string is "abc\nxyz" (where \n represents a single-character newline) a
1620       pattern  match for "yz" succeeds with PCRE2_FIRSTLINE if startoffset is
1621       greater than 3. See also PCRE2_USE_OFFSET_LIMIT, which provides a  more
1622       general  limiting  facility.  If  PCRE2_FIRSTLINE is set with an offset
1623       limit, a match must occur in the first line and also within the  offset
1624       limit. In other words, whichever limit comes first is used.
1625
1626         PCRE2_LITERAL
1627
1628       If this option is set, all meta-characters in the pattern are disabled,
1629       and it is treated as a literal string. Matching literal strings with  a
1630       regular expression engine is not the most efficient way of doing it. If
1631       you are doing a lot of literal matching and  are  worried  about  effi-
1632       ciency, you should consider using other approaches. The only other main
1633       options  that  are  allowed  with  PCRE2_LITERAL  are:  PCRE2_ANCHORED,
1634       PCRE2_ENDANCHORED, PCRE2_AUTO_CALLOUT, PCRE2_CASELESS, PCRE2_FIRSTLINE,
1635       PCRE2_MATCH_INVALID_UTF,  PCRE2_NO_START_OPTIMIZE,  PCRE2_NO_UTF_CHECK,
1636       PCRE2_UTF,  and  PCRE2_USE_OFFSET_LIMIT.  The  extra  options PCRE2_EX-
1637       TRA_MATCH_LINE and PCRE2_EXTRA_MATCH_WORD are also supported. Any other
1638       options cause an error.
1639
1640         PCRE2_MATCH_INVALID_UTF
1641
1642       This  option  forces PCRE2_UTF (see below) and also enables support for
1643       matching by pcre2_match() in subject strings that contain  invalid  UTF
1644       sequences.   This  facility  is not supported for DFA matching. For de-
1645       tails, see the pcre2unicode documentation.
1646
1647         PCRE2_MATCH_UNSET_BACKREF
1648
1649       If this option is set,  a  backreference  to  an  unset  capture  group
1650       matches  an  empty  string (by default this causes the current matching
1651       alternative to fail).  A pattern such as (\1)(a) succeeds when this op-
1652       tion  is  set  (assuming it can find an "a" in the subject), whereas it
1653       fails by default, for Perl compatibility.  Setting  this  option  makes
1654       PCRE2 behave more like ECMAscript (aka JavaScript).
1655
1656         PCRE2_MULTILINE
1657
1658       By  default,  for  the purposes of matching "start of line" and "end of
1659       line", PCRE2 treats the subject string as consisting of a  single  line
1660       of  characters,  even  if  it actually contains newlines. The "start of
1661       line" metacharacter (^) matches only at the start of  the  string,  and
1662       the  "end  of  line"  metacharacter  ($) matches only at the end of the
1663       string, or before a terminating newline (except  when  PCRE2_DOLLAR_EN-
1664       DONLY is set). Note, however, that unless PCRE2_DOTALL is set, the "any
1665       character" metacharacter (.) does not match at a newline.  This  behav-
1666       iour (for ^, $, and dot) is the same as Perl.
1667
1668       When  PCRE2_MULTILINE  it is set, the "start of line" and "end of line"
1669       constructs match immediately following or immediately  before  internal
1670       newlines  in  the  subject string, respectively, as well as at the very
1671       start and end. This is equivalent to Perl's /m option, and  it  can  be
1672       changed within a pattern by a (?m) option setting. Note that the "start
1673       of line" metacharacter does not match after a newline at the end of the
1674       subject,  for compatibility with Perl.  However, you can change this by
1675       setting the PCRE2_ALT_CIRCUMFLEX option. If there are no newlines in  a
1676       subject  string,  or  no  occurrences  of  ^ or $ in a pattern, setting
1677       PCRE2_MULTILINE has no effect.
1678
1679         PCRE2_NEVER_BACKSLASH_C
1680
1681       This option locks out the use of \C in the pattern that is  being  com-
1682       piled.   This  escape  can  cause  unpredictable  behaviour in UTF-8 or
1683       UTF-16 modes, because it may leave the current matching  point  in  the
1684       middle of a multi-code-unit character. This option may be useful in ap-
1685       plications that process patterns from external sources. Note that there
1686       is also a build-time option that permanently locks out the use of \C.
1687
1688         PCRE2_NEVER_UCP
1689
1690       This  option  locks  out the use of Unicode properties for handling \B,
1691       \b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes, as
1692       described  for  the  PCRE2_UCP option below. In particular, it prevents
1693       the creator of the pattern from enabling this facility by starting  the
1694       pattern  with  (*UCP).  This  option may be useful in applications that
1695       process patterns from external sources. The option combination PCRE_UCP
1696       and PCRE_NEVER_UCP causes an error.
1697
1698         PCRE2_NEVER_UTF
1699
1700       This  option  locks out interpretation of the pattern as UTF-8, UTF-16,
1701       or UTF-32, depending on which library is in use. In particular, it pre-
1702       vents  the  creator of the pattern from switching to UTF interpretation
1703       by starting the pattern with (*UTF). This option may be useful  in  ap-
1704       plications that process patterns from external sources. The combination
1705       of PCRE2_UTF and PCRE2_NEVER_UTF causes an error.
1706
1707         PCRE2_NO_AUTO_CAPTURE
1708
1709       If this option is set, it disables the use of numbered capturing paren-
1710       theses  in the pattern. Any opening parenthesis that is not followed by
1711       ? behaves as if it were followed by ?: but named parentheses can  still
1712       be used for capturing (and they acquire numbers in the usual way). This
1713       is the same as Perl's /n option.  Note that, when this option  is  set,
1714       references  to  capture  groups (backreferences or recursion/subroutine
1715       calls) may only refer to named groups, though the reference can  be  by
1716       name or by number.
1717
1718         PCRE2_NO_AUTO_POSSESS
1719
1720       If this option is set, it disables "auto-possessification", which is an
1721       optimization that, for example, turns a+b into a++b in order  to  avoid
1722       backtracks  into  a+ that can never be successful. However, if callouts
1723       are in use, auto-possessification means that some  callouts  are  never
1724       taken. You can set this option if you want the matching functions to do
1725       a full unoptimized search and run all the callouts, but  it  is  mainly
1726       provided for testing purposes.
1727
1728         PCRE2_NO_DOTSTAR_ANCHOR
1729
1730       If this option is set, it disables an optimization that is applied when
1731       .* is the first significant item in a top-level branch  of  a  pattern,
1732       and  all  the  other branches also start with .* or with \A or \G or ^.
1733       The optimization is automatically disabled for .* if it  is  inside  an
1734       atomic group or a capture group that is the subject of a backreference,
1735       or if the pattern contains (*PRUNE) or (*SKIP). When  the  optimization
1736       is   not   disabled,  such  a  pattern  is  automatically  anchored  if
1737       PCRE2_DOTALL is set for all the .* items and PCRE2_MULTILINE is not set
1738       for  any  ^ items. Otherwise, the fact that any match must start either
1739       at the start of the subject or following a newline is remembered.  Like
1740       other optimizations, this can cause callouts to be skipped.
1741
1742         PCRE2_NO_START_OPTIMIZE
1743
1744       This  is  an  option whose main effect is at matching time. It does not
1745       change what pcre2_compile() generates, but it does affect the output of
1746       the JIT compiler.
1747
1748       There  are  a  number of optimizations that may occur at the start of a
1749       match, in order to speed up the process. For example, if  it  is  known
1750       that  an  unanchored  match must start with a specific code unit value,
1751       the matching code searches the subject for that value, and fails  imme-
1752       diately  if it cannot find it, without actually running the main match-
1753       ing function. This means that a special item such as (*COMMIT)  at  the
1754       start  of  a  pattern is not considered until after a suitable starting
1755       point for the match has been found.  Also,  when  callouts  or  (*MARK)
1756       items  are  in use, these "start-up" optimizations can cause them to be
1757       skipped if the pattern is never actually used. The  start-up  optimiza-
1758       tions  are  in effect a pre-scan of the subject that takes place before
1759       the pattern is run.
1760
1761       The PCRE2_NO_START_OPTIMIZE option disables the start-up optimizations,
1762       possibly  causing  performance  to  suffer,  but ensuring that in cases
1763       where the result is "no match", the callouts do occur, and  that  items
1764       such as (*COMMIT) and (*MARK) are considered at every possible starting
1765       position in the subject string.
1766
1767       Setting PCRE2_NO_START_OPTIMIZE may change the outcome  of  a  matching
1768       operation.  Consider the pattern
1769
1770         (*COMMIT)ABC
1771
1772       When  this  is compiled, PCRE2 records the fact that a match must start
1773       with the character "A". Suppose the subject  string  is  "DEFABC".  The
1774       start-up  optimization  scans along the subject, finds "A" and runs the
1775       first match attempt from there. The (*COMMIT) item means that the  pat-
1776       tern  must  match the current starting position, which in this case, it
1777       does. However, if the same match is  run  with  PCRE2_NO_START_OPTIMIZE
1778       set,  the  initial  scan  along the subject string does not happen. The
1779       first match attempt is run starting  from  "D"  and  when  this  fails,
1780       (*COMMIT)  prevents any further matches being tried, so the overall re-
1781       sult is "no match".
1782
1783       As another start-up optimization makes use of a minimum  length  for  a
1784       matching subject, which is recorded when possible. Consider the pattern
1785
1786         (*MARK:1)B(*MARK:2)(X|Y)
1787
1788       The  minimum  length  for  a match is two characters. If the subject is
1789       "XXBB", the "starting character" optimization skips "XX", then tries to
1790       match  "BB", which is long enough. In the process, (*MARK:2) is encoun-
1791       tered and remembered. When the match attempt fails,  the  next  "B"  is
1792       found,  but  there is only one character left, so there are no more at-
1793       tempts, and "no match" is returned with the "last  mark  seen"  set  to
1794       "2".  If  NO_START_OPTIMIZE is set, however, matches are tried at every
1795       possible starting position, including at the end of the subject,  where
1796       (*MARK:1)  is encountered, but there is no "B", so the "last mark seen"
1797       that is returned is "1". In this case, the optimizations do not  affect
1798       the overall match result, which is still "no match", but they do affect
1799       the auxiliary information that is returned.
1800
1801         PCRE2_NO_UTF_CHECK
1802
1803       When PCRE2_UTF is set, the validity of the pattern as a UTF  string  is
1804       automatically  checked.  There  are  discussions  about the validity of
1805       UTF-8 strings, UTF-16 strings, and UTF-32 strings in  the  pcre2unicode
1806       document.  If an invalid UTF sequence is found, pcre2_compile() returns
1807       a negative error code.
1808
1809       If you know that your pattern is a valid UTF string, and  you  want  to
1810       skip   this   check   for   performance   reasons,   you  can  set  the
1811       PCRE2_NO_UTF_CHECK option. When it is set, the effect of passing an in-
1812       valid  UTF  string as a pattern is undefined. It may cause your program
1813       to crash or loop.
1814
1815       Note  that  this  option  can  also  be  passed  to  pcre2_match()  and
1816       pcre2_dfa_match(),  to  suppress  UTF  validity checking of the subject
1817       string.
1818
1819       Note also that setting PCRE2_NO_UTF_CHECK at compile time does not dis-
1820       able  the error that is given if an escape sequence for an invalid Uni-
1821       code code point is encountered in the pattern. In particular,  the  so-
1822       called  "surrogate"  code points (0xd800 to 0xdfff) are invalid. If you
1823       want to allow escape  sequences  such  as  \x{d800}  you  can  set  the
1824       PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES  extra  option, as described in the
1825       section entitled "Extra compile options" below.  However, this is  pos-
1826       sible only in UTF-8 and UTF-32 modes, because these values are not rep-
1827       resentable in UTF-16.
1828
1829         PCRE2_UCP
1830
1831       This option has two effects. Firstly, it change the way PCRE2 processes
1832       \B,  \b,  \D,  \d,  \S,  \s,  \W,  \w,  and some of the POSIX character
1833       classes. By default, only  ASCII  characters  are  recognized,  but  if
1834       PCRE2_UCP is set, Unicode properties are used instead to classify char-
1835       acters. More details are given in  the  section  on  generic  character
1836       types  in  the pcre2pattern page. If you set PCRE2_UCP, matching one of
1837       the items it affects takes much longer.
1838
1839       The second effect of PCRE2_UCP is to force the use of  Unicode  proper-
1840       ties  for  upper/lower casing operations on characters with code points
1841       greater than 127, even when PCRE2_UTF is not set. This makes it  possi-
1842       ble, for example, to process strings in the 16-bit UCS-2 code. This op-
1843       tion is available only if PCRE2 has been compiled with Unicode  support
1844       (which is the default).
1845
1846         PCRE2_UNGREEDY
1847
1848       This  option  inverts  the "greediness" of the quantifiers so that they
1849       are not greedy by default, but become greedy if followed by "?". It  is
1850       not  compatible  with Perl. It can also be set by a (?U) option setting
1851       within the pattern.
1852
1853         PCRE2_USE_OFFSET_LIMIT
1854
1855       This option must be set for pcre2_compile() if pcre2_set_offset_limit()
1856       is  going  to be used to set a non-default offset limit in a match con-
1857       text for matches that use this pattern. An error  is  generated  if  an
1858       offset  limit is set without this option. For more details, see the de-
1859       scription of pcre2_set_offset_limit() in  the  section  that  describes
1860       match contexts. See also the PCRE2_FIRSTLINE option above.
1861
1862         PCRE2_UTF
1863
1864       This  option  causes  PCRE2  to regard both the pattern and the subject
1865       strings that are subsequently processed as strings  of  UTF  characters
1866       instead  of  single-code-unit  strings.  It  is available when PCRE2 is
1867       built to include Unicode support (which is  the  default).  If  Unicode
1868       support is not available, the use of this option provokes an error. De-
1869       tails of how PCRE2_UTF changes the behaviour of PCRE2 are given in  the
1870       pcre2unicode  page.  In  particular,  note  that  it  changes  the  way
1871       PCRE2_CASELESS handles characters with code points greater than 127.
1872
1873   Extra compile options
1874
1875       The option bits that can be set in a compile  context  by  calling  the
1876       pcre2_set_compile_extra_options() function are as follows:
1877
1878         PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK
1879
1880       Since release 10.38 PCRE2 has forbidden the use of \K within lookaround
1881       assertions, following Perl's lead. This option is provided to re-enable
1882       the previous behaviour (act in positive lookarounds, ignore in negative
1883       ones) in case anybody is relying on it.
1884
1885         PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES
1886
1887       This option applies when compiling a pattern in UTF-8 or  UTF-32  mode.
1888       It  is  forbidden in UTF-16 mode, and ignored in non-UTF modes. Unicode
1889       "surrogate" code points in the range 0xd800 to 0xdfff are used in pairs
1890       in  UTF-16  to  encode  code points with values in the range 0x10000 to
1891       0x10ffff. The surrogates cannot therefore  be  represented  in  UTF-16.
1892       They can be represented in UTF-8 and UTF-32, but are defined as invalid
1893       code points, and cause errors if  encountered  in  a  UTF-8  or  UTF-32
1894       string that is being checked for validity by PCRE2.
1895
1896       These  values also cause errors if encountered in escape sequences such
1897       as \x{d912} within a pattern. However, it seems that some applications,
1898       when using PCRE2 to check for unwanted characters in UTF-8 strings, ex-
1899       plicitly  test  for  the  surrogates  using   escape   sequences.   The
1900       PCRE2_NO_UTF_CHECK  option  does not disable the error that occurs, be-
1901       cause it applies only to the testing of input strings for UTF validity.
1902
1903       If the extra option PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES is set,  surro-
1904       gate  code  point values in UTF-8 and UTF-32 patterns no longer provoke
1905       errors and are incorporated in the compiled pattern. However, they  can
1906       only  match  subject characters if the matching function is called with
1907       PCRE2_NO_UTF_CHECK set.
1908
1909         PCRE2_EXTRA_ALT_BSUX
1910
1911       The original option PCRE2_ALT_BSUX causes PCRE2 to process \U, \u,  and
1912       \x  in  the way that ECMAscript (aka JavaScript) does. Additional func-
1913       tionality was defined by ECMAscript 6; setting PCRE2_EXTRA_ALT_BSUX has
1914       the  effect  of PCRE2_ALT_BSUX, but in addition it recognizes \u{hhh..}
1915       as a hexadecimal character code, where hhh.. is any number of hexadeci-
1916       mal digits.
1917
1918         PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL
1919
1920       This  is a dangerous option. Use with care. By default, an unrecognized
1921       escape such as \j or a malformed one such as \x{2z} causes  a  compile-
1922       time error when detected by pcre2_compile(). Perl is somewhat inconsis-
1923       tent in handling such items: for example, \j is treated  as  a  literal
1924       "j",  and non-hexadecimal digits in \x{} are just ignored, though warn-
1925       ings are given in both cases if Perl's warning switch is enabled.  How-
1926       ever,  a  malformed  octal  number  after \o{ always causes an error in
1927       Perl.
1928
1929       If the PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL  extra  option  is  passed  to
1930       pcre2_compile(),  all  unrecognized  or  malformed escape sequences are
1931       treated as single-character escapes. For example, \j is a  literal  "j"
1932       and  \x{2z}  is treated as the literal string "x{2z}". Setting this op-
1933       tion means that typos in patterns may go undetected and have unexpected
1934       results.  Also  note  that a sequence such as [\N{] is interpreted as a
1935       malformed attempt at [\N{...}] and so is treated as [N{]  whereas  [\N]
1936       gives an error because an unqualified \N is a valid escape sequence but
1937       is not supported in a character class. To reiterate: this is a  danger-
1938       ous option. Use with great care.
1939
1940         PCRE2_EXTRA_ESCAPED_CR_IS_LF
1941
1942       There  are  some  legacy applications where the escape sequence \r in a
1943       pattern is expected to match a newline. If this option is set, \r in  a
1944       pattern  is  converted to \n so that it matches a LF (linefeed) instead
1945       of a CR (carriage return) character. The option does not affect a  lit-
1946       eral  CR in the pattern, nor does it affect CR specified as an explicit
1947       code point such as \x{0D}.
1948
1949         PCRE2_EXTRA_MATCH_LINE
1950
1951       This option is provided for use by  the  -x  option  of  pcre2grep.  It
1952       causes  the  pattern  only to match complete lines. This is achieved by
1953       automatically inserting the code for "^(?:" at the start  of  the  com-
1954       piled  pattern  and ")$" at the end. Thus, when PCRE2_MULTILINE is set,
1955       the matched line may be in the middle of the subject string.  This  op-
1956       tion can be used with PCRE2_LITERAL.
1957
1958         PCRE2_EXTRA_MATCH_WORD
1959
1960       This  option  is  provided  for  use  by the -w option of pcre2grep. It
1961       causes the pattern only to match strings that have a word  boundary  at
1962       the  start and the end. This is achieved by automatically inserting the
1963       code for "\b(?:" at the start of the compiled pattern and ")\b" at  the
1964       end.  The option may be used with PCRE2_LITERAL. However, it is ignored
1965       if PCRE2_EXTRA_MATCH_LINE is also set.
1966
1967
1968JUST-IN-TIME (JIT) COMPILATION
1969
1970       int pcre2_jit_compile(pcre2_code *code, uint32_t options);
1971
1972       int pcre2_jit_match(const pcre2_code *code, PCRE2_SPTR subject,
1973         PCRE2_SIZE length, PCRE2_SIZE startoffset,
1974         uint32_t options, pcre2_match_data *match_data,
1975         pcre2_match_context *mcontext);
1976
1977       void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext);
1978
1979       pcre2_jit_stack *pcre2_jit_stack_create(PCRE2_SIZE startsize,
1980         PCRE2_SIZE maxsize, pcre2_general_context *gcontext);
1981
1982       void pcre2_jit_stack_assign(pcre2_match_context *mcontext,
1983         pcre2_jit_callback callback_function, void *callback_data);
1984
1985       void pcre2_jit_stack_free(pcre2_jit_stack *jit_stack);
1986
1987       These functions provide support for  JIT  compilation,  which,  if  the
1988       just-in-time  compiler  is available, further processes a compiled pat-
1989       tern into machine code that executes much faster than the pcre2_match()
1990       interpretive  matching function. Full details are given in the pcre2jit
1991       documentation.
1992
1993       JIT compilation is a heavyweight optimization. It can  take  some  time
1994       for  patterns  to  be analyzed, and for one-off matches and simple pat-
1995       terns the benefit of faster execution might be offset by a much  slower
1996       compilation  time.  Most (but not all) patterns can be optimized by the
1997       JIT compiler.
1998
1999
2000LOCALE SUPPORT
2001
2002       const uint8_t *pcre2_maketables(pcre2_general_context *gcontext);
2003
2004       void pcre2_maketables_free(pcre2_general_context *gcontext,
2005         const uint8_t *tables);
2006
2007       PCRE2 handles caseless matching, and determines whether characters  are
2008       letters,  digits, or whatever, by reference to a set of tables, indexed
2009       by character code point. However, this applies only to characters whose
2010       code  points  are  less than 256. By default, higher-valued code points
2011       never match escapes such as \w or \d.
2012
2013       When PCRE2 is built with Unicode support (the default), certain Unicode
2014       character  properties  can be tested with \p and \P, or, alternatively,
2015       the PCRE2_UCP option can be set when a pattern is compiled; this causes
2016       \w  and friends to use Unicode property support instead of the built-in
2017       tables.  PCRE2_UCP also causes upper/lower casing operations on charac-
2018       ters with code points greater than 127 to use Unicode properties. These
2019       effects apply even when PCRE2_UTF is not set.
2020
2021       The use of locales with Unicode is discouraged.  If  you  are  handling
2022       characters  with  code  points  greater than 127, you should either use
2023       Unicode support, or use locales, but not try to mix the two.
2024
2025       PCRE2 contains a built-in set of character tables that are used by  de-
2026       fault.   These  are sufficient for many applications. Normally, the in-
2027       ternal tables recognize only ASCII characters. However, when  PCRE2  is
2028       built, it is possible to cause the internal tables to be rebuilt in the
2029       default "C" locale of the local system, which may cause them to be dif-
2030       ferent.
2031
2032       The  built-in tables can be overridden by tables supplied by the appli-
2033       cation that calls PCRE2. These may be created  in  a  different  locale
2034       from  the  default.  As more and more applications change to using Uni-
2035       code, the need for this locale support is expected to die away.
2036
2037       External tables are built by calling the  pcre2_maketables()  function,
2038       in the relevant locale. The only argument to this function is a general
2039       context, which can be used to pass a custom memory  allocator.  If  the
2040       argument is NULL, the system malloc() is used. The result can be passed
2041       to pcre2_compile() as often as necessary, by creating a compile context
2042       and  calling  pcre2_set_character_tables()  to  set  the tables pointer
2043       therein.
2044
2045       For example, to build and use  tables  that  are  appropriate  for  the
2046       French  locale  (where accented characters with values greater than 127
2047       are treated as letters), the following code could be used:
2048
2049         setlocale(LC_CTYPE, "fr_FR");
2050         tables = pcre2_maketables(NULL);
2051         ccontext = pcre2_compile_context_create(NULL);
2052         pcre2_set_character_tables(ccontext, tables);
2053         re = pcre2_compile(..., ccontext);
2054
2055       The locale name "fr_FR" is used on Linux and other  Unix-like  systems;
2056       if you are using Windows, the name for the French locale is "french".
2057
2058       The pointer that is passed (via the compile context) to pcre2_compile()
2059       is saved with the compiled pattern, and the same tables are used by the
2060       matching  functions.  Thus,  for  any  single  pattern, compilation and
2061       matching both happen in the same locale, but different patterns can  be
2062       processed in different locales.
2063
2064       It  is the caller's responsibility to ensure that the memory containing
2065       the tables remains available while they are still in use. When they are
2066       no  longer  needed, you can discard them using pcre2_maketables_free(),
2067       which should pass as its first parameter the same global  context  that
2068       was used to create the tables.
2069
2070   Saving locale tables
2071
2072       The  tables  described above are just a sequence of binary bytes, which
2073       makes them independent of hardware characteristics such  as  endianness
2074       or  whether  the processor is 32-bit or 64-bit. A copy of the result of
2075       pcre2_maketables() can therefore be saved in a file  or  elsewhere  and
2076       re-used  later, even in a different program or on another computer. The
2077       size of the tables (number  of  bytes)  must  be  obtained  by  calling
2078       pcre2_config()   with  the  PCRE2_CONFIG_TABLES_LENGTH  option  because
2079       pcre2_maketables()  does  not  return  this  value.   Note   that   the
2080       pcre2_dftables program, which is part of the PCRE2 build system, can be
2081       used stand-alone to create a file that contains a set of binary tables.
2082       See the pcre2build documentation for details.
2083
2084
2085INFORMATION ABOUT A COMPILED PATTERN
2086
2087       int pcre2_pattern_info(const pcre2 *code, uint32_t what, void *where);
2088
2089       The  pcre2_pattern_info()  function returns general information about a
2090       compiled pattern. For information about callouts, see the next section.
2091       The  first  argument  for pcre2_pattern_info() is a pointer to the com-
2092       piled pattern. The second argument specifies which piece of information
2093       is  required,  and the third argument is a pointer to a variable to re-
2094       ceive the data. If the third argument is NULL, the  first  argument  is
2095       ignored,  and  the  function  returns the size in bytes of the variable
2096       that is required for the information requested. Otherwise, the yield of
2097       the function is zero for success, or one of the following negative num-
2098       bers:
2099
2100         PCRE2_ERROR_NULL           the argument code was NULL
2101         PCRE2_ERROR_BADMAGIC       the "magic number" was not found
2102         PCRE2_ERROR_BADOPTION      the value of what was invalid
2103         PCRE2_ERROR_UNSET          the requested field is not set
2104
2105       The "magic number" is placed at the start of each compiled pattern as a
2106       simple  check  against  passing  an arbitrary memory pointer. Here is a
2107       typical call of pcre2_pattern_info(), to obtain the length of the  com-
2108       piled pattern:
2109
2110         int rc;
2111         size_t length;
2112         rc = pcre2_pattern_info(
2113           re,               /* result of pcre2_compile() */
2114           PCRE2_INFO_SIZE,  /* what is required */
2115           &length);         /* where to put the data */
2116
2117       The possible values for the second argument are defined in pcre2.h, and
2118       are as follows:
2119
2120         PCRE2_INFO_ALLOPTIONS
2121         PCRE2_INFO_ARGOPTIONS
2122         PCRE2_INFO_EXTRAOPTIONS
2123
2124       Return copies of the pattern's options. The third argument should point
2125       to  a  uint32_t variable. PCRE2_INFO_ARGOPTIONS returns exactly the op-
2126       tions that were passed to  pcre2_compile(),  whereas  PCRE2_INFO_ALLOP-
2127       TIONS  returns  the compile options as modified by any top-level (*XXX)
2128       option settings such as (*UTF) at the  start  of  the  pattern  itself.
2129       PCRE2_INFO_EXTRAOPTIONS  returns the extra options that were set in the
2130       compile context by calling the pcre2_set_compile_extra_options()  func-
2131       tion.
2132
2133       For  example, if the pattern /(*UTF)abc/ is compiled with the PCRE2_EX-
2134       TENDED option, the result for PCRE2_INFO_ALLOPTIONS  is  PCRE2_EXTENDED
2135       and  PCRE2_UTF.   Option settings such as (?i) that can change within a
2136       pattern do not affect the result of PCRE2_INFO_ALLOPTIONS, even if they
2137       appear  right  at the start of the pattern. (This was different in some
2138       earlier releases.)
2139
2140       A pattern compiled without PCRE2_ANCHORED is automatically anchored  by
2141       PCRE2 if the first significant item in every top-level branch is one of
2142       the following:
2143
2144         ^     unless PCRE2_MULTILINE is set
2145         \A    always
2146         \G    always
2147         .*    sometimes - see below
2148
2149       When .* is the first significant item, anchoring is possible only  when
2150       all the following are true:
2151
2152         .* is not in an atomic group
2153         .* is not in a capture group that is the subject
2154              of a backreference
2155         PCRE2_DOTALL is in force for .*
2156         Neither (*PRUNE) nor (*SKIP) appears in the pattern
2157         PCRE2_NO_DOTSTAR_ANCHOR is not set
2158
2159       For  patterns  that are auto-anchored, the PCRE2_ANCHORED bit is set in
2160       the options returned for PCRE2_INFO_ALLOPTIONS.
2161
2162         PCRE2_INFO_BACKREFMAX
2163
2164       Return the number of the highest  backreference  in  the  pattern.  The
2165       third  argument  should  point  to  a  uint32_t variable. Named capture
2166       groups acquire numbers as well as names, and these  count  towards  the
2167       highest  backreference.  Backreferences  such as \4 or \g{12} match the
2168       captured characters of the given group, but in addition, the check that
2169       a capture group is set in a conditional group such as (?(3)a|b) is also
2170       a backreference.  Zero is returned if there are no backreferences.
2171
2172         PCRE2_INFO_BSR
2173
2174       The output is a uint32_t integer whose value indicates  what  character
2175       sequences  the \R escape sequence matches. A value of PCRE2_BSR_UNICODE
2176       means that \R matches any Unicode line  ending  sequence;  a  value  of
2177       PCRE2_BSR_ANYCRLF means that \R matches only CR, LF, or CRLF.
2178
2179         PCRE2_INFO_CAPTURECOUNT
2180
2181       Return  the  highest  capture  group number in the pattern. In patterns
2182       where (?| is not used, this is also the total number of capture groups.
2183       The third argument should point to a uint32_t variable.
2184
2185         PCRE2_INFO_DEPTHLIMIT
2186
2187       If  the  pattern set a backtracking depth limit by including an item of
2188       the form (*LIMIT_DEPTH=nnnn) at the start, the value is  returned.  The
2189       third argument should point to a uint32_t integer. If no such value has
2190       been set, the call to pcre2_pattern_info() returns the error  PCRE2_ER-
2191       ROR_UNSET. Note that this limit will only be used during matching if it
2192       is less than the limit set or defaulted by  the  caller  of  the  match
2193       function.
2194
2195         PCRE2_INFO_FIRSTBITMAP
2196
2197       In  the absence of a single first code unit for a non-anchored pattern,
2198       pcre2_compile() may construct a 256-bit table that defines a fixed  set
2199       of  values for the first code unit in any match. For example, a pattern
2200       that starts with [abc] results in a table with  three  bits  set.  When
2201       code  unit  values greater than 255 are supported, the flag bit for 255
2202       means "any code unit of value 255 or above". If such a table  was  con-
2203       structed,  a pointer to it is returned. Otherwise NULL is returned. The
2204       third argument should point to a const uint8_t * variable.
2205
2206         PCRE2_INFO_FIRSTCODETYPE
2207
2208       Return information about the first code unit of any matched string, for
2209       a  non-anchored  pattern. The third argument should point to a uint32_t
2210       variable. If there is a fixed first value, for example, the letter  "c"
2211       from  a  pattern such as (cat|cow|coyote), 1 is returned, and the value
2212       can be retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is  no  fixed
2213       first  value,  but it is known that a match can occur only at the start
2214       of the subject or following a newline in the subject,  2  is  returned.
2215       Otherwise, and for anchored patterns, 0 is returned.
2216
2217         PCRE2_INFO_FIRSTCODEUNIT
2218
2219       Return  the  value  of  the first code unit of any matched string for a
2220       pattern where PCRE2_INFO_FIRSTCODETYPE returns 1; otherwise  return  0.
2221       The  third  argument  should point to a uint32_t variable. In the 8-bit
2222       library, the value is always less than 256. In the 16-bit  library  the
2223       value  can  be  up  to 0xffff. In the 32-bit library in UTF-32 mode the
2224       value can be up to 0x10ffff, and up to 0xffffffff when not using UTF-32
2225       mode.
2226
2227         PCRE2_INFO_FRAMESIZE
2228
2229       Return the size (in bytes) of the data frames that are used to remember
2230       backtracking positions when the pattern is processed  by  pcre2_match()
2231       without  the  use  of  JIT. The third argument should point to a size_t
2232       variable. The frame size depends on the number of capturing parentheses
2233       in the pattern. Each additional capture group adds two PCRE2_SIZE vari-
2234       ables.
2235
2236         PCRE2_INFO_HASBACKSLASHC
2237
2238       Return 1 if the pattern contains any instances of \C, otherwise 0.  The
2239       third argument should point to a uint32_t variable.
2240
2241         PCRE2_INFO_HASCRORLF
2242
2243       Return  1  if  the  pattern  contains any explicit matches for CR or LF
2244       characters, otherwise 0. The third argument should point to a  uint32_t
2245       variable.  An explicit match is either a literal CR or LF character, or
2246       \r or \n or one of the  equivalent  hexadecimal  or  octal  escape  se-
2247       quences.
2248
2249         PCRE2_INFO_HEAPLIMIT
2250
2251       If the pattern set a heap memory limit by including an item of the form
2252       (*LIMIT_HEAP=nnnn) at the start, the value is returned. The third argu-
2253       ment should point to a uint32_t integer. If no such value has been set,
2254       the call to pcre2_pattern_info() returns the  error  PCRE2_ERROR_UNSET.
2255       Note  that  this  limit will only be used during matching if it is less
2256       than the limit set or defaulted by the caller of the match function.
2257
2258         PCRE2_INFO_JCHANGED
2259
2260       Return 1 if the (?J) or (?-J) option setting is used  in  the  pattern,
2261       otherwise  0.  The  third argument should point to a uint32_t variable.
2262       (?J) and (?-J) set and unset the local PCRE2_DUPNAMES  option,  respec-
2263       tively.
2264
2265         PCRE2_INFO_JITSIZE
2266
2267       If  the  compiled  pattern was successfully processed by pcre2_jit_com-
2268       pile(), return the size of the  JIT  compiled  code,  otherwise  return
2269       zero. The third argument should point to a size_t variable.
2270
2271         PCRE2_INFO_LASTCODETYPE
2272
2273       Returns  1 if there is a rightmost literal code unit that must exist in
2274       any matched string, other than at its start. The third argument  should
2275       point to a uint32_t variable. If there is no such value, 0 is returned.
2276       When 1 is returned, the code unit value itself can be  retrieved  using
2277       PCRE2_INFO_LASTCODEUNIT. For anchored patterns, a last literal value is
2278       recorded only if it follows something of variable length. For  example,
2279       for  the pattern /^a\d+z\d+/ the returned value is 1 (with "z" returned
2280       from PCRE2_INFO_LASTCODEUNIT), but for /^a\dz\d/ the returned value  is
2281       0.
2282
2283         PCRE2_INFO_LASTCODEUNIT
2284
2285       Return  the value of the rightmost literal code unit that must exist in
2286       any matched string, other than  at  its  start,  for  a  pattern  where
2287       PCRE2_INFO_LASTCODETYPE returns 1. Otherwise, return 0. The third argu-
2288       ment should point to a uint32_t variable.
2289
2290         PCRE2_INFO_MATCHEMPTY
2291
2292       Return 1 if the pattern might match an empty string, otherwise  0.  The
2293       third argument should point to a uint32_t variable. When a pattern con-
2294       tains recursive subroutine calls it is not always possible to determine
2295       whether or not it can match an empty string. PCRE2 takes a cautious ap-
2296       proach and returns 1 in such cases.
2297
2298         PCRE2_INFO_MATCHLIMIT
2299
2300       If the pattern set a match limit by  including  an  item  of  the  form
2301       (*LIMIT_MATCH=nnnn)  at the start, the value is returned. The third ar-
2302       gument should point to a uint32_t integer. If no such  value  has  been
2303       set, the call to pcre2_pattern_info() returns the error PCRE2_ERROR_UN-
2304       SET. Note that this limit will only be used during matching  if  it  is
2305       less  than  the limit set or defaulted by the caller of the match func-
2306       tion.
2307
2308         PCRE2_INFO_MAXLOOKBEHIND
2309
2310       A lookbehind assertion moves back a certain number of  characters  (not
2311       code  units)  when  it starts to process each of its branches. This re-
2312       quest returns the largest of these backward moves. The  third  argument
2313       should point to a uint32_t integer. The simple assertions \b and \B re-
2314       quire a one-character lookbehind and cause PCRE2_INFO_MAXLOOKBEHIND  to
2315       return  1  in  the absence of anything longer. \A also registers a one-
2316       character lookbehind, though it does not actually inspect the  previous
2317       character.
2318
2319       Note that this information is useful for multi-segment matching only if
2320       the pattern contains no nested lookbehinds. For  example,  the  pattern
2321       (?<=a(?<=ba)c)  returns  a maximum lookbehind of 2, but when it is pro-
2322       cessed, the first lookbehind moves back by two characters, matches  one
2323       character,  then  the  nested lookbehind also moves back by two charac-
2324       ters. This puts the matching point three characters earlier than it was
2325       at  the start.  PCRE2_INFO_MAXLOOKBEHIND is really only useful as a de-
2326       bugging tool. See the pcre2partial documentation for  a  discussion  of
2327       multi-segment matching.
2328
2329         PCRE2_INFO_MINLENGTH
2330
2331       If  a  minimum  length  for  matching subject strings was computed, its
2332       value is returned. Otherwise the returned value is 0. This value is not
2333       computed  when PCRE2_NO_START_OPTIMIZE is set. The value is a number of
2334       characters, which in UTF mode may be different from the number of  code
2335       units.  The  third  argument  should  point to a uint32_t variable. The
2336       value is a lower bound to the length of any matching string. There  may
2337       not  be  any  strings  of that length that do actually match, but every
2338       string that does match is at least that long.
2339
2340         PCRE2_INFO_NAMECOUNT
2341         PCRE2_INFO_NAMEENTRYSIZE
2342         PCRE2_INFO_NAMETABLE
2343
2344       PCRE2 supports the use of named as well as numbered capturing parenthe-
2345       ses.  The names are just an additional way of identifying the parenthe-
2346       ses, which still acquire numbers. Several convenience functions such as
2347       pcre2_substring_get_byname()  are provided for extracting captured sub-
2348       strings by name. It is also possible to extract the data  directly,  by
2349       first  converting  the  name to a number in order to access the correct
2350       pointers in the output vector (described with pcre2_match() below).  To
2351       do the conversion, you need to use the name-to-number map, which is de-
2352       scribed by these three values.
2353
2354       The map consists of a number of  fixed-size  entries.  PCRE2_INFO_NAME-
2355       COUNT  gives  the number of entries, and PCRE2_INFO_NAMEENTRYSIZE gives
2356       the size of each entry in code units; both of these return  a  uint32_t
2357       value. The entry size depends on the length of the longest name.
2358
2359       PCRE2_INFO_NAMETABLE returns a pointer to the first entry of the table.
2360       This is a PCRE2_SPTR pointer to a block of code units. In the 8-bit li-
2361       brary,  the first two bytes of each entry are the number of the captur-
2362       ing parenthesis, most significant byte first. In  the  16-bit  library,
2363       the  pointer  points  to 16-bit code units, the first of which contains
2364       the parenthesis number. In the 32-bit library, the  pointer  points  to
2365       32-bit  code units, the first of which contains the parenthesis number.
2366       The rest of the entry is the corresponding name, zero terminated.
2367
2368       The names are in alphabetical order. If (?| is used to create  multiple
2369       capture groups with the same number, as described in the section on du-
2370       plicate group numbers in the pcre2pattern page, the groups may be given
2371       the  same  name,  but  there  is only one entry in the table. Different
2372       names for groups of the same number are not permitted.
2373
2374       Duplicate names for capture groups with different numbers  are  permit-
2375       ted, but only if PCRE2_DUPNAMES is set. They appear in the table in the
2376       order in which they were found in the pattern. In the  absence  of  (?|
2377       this  is  the  order of increasing number; when (?| is used this is not
2378       necessarily the case because later capture groups may have  lower  num-
2379       bers.
2380
2381       As  a  simple  example of the name/number table, consider the following
2382       pattern after compilation by the 8-bit library  (assume  PCRE2_EXTENDED
2383       is set, so white space - including newlines - is ignored):
2384
2385         (?<date> (?<year>(\d\d)?\d\d) -
2386         (?<month>\d\d) - (?<day>\d\d) )
2387
2388       There are four named capture groups, so the table has four entries, and
2389       each entry in the table is eight bytes long. The table is  as  follows,
2390       with non-printing bytes shows in hexadecimal, and undefined bytes shown
2391       as ??:
2392
2393         00 01 d  a  t  e  00 ??
2394         00 05 d  a  y  00 ?? ??
2395         00 04 m  o  n  t  h  00
2396         00 02 y  e  a  r  00 ??
2397
2398       When writing code to extract data from named capture groups  using  the
2399       name-to-number  map,  remember that the length of the entries is likely
2400       to be different for each compiled pattern.
2401
2402         PCRE2_INFO_NEWLINE
2403
2404       The output is one of the following uint32_t values:
2405
2406         PCRE2_NEWLINE_CR       Carriage return (CR)
2407         PCRE2_NEWLINE_LF       Linefeed (LF)
2408         PCRE2_NEWLINE_CRLF     Carriage return, linefeed (CRLF)
2409         PCRE2_NEWLINE_ANY      Any Unicode line ending
2410         PCRE2_NEWLINE_ANYCRLF  Any of CR, LF, or CRLF
2411         PCRE2_NEWLINE_NUL      The NUL character (binary zero)
2412
2413       This identifies the character sequence that will be recognized as mean-
2414       ing "newline" while matching.
2415
2416         PCRE2_INFO_SIZE
2417
2418       Return  the  size  of  the compiled pattern in bytes (for all three li-
2419       braries). The third argument should point to a  size_t  variable.  This
2420       value  includes  the  size  of the general data block that precedes the
2421       code units of the compiled pattern itself. The value that is used  when
2422       pcre2_compile()  is  getting memory in which to place the compiled pat-
2423       tern may be slightly larger than the value returned by this option, be-
2424       cause  there  are  cases where the code that calculates the size has to
2425       over-estimate. Processing a pattern with the JIT compiler does not  al-
2426       ter the value returned by this option.
2427
2428
2429INFORMATION ABOUT A PATTERN'S CALLOUTS
2430
2431       int pcre2_callout_enumerate(const pcre2_code *code,
2432         int (*callback)(pcre2_callout_enumerate_block *, void *),
2433         void *user_data);
2434
2435       A script language that supports the use of string arguments in callouts
2436       might like to scan all the callouts in a  pattern  before  running  the
2437       match. This can be done by calling pcre2_callout_enumerate(). The first
2438       argument is a pointer to a compiled pattern, the  second  points  to  a
2439       callback  function,  and the third is arbitrary user data. The callback
2440       function is called for every callout in the pattern  in  the  order  in
2441       which they appear. Its first argument is a pointer to a callout enumer-
2442       ation block, and its second argument is the user_data  value  that  was
2443       passed  to  pcre2_callout_enumerate(). The contents of the callout enu-
2444       meration block are described in the pcre2callout  documentation,  which
2445       also gives further details about callouts.
2446
2447
2448SERIALIZATION AND PRECOMPILING
2449
2450       It  is  possible  to  save  compiled patterns on disc or elsewhere, and
2451       reload them later, subject to a number of  restrictions.  The  host  on
2452       which  the  patterns  are  reloaded must be running the same version of
2453       PCRE2, with the same code unit width, and must also have the same endi-
2454       anness,  pointer  width,  and PCRE2_SIZE type. Before compiled patterns
2455       can be saved, they must be converted to a "serialized" form,  which  in
2456       the  case of PCRE2 is really just a bytecode dump.  The functions whose
2457       names begin with pcre2_serialize_ are used for converting to  and  from
2458       the  serialized form. They are described in the pcre2serialize documen-
2459       tation. Note that PCRE2 serialization does not  convert  compiled  pat-
2460       terns to an abstract format like Java or .NET serialization.
2461
2462
2463THE MATCH DATA BLOCK
2464
2465       pcre2_match_data *pcre2_match_data_create(uint32_t ovecsize,
2466         pcre2_general_context *gcontext);
2467
2468       pcre2_match_data *pcre2_match_data_create_from_pattern(
2469         const pcre2_code *code, pcre2_general_context *gcontext);
2470
2471       void pcre2_match_data_free(pcre2_match_data *match_data);
2472
2473       Information  about  a  successful  or unsuccessful match is placed in a
2474       match data block, which is an opaque  structure  that  is  accessed  by
2475       function  calls.  In particular, the match data block contains a vector
2476       of offsets into the subject string that define the matched parts of the
2477       subject. This is known as the ovector.
2478
2479       Before  calling  pcre2_match(), pcre2_dfa_match(), or pcre2_jit_match()
2480       you must create a match data block by calling one of the creation func-
2481       tions  above.  For pcre2_match_data_create(), the first argument is the
2482       number of pairs of offsets in the ovector.
2483
2484       When using pcre2_match(), one pair of offsets is required  to  identify
2485       the  string that matched the whole pattern, with an additional pair for
2486       each captured substring. For example, a value of 4 creates enough space
2487       to  record  the matched portion of the subject plus three captured sub-
2488       strings.
2489
2490       When using pcre2_dfa_match() there may be multiple  matched  substrings
2491       of  different  lengths  at  the  same point in the subject. The ovector
2492       should be made large enough to hold as many as are expected.
2493
2494       A minimum of at least 1 pair is imposed  by  pcre2_match_data_create(),
2495       so  it  is  always possible to return the overall matched string in the
2496       case  of  pcre2_match()  or  the  longest  match   in   the   case   of
2497       pcre2_dfa_match().  The  maximum  number  of pairs is 65535; if the the
2498       first argument of pcre2_match_data_create() is greater than this, 65535
2499       is used.
2500
2501       The second argument of pcre2_match_data_create() is a pointer to a gen-
2502       eral context, which can specify custom memory management for  obtaining
2503       the memory for the match data block. If you are not using custom memory
2504       management, pass NULL, which causes malloc() to be used.
2505
2506       For pcre2_match_data_create_from_pattern(), the  first  argument  is  a
2507       pointer to a compiled pattern. The ovector is created to be exactly the
2508       right size to hold all the substrings  a  pattern  might  capture  when
2509       matched using pcre2_match(). You should not use this call when matching
2510       with pcre2_dfa_match(). The second argument is again  a  pointer  to  a
2511       general  context, but in this case if NULL is passed, the memory is ob-
2512       tained using the same allocator that was used for the compiled  pattern
2513       (custom or default).
2514
2515       A  match  data block can be used many times, with the same or different
2516       compiled patterns. You can extract information from a match data  block
2517       after  a  match  operation  has  finished, using functions that are de-
2518       scribed in the sections on matched strings and other match data below.
2519
2520       When a call of pcre2_match() fails, valid  data  is  available  in  the
2521       match  block  only  when  the  error  is PCRE2_ERROR_NOMATCH, PCRE2_ER-
2522       ROR_PARTIAL, or one of the error codes for an invalid UTF  string.  Ex-
2523       actly what is available depends on the error, and is detailed below.
2524
2525       When  one of the matching functions is called, pointers to the compiled
2526       pattern and the subject string are set in the match data block so  that
2527       they  can  be referenced by the extraction functions after a successful
2528       match. After running a match, you must not free a compiled pattern or a
2529       subject  string until after all operations on the match data block (for
2530       that match) have taken place,  unless,  in  the  case  of  the  subject
2531       string,  you  have used the PCRE2_COPY_MATCHED_SUBJECT option, which is
2532       described in the section entitled "Option bits for  pcre2_match()"  be-
2533       low.
2534
2535       When  a match data block itself is no longer needed, it should be freed
2536       by calling pcre2_match_data_free(). If this function is called  with  a
2537       NULL argument, it returns immediately, without doing anything.
2538
2539
2540MATCHING A PATTERN: THE TRADITIONAL FUNCTION
2541
2542       int pcre2_match(const pcre2_code *code, PCRE2_SPTR subject,
2543         PCRE2_SIZE length, PCRE2_SIZE startoffset,
2544         uint32_t options, pcre2_match_data *match_data,
2545         pcre2_match_context *mcontext);
2546
2547       The  function pcre2_match() is called to match a subject string against
2548       a compiled pattern, which is passed in the code argument. You can  call
2549       pcre2_match() with the same code argument as many times as you like, in
2550       order to find multiple matches in the subject string or to  match  dif-
2551       ferent subject strings with the same pattern.
2552
2553       This  function is the main matching facility of the library, and it op-
2554       erates in a Perl-like manner. For specialist use there is also  an  al-
2555       ternative  matching  function,  which is described below in the section
2556       about the pcre2_dfa_match() function.
2557
2558       Here is an example of a simple call to pcre2_match():
2559
2560         pcre2_match_data *md = pcre2_match_data_create(4, NULL);
2561         int rc = pcre2_match(
2562           re,             /* result of pcre2_compile() */
2563           "some string",  /* the subject string */
2564           11,             /* the length of the subject string */
2565           0,              /* start at offset 0 in the subject */
2566           0,              /* default options */
2567           md,             /* the match data block */
2568           NULL);          /* a match context; NULL means use defaults */
2569
2570       If the subject string is zero-terminated, the length can  be  given  as
2571       PCRE2_ZERO_TERMINATED. A match context must be provided if certain less
2572       common matching parameters are to be changed. For details, see the sec-
2573       tion on the match context above.
2574
2575   The string to be matched by pcre2_match()
2576
2577       The  subject string is passed to pcre2_match() as a pointer in subject,
2578       a length in length, and a starting offset in  startoffset.  The  length
2579       and  offset  are  in  code units, not characters.  That is, they are in
2580       bytes for the 8-bit library, 16-bit code units for the 16-bit  library,
2581       and  32-bit  code units for the 32-bit library, whether or not UTF pro-
2582       cessing is enabled. As a special case, if subject is NULL and length is
2583       zero,  the  subject is assumed to be an empty string. If length is non-
2584       zero, an error occurs if subject is NULL.
2585
2586       If startoffset is greater than the length of the subject, pcre2_match()
2587       returns  PCRE2_ERROR_BADOFFSET.  When  the starting offset is zero, the
2588       search for a match starts at the beginning of the subject, and this  is
2589       by far the most common case. In UTF-8 or UTF-16 mode, the starting off-
2590       set must point to the start of a character, or to the end of  the  sub-
2591       ject  (in  UTF-32 mode, one code unit equals one character, so all off-
2592       sets are valid). Like the pattern string, the subject may  contain  bi-
2593       nary zeros.
2594
2595       A  non-zero  starting offset is useful when searching for another match
2596       in the same subject by calling pcre2_match()  again  after  a  previous
2597       success.   Setting  startoffset  differs  from passing over a shortened
2598       string and setting PCRE2_NOTBOL in the case of a  pattern  that  begins
2599       with any kind of lookbehind. For example, consider the pattern
2600
2601         \Biss\B
2602
2603       which  finds  occurrences  of "iss" in the middle of words. (\B matches
2604       only if the current position in the subject is not  a  word  boundary.)
2605       When   applied   to   the   string  "Mississippi"  the  first  call  to
2606       pcre2_match() finds the first occurrence. If  pcre2_match()  is  called
2607       again with just the remainder of the subject, namely "issippi", it does
2608       not match, because \B is always false at  the  start  of  the  subject,
2609       which  is  deemed  to  be a word boundary. However, if pcre2_match() is
2610       passed the entire string again, but with startoffset set to 4, it finds
2611       the  second  occurrence  of "iss" because it is able to look behind the
2612       starting point to discover that it is preceded by a letter.
2613
2614       Finding all the matches in a subject is tricky  when  the  pattern  can
2615       match an empty string. It is possible to emulate Perl's /g behaviour by
2616       first  trying  the  match  again  at  the   same   offset,   with   the
2617       PCRE2_NOTEMPTY_ATSTART  and  PCRE2_ANCHORED  options,  and then if that
2618       fails, advancing the starting  offset  and  trying  an  ordinary  match
2619       again.  There  is  some  code  that  demonstrates how to do this in the
2620       pcre2demo sample program. In the most general case, you have  to  check
2621       to  see  if the newline convention recognizes CRLF as a newline, and if
2622       so, and the current character is CR followed by LF, advance the  start-
2623       ing offset by two characters instead of one.
2624
2625       If a non-zero starting offset is passed when the pattern is anchored, a
2626       single attempt to match at the given offset is made. This can only suc-
2627       ceed  if  the  pattern does not require the match to be at the start of
2628       the subject. In other words, the anchoring must be the result  of  set-
2629       ting  the PCRE2_ANCHORED option or the use of .* with PCRE2_DOTALL, not
2630       by starting the pattern with ^ or \A.
2631
2632   Option bits for pcre2_match()
2633
2634       The unused bits of the options argument for pcre2_match() must be zero.
2635       The    only    bits    that    may    be    set   are   PCRE2_ANCHORED,
2636       PCRE2_COPY_MATCHED_SUBJECT, PCRE2_ENDANCHORED, PCRE2_NOTBOL,  PCRE2_NO-
2637       TEOL,     PCRE2_NOTEMPTY,     PCRE2_NOTEMPTY_ATSTART,     PCRE2_NO_JIT,
2638       PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and  PCRE2_PARTIAL_SOFT.  Their
2639       action is described below.
2640
2641       Setting  PCRE2_ANCHORED  or PCRE2_ENDANCHORED at match time is not sup-
2642       ported by the just-in-time (JIT) compiler. If it is set,  JIT  matching
2643       is  disabled  and  the interpretive code in pcre2_match() is run. Apart
2644       from PCRE2_NO_JIT (obviously), the remaining options are supported  for
2645       JIT matching.
2646
2647         PCRE2_ANCHORED
2648
2649       The PCRE2_ANCHORED option limits pcre2_match() to matching at the first
2650       matching position. If a pattern was compiled  with  PCRE2_ANCHORED,  or
2651       turned  out to be anchored by virtue of its contents, it cannot be made
2652       unachored at matching time. Note that setting the option at match  time
2653       disables JIT matching.
2654
2655         PCRE2_COPY_MATCHED_SUBJECT
2656
2657       By  default,  a  pointer to the subject is remembered in the match data
2658       block so that, after a successful match, it can be  referenced  by  the
2659       substring  extraction  functions.  This means that the subject's memory
2660       must not be freed until all such operations are complete. For some  ap-
2661       plications  where the lifetime of the subject string is not guaranteed,
2662       it may be necessary to make a copy of the subject  string,  but  it  is
2663       wasteful  to do this unless the match is successful. After a successful
2664       match, if PCRE2_COPY_MATCHED_SUBJECT is set, the subject is copied  and
2665       the  new  pointer  is remembered in the match data block instead of the
2666       original subject pointer. The memory allocator that was  used  for  the
2667       match  block  itself  is  used.  The  copy  is automatically freed when
2668       pcre2_match_data_free() is called to free the match data block.  It  is
2669       also automatically freed if the match data block is re-used for another
2670       match operation.
2671
2672         PCRE2_ENDANCHORED
2673
2674       If the PCRE2_ENDANCHORED option is set, any string  that  pcre2_match()
2675       matches  must be right at the end of the subject string. Note that set-
2676       ting the option at match time disables JIT matching.
2677
2678         PCRE2_NOTBOL
2679
2680       This option specifies that first character of the subject string is not
2681       the  beginning  of  a  line, so the circumflex metacharacter should not
2682       match before it. Setting this without  having  set  PCRE2_MULTILINE  at
2683       compile time causes circumflex never to match. This option affects only
2684       the behaviour of the circumflex metacharacter. It does not affect \A.
2685
2686         PCRE2_NOTEOL
2687
2688       This option specifies that the end of the subject string is not the end
2689       of  a line, so the dollar metacharacter should not match it nor (except
2690       in multiline mode) a newline immediately before it. Setting this  with-
2691       out  having  set PCRE2_MULTILINE at compile time causes dollar never to
2692       match. This option affects only the behaviour of the dollar metacharac-
2693       ter. It does not affect \Z or \z.
2694
2695         PCRE2_NOTEMPTY
2696
2697       An empty string is not considered to be a valid match if this option is
2698       set. If there are alternatives in the pattern, they are tried.  If  all
2699       the  alternatives  match  the empty string, the entire match fails. For
2700       example, if the pattern
2701
2702         a?b?
2703
2704       is applied to a string not beginning with "a" or  "b",  it  matches  an
2705       empty string at the start of the subject. With PCRE2_NOTEMPTY set, this
2706       match is not valid, so pcre2_match() searches further into  the  string
2707       for occurrences of "a" or "b".
2708
2709         PCRE2_NOTEMPTY_ATSTART
2710
2711       This  is  like PCRE2_NOTEMPTY, except that it locks out an empty string
2712       match only at the first matching position, that is, at the start of the
2713       subject  plus  the  starting offset. An empty string match later in the
2714       subject is permitted.  If the pattern is anchored, such a match can oc-
2715       cur only if the pattern contains \K.
2716
2717         PCRE2_NO_JIT
2718
2719       By   default,   if   a  pattern  has  been  successfully  processed  by
2720       pcre2_jit_compile(), JIT is automatically used  when  pcre2_match()  is
2721       called  with  options  that JIT supports. Setting PCRE2_NO_JIT disables
2722       the use of JIT; it forces matching to be done by the interpreter.
2723
2724         PCRE2_NO_UTF_CHECK
2725
2726       When PCRE2_UTF is set at compile time, the validity of the subject as a
2727       UTF   string   is   checked  unless  PCRE2_NO_UTF_CHECK  is  passed  to
2728       pcre2_match() or PCRE2_MATCH_INVALID_UTF was passed to pcre2_compile().
2729       The latter special case is discussed in detail in the pcre2unicode doc-
2730       umentation.
2731
2732       In the default case, if a non-zero starting offset is given, the  check
2733       is  applied  only  to  that part of the subject that could be inspected
2734       during matching, and there is a check that the starting  offset  points
2735       to  the first code unit of a character or to the end of the subject. If
2736       there are no lookbehind assertions in the pattern, the check starts  at
2737       the starting offset.  Otherwise, it starts at the length of the longest
2738       lookbehind before the starting offset, or at the start of  the  subject
2739       if  there are not that many characters before the starting offset. Note
2740       that the sequences \b and \B are one-character lookbehinds.
2741
2742       The check is carried out before any other processing takes place, and a
2743       negative  error  code is returned if the check fails. There are several
2744       UTF error codes for each code unit width,  corresponding  to  different
2745       problems  with  the code unit sequence. There are discussions about the
2746       validity of UTF-8 strings, UTF-16 strings, and UTF-32  strings  in  the
2747       pcre2unicode documentation.
2748
2749       If you know that your subject is valid, and you want to skip this check
2750       for performance reasons, you can set the PCRE2_NO_UTF_CHECK option when
2751       calling  pcre2_match().  You  might  want to do this for the second and
2752       subsequent calls to pcre2_match() if you are making repeated  calls  to
2753       find multiple matches in the same subject string.
2754
2755       Warning:  Unless  PCRE2_MATCH_INVALID_UTF was set at compile time, when
2756       PCRE2_NO_UTF_CHECK is set at match time the effect of  passing  an  in-
2757       valid string as a subject, or an invalid value of startoffset, is unde-
2758       fined.  Your program may crash or loop indefinitely or give  wrong  re-
2759       sults.
2760
2761         PCRE2_PARTIAL_HARD
2762         PCRE2_PARTIAL_SOFT
2763
2764       These options turn on the partial matching feature. A partial match oc-
2765       curs if the end of the subject  string  is  reached  successfully,  but
2766       there are not enough subject characters to complete the match. In addi-
2767       tion, either at least one character must have  been  inspected  or  the
2768       pattern  must  contain  a  lookbehind,  or the pattern must be one that
2769       could match an empty string.
2770
2771       If this situation arises when PCRE2_PARTIAL_SOFT  (but  not  PCRE2_PAR-
2772       TIAL_HARD) is set, matching continues by testing any remaining alterna-
2773       tives. Only if no complete match can be  found  is  PCRE2_ERROR_PARTIAL
2774       returned  instead  of  PCRE2_ERROR_NOMATCH.  In other words, PCRE2_PAR-
2775       TIAL_SOFT specifies that the caller is prepared  to  handle  a  partial
2776       match, but only if no complete match can be found.
2777
2778       If  PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In this
2779       case, if a partial match is found,  pcre2_match()  immediately  returns
2780       PCRE2_ERROR_PARTIAL,  without  considering  any  other alternatives. In
2781       other words, when PCRE2_PARTIAL_HARD is set, a partial match is consid-
2782       ered to be more important that an alternative complete match.
2783
2784       There is a more detailed discussion of partial and multi-segment match-
2785       ing, with examples, in the pcre2partial documentation.
2786
2787
2788NEWLINE HANDLING WHEN MATCHING
2789
2790       When PCRE2 is built, a default newline convention is set; this is  usu-
2791       ally  the standard convention for the operating system. The default can
2792       be overridden in a compile context by calling  pcre2_set_newline().  It
2793       can  also be overridden by starting a pattern string with, for example,
2794       (*CRLF), as described in the section  on  newline  conventions  in  the
2795       pcre2pattern  page. During matching, the newline choice affects the be-
2796       haviour of the dot, circumflex, and dollar metacharacters. It may  also
2797       alter  the  way  the  match starting position is advanced after a match
2798       failure for an unanchored pattern.
2799
2800       When PCRE2_NEWLINE_CRLF, PCRE2_NEWLINE_ANYCRLF, or PCRE2_NEWLINE_ANY is
2801       set  as  the  newline convention, and a match attempt for an unanchored
2802       pattern fails when the current starting position is at a CRLF sequence,
2803       and  the  pattern contains no explicit matches for CR or LF characters,
2804       the match position is advanced by two characters  instead  of  one,  in
2805       other words, to after the CRLF.
2806
2807       The above rule is a compromise that makes the most common cases work as
2808       expected. For example, if the pattern is .+A (and the PCRE2_DOTALL  op-
2809       tion  is  not set), it does not match the string "\r\nA" because, after
2810       failing at the start, it skips both the CR and the LF before  retrying.
2811       However,  the  pattern  [\r\n]A does match that string, because it con-
2812       tains an explicit CR or LF reference, and so advances only by one char-
2813       acter after the first failure.
2814
2815       An explicit match for CR of LF is either a literal appearance of one of
2816       those characters in the pattern, or one of the \r or \n  or  equivalent
2817       octal or hexadecimal escape sequences. Implicit matches such as [^X] do
2818       not count, nor does \s, even though it includes CR and LF in the  char-
2819       acters that it matches.
2820
2821       Notwithstanding  the above, anomalous effects may still occur when CRLF
2822       is a valid newline sequence and explicit \r or \n escapes appear in the
2823       pattern.
2824
2825
2826HOW PCRE2_MATCH() RETURNS A STRING AND CAPTURED SUBSTRINGS
2827
2828       uint32_t pcre2_get_ovector_count(pcre2_match_data *match_data);
2829
2830       PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *match_data);
2831
2832       In  general, a pattern matches a certain portion of the subject, and in
2833       addition, further substrings from the subject  may  be  picked  out  by
2834       parenthesized  parts  of  the  pattern.  Following the usage in Jeffrey
2835       Friedl's book, this is called "capturing"  in  what  follows,  and  the
2836       phrase  "capture  group" (Perl terminology) is used for a fragment of a
2837       pattern that picks out a substring. PCRE2 supports several other  kinds
2838       of parenthesized group that do not cause substrings to be captured. The
2839       pcre2_pattern_info() function can be used to find out how many  capture
2840       groups there are in a compiled pattern.
2841
2842       You  can  use  auxiliary functions for accessing captured substrings by
2843       number or by name, as described in sections below.
2844
2845       Alternatively, you can make direct use of the vector of PCRE2_SIZE val-
2846       ues,  called  the  ovector,  which  contains  the  offsets  of captured
2847       strings.  It  is  part  of  the  match  data   block.    The   function
2848       pcre2_get_ovector_pointer()  returns  the  address  of the ovector, and
2849       pcre2_get_ovector_count() returns the number of pairs of values it con-
2850       tains.
2851
2852       Within the ovector, the first in each pair of values is set to the off-
2853       set of the first code unit of a substring, and the second is set to the
2854       offset  of the first code unit after the end of a substring. These val-
2855       ues are always code unit offsets, not character offsets. That is,  they
2856       are byte offsets in the 8-bit library, 16-bit offsets in the 16-bit li-
2857       brary, and 32-bit offsets in the 32-bit library.
2858
2859       After a partial match  (error  return  PCRE2_ERROR_PARTIAL),  only  the
2860       first  pair  of  offsets  (that is, ovector[0] and ovector[1]) are set.
2861       They identify the part of the subject that was partially  matched.  See
2862       the pcre2partial documentation for details of partial matching.
2863
2864       After  a  fully  successful match, the first pair of offsets identifies
2865       the portion of the subject string that was matched by the  entire  pat-
2866       tern.  The  next  pair is used for the first captured substring, and so
2867       on. The value returned by pcre2_match() is one more  than  the  highest
2868       numbered  pair  that  has been set. For example, if two substrings have
2869       been captured, the returned value is 3. If there are no  captured  sub-
2870       strings, the return value from a successful match is 1, indicating that
2871       just the first pair of offsets has been set.
2872
2873       If a pattern uses the \K escape sequence within a  positive  assertion,
2874       the reported start of a successful match can be greater than the end of
2875       the match.  For example, if the pattern  (?=ab\K)  is  matched  against
2876       "ab", the start and end offset values for the match are 2 and 0.
2877
2878       If  a  capture group is matched repeatedly within a single match opera-
2879       tion, it is the last portion of the subject that it matched that is re-
2880       turned.
2881
2882       If the ovector is too small to hold all the captured substring offsets,
2883       as much as possible is filled in, and the function returns a  value  of
2884       zero.  If captured substrings are not of interest, pcre2_match() may be
2885       called with a match data block whose ovector is of minimum length (that
2886       is, one pair).
2887
2888       It  is  possible for capture group number n+1 to match some part of the
2889       subject when group n has not been used at  all.  For  example,  if  the
2890       string "abc" is matched against the pattern (a|(z))(bc) the return from
2891       the function is 4, and groups 1 and 3 are matched, but 2 is  not.  When
2892       this  happens,  both values in the offset pairs corresponding to unused
2893       groups are set to PCRE2_UNSET.
2894
2895       Offset values that correspond to unused groups at the end  of  the  ex-
2896       pression  are also set to PCRE2_UNSET. For example, if the string "abc"
2897       is matched against the pattern (abc)(x(yz)?)? groups 2 and  3  are  not
2898       matched.  The  return  from the function is 2, because the highest used
2899       capture group number is 1. The offsets for for  the  second  and  third
2900       capture  groupss  (assuming  the vector is large enough, of course) are
2901       set to PCRE2_UNSET.
2902
2903       Elements in the ovector that do not correspond to capturing parentheses
2904       in the pattern are never changed. That is, if a pattern contains n cap-
2905       turing parentheses, no more than ovector[0] to ovector[2n+1] are set by
2906       pcre2_match().  The  other  elements retain whatever values they previ-
2907       ously had. After a failed match attempt, the contents  of  the  ovector
2908       are unchanged.
2909
2910
2911OTHER INFORMATION ABOUT A MATCH
2912
2913       PCRE2_SPTR pcre2_get_mark(pcre2_match_data *match_data);
2914
2915       PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *match_data);
2916
2917       As  well as the offsets in the ovector, other information about a match
2918       is retained in the match data block and can be retrieved by  the  above
2919       functions  in  appropriate  circumstances.  If they are called at other
2920       times, the result is undefined.
2921
2922       After a successful match, a partial match (PCRE2_ERROR_PARTIAL),  or  a
2923       failure  to  match (PCRE2_ERROR_NOMATCH), a mark name may be available.
2924       The function pcre2_get_mark() can be called to access this name,  which
2925       can  be  specified  in  the  pattern by any of the backtracking control
2926       verbs, not just (*MARK). The same function applies to all the verbs. It
2927       returns a pointer to the zero-terminated name, which is within the com-
2928       piled pattern. If no name is available, NULL is returned. The length of
2929       the  name  (excluding  the terminating zero) is stored in the code unit
2930       that precedes the name. You should use this length instead  of  relying
2931       on the terminating zero if the name might contain a binary zero.
2932
2933       After  a  successful  match, the name that is returned is the last mark
2934       name encountered on the matching path through the pattern. Instances of
2935       backtracking  verbs  without  names do not count. Thus, for example, if
2936       the matching path contains (*MARK:A)(*PRUNE), the name "A" is returned.
2937       After a "no match" or a partial match, the last encountered name is re-
2938       turned. For example, consider this pattern:
2939
2940         ^(*MARK:A)((*MARK:B)a|b)c
2941
2942       When it matches "bc", the returned name is A. The B mark is  "seen"  in
2943       the  first  branch of the group, but it is not on the matching path. On
2944       the other hand, when this pattern fails to  match  "bx",  the  returned
2945       name is B.
2946
2947       Warning:  By  default, certain start-of-match optimizations are used to
2948       give a fast "no match" result in some situations. For example,  if  the
2949       anchoring  is removed from the pattern above, there is an initial check
2950       for the presence of "c" in the subject before running the matching  en-
2951       gine. This check fails for "bx", causing a match failure without seeing
2952       any marks. You can disable the start-of-match optimizations by  setting
2953       the  PCRE2_NO_START_OPTIMIZE  option for pcre2_compile() or by starting
2954       the pattern with (*NO_START_OPT).
2955
2956       After a successful match, a partial match, or one of  the  invalid  UTF
2957       errors  (for example, PCRE2_ERROR_UTF8_ERR5), pcre2_get_startchar() can
2958       be called. After a successful or partial match it returns the code unit
2959       offset  of  the character at which the match started. For a non-partial
2960       match, this can be different to the value of ovector[0] if the  pattern
2961       contains  the  \K escape sequence. After a partial match, however, this
2962       value is always the same as ovector[0] because \K does not  affect  the
2963       result of a partial match.
2964
2965       After  a UTF check failure, pcre2_get_startchar() can be used to obtain
2966       the code unit offset of the invalid UTF character. Details are given in
2967       the pcre2unicode page.
2968
2969
2970ERROR RETURNS FROM pcre2_match()
2971
2972       If  pcre2_match() fails, it returns a negative number. This can be con-
2973       verted to a text string by calling the pcre2_get_error_message()  func-
2974       tion  (see  "Obtaining a textual error message" below).  Negative error
2975       codes are also returned by other functions,  and  are  documented  with
2976       them.  The codes are given names in the header file. If UTF checking is
2977       in force and an invalid UTF subject string is detected, one of a number
2978       of  UTF-specific negative error codes is returned. Details are given in
2979       the pcre2unicode page. The following are the other errors that  may  be
2980       returned by pcre2_match():
2981
2982         PCRE2_ERROR_NOMATCH
2983
2984       The subject string did not match the pattern.
2985
2986         PCRE2_ERROR_PARTIAL
2987
2988       The  subject  string did not match, but it did match partially. See the
2989       pcre2partial documentation for details of partial matching.
2990
2991         PCRE2_ERROR_BADMAGIC
2992
2993       PCRE2 stores a 4-byte "magic number" at the start of the compiled code,
2994       to  catch  the case when it is passed a junk pointer. This is the error
2995       that is returned when the magic number is not present.
2996
2997         PCRE2_ERROR_BADMODE
2998
2999       This error is given when a compiled pattern is passed to a function  in
3000       a  library  of a different code unit width, for example, a pattern com-
3001       piled by the 8-bit library is passed to  a  16-bit  or  32-bit  library
3002       function.
3003
3004         PCRE2_ERROR_BADOFFSET
3005
3006       The value of startoffset was greater than the length of the subject.
3007
3008         PCRE2_ERROR_BADOPTION
3009
3010       An unrecognized bit was set in the options argument.
3011
3012         PCRE2_ERROR_BADUTFOFFSET
3013
3014       The UTF code unit sequence that was passed as a subject was checked and
3015       found to be valid (the PCRE2_NO_UTF_CHECK option was not set), but  the
3016       value  of startoffset did not point to the beginning of a UTF character
3017       or the end of the subject.
3018
3019         PCRE2_ERROR_CALLOUT
3020
3021       This error is never generated by pcre2_match() itself. It  is  provided
3022       for  use  by  callout  functions  that  want  to cause pcre2_match() or
3023       pcre2_callout_enumerate() to return a distinctive error code.  See  the
3024       pcre2callout documentation for details.
3025
3026         PCRE2_ERROR_DEPTHLIMIT
3027
3028       The nested backtracking depth limit was reached.
3029
3030         PCRE2_ERROR_HEAPLIMIT
3031
3032       The heap limit was reached.
3033
3034         PCRE2_ERROR_INTERNAL
3035
3036       An  unexpected  internal error has occurred. This error could be caused
3037       by a bug in PCRE2 or by overwriting of the compiled pattern.
3038
3039         PCRE2_ERROR_JIT_STACKLIMIT
3040
3041       This error is returned when a pattern that was successfully studied us-
3042       ing JIT is being matched, but the memory available for the just-in-time
3043       processing stack is not large enough. See  the  pcre2jit  documentation
3044       for more details.
3045
3046         PCRE2_ERROR_MATCHLIMIT
3047
3048       The backtracking match limit was reached.
3049
3050         PCRE2_ERROR_NOMEMORY
3051
3052       Heap  memory  is  used  to  remember backgracking points. This error is
3053       given when the memory allocation function (default  or  custom)  fails.
3054       Note  that  a  different  error, PCRE2_ERROR_HEAPLIMIT, is given if the
3055       amount of memory needed exceeds the heap limit. PCRE2_ERROR_NOMEMORY is
3056       also  returned  if PCRE2_COPY_MATCHED_SUBJECT is set and memory alloca-
3057       tion fails.
3058
3059         PCRE2_ERROR_NULL
3060
3061       Either the code, subject, or match_data argument was passed as NULL.
3062
3063         PCRE2_ERROR_RECURSELOOP
3064
3065       This error is returned when  pcre2_match()  detects  a  recursion  loop
3066       within  the  pattern. Specifically, it means that either the whole pat-
3067       tern or a capture group has been called recursively for the second time
3068       at  the  same position in the subject string. Some simple patterns that
3069       might do this are detected and faulted at compile time, but  more  com-
3070       plicated  cases,  in particular mutual recursions between two different
3071       groups, cannot be detected until matching is attempted.
3072
3073
3074OBTAINING A TEXTUAL ERROR MESSAGE
3075
3076       int pcre2_get_error_message(int errorcode, PCRE2_UCHAR *buffer,
3077         PCRE2_SIZE bufflen);
3078
3079       A text message for an error code  from  any  PCRE2  function  (compile,
3080       match,  or  auxiliary)  can be obtained by calling pcre2_get_error_mes-
3081       sage(). The code is passed as the first argument,  with  the  remaining
3082       two  arguments  specifying  a  code  unit buffer and its length in code
3083       units, into which the text message is placed. The message  is  returned
3084       in  code  units  of the appropriate width for the library that is being
3085       used.
3086
3087       The returned message is terminated with a trailing zero, and the  func-
3088       tion  returns  the  number  of  code units used, excluding the trailing
3089       zero. If the error number is unknown, the negative error code PCRE2_ER-
3090       ROR_BADDATA  is  returned.  If  the buffer is too small, the message is
3091       truncated (but still with a trailing zero), and the negative error code
3092       PCRE2_ERROR_NOMEMORY  is returned.  None of the messages are very long;
3093       a buffer size of 120 code units is ample.
3094
3095
3096EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
3097
3098       int pcre2_substring_length_bynumber(pcre2_match_data *match_data,
3099         uint32_t number, PCRE2_SIZE *length);
3100
3101       int pcre2_substring_copy_bynumber(pcre2_match_data *match_data,
3102         uint32_t number, PCRE2_UCHAR *buffer,
3103         PCRE2_SIZE *bufflen);
3104
3105       int pcre2_substring_get_bynumber(pcre2_match_data *match_data,
3106         uint32_t number, PCRE2_UCHAR **bufferptr,
3107         PCRE2_SIZE *bufflen);
3108
3109       void pcre2_substring_free(PCRE2_UCHAR *buffer);
3110
3111       Captured substrings can be accessed directly by using  the  ovector  as
3112       described above.  For convenience, auxiliary functions are provided for
3113       extracting  captured  substrings  as  new,  separate,   zero-terminated
3114       strings. A substring that contains a binary zero is correctly extracted
3115       and has a further zero added on the end, but  the  result  is  not,  of
3116       course, a C string.
3117
3118       The functions in this section identify substrings by number. The number
3119       zero refers to the entire matched substring, with higher numbers refer-
3120       ring  to  substrings  captured by parenthesized groups. After a partial
3121       match, only substring zero is available.  An  attempt  to  extract  any
3122       other  substring  gives the error PCRE2_ERROR_PARTIAL. The next section
3123       describes similar functions for extracting captured substrings by name.
3124
3125       If a pattern uses the \K escape sequence within a  positive  assertion,
3126       the reported start of a successful match can be greater than the end of
3127       the match.  For example, if the pattern  (?=ab\K)  is  matched  against
3128       "ab",  the  start  and  end offset values for the match are 2 and 0. In
3129       this situation, calling these functions with a  zero  substring  number
3130       extracts a zero-length empty string.
3131
3132       You  can  find the length in code units of a captured substring without
3133       extracting it by calling pcre2_substring_length_bynumber().  The  first
3134       argument  is a pointer to the match data block, the second is the group
3135       number, and the third is a pointer to a variable into which the  length
3136       is  placed.  If  you just want to know whether or not the substring has
3137       been captured, you can pass the third argument as NULL.
3138
3139       The pcre2_substring_copy_bynumber() function  copies  a  captured  sub-
3140       string  into  a supplied buffer, whereas pcre2_substring_get_bynumber()
3141       copies it into new memory, obtained using the  same  memory  allocation
3142       function  that  was  used for the match data block. The first two argu-
3143       ments of these functions are a pointer to the match data  block  and  a
3144       capture group number.
3145
3146       The final arguments of pcre2_substring_copy_bynumber() are a pointer to
3147       the buffer and a pointer to a variable that contains its length in code
3148       units.  This is updated to contain the actual number of code units used
3149       for the extracted substring, excluding the terminating zero.
3150
3151       For pcre2_substring_get_bynumber() the third and fourth arguments point
3152       to  variables that are updated with a pointer to the new memory and the
3153       number of code units that comprise the substring, again  excluding  the
3154       terminating  zero.  When  the substring is no longer needed, the memory
3155       should be freed by calling pcre2_substring_free().
3156
3157       The return value from all these functions is zero  for  success,  or  a
3158       negative  error  code.  If  the pattern match failed, the match failure
3159       code is returned.  If a substring number greater than zero is used  af-
3160       ter  a  partial  match, PCRE2_ERROR_PARTIAL is returned. Other possible
3161       error codes are:
3162
3163         PCRE2_ERROR_NOMEMORY
3164
3165       The buffer was too small for  pcre2_substring_copy_bynumber(),  or  the
3166       attempt to get memory failed for pcre2_substring_get_bynumber().
3167
3168         PCRE2_ERROR_NOSUBSTRING
3169
3170       There  is  no  substring  with that number in the pattern, that is, the
3171       number is greater than the number of capturing parentheses.
3172
3173         PCRE2_ERROR_UNAVAILABLE
3174
3175       The substring number, though not greater than the number of captures in
3176       the pattern, is greater than the number of slots in the ovector, so the
3177       substring could not be captured.
3178
3179         PCRE2_ERROR_UNSET
3180
3181       The substring did not participate in the match.  For  example,  if  the
3182       pattern  is  (abc)|(def) and the subject is "def", and the ovector con-
3183       tains at least two capturing slots, substring number 1 is unset.
3184
3185
3186EXTRACTING A LIST OF ALL CAPTURED SUBSTRINGS
3187
3188       int pcre2_substring_list_get(pcre2_match_data *match_data,
3189         PCRE2_UCHAR ***listptr, PCRE2_SIZE **lengthsptr);
3190
3191       void pcre2_substring_list_free(PCRE2_SPTR *list);
3192
3193       The pcre2_substring_list_get() function  extracts  all  available  sub-
3194       strings  and  builds  a  list of pointers to them. It also (optionally)
3195       builds a second list that contains their lengths (in code  units),  ex-
3196       cluding  a  terminating zero that is added to each of them. All this is
3197       done in a single block of memory that is obtained using the same memory
3198       allocation function that was used to get the match data block.
3199
3200       This  function  must be called only after a successful match. If called
3201       after a partial match, the error code PCRE2_ERROR_PARTIAL is returned.
3202
3203       The address of the memory block is returned via listptr, which is  also
3204       the start of the list of string pointers. The end of the list is marked
3205       by a NULL pointer. The address of the list of lengths is  returned  via
3206       lengthsptr.  If your strings do not contain binary zeros and you do not
3207       therefore need the lengths, you may supply NULL as the lengthsptr argu-
3208       ment  to  disable  the  creation of a list of lengths. The yield of the
3209       function is zero if all went well, or PCRE2_ERROR_NOMEMORY if the  mem-
3210       ory  block could not be obtained. When the list is no longer needed, it
3211       should be freed by calling pcre2_substring_list_free().
3212
3213       If this function encounters a substring that is unset, which can happen
3214       when  capture  group  number  n+1 matches some part of the subject, but
3215       group n has not been used at all, it returns an empty string. This  can
3216       be distinguished from a genuine zero-length substring by inspecting the
3217       appropriate offset in the ovector, which contain PCRE2_UNSET for  unset
3218       substrings, or by calling pcre2_substring_length_bynumber().
3219
3220
3221EXTRACTING CAPTURED SUBSTRINGS BY NAME
3222
3223       int pcre2_substring_number_from_name(const pcre2_code *code,
3224         PCRE2_SPTR name);
3225
3226       int pcre2_substring_length_byname(pcre2_match_data *match_data,
3227         PCRE2_SPTR name, PCRE2_SIZE *length);
3228
3229       int pcre2_substring_copy_byname(pcre2_match_data *match_data,
3230         PCRE2_SPTR name, PCRE2_UCHAR *buffer, PCRE2_SIZE *bufflen);
3231
3232       int pcre2_substring_get_byname(pcre2_match_data *match_data,
3233         PCRE2_SPTR name, PCRE2_UCHAR **bufferptr, PCRE2_SIZE *bufflen);
3234
3235       void pcre2_substring_free(PCRE2_UCHAR *buffer);
3236
3237       To  extract a substring by name, you first have to find associated num-
3238       ber.  For example, for this pattern:
3239
3240         (a+)b(?<xxx>\d+)...
3241
3242       the number of the capture group called "xxx" is 2. If the name is known
3243       to be unique (PCRE2_DUPNAMES was not set), you can find the number from
3244       the name by calling pcre2_substring_number_from_name(). The first argu-
3245       ment  is the compiled pattern, and the second is the name. The yield of
3246       the function is the group number, PCRE2_ERROR_NOSUBSTRING if  there  is
3247       no  group  with that name, or PCRE2_ERROR_NOUNIQUESUBSTRING if there is
3248       more than one group with that name.  Given the number, you can  extract
3249       the  substring  directly from the ovector, or use one of the "bynumber"
3250       functions described above.
3251
3252       For convenience, there are also "byname" functions that  correspond  to
3253       the "bynumber" functions, the only difference being that the second ar-
3254       gument is a name instead of a number.  If  PCRE2_DUPNAMES  is  set  and
3255       there are duplicate names, these functions scan all the groups with the
3256       given name, and return the captured  substring  from  the  first  named
3257       group that is set.
3258
3259       If  there are no groups with the given name, PCRE2_ERROR_NOSUBSTRING is
3260       returned. If all groups with the name have  numbers  that  are  greater
3261       than the number of slots in the ovector, PCRE2_ERROR_UNAVAILABLE is re-
3262       turned. If there is at least one group with a slot in the ovector,  but
3263       no group is found to be set, PCRE2_ERROR_UNSET is returned.
3264
3265       Warning: If the pattern uses the (?| feature to set up multiple capture
3266       groups with the same number, as described in the section  on  duplicate
3267       group numbers in the pcre2pattern page, you cannot use names to distin-
3268       guish the different capture groups, because names are not  included  in
3269       the  compiled  code.  The  matching process uses only numbers. For this
3270       reason, the use of different names for  groups  with  the  same  number
3271       causes an error at compile time.
3272
3273
3274CREATING A NEW STRING WITH SUBSTITUTIONS
3275
3276       int pcre2_substitute(const pcre2_code *code, PCRE2_SPTR subject,
3277         PCRE2_SIZE length, PCRE2_SIZE startoffset,
3278         uint32_t options, pcre2_match_data *match_data,
3279         pcre2_match_context *mcontext, PCRE2_SPTR replacement,
3280         PCRE2_SIZE rlength, PCRE2_UCHAR *outputbuffer,
3281         PCRE2_SIZE *outlengthptr);
3282
3283       This  function  optionally calls pcre2_match() and then makes a copy of
3284       the subject string in outputbuffer, replacing parts that  were  matched
3285       with the replacement string, whose length is supplied in rlength, which
3286       can be given as PCRE2_ZERO_TERMINATED for a zero-terminated string.  As
3287       a  special  case,  if  replacement is NULL and rlength is zero, the re-
3288       placement is assumed to be an empty string. If rlength is non-zero,  an
3289       error occurs if replacement is NULL.
3290
3291       There is an option (see PCRE2_SUBSTITUTE_REPLACEMENT_ONLY below) to re-
3292       turn just the replacement string(s). The default action is  to  perform
3293       just  one  replacement  if  the pattern matches, but there is an option
3294       that requests multiple replacements  (see  PCRE2_SUBSTITUTE_GLOBAL  be-
3295       low).
3296
3297       If  successful,  pcre2_substitute() returns the number of substitutions
3298       that were carried out. This may be zero if no match was found,  and  is
3299       never  greater  than one unless PCRE2_SUBSTITUTE_GLOBAL is set. A nega-
3300       tive value is returned if an error is detected.
3301
3302       Matches in which a \K item in a lookahead in  the  pattern  causes  the
3303       match  to  end  before it starts are not supported, and give rise to an
3304       error return. For global replacements, matches in which \K in a lookbe-
3305       hind  causes the match to start earlier than the point that was reached
3306       in the previous iteration are also not supported.
3307
3308       The first seven arguments of pcre2_substitute() are  the  same  as  for
3309       pcre2_match(), except that the partial matching options are not permit-
3310       ted, and match_data may be passed as NULL, in which case a  match  data
3311       block  is obtained and freed within this function, using memory manage-
3312       ment functions from the match context, if provided, or else those  that
3313       were used to allocate memory for the compiled code.
3314
3315       If  match_data is not NULL and PCRE2_SUBSTITUTE_MATCHED is not set, the
3316       provided block is used for all calls to pcre2_match(), and its contents
3317       afterwards  are  the result of the final call. For global changes, this
3318       will always be a no-match error. The contents of the ovector within the
3319       match data block may or may not have been changed.
3320
3321       As  well as the usual options for pcre2_match(), a number of additional
3322       options can be set in the options argument of pcre2_substitute().   One
3323       such  option is PCRE2_SUBSTITUTE_MATCHED. When this is set, an external
3324       match_data block must be provided, and it must have already  been  used
3325       for an external call to pcre2_match() with the same pattern and subject
3326       arguments. The data in the match_data block (return code,  offset  vec-
3327       tor)  is  then  used  for  the  first  substitution  instead of calling
3328       pcre2_match() from within pcre2_substitute(). This allows  an  applica-
3329       tion to check for a match before choosing to substitute, without having
3330       to repeat the match.
3331
3332       The contents of the  externally  supplied  match  data  block  are  not
3333       changed   when   PCRE2_SUBSTITUTE_MATCHED   is  set.  If  PCRE2_SUBSTI-
3334       TUTE_GLOBAL is also set, pcre2_match() is called after the  first  sub-
3335       stitution  to  check for further matches, but this is done using an in-
3336       ternally obtained match data block, thus always  leaving  the  external
3337       block unchanged.
3338
3339       The  code  argument is not used for matching before the first substitu-
3340       tion when PCRE2_SUBSTITUTE_MATCHED is set, but  it  must  be  provided,
3341       even  when  PCRE2_SUBSTITUTE_GLOBAL is not set, because it contains in-
3342       formation such as the UTF setting and the number of capturing parenthe-
3343       ses in the pattern.
3344
3345       The  default  action  of  pcre2_substitute() is to return a copy of the
3346       subject string with matched substrings replaced. However, if PCRE2_SUB-
3347       STITUTE_REPLACEMENT_ONLY  is  set,  only the replacement substrings are
3348       returned. In the global case, multiple replacements are concatenated in
3349       the  output  buffer.  Substitution  callouts (see below) can be used to
3350       separate them if necessary.
3351
3352       The outlengthptr argument of pcre2_substitute() must point to  a  vari-
3353       able  that contains the length, in code units, of the output buffer. If
3354       the function is successful, the value is updated to contain the  length
3355       in  code  units  of the new string, excluding the trailing zero that is
3356       automatically added.
3357
3358       If the function is not successful, the value set via  outlengthptr  de-
3359       pends  on  the  type  of  error.  For  syntax errors in the replacement
3360       string, the value is the offset in the replacement string where the er-
3361       ror  was  detected.  For  other errors, the value is PCRE2_UNSET by de-
3362       fault. This includes the case of the output buffer being too small, un-
3363       less PCRE2_SUBSTITUTE_OVERFLOW_LENGTH is set.
3364
3365       PCRE2_SUBSTITUTE_OVERFLOW_LENGTH  changes  what happens when the output
3366       buffer is too small. The default action is to return PCRE2_ERROR_NOMEM-
3367       ORY  immediately.  If  this  option is set, however, pcre2_substitute()
3368       continues to go through the motions of matching and substituting (with-
3369       out,  of course, writing anything) in order to compute the size of buf-
3370       fer that is needed. This value is  passed  back  via  the  outlengthptr
3371       variable,  with  the  result  of  the  function  still  being PCRE2_ER-
3372       ROR_NOMEMORY.
3373
3374       Passing a buffer size of zero is a permitted way  of  finding  out  how
3375       much  memory  is needed for given substitution. However, this does mean
3376       that the entire operation is carried out twice. Depending on the appli-
3377       cation,  it  may  be more efficient to allocate a large buffer and free
3378       the  excess  afterwards,  instead   of   using   PCRE2_SUBSTITUTE_OVER-
3379       FLOW_LENGTH.
3380
3381       The  replacement  string,  which  is interpreted as a UTF string in UTF
3382       mode, is checked for UTF validity unless PCRE2_NO_UTF_CHECK is set.  An
3383       invalid UTF replacement string causes an immediate return with the rel-
3384       evant UTF error code.
3385
3386       If PCRE2_SUBSTITUTE_LITERAL is set, the replacement string is  not  in-
3387       terpreted in any way. By default, however, a dollar character is an es-
3388       cape character that can specify the insertion of characters  from  cap-
3389       ture  groups  and names from (*MARK) or other control verbs in the pat-
3390       tern. The following forms are always recognized:
3391
3392         $$                  insert a dollar character
3393         $<n> or ${<n>}      insert the contents of group <n>
3394         $*MARK or ${*MARK}  insert a control verb name
3395
3396       Either a group number or a group name  can  be  given  for  <n>.  Curly
3397       brackets  are  required only if the following character would be inter-
3398       preted as part of the number or name. The number may be zero to include
3399       the  entire  matched  string.   For  example,  if  the pattern a(b)c is
3400       matched with "=abc=" and the replacement string "+$1$0$1+", the  result
3401       is "=+babcb+=".
3402
3403       $*MARK  inserts the name from the last encountered backtracking control
3404       verb on the matching path that has a name. (*MARK) must always  include
3405       a  name,  but  the  other  verbs  need not. For example, in the case of
3406       (*MARK:A)(*PRUNE) the name inserted is "A", but for (*MARK:A)(*PRUNE:B)
3407       the  relevant  name is "B". This facility can be used to perform simple
3408       simultaneous substitutions, as this pcre2test example shows:
3409
3410         /(*MARK:pear)apple|(*MARK:orange)lemon/g,replace=${*MARK}
3411             apple lemon
3412          2: pear orange
3413
3414       PCRE2_SUBSTITUTE_GLOBAL causes the function to iterate over the subject
3415       string,  replacing every matching substring. If this option is not set,
3416       only the first matching substring is replaced. The search  for  matches
3417       takes  place in the original subject string (that is, previous replace-
3418       ments do not affect it).  Iteration is  implemented  by  advancing  the
3419       startoffset  value  for  each search, which is always passed the entire
3420       subject string. If an offset limit is set in the match context, search-
3421       ing stops when that limit is reached.
3422
3423       You  can  restrict  the effect of a global substitution to a portion of
3424       the subject string by setting either or both of startoffset and an off-
3425       set limit. Here is a pcre2test example:
3426
3427         /B/g,replace=!,use_offset_limit
3428         ABC ABC ABC ABC\=offset=3,offset_limit=12
3429          2: ABC A!C A!C ABC
3430
3431       When  continuing  with  global substitutions after matching a substring
3432       with zero length, an attempt to find a non-empty match at the same off-
3433       set is performed.  If this is not successful, the offset is advanced by
3434       one character except when CRLF is a valid newline sequence and the next
3435       two  characters are CR, LF. In this case, the offset is advanced by two
3436       characters.
3437
3438       PCRE2_SUBSTITUTE_UNKNOWN_UNSET causes references to capture groups that
3439       do not appear in the pattern to be treated as unset groups. This option
3440       should be used with care, because it means that a typo in a group  name
3441       or number no longer causes the PCRE2_ERROR_NOSUBSTRING error.
3442
3443       PCRE2_SUBSTITUTE_UNSET_EMPTY causes unset capture groups (including un-
3444       known groups when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) to be  treated
3445       as  empty  strings  when inserted as described above. If this option is
3446       not set, an attempt to insert an unset group causes the PCRE2_ERROR_UN-
3447       SET  error.  This  option  does not influence the extended substitution
3448       syntax described below.
3449
3450       PCRE2_SUBSTITUTE_EXTENDED causes extra processing to be applied to  the
3451       replacement  string.  Without this option, only the dollar character is
3452       special, and only the group insertion forms  listed  above  are  valid.
3453       When PCRE2_SUBSTITUTE_EXTENDED is set, two things change:
3454
3455       Firstly,  backslash in a replacement string is interpreted as an escape
3456       character. The usual forms such as \n or \x{ddd} can be used to specify
3457       particular  character codes, and backslash followed by any non-alphanu-
3458       meric character quotes that character. Extended quoting  can  be  coded
3459       using \Q...\E, exactly as in pattern strings.
3460
3461       There  are  also four escape sequences for forcing the case of inserted
3462       letters.  The insertion mechanism has three states:  no  case  forcing,
3463       force upper case, and force lower case. The escape sequences change the
3464       current state: \U and \L change to upper or lower case forcing, respec-
3465       tively,  and  \E (when not terminating a \Q quoted sequence) reverts to
3466       no case forcing. The sequences \u and \l force the next  character  (if
3467       it  is  a  letter)  to  upper or lower case, respectively, and then the
3468       state automatically reverts to no case forcing. Case forcing applies to
3469       all  inserted  characters, including those from capture groups and let-
3470       ters within \Q...\E quoted sequences. If either PCRE2_UTF or  PCRE2_UCP
3471       was  set when the pattern was compiled, Unicode properties are used for
3472       case forcing characters whose code points are greater than 127.
3473
3474       Note that case forcing sequences such as \U...\E do not nest. For exam-
3475       ple,  the  result of processing "\Uaa\LBB\Ecc\E" is "AAbbcc"; the final
3476       \E has no effect. Note  also  that  the  PCRE2_ALT_BSUX  and  PCRE2_EX-
3477       TRA_ALT_BSUX options do not apply to replacement strings.
3478
3479       The  second  effect of setting PCRE2_SUBSTITUTE_EXTENDED is to add more
3480       flexibility to capture group substitution. The  syntax  is  similar  to
3481       that used by Bash:
3482
3483         ${<n>:-<string>}
3484         ${<n>:+<string1>:<string2>}
3485
3486       As  before,  <n> may be a group number or a name. The first form speci-
3487       fies a default value. If group <n> is set, its value  is  inserted;  if
3488       not,  <string>  is  expanded  and  the result inserted. The second form
3489       specifies strings that are expanded and inserted when group <n> is  set
3490       or  unset,  respectively. The first form is just a convenient shorthand
3491       for
3492
3493         ${<n>:+${<n>}:<string>}
3494
3495       Backslash can be used to escape colons and closing  curly  brackets  in
3496       the  replacement  strings.  A change of the case forcing state within a
3497       replacement string remains  in  force  afterwards,  as  shown  in  this
3498       pcre2test example:
3499
3500         /(some)?(body)/substitute_extended,replace=${1:+\U:\L}HeLLo
3501             body
3502          1: hello
3503             somebody
3504          1: HELLO
3505
3506       The  PCRE2_SUBSTITUTE_UNSET_EMPTY option does not affect these extended
3507       substitutions. However, PCRE2_SUBSTITUTE_UNKNOWN_UNSET does  cause  un-
3508       known groups in the extended syntax forms to be treated as unset.
3509
3510       If  PCRE2_SUBSTITUTE_LITERAL  is  set,  PCRE2_SUBSTITUTE_UNKNOWN_UNSET,
3511       PCRE2_SUBSTITUTE_UNSET_EMPTY, and PCRE2_SUBSTITUTE_EXTENDED are irrele-
3512       vant and are ignored.
3513
3514   Substitution errors
3515
3516       In  the  event of an error, pcre2_substitute() returns a negative error
3517       code. Except for PCRE2_ERROR_NOMATCH (which is never returned),  errors
3518       from pcre2_match() are passed straight back.
3519
3520       PCRE2_ERROR_NOSUBSTRING is returned for a non-existent substring inser-
3521       tion, unless PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set.
3522
3523       PCRE2_ERROR_UNSET is returned for an unset substring insertion (includ-
3524       ing  an  unknown  substring when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set)
3525       when the simple (non-extended) syntax is used and  PCRE2_SUBSTITUTE_UN-
3526       SET_EMPTY is not set.
3527
3528       PCRE2_ERROR_NOMEMORY  is  returned  if  the  output  buffer  is not big
3529       enough. If the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set, the size
3530       of  buffer  that is needed is returned via outlengthptr. Note that this
3531       does not happen by default.
3532
3533       PCRE2_ERROR_NULL is returned if PCRE2_SUBSTITUTE_MATCHED is set but the
3534       match_data  argument is NULL or if the subject or replacement arguments
3535       are NULL. For backward compatibility reasons an exception is  made  for
3536       the replacement argument if the rlength argument is also 0.
3537
3538       PCRE2_ERROR_BADREPLACEMENT  is  used for miscellaneous syntax errors in
3539       the replacement string, with more  particular  errors  being  PCRE2_ER-
3540       ROR_BADREPESCAPE (invalid escape sequence), PCRE2_ERROR_REPMISSINGBRACE
3541       (closing curly bracket not found), PCRE2_ERROR_BADSUBSTITUTION  (syntax
3542       error  in  extended group substitution), and PCRE2_ERROR_BADSUBSPATTERN
3543       (the pattern match ended before it started or the match started earlier
3544       than  the  current  position  in the subject, which can happen if \K is
3545       used in an assertion).
3546
3547       As for all PCRE2 errors, a text message that describes the error can be
3548       obtained  by  calling  the pcre2_get_error_message() function (see "Ob-
3549       taining a textual error message" above).
3550
3551   Substitution callouts
3552
3553       int pcre2_set_substitute_callout(pcre2_match_context *mcontext,
3554         int (*callout_function)(pcre2_substitute_callout_block *, void *),
3555         void *callout_data);
3556
3557       The pcre2_set_substitution_callout() function can be used to specify  a
3558       callout  function for pcre2_substitute(). This information is passed in
3559       a match context. The callout function is called after each substitution
3560       has been processed, but it can cause the replacement not to happen. The
3561       callout function is not called for simulated substitutions that  happen
3562       as a result of the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option.
3563
3564       The first argument of the callout function is a pointer to a substitute
3565       callout block structure, which contains the following fields, not  nec-
3566       essarily in this order:
3567
3568         uint32_t    version;
3569         uint32_t    subscount;
3570         PCRE2_SPTR  input;
3571         PCRE2_SPTR  output;
3572         PCRE2_SIZE *ovector;
3573         uint32_t    oveccount;
3574         PCRE2_SIZE  output_offsets[2];
3575
3576       The  version field contains the version number of the block format. The
3577       current version is 0. The version number will  increase  in  future  if
3578       more  fields are added, but the intention is never to remove any of the
3579       existing fields.
3580
3581       The subscount field is the number of the current match. It is 1 for the
3582       first callout, 2 for the second, and so on. The input and output point-
3583       ers are copies of the values passed to pcre2_substitute().
3584
3585       The ovector field points to the ovector, which contains the  result  of
3586       the most recent match. The oveccount field contains the number of pairs
3587       that are set in the ovector, and is always greater than zero.
3588
3589       The output_offsets vector contains the offsets of  the  replacement  in
3590       the  output  string. This has already been processed for dollar and (if
3591       requested) backslash substitutions as described above.
3592
3593       The second argument of the callout function  is  the  value  passed  as
3594       callout_data  when  the  function was registered. The value returned by
3595       the callout function is interpreted as follows:
3596
3597       If the value is zero, the replacement is accepted, and,  if  PCRE2_SUB-
3598       STITUTE_GLOBAL  is set, processing continues with a search for the next
3599       match. If the value is not zero, the current  replacement  is  not  ac-
3600       cepted.  If  the  value is greater than zero, processing continues when
3601       PCRE2_SUBSTITUTE_GLOBAL is set. Otherwise (the value is less than  zero
3602       or  PCRE2_SUBSTITUTE_GLOBAL  is  not set), the the rest of the input is
3603       copied to the output and the call to pcre2_substitute() exits,  return-
3604       ing the number of matches so far.
3605
3606
3607DUPLICATE CAPTURE GROUP NAMES
3608
3609       int pcre2_substring_nametable_scan(const pcre2_code *code,
3610         PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last);
3611
3612       When  a  pattern  is compiled with the PCRE2_DUPNAMES option, names for
3613       capture groups are not required to be unique. Duplicate names  are  al-
3614       ways  allowed for groups with the same number, created by using the (?|
3615       feature. Indeed, if such groups are named, they are required to use the
3616       same names.
3617
3618       Normally,  patterns  that  use duplicate names are such that in any one
3619       match, only one of each set of identically-named  groups  participates.
3620       An example is shown in the pcre2pattern documentation.
3621
3622       When   duplicates   are   present,   pcre2_substring_copy_byname()  and
3623       pcre2_substring_get_byname() return the first  substring  corresponding
3624       to  the given name that is set. Only if none are set is PCRE2_ERROR_UN-
3625       SET is returned. The  pcre2_substring_number_from_name()  function  re-
3626       turns  the error PCRE2_ERROR_NOUNIQUESUBSTRING when there are duplicate
3627       names.
3628
3629       If you want to get full details of all captured substrings for a  given
3630       name,  you  must use the pcre2_substring_nametable_scan() function. The
3631       first argument is the compiled pattern, and the second is the name.  If
3632       the  third  and fourth arguments are NULL, the function returns a group
3633       number for a unique name, or PCRE2_ERROR_NOUNIQUESUBSTRING otherwise.
3634
3635       When the third and fourth arguments are not NULL, they must be pointers
3636       to  variables  that are updated by the function. After it has run, they
3637       point to the first and last entries in the name-to-number table for the
3638       given  name,  and the function returns the length of each entry in code
3639       units. In both cases, PCRE2_ERROR_NOSUBSTRING is returned if there  are
3640       no entries for the given name.
3641
3642       The format of the name table is described above in the section entitled
3643       Information about a pattern. Given all the  relevant  entries  for  the
3644       name,  you  can  extract  each of their numbers, and hence the captured
3645       data.
3646
3647
3648FINDING ALL POSSIBLE MATCHES AT ONE POSITION
3649
3650       The traditional matching function uses a  similar  algorithm  to  Perl,
3651       which  stops when it finds the first match at a given point in the sub-
3652       ject. If you want to find all possible matches, or the longest possible
3653       match  at  a  given  position,  consider using the alternative matching
3654       function (see below) instead. If you cannot use the  alternative  func-
3655       tion, you can kludge it up by making use of the callout facility, which
3656       is described in the pcre2callout documentation.
3657
3658       What you have to do is to insert a callout right at the end of the pat-
3659       tern.   When your callout function is called, extract and save the cur-
3660       rent matched substring. Then return 1, which  forces  pcre2_match()  to
3661       backtrack  and  try other alternatives. Ultimately, when it runs out of
3662       matches, pcre2_match() will yield PCRE2_ERROR_NOMATCH.
3663
3664
3665MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
3666
3667       int pcre2_dfa_match(const pcre2_code *code, PCRE2_SPTR subject,
3668         PCRE2_SIZE length, PCRE2_SIZE startoffset,
3669         uint32_t options, pcre2_match_data *match_data,
3670         pcre2_match_context *mcontext,
3671         int *workspace, PCRE2_SIZE wscount);
3672
3673       The function pcre2_dfa_match() is called  to  match  a  subject  string
3674       against  a  compiled pattern, using a matching algorithm that scans the
3675       subject string just once (not counting lookaround assertions), and does
3676       not  backtrack (except when processing lookaround assertions). This has
3677       different characteristics to the normal algorithm, and is not  compati-
3678       ble  with  Perl.  Some  of  the features of PCRE2 patterns are not sup-
3679       ported. Nevertheless, there are times when this kind of matching can be
3680       useful.  For a discussion of the two matching algorithms, and a list of
3681       features that pcre2_dfa_match() does not support, see the pcre2matching
3682       documentation.
3683
3684       The  arguments  for  the pcre2_dfa_match() function are the same as for
3685       pcre2_match(), plus two extras. The ovector within the match data block
3686       is used in a different way, and this is described below. The other com-
3687       mon arguments are used in the same way as for pcre2_match(),  so  their
3688       description is not repeated here.
3689
3690       The  two  additional  arguments provide workspace for the function. The
3691       workspace vector should contain at least 20 elements. It  is  used  for
3692       keeping  track  of  multiple  paths  through  the  pattern  tree.  More
3693       workspace is needed for patterns and subjects where there are a lot  of
3694       potential matches.
3695
3696       Here is an example of a simple call to pcre2_dfa_match():
3697
3698         int wspace[20];
3699         pcre2_match_data *md = pcre2_match_data_create(4, NULL);
3700         int rc = pcre2_dfa_match(
3701           re,             /* result of pcre2_compile() */
3702           "some string",  /* the subject string */
3703           11,             /* the length of the subject string */
3704           0,              /* start at offset 0 in the subject */
3705           0,              /* default options */
3706           md,             /* the match data block */
3707           NULL,           /* a match context; NULL means use defaults */
3708           wspace,         /* working space vector */
3709           20);            /* number of elements (NOT size in bytes) */
3710
3711   Option bits for pcre2_dfa_match()
3712
3713       The  unused  bits of the options argument for pcre2_dfa_match() must be
3714       zero.  The  only   bits   that   may   be   set   are   PCRE2_ANCHORED,
3715       PCRE2_COPY_MATCHED_SUBJECT,  PCRE2_ENDANCHORED, PCRE2_NOTBOL, PCRE2_NO-
3716       TEOL,   PCRE2_NOTEMPTY,   PCRE2_NOTEMPTY_ATSTART,   PCRE2_NO_UTF_CHECK,
3717       PCRE2_PARTIAL_HARD,    PCRE2_PARTIAL_SOFT,    PCRE2_DFA_SHORTEST,   and
3718       PCRE2_DFA_RESTART. All but the last four of these are exactly the  same
3719       as for pcre2_match(), so their description is not repeated here.
3720
3721         PCRE2_PARTIAL_HARD
3722         PCRE2_PARTIAL_SOFT
3723
3724       These  have  the  same general effect as they do for pcre2_match(), but
3725       the details are slightly different. When PCRE2_PARTIAL_HARD is set  for
3726       pcre2_dfa_match(),  it  returns  PCRE2_ERROR_PARTIAL  if the end of the
3727       subject is reached and there is still at least one matching possibility
3728       that requires additional characters. This happens even if some complete
3729       matches have already been found. When PCRE2_PARTIAL_SOFT  is  set,  the
3730       return  code  PCRE2_ERROR_NOMATCH is converted into PCRE2_ERROR_PARTIAL
3731       if the end of the subject is  reached,  there  have  been  no  complete
3732       matches, but there is still at least one matching possibility. The por-
3733       tion of the string that was inspected when the  longest  partial  match
3734       was found is set as the first matching string in both cases. There is a
3735       more detailed discussion of partial and  multi-segment  matching,  with
3736       examples, in the pcre2partial documentation.
3737
3738         PCRE2_DFA_SHORTEST
3739
3740       Setting  the PCRE2_DFA_SHORTEST option causes the matching algorithm to
3741       stop as soon as it has found one match. Because of the way the alterna-
3742       tive  algorithm  works, this is necessarily the shortest possible match
3743       at the first possible matching point in the subject string.
3744
3745         PCRE2_DFA_RESTART
3746
3747       When pcre2_dfa_match() returns a partial match, it is possible to  call
3748       it again, with additional subject characters, and have it continue with
3749       the same match. The PCRE2_DFA_RESTART option requests this action; when
3750       it  is  set,  the workspace and wscount options must reference the same
3751       vector as before because data about the match so far is  left  in  them
3752       after a partial match. There is more discussion of this facility in the
3753       pcre2partial documentation.
3754
3755   Successful returns from pcre2_dfa_match()
3756
3757       When pcre2_dfa_match() succeeds, it may have matched more than one sub-
3758       string in the subject. Note, however, that all the matches from one run
3759       of the function start at the same point in  the  subject.  The  shorter
3760       matches  are all initial substrings of the longer matches. For example,
3761       if the pattern
3762
3763         <.*>
3764
3765       is matched against the string
3766
3767         This is <something> <something else> <something further> no more
3768
3769       the three matched strings are
3770
3771         <something> <something else> <something further>
3772         <something> <something else>
3773         <something>
3774
3775       On success, the yield of the function is a number  greater  than  zero,
3776       which  is  the  number  of  matched substrings. The offsets of the sub-
3777       strings are returned in the ovector, and can be extracted by number  in
3778       the  same way as for pcre2_match(), but the numbers bear no relation to
3779       any capture groups that may exist in the pattern, because DFA  matching
3780       does not support capturing.
3781
3782       Calls  to the convenience functions that extract substrings by name re-
3783       turn the error PCRE2_ERROR_DFA_UFUNC (unsupported function) if used af-
3784       ter  a  DFA match. The convenience functions that extract substrings by
3785       number never return PCRE2_ERROR_NOSUBSTRING.
3786
3787       The matched strings are stored in  the  ovector  in  reverse  order  of
3788       length;  that  is,  the longest matching string is first. If there were
3789       too many matches to fit into the ovector, the yield of the function  is
3790       zero, and the vector is filled with the longest matches.
3791
3792       NOTE:  PCRE2's  "auto-possessification" optimization usually applies to
3793       character repeats at the end of a pattern (as well as internally).  For
3794       example,  the pattern "a\d+" is compiled as if it were "a\d++". For DFA
3795       matching, this means that only one possible match is found. If you  re-
3796       ally do want multiple matches in such cases, either use an ungreedy re-
3797       peat such as "a\d+?" or set the PCRE2_NO_AUTO_POSSESS option when  com-
3798       piling.
3799
3800   Error returns from pcre2_dfa_match()
3801
3802       The pcre2_dfa_match() function returns a negative number when it fails.
3803       Many of the errors are the same  as  for  pcre2_match(),  as  described
3804       above.  There are in addition the following errors that are specific to
3805       pcre2_dfa_match():
3806
3807         PCRE2_ERROR_DFA_UITEM
3808
3809       This return is given if pcre2_dfa_match() encounters  an  item  in  the
3810       pattern  that it does not support, for instance, the use of \C in a UTF
3811       mode or a backreference.
3812
3813         PCRE2_ERROR_DFA_UCOND
3814
3815       This return is given if pcre2_dfa_match() encounters a  condition  item
3816       that uses a backreference for the condition, or a test for recursion in
3817       a specific capture group. These are not supported.
3818
3819         PCRE2_ERROR_DFA_UINVALID_UTF
3820
3821       This return is given if pcre2_dfa_match() is called for a pattern  that
3822       was  compiled  with  PCRE2_MATCH_INVALID_UTF. This is not supported for
3823       DFA matching.
3824
3825         PCRE2_ERROR_DFA_WSSIZE
3826
3827       This return is given if pcre2_dfa_match() runs  out  of  space  in  the
3828       workspace vector.
3829
3830         PCRE2_ERROR_DFA_RECURSE
3831
3832       When a recursion or subroutine call is processed, the matching function
3833       calls itself recursively, using private  memory  for  the  ovector  and
3834       workspace.   This  error  is given if the internal ovector is not large
3835       enough. This should be extremely rare, as a  vector  of  size  1000  is
3836       used.
3837
3838         PCRE2_ERROR_DFA_BADRESTART
3839
3840       When  pcre2_dfa_match()  is  called  with the PCRE2_DFA_RESTART option,
3841       some plausibility checks are made on the  contents  of  the  workspace,
3842       which  should  contain data about the previous partial match. If any of
3843       these checks fail, this error is given.
3844
3845
3846SEE ALSO
3847
3848       pcre2build(3),   pcre2callout(3),    pcre2demo(3),    pcre2matching(3),
3849       pcre2partial(3), pcre2posix(3), pcre2sample(3), pcre2unicode(3).
3850
3851
3852AUTHOR
3853
3854       Philip Hazel
3855       Retired from University Computing Service
3856       Cambridge, England.
3857
3858
3859REVISION
3860
3861       Last updated: 27 July 2022
3862       Copyright (c) 1997-2022 University of Cambridge.
3863------------------------------------------------------------------------------
3864
3865
3866PCRE2BUILD(3)              Library Functions Manual              PCRE2BUILD(3)
3867
3868
3869
3870NAME
3871       PCRE2 - Perl-compatible regular expressions (revised API)
3872
3873BUILDING PCRE2
3874
3875       PCRE2  is distributed with a configure script that can be used to build
3876       the library in Unix-like environments using the applications  known  as
3877       Autotools. Also in the distribution are files to support building using
3878       CMake instead of configure. The text file README contains  general  in-
3879       formation  about building with Autotools (some of which is repeated be-
3880       low), and also has some comments about building  on  various  operating
3881       systems.  There  is a lot more information about building PCRE2 without
3882       using Autotools (including information about using CMake  and  building
3883       "by  hand")  in  the  text file called NON-AUTOTOOLS-BUILD.  You should
3884       consult this file as well as the README file if you are building  in  a
3885       non-Unix-like environment.
3886
3887
3888PCRE2 BUILD-TIME OPTIONS
3889
3890       The rest of this document describes the optional features of PCRE2 that
3891       can be selected when the library is compiled. It  assumes  use  of  the
3892       configure  script,  where  the  optional features are selected or dese-
3893       lected by providing options to configure before running the  make  com-
3894       mand.  However,  the same options can be selected in both Unix-like and
3895       non-Unix-like environments if you are using CMake instead of  configure
3896       to build PCRE2.
3897
3898       If  you  are not using Autotools or CMake, option selection can be done
3899       by editing the config.h file, or by passing parameter settings  to  the
3900       compiler, as described in NON-AUTOTOOLS-BUILD.
3901
3902       The complete list of options for configure (which includes the standard
3903       ones such as the selection of the installation directory)  can  be  ob-
3904       tained by running
3905
3906         ./configure --help
3907
3908       The  following  sections include descriptions of "on/off" options whose
3909       names begin with --enable or --disable. Because of the way that config-
3910       ure  works, --enable and --disable always come in pairs, so the comple-
3911       mentary option always exists as well, but as it specifies the  default,
3912       it is not described.  Options that specify values have names that start
3913       with --with. At the end of a configure run, a summary of the configura-
3914       tion is output.
3915
3916
3917BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES
3918
3919       By  default, a library called libpcre2-8 is built, containing functions
3920       that take string arguments contained in arrays  of  bytes,  interpreted
3921       either  as single-byte characters, or UTF-8 strings. You can also build
3922       two other libraries, called libpcre2-16 and libpcre2-32, which  process
3923       strings  that  are contained in arrays of 16-bit and 32-bit code units,
3924       respectively. These can be interpreted either as single-unit characters
3925       or  UTF-16/UTF-32 strings. To build these additional libraries, add one
3926       or both of the following to the configure command:
3927
3928         --enable-pcre2-16
3929         --enable-pcre2-32
3930
3931       If you do not want the 8-bit library, add
3932
3933         --disable-pcre2-8
3934
3935       as well. At least one of the three libraries must be built.  Note  that
3936       the  POSIX wrapper is for the 8-bit library only, and that pcre2grep is
3937       an 8-bit program. Neither of these are built if  you  select  only  the
3938       16-bit or 32-bit libraries.
3939
3940
3941BUILDING SHARED AND STATIC LIBRARIES
3942
3943       The  Autotools PCRE2 building process uses libtool to build both shared
3944       and static libraries by default. You can suppress an  unwanted  library
3945       by adding one of
3946
3947         --disable-shared
3948         --disable-static
3949
3950       to the configure command.
3951
3952
3953UNICODE AND UTF SUPPORT
3954
3955       By  default,  PCRE2 is built with support for Unicode and UTF character
3956       strings.  To build it without Unicode support, add
3957
3958         --disable-unicode
3959
3960       to the configure command. This setting applies to all three  libraries.
3961       It  is  not  possible to build one library with Unicode support and an-
3962       other without in the same configuration.
3963
3964       Of itself, Unicode support does not make PCRE2 treat strings as  UTF-8,
3965       UTF-16 or UTF-32. To do that, applications that use the library can set
3966       the PCRE2_UTF option when they call pcre2_compile() to compile  a  pat-
3967       tern.   Alternatively,  patterns  may be started with (*UTF) unless the
3968       application has locked this out by setting PCRE2_NEVER_UTF.
3969
3970       UTF support allows the libraries to process character code points up to
3971       0x10ffff  in  the  strings that they handle. Unicode support also gives
3972       access to the Unicode properties of characters, using  pattern  escapes
3973       such as \P, \p, and \X. Only the general category properties such as Lu
3974       and Nd, script names, and some bi-directional properties are supported.
3975       Details are given in the pcre2pattern documentation.
3976
3977       Pattern escapes such as \d and \w do not by default make use of Unicode
3978       properties. The application can request that they  do  by  setting  the
3979       PCRE2_UCP  option.  Unless  the  application has set PCRE2_NEVER_UCP, a
3980       pattern may also request this by starting with (*UCP).
3981
3982
3983DISABLING THE USE OF \C
3984
3985       The \C escape sequence, which matches a single code unit, even in a UTF
3986       mode,  can  cause unpredictable behaviour because it may leave the cur-
3987       rent matching point in the middle of a multi-code-unit  character.  The
3988       application  can lock it out by setting the PCRE2_NEVER_BACKSLASH_C op-
3989       tion when calling pcre2_compile(). There is also a build-time option
3990
3991         --enable-never-backslash-C
3992
3993       (note the upper case C) which locks out the use of \C entirely.
3994
3995
3996JUST-IN-TIME COMPILER SUPPORT
3997
3998       Just-in-time (JIT) compiler support is included in the build by  speci-
3999       fying
4000
4001         --enable-jit
4002
4003       This  support  is available only for certain hardware architectures. If
4004       this option is set for an unsupported architecture,  a  building  error
4005       occurs.  If in doubt, use
4006
4007         --enable-jit=auto
4008
4009       which  enables  JIT  only if the current hardware is supported. You can
4010       check if JIT is enabled in the configuration summary that is output  at
4011       the  end  of a configure run. If you are enabling JIT under SELinux you
4012       may also want to add
4013
4014         --enable-jit-sealloc
4015
4016       which enables the use of an execmem allocator in JIT that is compatible
4017       with  SELinux.  This  has  no  effect  if  JIT  is not enabled. See the
4018       pcre2jit documentation for a discussion of JIT usage. When JIT  support
4019       is enabled, pcre2grep automatically makes use of it, unless you add
4020
4021         --disable-pcre2grep-jit
4022
4023       to the configure command.
4024
4025
4026NEWLINE RECOGNITION
4027
4028       By  default, PCRE2 interprets the linefeed (LF) character as indicating
4029       the end of a line. This is the normal newline  character  on  Unix-like
4030       systems.  You can compile PCRE2 to use carriage return (CR) instead, by
4031       adding
4032
4033         --enable-newline-is-cr
4034
4035       to the configure command. There is also an  --enable-newline-is-lf  op-
4036       tion, which explicitly specifies linefeed as the newline character.
4037
4038       Alternatively, you can specify that line endings are to be indicated by
4039       the two-character sequence CRLF (CR immediately followed by LF). If you
4040       want this, add
4041
4042         --enable-newline-is-crlf
4043
4044       to the configure command. There is a fourth option, specified by
4045
4046         --enable-newline-is-anycrlf
4047
4048       which  causes  PCRE2 to recognize any of the three sequences CR, LF, or
4049       CRLF as indicating a line ending. A fifth option, specified by
4050
4051         --enable-newline-is-any
4052
4053       causes PCRE2 to recognize any Unicode  newline  sequence.  The  Unicode
4054       newline sequences are the three just mentioned, plus the single charac-
4055       ters VT (vertical tab, U+000B), FF (form feed, U+000C), NEL (next line,
4056       U+0085),  LS  (line  separator,  U+2028),  and PS (paragraph separator,
4057       U+2029). The final option is
4058
4059         --enable-newline-is-nul
4060
4061       which causes NUL (binary zero) to be set  as  the  default  line-ending
4062       character.
4063
4064       Whatever default line ending convention is selected when PCRE2 is built
4065       can be overridden by applications that use the library. At  build  time
4066       it is recommended to use the standard for your operating system.
4067
4068
4069WHAT \R MATCHES
4070
4071       By  default,  the  sequence \R in a pattern matches any Unicode newline
4072       sequence, independently of what has been selected as  the  line  ending
4073       sequence. If you specify
4074
4075         --enable-bsr-anycrlf
4076
4077       the  default  is changed so that \R matches only CR, LF, or CRLF. What-
4078       ever is selected when PCRE2 is built can be overridden by  applications
4079       that use the library.
4080
4081
4082HANDLING VERY LARGE PATTERNS
4083
4084       Within  a  compiled  pattern,  offset values are used to point from one
4085       part to another (for example, from an opening parenthesis to an  alter-
4086       nation  metacharacter).  By default, in the 8-bit and 16-bit libraries,
4087       two-byte values are used for these offsets, leading to a  maximum  size
4088       for a compiled pattern of around 64 thousand code units. This is suffi-
4089       cient to handle all but the most gigantic patterns. Nevertheless,  some
4090       people do want to process truly enormous patterns, so it is possible to
4091       compile PCRE2 to use three-byte or four-byte offsets by adding  a  set-
4092       ting such as
4093
4094         --with-link-size=3
4095
4096       to  the  configure command. The value given must be 2, 3, or 4. For the
4097       16-bit library, a value of 3 is rounded up to 4.  In  these  libraries,
4098       using  longer  offsets slows down the operation of PCRE2 because it has
4099       to load additional data when handling them. For the 32-bit library  the
4100       value  is  always 4 and cannot be overridden; the value of --with-link-
4101       size is ignored.
4102
4103
4104LIMITING PCRE2 RESOURCE USAGE
4105
4106       The pcre2_match() function increments a counter each time it goes round
4107       its  main  loop. Putting a limit on this counter controls the amount of
4108       computing resource used by a single call to  pcre2_match().  The  limit
4109       can be changed at run time, as described in the pcre2api documentation.
4110       The default is 10 million, but this can be changed by adding a  setting
4111       such as
4112
4113         --with-match-limit=500000
4114
4115       to   the   configure   command.   This  setting  also  applies  to  the
4116       pcre2_dfa_match() matching function, and to JIT  matching  (though  the
4117       counting is done differently).
4118
4119       The  pcre2_match()  function  uses  heap  memory to record backtracking
4120       points. The more nested backtracking points there  are  (that  is,  the
4121       deeper  the  search tree), the more memory is needed. There is an upper
4122       limit, specified in kibibytes (units of 1024 bytes). This limit can  be
4123       changed  at  run  time, as described in the pcre2api documentation. The
4124       default limit (in effect unlimited) is 20 million. You can change  this
4125       by a setting such as
4126
4127         --with-heap-limit=500
4128
4129       which  limits the amount of heap to 500 KiB. This limit applies only to
4130       interpretive matching in pcre2_match() and pcre2_dfa_match(), which may
4131       also  use  the  heap for internal workspace when processing complicated
4132       patterns. This limit does not apply when JIT (which has its own  memory
4133       arrangements) is used.
4134
4135       You  can  also explicitly limit the depth of nested backtracking in the
4136       pcre2_match() interpreter. This limit defaults to the value that is set
4137       for  --with-match-limit.  You  can set a lower default limit by adding,
4138       for example,
4139
4140         --with-match-limit-depth=10000
4141
4142       to the configure command. This value can be  overridden  at  run  time.
4143       This  depth  limit  indirectly limits the amount of heap memory that is
4144       used, but because the size of each backtracking "frame" depends on  the
4145       number  of  capturing parentheses in a pattern, the amount of heap that
4146       is used before the limit is reached varies  from  pattern  to  pattern.
4147       This limit was more useful in versions before 10.30, where function re-
4148       cursion was used for backtracking.
4149
4150       As well as applying to pcre2_match(), the depth limit also controls the
4151       depth  of recursive function calls in pcre2_dfa_match(). These are used
4152       for lookaround assertions, atomic groups,  and  recursion  within  pat-
4153       terns.  The limit does not apply to JIT matching.
4154
4155
4156CREATING CHARACTER TABLES AT BUILD TIME
4157
4158       PCRE2 uses fixed tables for processing characters whose code points are
4159       less than 256. By default, PCRE2 is built with a set of tables that are
4160       distributed  in  the file src/pcre2_chartables.c.dist. These tables are
4161       for ASCII codes only. If you add
4162
4163         --enable-rebuild-chartables
4164
4165       to the configure command, the distributed tables are  no  longer  used.
4166       Instead, a program called pcre2_dftables is compiled and run. This out-
4167       puts the source for new set of tables, created in the default locale of
4168       your  C  run-time  system. This method of replacing the tables does not
4169       work if you are cross compiling, because pcre2_dftables needs to be run
4170       on the local host and therefore not compiled with the cross compiler.
4171
4172       If you need to create alternative tables when cross compiling, you will
4173       have to do so "by hand". There may also be other reasons  for  creating
4174       tables  manually.   To  cause  pcre2_dftables  to be built on the local
4175       host, run a normal compiling command, and then run the program with the
4176       output file as its argument, for example:
4177
4178         cc src/pcre2_dftables.c -o pcre2_dftables
4179         ./pcre2_dftables src/pcre2_chartables.c
4180
4181       This  builds the tables in the default locale of the local host. If you
4182       want to specify a locale, you must use the -L option:
4183
4184         LC_ALL=fr_FR ./pcre2_dftables -L src/pcre2_chartables.c
4185
4186       You can also specify -b (with or without -L). This causes the tables to
4187       be  written in binary instead of as source code. A set of binary tables
4188       can be loaded into memory by an application and  passed  to  pcre2_com-
4189       pile() in the same way as tables created by calling pcre2_maketables().
4190       The tables are just a string of bytes, independent of hardware  charac-
4191       teristics  such  as  endianness. This means they can be bundled with an
4192       application that runs in different environments, to  ensure  consistent
4193       behaviour.
4194
4195
4196USING EBCDIC CODE
4197
4198       PCRE2  assumes  by default that it will run in an environment where the
4199       character code is ASCII or Unicode, which is a superset of ASCII.  This
4200       is the case for most computer operating systems. PCRE2 can, however, be
4201       compiled to run in an 8-bit EBCDIC environment by adding
4202
4203         --enable-ebcdic --disable-unicode
4204
4205       to the configure command. This setting implies --enable-rebuild-charta-
4206       bles.  You should only use it if you know that you are in an EBCDIC en-
4207       vironment (for example, an IBM mainframe operating system).
4208
4209       It is not possible to support both EBCDIC and UTF-8 codes in  the  same
4210       version  of  the  library. Consequently, --enable-unicode and --enable-
4211       ebcdic are mutually exclusive.
4212
4213       The EBCDIC character that corresponds to an ASCII LF is assumed to have
4214       the  value  0x15 by default. However, in some EBCDIC environments, 0x25
4215       is used. In such an environment you should use
4216
4217         --enable-ebcdic-nl25
4218
4219       as well as, or instead of, --enable-ebcdic. The EBCDIC character for CR
4220       has  the  same  value  as in ASCII, namely, 0x0d. Whichever of 0x15 and
4221       0x25 is not chosen as LF is made to correspond to the Unicode NEL char-
4222       acter (which, in Unicode, is 0x85).
4223
4224       The options that select newline behaviour, such as --enable-newline-is-
4225       cr, and equivalent run-time options, refer to these character values in
4226       an EBCDIC environment.
4227
4228
4229PCRE2GREP SUPPORT FOR EXTERNAL SCRIPTS
4230
4231       By default pcre2grep supports the use of callouts with string arguments
4232       within the patterns it is matching. There are two kinds: one that  gen-
4233       erates output using local code, and another that calls an external pro-
4234       gram or script.  If --disable-pcre2grep-callout-fork is  added  to  the
4235       configure  command,  only  the  first  kind of callout is supported; if
4236       --disable-pcre2grep-callout is used, all callouts  are  completely  ig-
4237       nored.  For more details of pcre2grep callouts, see the pcre2grep docu-
4238       mentation.
4239
4240
4241PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT
4242
4243       By default, pcre2grep reads all files as plain text. You can  build  it
4244       so  that  it recognizes files whose names end in .gz or .bz2, and reads
4245       them with libz or libbz2, respectively, by adding one or both of
4246
4247         --enable-pcre2grep-libz
4248         --enable-pcre2grep-libbz2
4249
4250       to the configure command. These options naturally require that the rel-
4251       evant  libraries  are installed on your system. Configuration will fail
4252       if they are not.
4253
4254
4255PCRE2GREP BUFFER SIZE
4256
4257       pcre2grep uses an internal buffer to hold a "window" on the file it  is
4258       scanning, in order to be able to output "before" and "after" lines when
4259       it finds a match. The default starting size of the buffer is 20KiB. The
4260       buffer  itself  is  three times this size, but because of the way it is
4261       used for holding "before" lines, the longest line that is guaranteed to
4262       be processable is the notional buffer size. If a longer line is encoun-
4263       tered, pcre2grep automatically expands the buffer, up  to  a  specified
4264       maximum  size, whose default is 1MiB or the starting size, whichever is
4265       the larger. You can change the default parameter values by adding,  for
4266       example,
4267
4268         --with-pcre2grep-bufsize=51200
4269         --with-pcre2grep-max-bufsize=2097152
4270
4271       to  the  configure  command. The caller of pcre2grep can override these
4272       values by using --buffer-size  and  --max-buffer-size  on  the  command
4273       line.
4274
4275
4276PCRE2TEST OPTION FOR LIBREADLINE SUPPORT
4277
4278       If you add one of
4279
4280         --enable-pcre2test-libreadline
4281         --enable-pcre2test-libedit
4282
4283       to  the configure command, pcre2test is linked with the libreadline or-
4284       libedit library, respectively, and when its input is from  a  terminal,
4285       it  reads  it using the readline() function. This provides line-editing
4286       and history facilities. Note that libreadline is  GPL-licensed,  so  if
4287       you  distribute  a binary of pcre2test linked in this way, there may be
4288       licensing issues. These can be avoided by linking instead with libedit,
4289       which has a BSD licence.
4290
4291       Setting  --enable-pcre2test-libreadline causes the -lreadline option to
4292       be added to the pcre2test build. In many operating environments with  a
4293       sytem-installed  readline  library this is sufficient. However, in some
4294       environments (e.g. if an unmodified distribution version of readline is
4295       in  use),  some  extra configuration may be necessary. The INSTALL file
4296       for libreadline says this:
4297
4298         "Readline uses the termcap functions, but does not link with
4299         the termcap or curses library itself, allowing applications
4300         which link with readline the to choose an appropriate library."
4301
4302       If your environment has not been set up so that an appropriate  library
4303       is automatically included, you may need to add something like
4304
4305         LIBS="-ncurses"
4306
4307       immediately before the configure command.
4308
4309
4310INCLUDING DEBUGGING CODE
4311
4312       If you add
4313
4314         --enable-debug
4315
4316       to  the configure command, additional debugging code is included in the
4317       build. This feature is intended for use by the PCRE2 maintainers.
4318
4319
4320DEBUGGING WITH VALGRIND SUPPORT
4321
4322       If you add
4323
4324         --enable-valgrind
4325
4326       to the configure command, PCRE2 will use valgrind annotations  to  mark
4327       certain  memory  regions as unaddressable. This allows it to detect in-
4328       valid memory accesses, and is mostly useful for debugging PCRE2 itself.
4329
4330
4331CODE COVERAGE REPORTING
4332
4333       If your C compiler is gcc, you can build a version of  PCRE2  that  can
4334       generate a code coverage report for its test suite. To enable this, you
4335       must install lcov version 1.6 or above. Then specify
4336
4337         --enable-coverage
4338
4339       to the configure command and build PCRE2 in the usual way.
4340
4341       Note that using ccache (a caching C compiler) is incompatible with code
4342       coverage  reporting. If you have configured ccache to run automatically
4343       on your system, you must set the environment variable
4344
4345         CCACHE_DISABLE=1
4346
4347       before running make to build PCRE2, so that ccache is not used.
4348
4349       When --enable-coverage is used,  the  following  addition  targets  are
4350       added to the Makefile:
4351
4352         make coverage
4353
4354       This  creates  a  fresh coverage report for the PCRE2 test suite. It is
4355       equivalent to running "make coverage-reset", "make  coverage-baseline",
4356       "make check", and then "make coverage-report".
4357
4358         make coverage-reset
4359
4360       This zeroes the coverage counters, but does nothing else.
4361
4362         make coverage-baseline
4363
4364       This captures baseline coverage information.
4365
4366         make coverage-report
4367
4368       This creates the coverage report.
4369
4370         make coverage-clean-report
4371
4372       This  removes the generated coverage report without cleaning the cover-
4373       age data itself.
4374
4375         make coverage-clean-data
4376
4377       This removes the captured coverage data without removing  the  coverage
4378       files created at compile time (*.gcno).
4379
4380         make coverage-clean
4381
4382       This  cleans all coverage data including the generated coverage report.
4383       For more information about code coverage, see the gcov and  lcov  docu-
4384       mentation.
4385
4386
4387DISABLING THE Z AND T FORMATTING MODIFIERS
4388
4389       The  C99  standard  defines formatting modifiers z and t for size_t and
4390       ptrdiff_t values, respectively. By default, PCRE2 uses these  modifiers
4391       in environments other than old versions of Microsoft Visual Studio when
4392       __STDC_VERSION__ is defined and has a value greater than  or  equal  to
4393       199901L  (indicating  support for C99).  However, there is at least one
4394       environment that claims to be C99 but does not support these modifiers.
4395       If
4396
4397         --disable-percent-zt
4398
4399       is specified, no use is made of the z or t modifiers. Instead of %td or
4400       %zu, a suitable format is used depending in the size of  long  for  the
4401       platform.
4402
4403
4404SUPPORT FOR FUZZERS
4405
4406       There  is  a  special  option for use by people who want to run fuzzing
4407       tests on PCRE2:
4408
4409         --enable-fuzz-support
4410
4411       At present this applies only to the 8-bit library. If set, it causes an
4412       extra  library  called  libpcre2-fuzzsupport.a to be built, but not in-
4413       stalled. This contains a single  function  called  LLVMFuzzerTestOneIn-
4414       put()  whose  arguments are a pointer to a string and the length of the
4415       string. When called, this function tries to compile  the  string  as  a
4416       pattern,  and if that succeeds, to match it.  This is done both with no
4417       options and with some random options bits that are generated  from  the
4418       string.
4419
4420       Setting  --enable-fuzz-support  also  causes  a binary called pcre2fuz-
4421       zcheck to be created. This is normally run under valgrind or used  when
4422       PCRE2 is compiled with address sanitizing enabled. It calls the fuzzing
4423       function and outputs information about what  it  is  doing.  The  input
4424       strings  are specified by arguments: if an argument starts with "=" the
4425       rest of it is a literal input string. Otherwise, it is assumed to be  a
4426       file name, and the contents of the file are the test string.
4427
4428
4429OBSOLETE OPTION
4430
4431       In  versions  of  PCRE2 prior to 10.30, there were two ways of handling
4432       backtracking in the pcre2_match() function. The default was to use  the
4433       system stack, but if
4434
4435         --disable-stack-for-recursion
4436
4437       was  set,  memory on the heap was used. From release 10.30 onwards this
4438       has changed (the stack is no longer used)  and  this  option  now  does
4439       nothing except give a warning.
4440
4441
4442SEE ALSO
4443
4444       pcre2api(3), pcre2-config(3).
4445
4446
4447AUTHOR
4448
4449       Philip Hazel
4450       Retired from University Computing Service
4451       Cambridge, England.
4452
4453
4454REVISION
4455
4456       Last updated: 27 July 2022
4457       Copyright (c) 1997-2022 University of Cambridge.
4458------------------------------------------------------------------------------
4459
4460
4461PCRE2CALLOUT(3)            Library Functions Manual            PCRE2CALLOUT(3)
4462
4463
4464
4465NAME
4466       PCRE2 - Perl-compatible regular expressions (revised API)
4467
4468SYNOPSIS
4469
4470       #include <pcre2.h>
4471
4472       int (*pcre2_callout)(pcre2_callout_block *, void *);
4473
4474       int pcre2_callout_enumerate(const pcre2_code *code,
4475         int (*callback)(pcre2_callout_enumerate_block *, void *),
4476         void *user_data);
4477
4478
4479DESCRIPTION
4480
4481       PCRE2  provides  a feature called "callout", which is a means of tempo-
4482       rarily passing control to the caller of PCRE2 in the middle of  pattern
4483       matching.  The caller of PCRE2 provides an external function by putting
4484       its entry point in a match  context  (see  pcre2_set_callout()  in  the
4485       pcre2api documentation).
4486
4487       When  using the pcre2_substitute() function, an additional callout fea-
4488       ture is available. This does a callout after each change to the subject
4489       string and is described in the pcre2api documentation; the rest of this
4490       document is concerned with callouts during pattern matching.
4491
4492       Within a regular expression, (?C<arg>) indicates a point at  which  the
4493       external  function  is  to  be  called. Different callout points can be
4494       identified by putting a number less than 256 after the  letter  C.  The
4495       default  value is zero.  Alternatively, the argument may be a delimited
4496       string. The starting delimiter must be one of ` ' " ^ % # $ {  and  the
4497       ending delimiter is the same as the start, except for {, where the end-
4498       ing delimiter is }. If  the  ending  delimiter  is  needed  within  the
4499       string,  it  must be doubled. For example, this pattern has two callout
4500       points:
4501
4502         (?C1)abc(?C"some ""arbitrary"" text")def
4503
4504       If the PCRE2_AUTO_CALLOUT option bit is set when a pattern is compiled,
4505       PCRE2  automatically inserts callouts, all with number 255, before each
4506       item in the pattern except for immediately before or after an  explicit
4507       callout. For example, if PCRE2_AUTO_CALLOUT is used with the pattern
4508
4509         A(?C3)B
4510
4511       it is processed as if it were
4512
4513         (?C255)A(?C3)B(?C255)
4514
4515       Here is a more complicated example:
4516
4517         A(\d{2}|--)
4518
4519       With PCRE2_AUTO_CALLOUT, this pattern is processed as if it were
4520
4521         (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
4522
4523       Notice  that  there  is a callout before and after each parenthesis and
4524       alternation bar. If the pattern contains a conditional group whose con-
4525       dition  is  an  assertion, an automatic callout is inserted immediately
4526       before the condition. Such a callout may also be  inserted  explicitly,
4527       for example:
4528
4529         (?(?C9)(?=a)ab|de)  (?(?C%text%)(?!=d)ab|de)
4530
4531       This  applies only to assertion conditions (because they are themselves
4532       independent groups).
4533
4534       Callouts can be useful for tracking the progress of  pattern  matching.
4535       The pcre2test program has a pattern qualifier (/auto_callout) that sets
4536       automatic callouts.  When any callouts are  present,  the  output  from
4537       pcre2test  indicates  how  the pattern is being matched. This is useful
4538       information when you are trying to optimize the performance of  a  par-
4539       ticular pattern.
4540
4541
4542MISSING CALLOUTS
4543
4544       You  should  be  aware  that, because of optimizations in the way PCRE2
4545       compiles and matches patterns, callouts sometimes do not happen exactly
4546       as you might expect.
4547
4548   Auto-possessification
4549
4550       At compile time, PCRE2 "auto-possessifies" repeated items when it knows
4551       that what follows cannot be part of the repeat. For example, a+[bc]  is
4552       compiled  as if it were a++[bc]. The pcre2test output when this pattern
4553       is compiled with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied
4554       to the string "aaaa" is:
4555
4556         --->aaaa
4557          +0 ^        a+
4558          +2 ^   ^    [bc]
4559         No match
4560
4561       This  indicates that when matching [bc] fails, there is no backtracking
4562       into a+ (because it is being treated as a++) and therefore the callouts
4563       that  would  be  taken for the backtracks do not occur. You can disable
4564       the  auto-possessify  feature  by  passing   PCRE2_NO_AUTO_POSSESS   to
4565       pcre2_compile(),  or  starting  the pattern with (*NO_AUTO_POSSESS). In
4566       this case, the output changes to this:
4567
4568         --->aaaa
4569          +0 ^        a+
4570          +2 ^   ^    [bc]
4571          +2 ^  ^     [bc]
4572          +2 ^ ^      [bc]
4573          +2 ^^       [bc]
4574         No match
4575
4576       This time, when matching [bc] fails, the matcher backtracks into a+ and
4577       tries again, repeatedly, until a+ itself fails.
4578
4579   Automatic .* anchoring
4580
4581       By default, an optimization is applied when .* is the first significant
4582       item in a pattern. If PCRE2_DOTALL is set, so that the  dot  can  match
4583       any  character,  the pattern is automatically anchored. If PCRE2_DOTALL
4584       is not set, a match can start only after an internal newline or at  the
4585       beginning of the subject, and pcre2_compile() remembers this. If a pat-
4586       tern has more than one top-level branch, automatic anchoring occurs  if
4587       all branches are anchorable.
4588
4589       This  optimization is disabled, however, if .* is in an atomic group or
4590       if there is a backreference to the capture group in which  it  appears.
4591       It  is  also disabled if the pattern contains (*PRUNE) or (*SKIP). How-
4592       ever, the presence of callouts does not affect it.
4593
4594       For example, if the pattern .*\d is  compiled  with  PCRE2_AUTO_CALLOUT
4595       and applied to the string "aa", the pcre2test output is:
4596
4597         --->aa
4598          +0 ^      .*
4599          +2 ^ ^    \d
4600          +2 ^^     \d
4601          +2 ^      \d
4602         No match
4603
4604       This  shows  that all match attempts start at the beginning of the sub-
4605       ject. In other words, the pattern is anchored. You can disable this op-
4606       timization  by  passing  PCRE2_NO_DOTSTAR_ANCHOR to pcre2_compile(), or
4607       starting the pattern with (*NO_DOTSTAR_ANCHOR). In this case, the  out-
4608       put changes to:
4609
4610         --->aa
4611          +0 ^      .*
4612          +2 ^ ^    \d
4613          +2 ^^     \d
4614          +2 ^      \d
4615          +0  ^     .*
4616          +2  ^^    \d
4617          +2  ^     \d
4618         No match
4619
4620       This  shows more match attempts, starting at the second subject charac-
4621       ter.  Another optimization, described in the next section,  means  that
4622       there is no subsequent attempt to match with an empty subject.
4623
4624   Other optimizations
4625
4626       Other  optimizations  that  provide fast "no match" results also affect
4627       callouts.  For example, if the pattern is
4628
4629         ab(?C4)cd
4630
4631       PCRE2 knows that any matching string must contain the  letter  "d".  If
4632       the  subject  string  is  "abyz",  the  lack of "d" means that matching
4633       doesn't ever start, and the callout is  never  reached.  However,  with
4634       "abyd", though the result is still no match, the callout is obeyed.
4635
4636       For  most  patterns  PCRE2  also knows the minimum length of a matching
4637       string, and will immediately give a "no match" return without  actually
4638       running  a  match if the subject is not long enough, or, for unanchored
4639       patterns, if it has been scanned far enough.
4640
4641       You can disable these optimizations by passing the PCRE2_NO_START_OPTI-
4642       MIZE  option  to  pcre2_compile(),  or  by  starting  the  pattern with
4643       (*NO_START_OPT). This slows down the matching process, but does  ensure
4644       that callouts such as the example above are obeyed.
4645
4646
4647THE CALLOUT INTERFACE
4648
4649       During  matching,  when  PCRE2  reaches a callout point, if an external
4650       function is provided in the match context, it is called.  This  applies
4651       to  both normal, DFA, and JIT matching. The first argument to the call-
4652       out function is a pointer to a pcre2_callout block. The second argument
4653       is  the  void * callout data that was supplied when the callout was set
4654       up by calling pcre2_set_callout() (see the pcre2api documentation). The
4655       callout  block structure contains the following fields, not necessarily
4656       in this order:
4657
4658         uint32_t      version;
4659         uint32_t      callout_number;
4660         uint32_t      capture_top;
4661         uint32_t      capture_last;
4662         uint32_t      callout_flags;
4663         PCRE2_SIZE   *offset_vector;
4664         PCRE2_SPTR    mark;
4665         PCRE2_SPTR    subject;
4666         PCRE2_SIZE    subject_length;
4667         PCRE2_SIZE    start_match;
4668         PCRE2_SIZE    current_position;
4669         PCRE2_SIZE    pattern_position;
4670         PCRE2_SIZE    next_item_length;
4671         PCRE2_SIZE    callout_string_offset;
4672         PCRE2_SIZE    callout_string_length;
4673         PCRE2_SPTR    callout_string;
4674
4675       The version field contains the version number of the block format.  The
4676       current  version  is  2; the three callout string fields were added for
4677       version 1, and the callout_flags field for version 2. If you are  writ-
4678       ing  an  application  that  might  use an earlier release of PCRE2, you
4679       should check the version number before accessing any of  these  fields.
4680       The  version  number  will increase in future if more fields are added,
4681       but the intention is never to remove any of the existing fields.
4682
4683   Fields for numerical callouts
4684
4685       For a numerical callout, callout_string  is  NULL,  and  callout_number
4686       contains  the  number  of  the callout, in the range 0-255. This is the
4687       number that follows (?C for callouts that part of the  pattern;  it  is
4688       255 for automatically generated callouts.
4689
4690   Fields for string callouts
4691
4692       For  callouts with string arguments, callout_number is always zero, and
4693       callout_string points to the string that is contained within  the  com-
4694       piled pattern. Its length is given by callout_string_length. Duplicated
4695       ending delimiters that were present in the original pattern string have
4696       been turned into single characters, but there is no other processing of
4697       the callout string argument. An additional code unit containing  binary
4698       zero  is  present  after the string, but is not included in the length.
4699       The delimiter that was used to start the string is also  stored  within
4700       the  pattern, immediately before the string itself. You can access this
4701       delimiter as callout_string[-1] if you need it.
4702
4703       The callout_string_offset field is the code unit offset to the start of
4704       the callout argument string within the original pattern string. This is
4705       provided for the benefit of applications such as script languages  that
4706       might need to report errors in the callout string within the pattern.
4707
4708   Fields for all callouts
4709
4710       The  remaining  fields in the callout block are the same for both kinds
4711       of callout.
4712
4713       The offset_vector field is a pointer to a vector of  capturing  offsets
4714       (the "ovector"). You may read the elements in this vector, but you must
4715       not change any of them.
4716
4717       For calls to pcre2_match(), the offset_vector field is not  (since  re-
4718       lease  10.30)  a  pointer  to the actual ovector that was passed to the
4719       matching function in the match data block. Instead it points to an  in-
4720       ternal  ovector  of  a  size large enough to hold all possible captured
4721       substrings in the pattern. Note that whenever a recursion or subroutine
4722       call  within  a pattern completes, the capturing state is reset to what
4723       it was before.
4724
4725       The capture_last field contains the number of the  most  recently  cap-
4726       tured  substring,  and the capture_top field contains one more than the
4727       number of the highest numbered captured substring so far.  If  no  sub-
4728       strings  have yet been captured, the value of capture_last is 0 and the
4729       value of capture_top is 1. The values of these  fields  do  not  always
4730       differ   by   one;  for  example,  when  the  callout  in  the  pattern
4731       ((a)(b))(?C2) is taken, capture_last is 1 but capture_top is 4.
4732
4733       The contents of ovector[2] to  ovector[<capture_top>*2-1]  can  be  in-
4734       spected  in  order to extract substrings that have been matched so far,
4735       in the same way as extracting substrings after a match  has  completed.
4736       The  values in ovector[0] and ovector[1] are always PCRE2_UNSET because
4737       the match is by definition not complete. Substrings that have not  been
4738       captured  but whose numbers are less than capture_top also have both of
4739       their ovector slots set to PCRE2_UNSET.
4740
4741       For DFA matching, the offset_vector field points to  the  ovector  that
4742       was  passed  to the matching function in the match data block for call-
4743       outs at the top level, but to an internal ovector during the processing
4744       of  pattern  recursions, lookarounds, and atomic groups. However, these
4745       ovectors hold no useful information because pcre2_dfa_match() does  not
4746       support  substring  capturing. The value of capture_top is always 1 and
4747       the value of capture_last is always 0 for DFA matching.
4748
4749       The subject and subject_length fields contain copies of the values that
4750       were passed to the matching function.
4751
4752       The  start_match  field normally contains the offset within the subject
4753       at which the current match attempt started. However, if the escape  se-
4754       quence  \K  has  been encountered, this value is changed to reflect the
4755       modified starting point. If the pattern is not  anchored,  the  callout
4756       function may be called several times from the same point in the pattern
4757       for different starting points in the subject.
4758
4759       The current_position field contains the offset within  the  subject  of
4760       the current match pointer.
4761
4762       The pattern_position field contains the offset in the pattern string to
4763       the next item to be matched.
4764
4765       The next_item_length field contains the length of the next item  to  be
4766       processed  in the pattern string. When the callout is at the end of the
4767       pattern, the length is zero.  When  the  callout  precedes  an  opening
4768       parenthesis, the length includes meta characters that follow the paren-
4769       thesis. For example, in a callout before an assertion  such  as  (?=ab)
4770       the  length  is  3. For an an alternation bar or a closing parenthesis,
4771       the length is one, unless a closing parenthesis is followed by a  quan-
4772       tifier, in which case its length is included.  (This changed in release
4773       10.23. In earlier releases, before an opening  parenthesis  the  length
4774       was  that of the entire group, and before an alternation bar or a clos-
4775       ing parenthesis the length was zero.)
4776
4777       The pattern_position and next_item_length fields are intended  to  help
4778       in  distinguishing between different automatic callouts, which all have
4779       the same callout number. However, they are set for  all  callouts,  and
4780       are used by pcre2test to show the next item to be matched when display-
4781       ing callout information.
4782
4783       In callouts from pcre2_match() the mark field contains a pointer to the
4784       zero-terminated  name of the most recently passed (*MARK), (*PRUNE), or
4785       (*THEN) item in the match, or NULL if no such items have  been  passed.
4786       Instances  of  (*PRUNE)  or  (*THEN) without a name do not obliterate a
4787       previous (*MARK). In callouts from the DFA matching function this field
4788       always contains NULL.
4789
4790       The   callout_flags   field   is   always   zero   in   callouts   from
4791       pcre2_dfa_match() or when JIT is being used. When pcre2_match() without
4792       JIT is used, the following bits may be set:
4793
4794         PCRE2_CALLOUT_STARTMATCH
4795
4796       This  is set for the first callout after the start of matching for each
4797       new starting position in the subject.
4798
4799         PCRE2_CALLOUT_BACKTRACK
4800
4801       This is set if there has been a matching backtrack since  the  previous
4802       callout,  or  since  the start of matching if this is the first callout
4803       from a pcre2_match() run.
4804
4805       Both bits are set when a backtrack has caused a "bumpalong"  to  a  new
4806       starting  position in the subject. Output from pcre2test does not indi-
4807       cate the presence of these bits unless the  callout_extra  modifier  is
4808       set.
4809
4810       The information in the callout_flags field is provided so that applica-
4811       tions can track and tell their users how matching with backtracking  is
4812       done.  This  can be useful when trying to optimize patterns, or just to
4813       understand how PCRE2 works. There is no  support  in  pcre2_dfa_match()
4814       because  there is no backtracking in DFA matching, and there is no sup-
4815       port in JIT because JIT is all about maximimizing matching performance.
4816       In both these cases the callout_flags field is always zero.
4817
4818
4819RETURN VALUES FROM CALLOUTS
4820
4821       The external callout function returns an integer to PCRE2. If the value
4822       is zero, matching proceeds as normal. If  the  value  is  greater  than
4823       zero,  matching  fails  at  the current point, but the testing of other
4824       matching possibilities goes ahead, just as if a lookahead assertion had
4825       failed. If the value is less than zero, the match is abandoned, and the
4826       matching function returns the negative value.
4827
4828       Negative values should normally be chosen from  the  set  of  PCRE2_ER-
4829       ROR_xxx  values.  In  particular, PCRE2_ERROR_NOMATCH forces a standard
4830       "no match" failure. The error number  PCRE2_ERROR_CALLOUT  is  reserved
4831       for use by callout functions; it will never be used by PCRE2 itself.
4832
4833
4834CALLOUT ENUMERATION
4835
4836       int pcre2_callout_enumerate(const pcre2_code *code,
4837         int (*callback)(pcre2_callout_enumerate_block *, void *),
4838         void *user_data);
4839
4840       A script language that supports the use of string arguments in callouts
4841       might like to scan all the callouts in a  pattern  before  running  the
4842       match. This can be done by calling pcre2_callout_enumerate(). The first
4843       argument is a pointer to a compiled pattern, the  second  points  to  a
4844       callback  function,  and the third is arbitrary user data. The callback
4845       function is called for every callout in the pattern  in  the  order  in
4846       which they appear. Its first argument is a pointer to a callout enumer-
4847       ation block, and its second argument is the user_data  value  that  was
4848       passed  to  pcre2_callout_enumerate(). The data block contains the fol-
4849       lowing fields:
4850
4851         version                Block version number
4852         pattern_position       Offset to next item in pattern
4853         next_item_length       Length of next item in pattern
4854         callout_number         Number for numbered callouts
4855         callout_string_offset  Offset to string within pattern
4856         callout_string_length  Length of callout string
4857         callout_string         Points to callout string or is NULL
4858
4859       The version number is currently 0. It will increase if new  fields  are
4860       ever  added  to  the  block. The remaining fields are the same as their
4861       namesakes in the pcre2_callout block that is used for  callouts  during
4862       matching, as described above.
4863
4864       Note  that  the  value  of pattern_position is unique for each callout.
4865       However, if a callout occurs inside a group that is quantified  with  a
4866       non-zero minimum or a fixed maximum, the group is replicated inside the
4867       compiled pattern. For example, a pattern such as /(a){2}/  is  compiled
4868       as  if it were /(a)(a)/. This means that the callout will be enumerated
4869       more than once, but with the same value for  pattern_position  in  each
4870       case.
4871
4872       The callback function should normally return zero. If it returns a non-
4873       zero value, scanning the pattern stops, and that value is returned from
4874       pcre2_callout_enumerate().
4875
4876
4877AUTHOR
4878
4879       Philip Hazel
4880       University Computing Service
4881       Cambridge, England.
4882
4883
4884REVISION
4885
4886       Last updated: 03 February 2019
4887       Copyright (c) 1997-2019 University of Cambridge.
4888------------------------------------------------------------------------------
4889
4890
4891PCRE2COMPAT(3)             Library Functions Manual             PCRE2COMPAT(3)
4892
4893
4894
4895NAME
4896       PCRE2 - Perl-compatible regular expressions (revised API)
4897
4898DIFFERENCES BETWEEN PCRE2 AND PERL
4899
4900       This  document describes some of the differences in the ways that PCRE2
4901       and Perl handle regular expressions. The differences described here are
4902       with  respect  to  Perl  version 5.34.0, but as both Perl and PCRE2 are
4903       continually changing, the information may at times be out of date.
4904
4905       1. When PCRE2_DOTALL (equivalent to Perl's /s qualifier)  is  not  set,
4906       the behaviour of the '.' metacharacter differs from Perl. In PCRE2, '.'
4907       matches the next character unless it is the  start  of  a  newline  se-
4908       quence.  This  means  that, if the newline setting is CR, CRLF, or NUL,
4909       '.' will match the code point LF (0x0A) in ASCII/Unicode  environments,
4910       and  NL  (either  0x15 or 0x25) when using EBCDIC. In Perl, '.' appears
4911       never to match LF, even when 0x0A is not a newline indicator.
4912
4913       2. PCRE2 has only a subset of Perl's Unicode support. Details  of  what
4914       it does have are given in the pcre2unicode page.
4915
4916       3.  Like  Perl, PCRE2 allows repeat quantifiers on parenthesized asser-
4917       tions, but they do not mean what you might think. For example, (?!a){3}
4918       does not assert that the next three characters are not "a". It just as-
4919       serts that the next character is not "a"  three  times  (in  principle;
4920       PCRE2  optimizes this to run the assertion just once). Perl allows some
4921       repeat quantifiers on other assertions, for example, \b* , but these do
4922       not  seem  to have any use. PCRE2 does not allow any kind of quantifier
4923       on non-lookaround assertions.
4924
4925       4. Capture groups that occur inside negative lookaround assertions  are
4926       counted,  but  their  entries in the offsets vector are set only when a
4927       negative assertion is a condition that has a matching branch (that  is,
4928       the  condition  is  false).   Perl may set such capture groups in other
4929       circumstances.
4930
4931       5. The following Perl escape sequences are not supported: \F,  \l,  \L,
4932       \u, \U, and \N when followed by a character name. \N on its own, match-
4933       ing a non-newline character, and \N{U+dd..}, matching  a  Unicode  code
4934       point,  are  supported.  The  escapes that modify the case of following
4935       letters are implemented by Perl's general string-handling and  are  not
4936       part of its pattern matching engine. If any of these are encountered by
4937       PCRE2, an error is generated by default.  However,  if  either  of  the
4938       PCRE2_ALT_BSUX  or  PCRE2_EXTRA_ALT_BSUX  options is set, \U and \u are
4939       interpreted as ECMAScript interprets them.
4940
4941       6. The Perl escape sequences \p, \P, and \X are supported only if PCRE2
4942       is built with Unicode support (the default). The properties that can be
4943       tested with \p and \P are limited to the  general  category  properties
4944       such  as  Lu  and  Nd,  script  names such as Greek or Han, Bidi_Class,
4945       Bidi_Control, and the derived properties Any and LC (synonym L&).  Both
4946       PCRE2  and  Perl  support the Cs (surrogate) property, but in PCRE2 its
4947       use is limited. See the pcre2pattern  documentation  for  details.  The
4948       long  synonyms  for  property names that Perl supports (such as \p{Let-
4949       ter}) are not supported by PCRE2, nor is it permitted to prefix any  of
4950       these properties with "Is".
4951
4952       7. PCRE2 supports the \Q...\E escape for quoting substrings. Characters
4953       in between are treated as literals. However, this is slightly different
4954       from  Perl  in  that  $  and  @ are also handled as literals inside the
4955       quotes. In Perl, they cause variable interpolation (PCRE2 does not have
4956       variables). Also, Perl does "double-quotish backslash interpolation" on
4957       any backslashes between \Q and \E which, its documentation  says,  "may
4958       lead  to confusing results". PCRE2 treats a backslash between \Q and \E
4959       just like any other character. Note the following examples:
4960
4961           Pattern            PCRE2 matches     Perl matches
4962
4963           \Qabc$xyz\E        abc$xyz           abc followed by the
4964                                                  contents of $xyz
4965           \Qabc\$xyz\E       abc\$xyz          abc\$xyz
4966           \Qabc\E\$\Qxyz\E   abc$xyz           abc$xyz
4967           \QA\B\E            A\B               A\B
4968           \Q\\E              \                 \\E
4969
4970       The \Q...\E sequence is recognized both inside  and  outside  character
4971       classes by both PCRE2 and Perl.
4972
4973       8.   Fairly  obviously,  PCRE2  does  not  support  the  (?{code})  and
4974       (??{code}) constructions. However, PCRE2 does have a "callout" feature,
4975       which allows an external function to be called during pattern matching.
4976       See the pcre2callout documentation for details.
4977
4978       9. Subroutine calls (whether recursive or not) were treated  as  atomic
4979       groups  up to PCRE2 release 10.23, but from release 10.30 this changed,
4980       and backtracking into subroutine calls is now supported, as in Perl.
4981
4982       10. In PCRE2, if any of the backtracking control verbs are  used  in  a
4983       group  that  is  called  as  a subroutine (whether or not recursively),
4984       their effect is confined to that group; it does not extend to the  sur-
4985       rounding  pattern.  This is not always the case in Perl. In particular,
4986       if (*THEN) is present in a group that is called as  a  subroutine,  its
4987       action is limited to that group, even if the group does not contain any
4988       | characters. Note that such groups are processed as  anchored  at  the
4989       point where they are tested.
4990
4991       11.  If a pattern contains more than one backtracking control verb, the
4992       first one that is backtracked onto acts. For example,  in  the  pattern
4993       A(*COMMIT)B(*PRUNE)C  a  failure in B triggers (*COMMIT), but a failure
4994       in C triggers (*PRUNE). Perl's behaviour is more complex; in many cases
4995       it is the same as PCRE2, but there are cases where it differs.
4996
4997       12.  There are some differences that are concerned with the settings of
4998       captured strings when part of  a  pattern  is  repeated.  For  example,
4999       matching  "aba"  against the pattern /^(a(b)?)+$/ in Perl leaves $2 un-
5000       set, but in PCRE2 it is set to "b".
5001
5002       13. PCRE2's handling of duplicate capture group numbers  and  names  is
5003       not  as  general as Perl's. This is a consequence of the fact the PCRE2
5004       works internally just with numbers, using an external table  to  trans-
5005       late  between  numbers  and  names.  In  particular,  a pattern such as
5006       (?|(?<a>A)|(?<b>B)), where the two capture groups have the same  number
5007       but  different  names, is not supported, and causes an error at compile
5008       time. If it were allowed, it would not be possible to distinguish which
5009       group  matched,  because  both  names map to capture group number 1. To
5010       avoid this confusing situation, an error is given at compile time.
5011
5012       14. Perl used to recognize comments in some places that PCRE2 does not,
5013       for  example,  between  the  ( and ? at the start of a group. If the /x
5014       modifier is set, Perl allowed white space between ( and  ?  though  the
5015       latest  Perls give an error (for a while it was just deprecated). There
5016       may still be some cases where Perl behaves differently.
5017
5018       15. Perl, when in warning mode, gives warnings  for  character  classes
5019       such  as  [A-\d] or [a-[:digit:]]. It then treats the hyphens as liter-
5020       als. PCRE2 has no warning features, so it gives an error in these cases
5021       because they are almost certainly user mistakes.
5022
5023       16.  In  PCRE2, the upper/lower case character properties Lu and Ll are
5024       not affected when case-independent matching is specified. For  example,
5025       \p{Lu} always matches an upper case letter. I think Perl has changed in
5026       this respect; in the release at the time of writing (5.34), \p{Lu}  and
5027       \p{Ll} match all letters, regardless of case, when case independence is
5028       specified.
5029
5030       17. From release 5.32.0, Perl locks out the use of \K in lookaround as-
5031       sertions.  From  release 10.38 PCRE2 does the same by default. However,
5032       there is an option for re-enabling the previous  behaviour.  When  this
5033       option  is  set,  \K is acted on when it occurs in positive assertions,
5034       but is ignored in negative assertions.
5035
5036       18. PCRE2 provides some extensions to the Perl regular  expression  fa-
5037       cilities.   Perl  5.10  included  new features that were not in earlier
5038       versions of Perl, some of which (such as  named  parentheses)  were  in
5039       PCRE2 for some time before. This list is with respect to Perl 5.34:
5040
5041       (a)  Although  lookbehind  assertions  in PCRE2 must match fixed length
5042       strings, each alternative toplevel branch of a lookbehind assertion can
5043       match  a  different  length of string. Perl used to require them all to
5044       have the same length, but the latest version has some  variable  length
5045       support.
5046
5047       (b) From PCRE2 10.23, backreferences to groups of fixed length are sup-
5048       ported in lookbehinds, provided that there is no possibility of  refer-
5049       encing  a  non-unique  number or name. Perl does not support backrefer-
5050       ences in lookbehinds.
5051
5052       (c) If PCRE2_DOLLAR_ENDONLY is set and PCRE2_MULTILINE is not set,  the
5053       $ meta-character matches only at the very end of the string.
5054
5055       (d)  A  backslash  followed  by  a  letter  with  no special meaning is
5056       faulted. (Perl can be made to issue a warning.)
5057
5058       (e) If PCRE2_UNGREEDY is set, the greediness of the repetition  quanti-
5059       fiers is inverted, that is, by default they are not greedy, but if fol-
5060       lowed by a question mark they are.
5061
5062       (f) PCRE2_ANCHORED can be used at matching time to force a  pattern  to
5063       be tried only at the first matching position in the subject string.
5064
5065       (g)     The     PCRE2_NOTBOL,    PCRE2_NOTEOL,    PCRE2_NOTEMPTY    and
5066       PCRE2_NOTEMPTY_ATSTART options have no Perl equivalents.
5067
5068       (h) The \R escape sequence can be restricted to match only CR,  LF,  or
5069       CRLF by the PCRE2_BSR_ANYCRLF option.
5070
5071       (i)  The  callout  facility is PCRE2-specific. Perl supports codeblocks
5072       and variable interpolation, but not general hooks on every match.
5073
5074       (j) The partial matching facility is PCRE2-specific.
5075
5076       (k) The alternative matching function (pcre2_dfa_match() matches  in  a
5077       different way and is not Perl-compatible.
5078
5079       (l)  PCRE2 recognizes some special sequences such as (*CR) or (*NO_JIT)
5080       at the start of a pattern. These set overall  options  that  cannot  be
5081       changed within the pattern.
5082
5083       (m)  PCRE2  supports non-atomic positive lookaround assertions. This is
5084       an extension to the lookaround facilities. The default, Perl-compatible
5085       lookarounds are atomic.
5086
5087       19.  The  Perl  /a modifier restricts /d numbers to pure ascii, and the
5088       /aa modifier restricts /i case-insensitive matching to pure ascii,  ig-
5089       noring  Unicode  rules.  This  separation  cannot  be  represented with
5090       PCRE2_UCP.
5091
5092       20. Perl has different limits than PCRE2. See the pcre2limit documenta-
5093       tion for details. Perl went with 5.10 from recursion to iteration keep-
5094       ing the intermediate matches on the heap, which is ~10% slower but does
5095       not  fall into any stack-overflow limit. PCRE2 made a similar change at
5096       release 10.30, and also has many build-time and  run-time  customizable
5097       limits.
5098
5099
5100AUTHOR
5101
5102       Philip Hazel
5103       Retired from University Computing Service
5104       Cambridge, England.
5105
5106
5107REVISION
5108
5109       Last updated: 08 December 2021
5110       Copyright (c) 1997-2021 University of Cambridge.
5111------------------------------------------------------------------------------
5112
5113
5114PCRE2JIT(3)                Library Functions Manual                PCRE2JIT(3)
5115
5116
5117
5118NAME
5119       PCRE2 - Perl-compatible regular expressions (revised API)
5120
5121PCRE2 JUST-IN-TIME COMPILER SUPPORT
5122
5123       Just-in-time  compiling  is a heavyweight optimization that can greatly
5124       speed up pattern matching. However, it comes at the cost of extra  pro-
5125       cessing  before  the  match is performed, so it is of most benefit when
5126       the same pattern is going to be matched many times. This does not  nec-
5127       essarily  mean many calls of a matching function; if the pattern is not
5128       anchored, matching attempts may take place many times at various  posi-
5129       tions in the subject, even for a single call. Therefore, if the subject
5130       string is very long, it may still pay  to  use  JIT  even  for  one-off
5131       matches.  JIT  support  is  available  for all of the 8-bit, 16-bit and
5132       32-bit PCRE2 libraries.
5133
5134       JIT support applies only to the  traditional  Perl-compatible  matching
5135       function.   It  does  not apply when the DFA matching function is being
5136       used. The code for this support was written by Zoltan Herczeg.
5137
5138
5139AVAILABILITY OF JIT SUPPORT
5140
5141       JIT support is an optional feature of  PCRE2.  The  "configure"  option
5142       --enable-jit  (or  equivalent  CMake  option) must be set when PCRE2 is
5143       built if you want to use JIT. The support is limited to  the  following
5144       hardware platforms:
5145
5146         ARM 32-bit (v5, v7, and Thumb2)
5147         ARM 64-bit
5148         IBM s390x 64 bit
5149         Intel x86 32-bit and 64-bit
5150         MIPS 32-bit and 64-bit
5151         Power PC 32-bit and 64-bit
5152         SPARC 32-bit
5153
5154       If --enable-jit is set on an unsupported platform, compilation fails.
5155
5156       A  program  can  tell if JIT support is available by calling pcre2_con-
5157       fig() with the PCRE2_CONFIG_JIT option. The result is  1  when  JIT  is
5158       available,  and 0 otherwise. However, a simple program does not need to
5159       check this in order to use JIT. The API is implemented in  a  way  that
5160       falls  back  to the interpretive code if JIT is not available. For pro-
5161       grams that need the best possible performance, there is  also  a  "fast
5162       path" API that is JIT-specific.
5163
5164
5165SIMPLE USE OF JIT
5166
5167       To  make use of the JIT support in the simplest way, all you have to do
5168       is to call pcre2_jit_compile() after successfully compiling  a  pattern
5169       with pcre2_compile(). This function has two arguments: the first is the
5170       compiled pattern pointer that was returned by pcre2_compile(), and  the
5171       second  is  zero  or  more of the following option bits: PCRE2_JIT_COM-
5172       PLETE, PCRE2_JIT_PARTIAL_HARD, or PCRE2_JIT_PARTIAL_SOFT.
5173
5174       If JIT support is not available, a  call  to  pcre2_jit_compile()  does
5175       nothing  and returns PCRE2_ERROR_JIT_BADOPTION. Otherwise, the compiled
5176       pattern is passed to the JIT compiler, which turns it into machine code
5177       that executes much faster than the normal interpretive code, but yields
5178       exactly the same results. The returned value  from  pcre2_jit_compile()
5179       is zero on success, or a negative error code.
5180
5181       There  is  a limit to the size of pattern that JIT supports, imposed by
5182       the size of machine stack that it uses. The exact rules are  not  docu-
5183       mented because they may change at any time, in particular, when new op-
5184       timizations are introduced.  If  a  pattern  is  too  big,  a  call  to
5185       pcre2_jit_compile() returns PCRE2_ERROR_NOMEMORY.
5186
5187       PCRE2_JIT_COMPLETE  requests the JIT compiler to generate code for com-
5188       plete matches. If you want to run partial matches using the  PCRE2_PAR-
5189       TIAL_HARD  or  PCRE2_PARTIAL_SOFT  options of pcre2_match(), you should
5190       set one or both of  the  other  options  as  well  as,  or  instead  of
5191       PCRE2_JIT_COMPLETE. The JIT compiler generates different optimized code
5192       for each of the three modes (normal, soft partial, hard partial).  When
5193       pcre2_match()  is  called,  the appropriate code is run if it is avail-
5194       able. Otherwise, the pattern is matched using interpretive code.
5195
5196       You can call pcre2_jit_compile() multiple times for the  same  compiled
5197       pattern.  It does nothing if it has previously compiled code for any of
5198       the option bits. For example, you can call it once with  PCRE2_JIT_COM-
5199       PLETE  and  (perhaps  later,  when  you find you need partial matching)
5200       again with PCRE2_JIT_COMPLETE and PCRE2_JIT_PARTIAL_HARD. This time  it
5201       will ignore PCRE2_JIT_COMPLETE and just compile code for partial match-
5202       ing. If pcre2_jit_compile() is called with no option bits set, it imme-
5203       diately returns zero. This is an alternative way of testing whether JIT
5204       is available.
5205
5206       At present, it is not possible to free JIT compiled  code  except  when
5207       the entire compiled pattern is freed by calling pcre2_code_free().
5208
5209       In  some circumstances you may need to call additional functions. These
5210       are described in the section entitled "Controlling the JIT  stack"  be-
5211       low.
5212
5213       There are some pcre2_match() options that are not supported by JIT, and
5214       there are also some pattern items that JIT cannot handle.  Details  are
5215       given  below.  In  both cases, matching automatically falls back to the
5216       interpretive code. If you want to know whether JIT  was  actually  used
5217       for  a particular match, you should arrange for a JIT callback function
5218       to be set up as described in the section entitled "Controlling the  JIT
5219       stack"  below,  even  if  you  do  not need to supply a non-default JIT
5220       stack. Such a callback function is called whenever JIT code is about to
5221       be  obeyed.  If the match-time options are not right for JIT execution,
5222       the callback function is not obeyed.
5223
5224       If the JIT compiler finds an unsupported item, no JIT  data  is  gener-
5225       ated.  You  can find out if JIT matching is available after compiling a
5226       pattern by calling pcre2_pattern_info() with the PCRE2_INFO_JITSIZE op-
5227       tion.  A  non-zero  result means that JIT compilation was successful. A
5228       result of 0 means that JIT support is not available, or the pattern was
5229       not  processed by pcre2_jit_compile(), or the JIT compiler was not able
5230       to handle the pattern.
5231
5232
5233MATCHING SUBJECTS CONTAINING INVALID UTF
5234
5235       When a pattern is compiled with the PCRE2_UTF option,  subject  strings
5236       are  normally expected to be a valid sequence of UTF code units. By de-
5237       fault, this is checked at the start of matching and an error is  gener-
5238       ated  if  invalid UTF is detected. The PCRE2_NO_UTF_CHECK option can be
5239       passed to pcre2_match() to skip the check (for improved performance) if
5240       you  are  sure  that  a subject string is valid. If this option is used
5241       with an invalid string, the result is undefined.
5242
5243       However, a way of running matches on strings that may  contain  invalid
5244       UTF   sequences   is   available.   Calling  pcre2_compile()  with  the
5245       PCRE2_MATCH_INVALID_UTF option has two effects:  it  tells  the  inter-
5246       preter  in pcre2_match() to support invalid UTF, and, if pcre2_jit_com-
5247       pile() is called, the compiled JIT code also supports invalid UTF.  De-
5248       tails  of  how this support works, in both the JIT and the interpretive
5249       cases, is given in the pcre2unicode documentation.
5250
5251       There  is  also  an  obsolete  option  for  pcre2_jit_compile()  called
5252       PCRE2_JIT_INVALID_UTF, which currently exists only for backward compat-
5253       ibility.    It   is   superseded   by   the   pcre2_compile()    option
5254       PCRE2_MATCH_INVALID_UTF and should no longer be used. It may be removed
5255       in future.
5256
5257
5258UNSUPPORTED OPTIONS AND PATTERN ITEMS
5259
5260       The pcre2_match() options that  are  supported  for  JIT  matching  are
5261       PCRE2_COPY_MATCHED_SUBJECT, PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY,
5262       PCRE2_NOTEMPTY_ATSTART,  PCRE2_NO_UTF_CHECK,  PCRE2_PARTIAL_HARD,   and
5263       PCRE2_PARTIAL_SOFT.  The  PCRE2_ANCHORED  and PCRE2_ENDANCHORED options
5264       are not supported at match time.
5265
5266       If the PCRE2_NO_JIT option is passed to pcre2_match() it  disables  the
5267       use of JIT, forcing matching by the interpreter code.
5268
5269       The  only  unsupported  pattern items are \C (match a single data unit)
5270       when running in a UTF mode, and a callout immediately before an  asser-
5271       tion condition in a conditional group.
5272
5273
5274RETURN VALUES FROM JIT MATCHING
5275
5276       When a pattern is matched using JIT matching, the return values are the
5277       same as those given by the interpretive pcre2_match()  code,  with  the
5278       addition  of one new error code: PCRE2_ERROR_JIT_STACKLIMIT. This means
5279       that the memory used for the JIT stack was insufficient. See  "Control-
5280       ling the JIT stack" below for a discussion of JIT stack usage.
5281
5282       The  error  code  PCRE2_ERROR_MATCHLIMIT is returned by the JIT code if
5283       searching a very large pattern tree goes on for too long, as it  is  in
5284       the  same circumstance when JIT is not used, but the details of exactly
5285       what is counted are not the same. The PCRE2_ERROR_DEPTHLIMIT error code
5286       is never returned when JIT matching is used.
5287
5288
5289CONTROLLING THE JIT STACK
5290
5291       When the compiled JIT code runs, it needs a block of memory to use as a
5292       stack.  By default, it uses 32KiB on the machine stack.  However,  some
5293       large  or complicated patterns need more than this. The error PCRE2_ER-
5294       ROR_JIT_STACKLIMIT is given when there is not enough stack. Three func-
5295       tions are provided for managing blocks of memory for use as JIT stacks.
5296       There is further discussion about the use of JIT stacks in the  section
5297       entitled "JIT stack FAQ" below.
5298
5299       The  pcre2_jit_stack_create()  function  creates a JIT stack. Its argu-
5300       ments are a starting size, a maximum size, and a general  context  (for
5301       memory  allocation  functions, or NULL for standard memory allocation).
5302       It returns a pointer to an opaque structure of type pcre2_jit_stack, or
5303       NULL  if there is an error. The pcre2_jit_stack_free() function is used
5304       to free a stack that is no longer needed. If its argument is NULL, this
5305       function  returns immediately, without doing anything. (For the techni-
5306       cally minded: the address space is allocated by mmap or  VirtualAlloc.)
5307       A  maximum  stack size of 512KiB to 1MiB should be more than enough for
5308       any pattern.
5309
5310       The pcre2_jit_stack_assign() function specifies which  stack  JIT  code
5311       should use. Its arguments are as follows:
5312
5313         pcre2_match_context  *mcontext
5314         pcre2_jit_callback    callback
5315         void                 *data
5316
5317       The first argument is a pointer to a match context. When this is subse-
5318       quently passed to a matching function, its information determines which
5319       JIT stack is used. If this argument is NULL, the function returns imme-
5320       diately, without doing anything. There are three cases for  the  values
5321       of the other two options:
5322
5323         (1) If callback is NULL and data is NULL, an internal 32KiB block
5324             on the machine stack is used. This is the default when a match
5325             context is created.
5326
5327         (2) If callback is NULL and data is not NULL, data must be
5328             a pointer to a valid JIT stack, the result of calling
5329             pcre2_jit_stack_create().
5330
5331         (3) If callback is not NULL, it must point to a function that is
5332             called with data as an argument at the start of matching, in
5333             order to set up a JIT stack. If the return from the callback
5334             function is NULL, the internal 32KiB stack is used; otherwise the
5335             return value must be a valid JIT stack, the result of calling
5336             pcre2_jit_stack_create().
5337
5338       A  callback function is obeyed whenever JIT code is about to be run; it
5339       is not obeyed when pcre2_match() is called with options that are incom-
5340       patible  for JIT matching. A callback function can therefore be used to
5341       determine whether a match operation was executed by JIT or by  the  in-
5342       terpreter.
5343
5344       You may safely use the same JIT stack for more than one pattern (either
5345       by assigning directly or by callback), as  long  as  the  patterns  are
5346       matched sequentially in the same thread. Currently, the only way to set
5347       up non-sequential matches in one thread is to use callouts: if a  call-
5348       out  function starts another match, that match must use a different JIT
5349       stack to the one used for currently suspended match(es).
5350
5351       In a multithread application, if you do not specify a JIT stack, or  if
5352       you  assign or pass back NULL from a callback, that is thread-safe, be-
5353       cause each thread has its own machine stack. However, if you assign  or
5354       pass back a non-NULL JIT stack, this must be a different stack for each
5355       thread so that the application is thread-safe.
5356
5357       Strictly speaking, even more is allowed. You can assign the  same  non-
5358       NULL  stack  to a match context that is used by any number of patterns,
5359       as long as they are not used for matching by multiple  threads  at  the
5360       same  time.  For  example, you could use the same stack in all compiled
5361       patterns, with a global mutex in the callback to wait until  the  stack
5362       is available for use. However, this is an inefficient solution, and not
5363       recommended.
5364
5365       This is a suggestion for how a multithreaded program that needs to  set
5366       up non-default JIT stacks might operate:
5367
5368         During thread initialization
5369           thread_local_var = pcre2_jit_stack_create(...)
5370
5371         During thread exit
5372           pcre2_jit_stack_free(thread_local_var)
5373
5374         Use a one-line callback function
5375           return thread_local_var
5376
5377       All  the  functions  described in this section do nothing if JIT is not
5378       available.
5379
5380
5381JIT STACK FAQ
5382
5383       (1) Why do we need JIT stacks?
5384
5385       PCRE2 (and JIT) is a recursive, depth-first engine, so it needs a stack
5386       where  the local data of the current node is pushed before checking its
5387       child nodes.  Allocating real machine stack on some platforms is diffi-
5388       cult. For example, the stack chain needs to be updated every time if we
5389       extend the stack on PowerPC.  Although it  is  possible,  its  updating
5390       time overhead decreases performance. So we do the recursion in memory.
5391
5392       (2) Why don't we simply allocate blocks of memory with malloc()?
5393
5394       Modern  operating  systems have a nice feature: they can reserve an ad-
5395       dress space instead of allocating memory. We can safely allocate memory
5396       pages inside this address space, so the stack could grow without moving
5397       memory data (this is important because of pointers). Thus we can  allo-
5398       cate  1MiB  address  space,  and use only a single memory page (usually
5399       4KiB) if that is enough. However, we can still grow up to 1MiB  anytime
5400       if needed.
5401
5402       (3) Who "owns" a JIT stack?
5403
5404       The owner of the stack is the user program, not the JIT studied pattern
5405       or anything else. The user program must ensure that if a stack is being
5406       used by pcre2_match(), (that is, it is assigned to a match context that
5407       is passed to the pattern currently running), that  stack  must  not  be
5408       used  by any other threads (to avoid overwriting the same memory area).
5409       The best practice for multithreaded programs is to allocate a stack for
5410       each thread, and return this stack through the JIT callback function.
5411
5412       (4) When should a JIT stack be freed?
5413
5414       You can free a JIT stack at any time, as long as it will not be used by
5415       pcre2_match() again. When you assign the stack to a match context, only
5416       a  pointer  is  set. There is no reference counting or any other magic.
5417       You can free compiled patterns, contexts, and stacks in any order, any-
5418       time.   Just do not call pcre2_match() with a match context pointing to
5419       an already freed stack, as that will cause SEGFAULT. (Also, do not free
5420       a  stack  currently  used  by pcre2_match() in another thread). You can
5421       also replace the stack in a context at any time when it is not in  use.
5422       You should free the previous stack before assigning a replacement.
5423
5424       (5)  Should  I  allocate/free  a  stack every time before/after calling
5425       pcre2_match()?
5426
5427       No, because this is too costly in  terms  of  resources.  However,  you
5428       could  implement  some clever idea which release the stack if it is not
5429       used in let's say two minutes. The JIT callback  can  help  to  achieve
5430       this without keeping a list of patterns.
5431
5432       (6)  OK, the stack is for long term memory allocation. But what happens
5433       if a pattern causes stack overflow with a stack of 1MiB? Is  that  1MiB
5434       kept until the stack is freed?
5435
5436       Especially  on embedded sytems, it might be a good idea to release mem-
5437       ory sometimes without freeing the stack. There is no API  for  this  at
5438       the  moment.  Probably a function call which returns with the currently
5439       allocated memory for any stack and another which allows releasing  mem-
5440       ory (shrinking the stack) would be a good idea if someone needs this.
5441
5442       (7) This is too much of a headache. Isn't there any better solution for
5443       JIT stack handling?
5444
5445       No, thanks to Windows. If POSIX threads were used everywhere, we  could
5446       throw out this complicated API.
5447
5448
5449FREEING JIT SPECULATIVE MEMORY
5450
5451       void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext);
5452
5453       The JIT executable allocator does not free all memory when it is possi-
5454       ble. It expects new allocations, and keeps some free memory  around  to
5455       improve  allocation  speed. However, in low memory conditions, it might
5456       be better to free all possible memory. You can cause this to happen  by
5457       calling  pcre2_jit_free_unused_memory(). Its argument is a general con-
5458       text, for custom memory management, or NULL for standard memory manage-
5459       ment.
5460
5461
5462EXAMPLE CODE
5463
5464       This  is  a  single-threaded example that specifies a JIT stack without
5465       using a callback. A real program should include  error  checking  after
5466       all the function calls.
5467
5468         int rc;
5469         pcre2_code *re;
5470         pcre2_match_data *match_data;
5471         pcre2_match_context *mcontext;
5472         pcre2_jit_stack *jit_stack;
5473
5474         re = pcre2_compile(pattern, PCRE2_ZERO_TERMINATED, 0,
5475           &errornumber, &erroffset, NULL);
5476         rc = pcre2_jit_compile(re, PCRE2_JIT_COMPLETE);
5477         mcontext = pcre2_match_context_create(NULL);
5478         jit_stack = pcre2_jit_stack_create(32*1024, 512*1024, NULL);
5479         pcre2_jit_stack_assign(mcontext, NULL, jit_stack);
5480         match_data = pcre2_match_data_create(re, 10);
5481         rc = pcre2_match(re, subject, length, 0, 0, match_data, mcontext);
5482         /* Process result */
5483
5484         pcre2_code_free(re);
5485         pcre2_match_data_free(match_data);
5486         pcre2_match_context_free(mcontext);
5487         pcre2_jit_stack_free(jit_stack);
5488
5489
5490JIT FAST PATH API
5491
5492       Because the API described above falls back to interpreted matching when
5493       JIT is not available, it is convenient for programs  that  are  written
5494       for  general  use  in  many  environments.  However,  calling  JIT  via
5495       pcre2_match() does have a performance impact. Programs that are written
5496       for  use  where  JIT  is known to be available, and which need the best
5497       possible performance, can instead use a "fast path"  API  to  call  JIT
5498       matching  directly instead of calling pcre2_match() (obviously only for
5499       patterns that have been successfully processed by pcre2_jit_compile()).
5500
5501       The fast path function is called pcre2_jit_match(), and  it  takes  ex-
5502       actly  the same arguments as pcre2_match(). However, the subject string
5503       must be specified with a  length;  PCRE2_ZERO_TERMINATED  is  not  sup-
5504       ported. Unsupported option bits (for example, PCRE2_ANCHORED, PCRE2_EN-
5505       DANCHORED  and  PCRE2_COPY_MATCHED_SUBJECT)  are  ignored,  as  is  the
5506       PCRE2_NO_JIT  option.  The  return  values  are  also  the  same as for
5507       pcre2_match(), plus PCRE2_ERROR_JIT_BADOPTION if a matching mode  (par-
5508       tial or complete) is requested that was not compiled.
5509
5510       When  you call pcre2_match(), as well as testing for invalid options, a
5511       number of other sanity checks are performed on the arguments. For exam-
5512       ple,  if the subject pointer is NULL but the length is non-zero, an im-
5513       mediate error is given. Also, unless PCRE2_NO_UTF_CHECK is set,  a  UTF
5514       subject string is tested for validity. In the interests of speed, these
5515       checks do not happen on the JIT fast  path,  and  if  invalid  data  is
5516       passed, the result is undefined.
5517
5518       Bypassing  the  sanity  checks  and the pcre2_match() wrapping can give
5519       speedups of more than 10%.
5520
5521
5522SEE ALSO
5523
5524       pcre2api(3)
5525
5526
5527AUTHOR
5528
5529       Philip Hazel (FAQ by Zoltan Herczeg)
5530       University Computing Service
5531       Cambridge, England.
5532
5533
5534REVISION
5535
5536       Last updated: 30 November 2021
5537       Copyright (c) 1997-2021 University of Cambridge.
5538------------------------------------------------------------------------------
5539
5540
5541PCRE2LIMITS(3)             Library Functions Manual             PCRE2LIMITS(3)
5542
5543
5544
5545NAME
5546       PCRE2 - Perl-compatible regular expressions (revised API)
5547
5548SIZE AND OTHER LIMITATIONS
5549
5550       There are some size limitations in PCRE2 but it is hoped that they will
5551       never in practice be relevant.
5552
5553       The maximum size of a compiled pattern  is  approximately  64  thousand
5554       code units for the 8-bit and 16-bit libraries if PCRE2 is compiled with
5555       the default internal linkage size, which  is  2  bytes  for  these  li-
5556       braries.  If  you  want  to  process regular expressions that are truly
5557       enormous, you can compile PCRE2 with an internal linkage size of 3 or 4
5558       (when  building  the  16-bit  library,  3  is rounded up to 4). See the
5559       README file in the source distribution and the pcre2build documentation
5560       for  details.  In  these cases the limit is substantially larger.  How-
5561       ever, the speed of execution is slower. In the 32-bit library, the  in-
5562       ternal linkage size is always 4.
5563
5564       The maximum length of a source pattern string is essentially unlimited;
5565       it is the largest number a PCRE2_SIZE variable can hold.  However,  the
5566       program that calls pcre2_compile() can specify a smaller limit.
5567
5568       The maximum length (in code units) of a subject string is one less than
5569       the largest number a PCRE2_SIZE variable can hold. PCRE2_SIZE is an un-
5570       signed integer type, usually defined as size_t. Its maximum value (that
5571       is ~(PCRE2_SIZE)0) is reserved as a special indicator  for  zero-termi-
5572       nated strings and unset offsets.
5573
5574       All values in repeating quantifiers must be less than 65536.
5575
5576       The maximum length of a lookbehind assertion is 65535 characters.
5577
5578       There  is no limit to the number of parenthesized groups, but there can
5579       be no more than 65535 capture groups, and there is a limit to the depth
5580       of  nesting  of parenthesized subpatterns of all kinds. This is imposed
5581       in order to limit the amount of system stack used at compile time.  The
5582       default limit can be specified when PCRE2 is built; if not, the default
5583       is set to  250.  An  application  can  change  this  limit  by  calling
5584       pcre2_set_parens_nest_limit() to set the limit in a compile context.
5585
5586       The  maximum length of name for a named capture group is 32 code units,
5587       and the maximum number of such groups is 10000.
5588
5589       The maximum length of a  name  in  a  (*MARK),  (*PRUNE),  (*SKIP),  or
5590       (*THEN)  verb  is  255  code units for the 8-bit library and 65535 code
5591       units for the 16-bit and 32-bit libraries.
5592
5593       The maximum length of a string argument to a  callout  is  the  largest
5594       number a 32-bit unsigned integer can hold.
5595
5596       The  maximum  amount  of heap memory used for matching is controlled by
5597       the heap limit, which can be set in a pattern or in  a  match  context.
5598       The default is a very large number, effectively unlimited.
5599
5600
5601AUTHOR
5602
5603       Philip Hazel
5604       Retired from University Computing Service
5605       Cambridge, England.
5606
5607
5608REVISION
5609
5610       Last updated: 26 July 2022
5611       Copyright (c) 1997-2022 University of Cambridge.
5612------------------------------------------------------------------------------
5613
5614
5615PCRE2MATCHING(3)           Library Functions Manual           PCRE2MATCHING(3)
5616
5617
5618
5619NAME
5620       PCRE2 - Perl-compatible regular expressions (revised API)
5621
5622PCRE2 MATCHING ALGORITHMS
5623
5624       This document describes the two different algorithms that are available
5625       in PCRE2 for matching a compiled regular  expression  against  a  given
5626       subject  string.  The  "standard"  algorithm is the one provided by the
5627       pcre2_match() function. This works in the same as  as  Perl's  matching
5628       function,  and  provide a Perl-compatible matching operation. The just-
5629       in-time (JIT) optimization that is described in the pcre2jit documenta-
5630       tion is compatible with this function.
5631
5632       An alternative algorithm is provided by the pcre2_dfa_match() function;
5633       it operates in a different way, and is not Perl-compatible. This alter-
5634       native  has advantages and disadvantages compared with the standard al-
5635       gorithm, and these are described below.
5636
5637       When there is only one possible way in which a given subject string can
5638       match  a pattern, the two algorithms give the same answer. A difference
5639       arises, however, when there are multiple possibilities. For example, if
5640       the pattern
5641
5642         ^<.*>
5643
5644       is matched against the string
5645
5646         <something> <something else> <something further>
5647
5648       there are three possible answers. The standard algorithm finds only one
5649       of them, whereas the alternative algorithm finds all three.
5650
5651
5652REGULAR EXPRESSIONS AS TREES
5653
5654       The set of strings that are matched by a regular expression can be rep-
5655       resented  as  a  tree structure. An unlimited repetition in the pattern
5656       makes the tree of infinite size, but it is still a tree.  Matching  the
5657       pattern  to a given subject string (from a given starting point) can be
5658       thought of as a search of the tree.  There are two  ways  to  search  a
5659       tree:  depth-first  and  breadth-first, and these correspond to the two
5660       matching algorithms provided by PCRE2.
5661
5662
5663THE STANDARD MATCHING ALGORITHM
5664
5665       In the terminology of Jeffrey Friedl's book "Mastering Regular  Expres-
5666       sions",  the  standard  algorithm  is an "NFA algorithm". It conducts a
5667       depth-first search of the pattern tree. That is, it  proceeds  along  a
5668       single path through the tree, checking that the subject matches what is
5669       required. When there is a mismatch, the algorithm  tries  any  alterna-
5670       tives  at  the  current point, and if they all fail, it backs up to the
5671       previous branch point in the  tree,  and  tries  the  next  alternative
5672       branch  at  that  level.  This often involves backing up (moving to the
5673       left) in the subject string as well.  The  order  in  which  repetition
5674       branches  are  tried  is controlled by the greedy or ungreedy nature of
5675       the quantifier.
5676
5677       If a leaf node is reached, a matching string has  been  found,  and  at
5678       that  point the algorithm stops. Thus, if there is more than one possi-
5679       ble match, this algorithm returns the first one that it finds.  Whether
5680       this  is the shortest, the longest, or some intermediate length depends
5681       on the way the alternations and the greedy or ungreedy repetition quan-
5682       tifiers are specified in the pattern.
5683
5684       Because  it  ends  up  with a single path through the tree, it is rela-
5685       tively straightforward for this algorithm to keep  track  of  the  sub-
5686       strings  that  are  matched  by portions of the pattern in parentheses.
5687       This provides support for capturing parentheses and backreferences.
5688
5689
5690THE ALTERNATIVE MATCHING ALGORITHM
5691
5692       This algorithm conducts a breadth-first search of  the  tree.  Starting
5693       from  the  first  matching  point  in the subject, it scans the subject
5694       string from left to right, once, character by character, and as it does
5695       this,  it remembers all the paths through the tree that represent valid
5696       matches. In Friedl's terminology, this is a kind  of  "DFA  algorithm",
5697       though  it is not implemented as a traditional finite state machine (it
5698       keeps multiple states active simultaneously).
5699
5700       Although the general principle of this matching algorithm  is  that  it
5701       scans  the subject string only once, without backtracking, there is one
5702       exception: when a lookaround assertion is encountered,  the  characters
5703       following  or  preceding the current point have to be independently in-
5704       spected.
5705
5706       The scan continues until either the end of the subject is  reached,  or
5707       there  are  no more unterminated paths. At this point, terminated paths
5708       represent the different matching possibilities (if there are none,  the
5709       match  has  failed).   Thus,  if there is more than one possible match,
5710       this algorithm finds all of them, and in particular, it finds the long-
5711       est.  The matches are returned in the output vector in decreasing order
5712       of length. There is an option to stop the  algorithm  after  the  first
5713       match (which is necessarily the shortest) is found.
5714
5715       Note  that the size of vector needed to contain all the results depends
5716       on the number of simultaneous matches, not on the number of parentheses
5717       in  the pattern. Using pcre2_match_data_create_from_pattern() to create
5718       the match data block is therefore not advisable when doing  DFA  match-
5719       ing.
5720
5721       Note  also  that all the matches that are found start at the same point
5722       in the subject. If the pattern
5723
5724         cat(er(pillar)?)?
5725
5726       is matched against the string "the caterpillar catchment",  the  result
5727       is  the  three  strings "caterpillar", "cater", and "cat" that start at
5728       the fifth character of the subject. The algorithm  does  not  automati-
5729       cally move on to find matches that start at later positions.
5730
5731       PCRE2's "auto-possessification" optimization usually applies to charac-
5732       ter repeats at the end of a pattern (as well as internally). For  exam-
5733       ple, the pattern "a\d+" is compiled as if it were "a\d++" because there
5734       is no point even considering the possibility of backtracking  into  the
5735       repeated  digits.  For  DFA matching, this means that only one possible
5736       match is found. If you really do want multiple matches in  such  cases,
5737       either  use  an ungreedy repeat ("a\d+?") or set the PCRE2_NO_AUTO_POS-
5738       SESS option when compiling.
5739
5740       There are a number of features of PCRE2 regular  expressions  that  are
5741       not  supported  or behave differently in the alternative matching func-
5742       tion. Those that are not supported cause an error if encountered.
5743
5744       1. Because the algorithm finds all possible matches, the greedy or  un-
5745       greedy  nature of repetition quantifiers is not relevant (though it may
5746       affect auto-possessification,  as  just  described).  During  matching,
5747       greedy  and  ungreedy  quantifiers are treated in exactly the same way.
5748       However, possessive quantifiers can make a difference when what follows
5749       could  also  match  what  is  quantified, for example in a pattern like
5750       this:
5751
5752         ^a++\w!
5753
5754       This pattern matches "aaab!" but not "aaa!", which would be matched  by
5755       a  non-possessive quantifier. Similarly, if an atomic group is present,
5756       it is matched as if it were a standalone pattern at the current  point,
5757       and  the  longest match is then "locked in" for the rest of the overall
5758       pattern.
5759
5760       2. When dealing with multiple paths through the tree simultaneously, it
5761       is  not  straightforward  to  keep track of captured substrings for the
5762       different matching possibilities, and PCRE2's  implementation  of  this
5763       algorithm does not attempt to do this. This means that no captured sub-
5764       strings are available.
5765
5766       3. Because no substrings are captured, backreferences within  the  pat-
5767       tern are not supported.
5768
5769       4.  For  the same reason, conditional expressions that use a backrefer-
5770       ence as the condition or test for a specific group  recursion  are  not
5771       supported.
5772
5773       5. Again for the same reason, script runs are not supported.
5774
5775       6. Because many paths through the tree may be active, the \K escape se-
5776       quence, which resets the start of the match when encountered  (but  may
5777       be on some paths and not on others), is not supported.
5778
5779       7.  Callouts  are  supported, but the value of the capture_top field is
5780       always 1, and the value of the capture_last field is always 0.
5781
5782       8. The \C escape sequence, which (in  the  standard  algorithm)  always
5783       matches  a  single  code  unit, even in a UTF mode, is not supported in
5784       these modes, because the alternative algorithm moves through  the  sub-
5785       ject  string  one  character  (not code unit) at a time, for all active
5786       paths through the tree.
5787
5788       9. Except for (*FAIL), the backtracking control verbs such as  (*PRUNE)
5789       are  not  supported.  (*FAIL)  is supported, and behaves like a failing
5790       negative assertion.
5791
5792       10. The PCRE2_MATCH_INVALID_UTF option for pcre2_compile() is not  sup-
5793       ported by pcre2_dfa_match().
5794
5795
5796ADVANTAGES OF THE ALTERNATIVE ALGORITHM
5797
5798       The  main  advantage  of the alternative algorithm is that all possible
5799       matches (at a single point in the subject) are automatically found, and
5800       in  particular, the longest match is found. To find more than one match
5801       at the same point using the standard algorithm, you have to  do  kludgy
5802       things with callouts.
5803
5804       Partial  matching  is  possible with this algorithm, though it has some
5805       limitations. The pcre2partial documentation gives  details  of  partial
5806       matching and discusses multi-segment matching.
5807
5808
5809DISADVANTAGES OF THE ALTERNATIVE ALGORITHM
5810
5811       The alternative algorithm suffers from a number of disadvantages:
5812
5813       1.  It  is  substantially  slower  than the standard algorithm. This is
5814       partly because it has to search for all possible matches, but  is  also
5815       because it is less susceptible to optimization.
5816
5817       2.  Capturing  parentheses,  backreferences,  script runs, and matching
5818       within invalid UTF string are not supported.
5819
5820       3. Although atomic groups are supported, their use does not provide the
5821       performance advantage that it does for the standard algorithm.
5822
5823       4. JIT optimization is not supported.
5824
5825
5826AUTHOR
5827
5828       Philip Hazel
5829       Retired from University Computing Service
5830       Cambridge, England.
5831
5832
5833REVISION
5834
5835       Last updated: 28 August 2021
5836       Copyright (c) 1997-2021 University of Cambridge.
5837------------------------------------------------------------------------------
5838
5839
5840PCRE2PARTIAL(3)            Library Functions Manual            PCRE2PARTIAL(3)
5841
5842
5843
5844NAME
5845       PCRE2 - Perl-compatible regular expressions
5846
5847PARTIAL MATCHING IN PCRE2
5848
5849       In  normal use of PCRE2, if there is a match up to the end of a subject
5850       string, but more characters are needed to  match  the  entire  pattern,
5851       PCRE2_ERROR_NOMATCH  is  returned,  just  like any other failing match.
5852       There are circumstances where it might be helpful to  distinguish  this
5853       "partial match" case.
5854
5855       One  example  is  an application where the subject string is very long,
5856       and not all available at once. The requirement here is to be able to do
5857       the  matching  segment  by segment, but special action is needed when a
5858       matched substring spans the boundary between two segments.
5859
5860       Another example is checking a user input string as it is typed, to  en-
5861       sure  that  it conforms to a required format. Invalid characters can be
5862       immediately diagnosed and rejected, giving instant feedback.
5863
5864       Partial matching is a PCRE2-specific feature; it is  not  Perl-compati-
5865       ble.  It  is  requested  by  setting  one  of the PCRE2_PARTIAL_HARD or
5866       PCRE2_PARTIAL_SOFT options when calling a matching function.  The  dif-
5867       ference  between  the  two options is whether or not a partial match is
5868       preferred to an alternative complete match, though the  details  differ
5869       between  the  two  types of matching function. If both options are set,
5870       PCRE2_PARTIAL_HARD takes precedence.
5871
5872       If you want to use partial matching with just-in-time  optimized  code,
5873       as  well  as  setting a partial match option for the matching function,
5874       you must also call pcre2_jit_compile() with one or both  of  these  op-
5875       tions:
5876
5877         PCRE2_JIT_PARTIAL_HARD
5878         PCRE2_JIT_PARTIAL_SOFT
5879
5880       PCRE2_JIT_COMPLETE  should also be set if you are going to run non-par-
5881       tial matches on the same pattern. Separate code is  compiled  for  each
5882       mode.  If  the appropriate JIT mode has not been compiled, interpretive
5883       matching code is used.
5884
5885       Setting a partial matching option disables two of PCRE2's standard  op-
5886       timization  hints. PCRE2 remembers the last literal code unit in a pat-
5887       tern, and abandons matching immediately if it is  not  present  in  the
5888       subject  string.  This optimization cannot be used for a subject string
5889       that might match only partially. PCRE2 also remembers a minimum  length
5890       of  a matching string, and does not bother to run the matching function
5891       on shorter strings. This optimization  is  also  disabled  for  partial
5892       matching.
5893
5894
5895REQUIREMENTS FOR A PARTIAL MATCH
5896
5897       A  possible  partial  match  occurs during matching when the end of the
5898       subject string is reached successfully, but either more characters  are
5899       needed  to complete the match, or the addition of more characters might
5900       change what is matched.
5901
5902       Example 1: if the pattern is /abc/ and the subject is "ab", more  char-
5903       acters  are  definitely  needed  to complete a match. In this case both
5904       hard and soft matching options yield a partial match.
5905
5906       Example 2: if the pattern is /ab+/ and the subject is "ab", a  complete
5907       match  can  be  found, but the addition of more characters might change
5908       what is matched. In this case, only PCRE2_PARTIAL_HARD returns  a  par-
5909       tial match; PCRE2_PARTIAL_SOFT returns the complete match.
5910
5911       On  reaching the end of the subject, when PCRE2_PARTIAL_HARD is set, if
5912       the next pattern item is \z, \Z, \b, \B, or $ there is always a partial
5913       match.   Otherwise, for both options, the next pattern item must be one
5914       that inspects a character, and at least one of the  following  must  be
5915       true:
5916
5917       (1)  At  least  one  character has already been inspected. An inspected
5918       character need not form part of the final  matched  string;  lookbehind
5919       assertions  and the \K escape sequence provide ways of inspecting char-
5920       acters before the start of a matched string.
5921
5922       (2) The pattern contains one or more lookbehind assertions. This condi-
5923       tion  exists in case there is a lookbehind that inspects characters be-
5924       fore the start of the match.
5925
5926       (3) There is a special case when the whole pattern can match  an  empty
5927       string.   When  the  starting  point  is at the end of the subject, the
5928       empty string match is a possibility, and if PCRE2_PARTIAL_SOFT  is  set
5929       and  neither  of the above conditions is true, it is returned. However,
5930       because adding more characters  might  result  in  a  non-empty  match,
5931       PCRE2_PARTIAL_HARD  returns  a  partial match, which in this case means
5932       "there is going to be a match at this point, but until some more  char-
5933       acters are added, we do not know if it will be an empty string or some-
5934       thing longer".
5935
5936
5937PARTIAL MATCHING USING pcre2_match()
5938
5939       When  a  partial  matching  option  is  set,  the  result  of   calling
5940       pcre2_match() can be one of the following:
5941
5942       A successful match
5943         A complete match has been found, starting and ending within this sub-
5944         ject.
5945
5946       PCRE2_ERROR_NOMATCH
5947         No match can start anywhere in this subject.
5948
5949       PCRE2_ERROR_PARTIAL
5950         Adding more characters may result in a complete match that  uses  one
5951         or more characters from the end of this subject.
5952
5953       When a partial match is returned, the first two elements in the ovector
5954       point to the portion of the subject that was matched, but the values in
5955       the rest of the ovector are undefined. The appearance of \K in the pat-
5956       tern has no effect for a partial match. Consider this pattern:
5957
5958         /abc\K123/
5959
5960       If it is matched against "456abc123xyz" the result is a complete match,
5961       and  the ovector defines the matched string as "123", because \K resets
5962       the "start of match" point. However, if a partial  match  is  requested
5963       and  the subject string is "456abc12", a partial match is found for the
5964       string "abc12", because all these characters are needed  for  a  subse-
5965       quent re-match with additional characters.
5966
5967       If  there  is more than one partial match, the first one that was found
5968       provides the data that is returned. Consider this pattern:
5969
5970         /123\w+X|dogY/
5971
5972       If this is matched against the subject string "abc123dog", both  alter-
5973       natives  fail  to  match,  but the end of the subject is reached during
5974       matching, so PCRE2_ERROR_PARTIAL is returned. The offsets are set to  3
5975       and  9, identifying "123dog" as the first partial match. (In this exam-
5976       ple, there are two partial matches, because "dog" on its own  partially
5977       matches the second alternative.)
5978
5979   How a partial match is processed by pcre2_match()
5980
5981       What happens when a partial match is identified depends on which of the
5982       two partial matching options is set.
5983
5984       If PCRE2_PARTIAL_HARD is set, PCRE2_ERROR_PARTIAL is returned  as  soon
5985       as  a partial match is found, without continuing to search for possible
5986       complete matches. This option is "hard" because it prefers  an  earlier
5987       partial match over a later complete match. For this reason, the assump-
5988       tion is made that the end of the supplied subject  string  is  not  the
5989       true  end of the available data, which is why \z, \Z, \b, \B, and $ al-
5990       ways give a partial match.
5991
5992       If PCRE2_PARTIAL_SOFT is set, the  partial  match  is  remembered,  but
5993       matching continues as normal, and other alternatives in the pattern are
5994       tried. If no complete match can be found,  PCRE2_ERROR_PARTIAL  is  re-
5995       turned instead of PCRE2_ERROR_NOMATCH. This option is "soft" because it
5996       prefers a complete match over a partial match. All the various matching
5997       items  in a pattern behave as if the subject string is potentially com-
5998       plete; \z, \Z, and $ match at the end of the subject,  as  normal,  and
5999       for \b and \B the end of the subject is treated as a non-alphanumeric.
6000
6001       The  difference  between the two partial matching options can be illus-
6002       trated by a pattern such as:
6003
6004         /dog(sbody)?/
6005
6006       This matches either "dog" or "dogsbody", greedily (that is, it  prefers
6007       the  longer  string  if  possible). If it is matched against the string
6008       "dog" with PCRE2_PARTIAL_SOFT, it yields a complete  match  for  "dog".
6009       However,  if  PCRE2_PARTIAL_HARD is set, the result is PCRE2_ERROR_PAR-
6010       TIAL. On the other hand, if the pattern is made ungreedy the result  is
6011       different:
6012
6013         /dog(sbody)??/
6014
6015       In  this  case  the  result  is always a complete match because that is
6016       found first, and matching never  continues  after  finding  a  complete
6017       match. It might be easier to follow this explanation by thinking of the
6018       two patterns like this:
6019
6020         /dog(sbody)?/    is the same as  /dogsbody|dog/
6021         /dog(sbody)??/   is the same as  /dog|dogsbody/
6022
6023       The second pattern will never match "dogsbody", because it will  always
6024       find the shorter match first.
6025
6026   Example of partial matching using pcre2test
6027
6028       The  pcre2test data modifiers partial_hard (or ph) and partial_soft (or
6029       ps) set PCRE2_PARTIAL_HARD and PCRE2_PARTIAL_SOFT,  respectively,  when
6030       calling  pcre2_match(). Here is a run of pcre2test using a pattern that
6031       matches the whole subject in the form of a date:
6032
6033           re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
6034         data> 25dec3\=ph
6035         Partial match: 23dec3
6036         data> 3ju\=ph
6037         Partial match: 3ju
6038         data> 3juj\=ph
6039         No match
6040
6041       This example gives the same results for  both  hard  and  soft  partial
6042       matching options. Here is an example where there is a difference:
6043
6044           re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
6045         data> 25jun04\=ps
6046          0: 25jun04
6047          1: jun
6048         data> 25jun04\=ph
6049         Partial match: 25jun04
6050
6051       With   PCRE2_PARTIAL_SOFT,  the  subject  is  matched  completely.  For
6052       PCRE2_PARTIAL_HARD, however, the subject is assumed not to be complete,
6053       so there is only a partial match.
6054
6055
6056MULTI-SEGMENT MATCHING WITH pcre2_match()
6057
6058       PCRE  was  not originally designed with multi-segment matching in mind.
6059       However, over time, features (including  partial  matching)  that  make
6060       multi-segment matching possible have been added. A very long string can
6061       be searched segment by segment  by  calling  pcre2_match()  repeatedly,
6062       with the aim of achieving the same results that would happen if the en-
6063       tire string was available for searching all  the  time.  Normally,  the
6064       strings  that  are  being  sought are much shorter than each individual
6065       segment, and are in the middle of very long strings, so the pattern  is
6066       normally not anchored.
6067
6068       Special  logic  must  be implemented to handle a matched substring that
6069       spans a segment boundary. PCRE2_PARTIAL_HARD should be used, because it
6070       returns  a  partial match at the end of a segment whenever there is the
6071       possibility of changing  the  match  by  adding  more  characters.  The
6072       PCRE2_NOTBOL option should also be set for all but the first segment.
6073
6074       When a partial match occurs, the next segment must be added to the cur-
6075       rent subject and the match re-run, using the  startoffset  argument  of
6076       pcre2_match()  to  begin  at the point where the partial match started.
6077       For example:
6078
6079           re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
6080         data> ...the date is 23ja\=ph
6081         Partial match: 23ja
6082         data> ...the date is 23jan19 and on that day...\=offset=15
6083          0: 23jan19
6084          1: jan
6085
6086       Note the use of the offset modifier to start the new  match  where  the
6087       partial match was found. In this example, the next segment was added to
6088       the one in which  the  partial  match  was  found.  This  is  the  most
6089       straightforward approach, typically using a memory buffer that is twice
6090       the size of each segment. After a partial match, the first half of  the
6091       buffer  is discarded, the second half is moved to the start of the buf-
6092       fer, and a new segment is added before repeating the match  as  in  the
6093       example above. After a no match, the entire buffer can be discarded.
6094
6095       If there are memory constraints, you may want to discard text that pre-
6096       cedes a partial match before adding the  next  segment.  Unfortunately,
6097       this  is  not  at  present straightforward. In cases such as the above,
6098       where the pattern does not contain any lookbehinds, it is sufficient to
6099       retain  only  the  partially matched substring. However, if the pattern
6100       contains a lookbehind assertion, characters that precede the  start  of
6101       the  partial match may have been inspected during the matching process.
6102       When pcre2test displays a partial match, it indicates these  characters
6103       with '<' if the allusedtext modifier is set:
6104
6105           re> "(?<=123)abc"
6106         data> xx123ab\=ph,allusedtext
6107         Partial match: 123ab
6108                        <<<
6109
6110       However,  the  allusedtext  modifier is not available for JIT matching,
6111       because JIT matching does not record  the  first  (or  last)  consulted
6112       characters.  For this reason, this information is not available via the
6113       API. It is therefore not possible in general to obtain the exact number
6114       of characters that must be retained in order to get the right match re-
6115       sult. If you cannot retain the  entire  segment,  you  must  find  some
6116       heuristic way of choosing.
6117
6118       If  you know the approximate length of the matching substrings, you can
6119       use that to decide how much text to retain. The only lookbehind  infor-
6120       mation  that  is  currently  available via the API is the length of the
6121       longest individual lookbehind in a pattern, but this can be  misleading
6122       if  there  are  nested  lookbehinds.  The  value  returned  by  calling
6123       pcre2_pattern_info() with the PCRE2_INFO_MAXLOOKBEHIND  option  is  the
6124       maximum number of characters (not code units) that any individual look-
6125       behind  moves  back  when  it  is  processed.   A   pattern   such   as
6126       "(?<=(?<!b)a)"  has a maximum lookbehind value of one, but inspects two
6127       characters before its starting point.
6128
6129       In a non-UTF or a 32-bit case, moving back is just a  subtraction,  but
6130       in  UTF-8  or  UTF-16  you  have  to count characters while moving back
6131       through the code units.
6132
6133
6134PARTIAL MATCHING USING pcre2_dfa_match()
6135
6136       The DFA function moves along the subject string character by character,
6137       without  backtracking,  searching  for  all possible matches simultane-
6138       ously. If the end of the subject is reached before the end of the  pat-
6139       tern, there is the possibility of a partial match.
6140
6141       When PCRE2_PARTIAL_SOFT is set, PCRE2_ERROR_PARTIAL is returned only if
6142       there have been no complete matches. Otherwise,  the  complete  matches
6143       are  returned.   If  PCRE2_PARTIAL_HARD  is  set, a partial match takes
6144       precedence over any complete matches. The portion of  the  string  that
6145       was  matched  when  the  longest  partial match was found is set as the
6146       first matching string.
6147
6148       Because the DFA function always searches for all possible matches,  and
6149       there  is no difference between greedy and ungreedy repetition, its be-
6150       haviour is different from the pcre2_match(). Consider the string  "dog"
6151       matched against this ungreedy pattern:
6152
6153         /dog(sbody)??/
6154
6155       Whereas  the  standard  function stops as soon as it finds the complete
6156       match for "dog", the DFA function also  finds  the  partial  match  for
6157       "dogsbody", and so returns that when PCRE2_PARTIAL_HARD is set.
6158
6159
6160MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()
6161
6162       When a partial match has been found using the DFA matching function, it
6163       is possible to continue the match by providing additional subject  data
6164       and  calling  the function again with the same compiled regular expres-
6165       sion, this time setting the PCRE2_DFA_RESTART option. You must pass the
6166       same working space as before, because this is where details of the pre-
6167       vious partial match are stored. You can set the  PCRE2_PARTIAL_SOFT  or
6168       PCRE2_PARTIAL_HARD  options  with PCRE2_DFA_RESTART to continue partial
6169       matching over multiple segments. Here is an example using pcre2test:
6170
6171           re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
6172         data> 23ja\=dfa,ps
6173         Partial match: 23ja
6174         data> n05\=dfa,dfa_restart
6175          0: n05
6176
6177       The first call has "23ja" as the subject, and requests  partial  match-
6178       ing;  the  second  call  has  "n05"  as  the  subject for the continued
6179       (restarted) match.  Notice that when the match is  complete,  only  the
6180       last  part  is  shown;  PCRE2 does not retain the previously partially-
6181       matched string. It is up to the calling program to do that if it  needs
6182       to.  This  means  that, for an unanchored pattern, if a continued match
6183       fails, it is not possible to try again at a  new  starting  point.  All
6184       this facility is capable of doing is continuing with the previous match
6185       attempt. For example, consider this pattern:
6186
6187         1234|3789
6188
6189       If the first part of the subject is "ABC123", a partial  match  of  the
6190       first  alternative  is found at offset 3. There is no partial match for
6191       the second alternative, because such a match does not start at the same
6192       point  in  the  subject  string. Attempting to continue with the string
6193       "7890" does not yield a match  because  only  those  alternatives  that
6194       match  at one point in the subject are remembered. Depending on the ap-
6195       plication, this may or may not be what you want.
6196
6197       If you do want to allow for starting again at the next  character,  one
6198       way  of  doing it is to retain some or all of the segment and try a new
6199       complete match, as described for pcre2_match() above. Another possibil-
6200       ity  is to work with two buffers. If a partial match at offset n in the
6201       first buffer is followed by "no match" when PCRE2_DFA_RESTART  is  used
6202       on  the  second buffer, you can then try a new match starting at offset
6203       n+1 in the first buffer.
6204
6205
6206AUTHOR
6207
6208       Philip Hazel
6209       University Computing Service
6210       Cambridge, England.
6211
6212
6213REVISION
6214
6215       Last updated: 04 September 2019
6216       Copyright (c) 1997-2019 University of Cambridge.
6217------------------------------------------------------------------------------
6218
6219
6220PCRE2PATTERN(3)            Library Functions Manual            PCRE2PATTERN(3)
6221
6222
6223
6224NAME
6225       PCRE2 - Perl-compatible regular expressions (revised API)
6226
6227PCRE2 REGULAR EXPRESSION DETAILS
6228
6229       The  syntax and semantics of the regular expressions that are supported
6230       by PCRE2 are described in detail below. There is a quick-reference syn-
6231       tax  summary  in the pcre2syntax page. PCRE2 tries to match Perl syntax
6232       and semantics as closely as it can.  PCRE2 also supports some  alterna-
6233       tive  regular  expression syntax (which does not conflict with the Perl
6234       syntax) in order to provide some compatibility with regular expressions
6235       in Python, .NET, and Oniguruma.
6236
6237       Perl's  regular expressions are described in its own documentation, and
6238       regular expressions in general are covered in a number of  books,  some
6239       of which have copious examples. Jeffrey Friedl's "Mastering Regular Ex-
6240       pressions", published by O'Reilly, covers regular expressions in  great
6241       detail.  This description of PCRE2's regular expressions is intended as
6242       reference material.
6243
6244       This document discusses the regular expression patterns that  are  sup-
6245       ported  by  PCRE2  when  its  main matching function, pcre2_match(), is
6246       used.   PCRE2   also   has   an    alternative    matching    function,
6247       pcre2_dfa_match(),  which  matches  using a different algorithm that is
6248       not Perl-compatible. Some of  the  features  discussed  below  are  not
6249       available  when  DFA matching is used. The advantages and disadvantages
6250       of the alternative function, and how it differs from the  normal  func-
6251       tion, are discussed in the pcre2matching page.
6252
6253
6254SPECIAL START-OF-PATTERN ITEMS
6255
6256       A  number  of options that can be passed to pcre2_compile() can also be
6257       set by special items at the start of a pattern. These are not Perl-com-
6258       patible,  but  are provided to make these options accessible to pattern
6259       writers who are not able to change the program that processes the  pat-
6260       tern.  Any  number  of these items may appear, but they must all be to-
6261       gether right at the start of the pattern string, and the  letters  must
6262       be in upper case.
6263
6264   UTF support
6265
6266       In the 8-bit and 16-bit PCRE2 libraries, characters may be coded either
6267       as single code units, or as multiple UTF-8 or UTF-16 code units. UTF-32
6268       can  be  specified  for the 32-bit library, in which case it constrains
6269       the character values to valid  Unicode  code  points.  To  process  UTF
6270       strings,  PCRE2  must be built to include Unicode support (which is the
6271       default). When using UTF strings you must  either  call  the  compiling
6272       function  with  one or both of the PCRE2_UTF or PCRE2_MATCH_INVALID_UTF
6273       options, or the pattern must start with the  special  sequence  (*UTF),
6274       which  is  equivalent  to setting the relevant PCRE2_UTF. How setting a
6275       UTF mode affects pattern matching is mentioned in several places below.
6276       There is also a summary of features in the pcre2unicode page.
6277
6278       Some applications that allow their users to supply patterns may wish to
6279       restrict  them  to  non-UTF  data  for   security   reasons.   If   the
6280       PCRE2_NEVER_UTF  option is passed to pcre2_compile(), (*UTF) is not al-
6281       lowed, and its appearance in a pattern causes an error.
6282
6283   Unicode property support
6284
6285       Another special sequence that may appear at the start of a  pattern  is
6286       (*UCP).   This  has the same effect as setting the PCRE2_UCP option: it
6287       causes sequences such as \d and \w to use Unicode properties to  deter-
6288       mine character types, instead of recognizing only characters with codes
6289       less than 256 via a lookup table. If also causes upper/lower casing op-
6290       erations  to  use  Unicode  properties  for characters with code points
6291       greater than 127, even when UTF is not set.
6292
6293       Some applications that allow their users to supply patterns may wish to
6294       restrict  them  for  security reasons. If the PCRE2_NEVER_UCP option is
6295       passed to pcre2_compile(), (*UCP) is not allowed, and its appearance in
6296       a pattern causes an error.
6297
6298   Locking out empty string matching
6299
6300       Starting a pattern with (*NOTEMPTY) or (*NOTEMPTY_ATSTART) has the same
6301       effect as passing the PCRE2_NOTEMPTY or  PCRE2_NOTEMPTY_ATSTART  option
6302       to whichever matching function is subsequently called to match the pat-
6303       tern. These options lock out the matching of empty strings, either  en-
6304       tirely, or only at the start of the subject.
6305
6306   Disabling auto-possessification
6307
6308       If  a pattern starts with (*NO_AUTO_POSSESS), it has the same effect as
6309       setting the PCRE2_NO_AUTO_POSSESS option. This stops PCRE2 from  making
6310       quantifiers  possessive  when  what  follows  cannot match the repeated
6311       item. For example, by default a+b is treated as a++b. For more details,
6312       see the pcre2api documentation.
6313
6314   Disabling start-up optimizations
6315
6316       If  a  pattern  starts  with (*NO_START_OPT), it has the same effect as
6317       setting the PCRE2_NO_START_OPTIMIZE option. This disables several opti-
6318       mizations  for  quickly  reaching "no match" results. For more details,
6319       see the pcre2api documentation.
6320
6321   Disabling automatic anchoring
6322
6323       If a pattern starts with (*NO_DOTSTAR_ANCHOR), it has the  same  effect
6324       as  setting the PCRE2_NO_DOTSTAR_ANCHOR option. This disables optimiza-
6325       tions that apply to patterns whose top-level branches all start with .*
6326       (match  any  number of arbitrary characters). For more details, see the
6327       pcre2api documentation.
6328
6329   Disabling JIT compilation
6330
6331       If a pattern that starts with (*NO_JIT) is  successfully  compiled,  an
6332       attempt  by  the  application  to apply the JIT optimization by calling
6333       pcre2_jit_compile() is ignored.
6334
6335   Setting match resource limits
6336
6337       The pcre2_match() function contains a counter that is incremented every
6338       time it goes round its main loop. The caller of pcre2_match() can set a
6339       limit on this counter, which therefore limits the amount  of  computing
6340       resource used for a match. The maximum depth of nested backtracking can
6341       also be limited; this indirectly restricts the amount  of  heap  memory
6342       that  is  used,  but there is also an explicit memory limit that can be
6343       set.
6344
6345       These facilities are provided to catch runaway matches  that  are  pro-
6346       voked  by patterns with huge matching trees. A common example is a pat-
6347       tern with nested unlimited repeats applied to a long string  that  does
6348       not  match. When one of these limits is reached, pcre2_match() gives an
6349       error return. The limits can also be set by items at the start  of  the
6350       pattern of the form
6351
6352         (*LIMIT_HEAP=d)
6353         (*LIMIT_MATCH=d)
6354         (*LIMIT_DEPTH=d)
6355
6356       where d is any number of decimal digits. However, the value of the set-
6357       ting must be less than the value set (or defaulted) by  the  caller  of
6358       pcre2_match()  for  it  to have any effect. In other words, the pattern
6359       writer can lower the limits set by the programmer, but not raise  them.
6360       If  there  is  more  than one setting of one of these limits, the lower
6361       value is used. The heap limit is specified in kibibytes (units of  1024
6362       bytes).
6363
6364       Prior  to  release  10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This
6365       name is still recognized for backwards compatibility.
6366
6367       The heap limit applies only when the pcre2_match() or pcre2_dfa_match()
6368       interpreters are used for matching. It does not apply to JIT. The match
6369       limit is used (but in a different way) when JIT is being used, or  when
6370       pcre2_dfa_match() is called, to limit computing resource usage by those
6371       matching functions. The depth limit is ignored by JIT but  is  relevant
6372       for  DFA  matching, which uses function recursion for recursions within
6373       the pattern and for lookaround assertions and atomic  groups.  In  this
6374       case, the depth limit controls the depth of such recursion.
6375
6376   Newline conventions
6377
6378       PCRE2  supports six different conventions for indicating line breaks in
6379       strings: a single CR (carriage return) character, a  single  LF  (line-
6380       feed) character, the two-character sequence CRLF, any of the three pre-
6381       ceding, any Unicode newline sequence,  or  the  NUL  character  (binary
6382       zero).  The  pcre2api  page  has further discussion about newlines, and
6383       shows how to set the newline convention when calling pcre2_compile().
6384
6385       It is also possible to specify a newline convention by starting a  pat-
6386       tern string with one of the following sequences:
6387
6388         (*CR)        carriage return
6389         (*LF)        linefeed
6390         (*CRLF)      carriage return, followed by linefeed
6391         (*ANYCRLF)   any of the three above
6392         (*ANY)       all Unicode newline sequences
6393         (*NUL)       the NUL character (binary zero)
6394
6395       These override the default and the options given to the compiling func-
6396       tion. For example, on a Unix system where LF is the default newline se-
6397       quence, the pattern
6398
6399         (*CR)a.b
6400
6401       changes the convention to CR. That pattern matches "a\nb" because LF is
6402       no longer a newline. If more than one of these settings is present, the
6403       last one is used.
6404
6405       The  newline  convention affects where the circumflex and dollar asser-
6406       tions are true. It also affects the interpretation of the dot metachar-
6407       acter  when  PCRE2_DOTALL  is not set, and the behaviour of \N when not
6408       followed by an opening brace. However, it does not affect what  the  \R
6409       escape  sequence  matches.  By default, this is any Unicode newline se-
6410       quence, for Perl compatibility. However, this can be changed;  see  the
6411       next section and the description of \R in the section entitled "Newline
6412       sequences" below. A change of \R setting can be combined with a  change
6413       of newline convention.
6414
6415   Specifying what \R matches
6416
6417       It is possible to restrict \R to match only CR, LF, or CRLF (instead of
6418       the complete set  of  Unicode  line  endings)  by  setting  the  option
6419       PCRE2_BSR_ANYCRLF  at compile time. This effect can also be achieved by
6420       starting a pattern with (*BSR_ANYCRLF).  For  completeness,  (*BSR_UNI-
6421       CODE) is also recognized, corresponding to PCRE2_BSR_UNICODE.
6422
6423
6424EBCDIC CHARACTER CODES
6425
6426       PCRE2  can be compiled to run in an environment that uses EBCDIC as its
6427       character code instead of ASCII or Unicode (typically a mainframe  sys-
6428       tem).  In  the  sections below, character code values are ASCII or Uni-
6429       code; in an EBCDIC environment these characters may have different code
6430       values, and there are no code points greater than 255.
6431
6432
6433CHARACTERS AND METACHARACTERS
6434
6435       A  regular  expression  is  a pattern that is matched against a subject
6436       string from left to right. Most characters stand for  themselves  in  a
6437       pattern,  and  match  the corresponding characters in the subject. As a
6438       trivial example, the pattern
6439
6440         The quick brown fox
6441
6442       matches a portion of a subject string that is identical to itself. When
6443       caseless  matching  is  specified  (the  PCRE2_CASELESS  option or (?i)
6444       within the pattern), letters are matched independently  of  case.  Note
6445       that  there  are  two  ASCII  characters, K and S, that, in addition to
6446       their lower case ASCII equivalents, are  case-equivalent  with  Unicode
6447       U+212A  (Kelvin  sign)  and  U+017F  (long  S) respectively when either
6448       PCRE2_UTF or PCRE2_UCP is set.
6449
6450       The power of regular expressions comes from the ability to include wild
6451       cards, character classes, alternatives, and repetitions in the pattern.
6452       These are encoded in the pattern by the use of metacharacters, which do
6453       not  stand  for  themselves but instead are interpreted in some special
6454       way.
6455
6456       There are two different sets of metacharacters: those that  are  recog-
6457       nized  anywhere in the pattern except within square brackets, and those
6458       that are recognized within square brackets.  Outside  square  brackets,
6459       the metacharacters are as follows:
6460
6461         \      general escape character with several uses
6462         ^      assert start of string (or line, in multiline mode)
6463         $      assert end of string (or line, in multiline mode)
6464         .      match any character except newline (by default)
6465         [      start character class definition
6466         |      start of alternative branch
6467         (      start group or control verb
6468         )      end group or control verb
6469         *      0 or more quantifier
6470         +      1 or more quantifier; also "possessive quantifier"
6471         ?      0 or 1 quantifier; also quantifier minimizer
6472         {      start min/max quantifier
6473
6474       Part  of  a  pattern  that is in square brackets is called a "character
6475       class". In a character class the only metacharacters are:
6476
6477         \      general escape character
6478         ^      negate the class, but only if the first character
6479         -      indicates character range
6480         [      POSIX character class (if followed by POSIX syntax)
6481         ]      terminates the character class
6482
6483       If a pattern is compiled with the  PCRE2_EXTENDED  option,  most  white
6484       space  in  the pattern, other than in a character class, and characters
6485       between a # outside a character class and the next newline,  inclusive,
6486       are ignored. An escaping backslash can be used to include a white space
6487       or a # character as part of the pattern. If the PCRE2_EXTENDED_MORE op-
6488       tion is set, the same applies, but in addition unescaped space and hor-
6489       izontal tab characters are ignored inside a character class. Note: only
6490       these  two  characters  are  ignored, not the full set of pattern white
6491       space characters that are ignored outside  a  character  class.  Option
6492       settings can be changed within a pattern; see the section entitled "In-
6493       ternal Option Setting" below.
6494
6495       The following sections describe the use of each of the metacharacters.
6496
6497
6498BACKSLASH
6499
6500       The backslash character has several uses. Firstly, if it is followed by
6501       a  character that is not a digit or a letter, it takes away any special
6502       meaning that character may have. This use of  backslash  as  an  escape
6503       character applies both inside and outside character classes.
6504
6505       For  example,  if you want to match a * character, you must write \* in
6506       the pattern. This escaping action applies whether or not the  following
6507       character  would  otherwise be interpreted as a metacharacter, so it is
6508       always safe to precede a non-alphanumeric  with  backslash  to  specify
6509       that it stands for itself.  In particular, if you want to match a back-
6510       slash, you write \\.
6511
6512       Only ASCII digits and letters have any special meaning  after  a  back-
6513       slash. All other characters (in particular, those whose code points are
6514       greater than 127) are treated as literals.
6515
6516       If you want to treat all characters in a sequence as literals, you  can
6517       do so by putting them between \Q and \E. This is different from Perl in
6518       that $ and @ are handled as literals in  \Q...\E  sequences  in  PCRE2,
6519       whereas  in Perl, $ and @ cause variable interpolation. Also, Perl does
6520       "double-quotish backslash interpolation" on any backslashes between  \Q
6521       and  \E which, its documentation says, "may lead to confusing results".
6522       PCRE2 treats a backslash between \Q and \E just like any other  charac-
6523       ter. Note the following examples:
6524
6525         Pattern            PCRE2 matches   Perl matches
6526
6527         \Qabc$xyz\E        abc$xyz        abc followed by the
6528                                             contents of $xyz
6529         \Qabc\$xyz\E       abc\$xyz       abc\$xyz
6530         \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz
6531         \QA\B\E            A\B            A\B
6532         \Q\\E              \              \\E
6533
6534       The  \Q...\E  sequence  is recognized both inside and outside character
6535       classes.  An isolated \E that is not preceded by \Q is ignored.  If  \Q
6536       is  not followed by \E later in the pattern, the literal interpretation
6537       continues to the end of the pattern (that is,  \E  is  assumed  at  the
6538       end).  If  the  isolated \Q is inside a character class, this causes an
6539       error, because the character class  is  not  terminated  by  a  closing
6540       square bracket.
6541
6542   Non-printing characters
6543
6544       A second use of backslash provides a way of encoding non-printing char-
6545       acters in patterns in a visible manner. There is no restriction on  the
6546       appearance  of non-printing characters in a pattern, but when a pattern
6547       is being prepared by text editing, it is often easier to use one of the
6548       following  escape  sequences  instead of the binary character it repre-
6549       sents. In an ASCII or Unicode environment, these escapes  are  as  fol-
6550       lows:
6551
6552         \a          alarm, that is, the BEL character (hex 07)
6553         \cx         "control-x", where x is any printable ASCII character
6554         \e          escape (hex 1B)
6555         \f          form feed (hex 0C)
6556         \n          linefeed (hex 0A)
6557         \r          carriage return (hex 0D) (but see below)
6558         \t          tab (hex 09)
6559         \0dd        character with octal code 0dd
6560         \ddd        character with octal code ddd, or backreference
6561         \o{ddd..}   character with octal code ddd..
6562         \xhh        character with hex code hh
6563         \x{hhh..}   character with hex code hhh..
6564         \N{U+hhh..} character with Unicode hex code point hhh..
6565
6566       By  default, after \x that is not followed by {, from zero to two hexa-
6567       decimal digits are read (letters can be in upper or  lower  case).  Any
6568       number of hexadecimal digits may appear between \x{ and }. If a charac-
6569       ter other than a hexadecimal digit appears between \x{  and  },  or  if
6570       there is no terminating }, an error occurs.
6571
6572       Characters whose code points are less than 256 can be defined by either
6573       of the two syntaxes for \x or by an octal sequence. There is no differ-
6574       ence in the way they are handled. For example, \xdc is exactly the same
6575       as \x{dc} or \334.  However, using the braced versions does  make  such
6576       sequences easier to read.
6577
6578       Support  is  available  for some ECMAScript (aka JavaScript) escape se-
6579       quences via two compile-time options. If PCRE2_ALT_BSUX is set, the se-
6580       quence  \x  followed  by { is not recognized. Only if \x is followed by
6581       two hexadecimal digits is it recognized as a character  escape.  Other-
6582       wise  it  is interpreted as a literal "x" character. In this mode, sup-
6583       port for code points greater than 256 is provided by \u, which must  be
6584       followed  by  four hexadecimal digits; otherwise it is interpreted as a
6585       literal "u" character.
6586
6587       PCRE2_EXTRA_ALT_BSUX has the same effect as PCRE2_ALT_BSUX and, in  ad-
6588       dition, \u{hhh..} is recognized as the character specified by hexadeci-
6589       mal code point.  There may be any number of  hexadecimal  digits.  This
6590       syntax is from ECMAScript 6.
6591
6592       The  \N{U+hhh..} escape sequence is recognized only when PCRE2 is oper-
6593       ating in UTF mode. Perl also uses \N{name}  to  specify  characters  by
6594       Unicode  name;  PCRE2  does  not support this. Note that when \N is not
6595       followed by an opening brace (curly bracket) it has an entirely differ-
6596       ent meaning, matching any character that is not a newline.
6597
6598       There  are some legacy applications where the escape sequence \r is ex-
6599       pected to match a newline. If the  PCRE2_EXTRA_ESCAPED_CR_IS_LF  option
6600       is  set,  \r  in  a  pattern is converted to \n so that it matches a LF
6601       (linefeed) instead of a CR (carriage return) character.
6602
6603       The precise effect of \cx on ASCII characters is as follows: if x is  a
6604       lower  case  letter,  it  is converted to upper case. Then bit 6 of the
6605       character (hex 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A
6606       (A  is  41, Z is 5A), but \c{ becomes hex 3B ({ is 7B), and \c; becomes
6607       hex 7B (; is 3B). If the code unit following \c has a value  less  than
6608       32 or greater than 126, a compile-time error occurs.
6609
6610       When  PCRE2  is  compiled in EBCDIC mode, \N{U+hhh..} is not supported.
6611       \a, \e, \f, \n, \r, and \t generate the appropriate EBCDIC code values.
6612       The \c escape is processed as specified for Perl in the perlebcdic doc-
6613       ument. The only characters that are allowed after \c are A-Z,  a-z,  or
6614       one  of @, [, \, ], ^, _, or ?. Any other character provokes a compile-
6615       time error. The sequence \c@ encodes character code  0;  after  \c  the
6616       letters  (in either case) encode characters 1-26 (hex 01 to hex 1A); [,
6617       \, ], ^, and _ encode characters 27-31 (hex 1B to hex 1F), and \c?  be-
6618       comes either 255 (hex FF) or 95 (hex 5F).
6619
6620       Thus,  apart  from  \c?, these escapes generate the same character code
6621       values as they do in an ASCII environment, though the meanings  of  the
6622       values  mostly  differ. For example, \cG always generates code value 7,
6623       which is BEL in ASCII but DEL in EBCDIC.
6624
6625       The sequence \c? generates DEL (127, hex 7F) in an  ASCII  environment,
6626       but  because  127  is  not a control character in EBCDIC, Perl makes it
6627       generate the APC character. Unfortunately, there are  several  variants
6628       of  EBCDIC.  In  most  of them the APC character has the value 255 (hex
6629       FF), but in the one Perl calls POSIX-BC its value is 95  (hex  5F).  If
6630       certain other characters have POSIX-BC values, PCRE2 makes \c? generate
6631       95; otherwise it generates 255.
6632
6633       After \0 up to two further octal digits are read. If  there  are  fewer
6634       than  two  digits,  just  those that are present are used. Thus the se-
6635       quence \0\x\015 specifies two binary zeros followed by a  CR  character
6636       (code value 13). Make sure you supply two digits after the initial zero
6637       if the pattern character that follows is itself an octal digit.
6638
6639       The escape \o must be followed by a sequence of octal digits,  enclosed
6640       in  braces.  An  error occurs if this is not the case. This escape is a
6641       recent addition to Perl; it provides way of specifying  character  code
6642       points  as  octal  numbers  greater than 0777, and it also allows octal
6643       numbers and backreferences to be unambiguously specified.
6644
6645       For greater clarity and unambiguity, it is best to avoid following \ by
6646       a digit greater than zero. Instead, use \o{} or \x{} to specify numeri-
6647       cal character code points, and \g{} to specify backreferences. The fol-
6648       lowing paragraphs describe the old, ambiguous syntax.
6649
6650       The handling of a backslash followed by a digit other than 0 is compli-
6651       cated, and Perl has changed over time, causing PCRE2 also to change.
6652
6653       Outside a character class, PCRE2 reads the digit and any following dig-
6654       its as a decimal number. If the number is less than 10, begins with the
6655       digit 8 or 9, or if there are  at  least  that  many  previous  capture
6656       groups  in the expression, the entire sequence is taken as a backrefer-
6657       ence. A description of how this works is  given  later,  following  the
6658       discussion  of parenthesized groups.  Otherwise, up to three octal dig-
6659       its are read to form a character code.
6660
6661       Inside a character class, PCRE2 handles \8 and \9 as the literal  char-
6662       acters  "8"  and "9", and otherwise reads up to three octal digits fol-
6663       lowing the backslash, using them to generate a data character. Any sub-
6664       sequent  digits  stand for themselves. For example, outside a character
6665       class:
6666
6667         \040   is another way of writing an ASCII space
6668         \40    is the same, provided there are fewer than 40
6669                   previous capture groups
6670         \7     is always a backreference
6671         \11    might be a backreference, or another way of
6672                   writing a tab
6673         \011   is always a tab
6674         \0113  is a tab followed by the character "3"
6675         \113   might be a backreference, otherwise the
6676                   character with octal code 113
6677         \377   might be a backreference, otherwise
6678                   the value 255 (decimal)
6679         \81    is always a backreference
6680
6681       Note that octal values of 100 or greater that are specified using  this
6682       syntax  must  not be introduced by a leading zero, because no more than
6683       three octal digits are ever read.
6684
6685   Constraints on character values
6686
6687       Characters that are specified using octal or  hexadecimal  numbers  are
6688       limited to certain values, as follows:
6689
6690         8-bit non-UTF mode    no greater than 0xff
6691         16-bit non-UTF mode   no greater than 0xffff
6692         32-bit non-UTF mode   no greater than 0xffffffff
6693         All UTF modes         no greater than 0x10ffff and a valid code point
6694
6695       Invalid Unicode code points are all those in the range 0xd800 to 0xdfff
6696       (the so-called "surrogate" code points). The check  for  these  can  be
6697       disabled  by  the  caller  of  pcre2_compile()  by  setting  the option
6698       PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES. However, this is possible only  in
6699       UTF-8  and  UTF-32 modes, because these values are not representable in
6700       UTF-16.
6701
6702   Escape sequences in character classes
6703
6704       All the sequences that define a single character value can be used both
6705       inside  and  outside character classes. In addition, inside a character
6706       class, \b is interpreted as the backspace character (hex 08).
6707
6708       When not followed by an opening brace, \N is not allowed in a character
6709       class.   \B,  \R, and \X are not special inside a character class. Like
6710       other unrecognized alphabetic escape sequences, they  cause  an  error.
6711       Outside a character class, these sequences have different meanings.
6712
6713   Unsupported escape sequences
6714
6715       In  Perl,  the  sequences  \F, \l, \L, \u, and \U are recognized by its
6716       string handler and used to modify the case of following characters.  By
6717       default,  PCRE2  does  not  support these escape sequences in patterns.
6718       However, if either of the PCRE2_ALT_BSUX  or  PCRE2_EXTRA_ALT_BSUX  op-
6719       tions  is set, \U matches a "U" character, and \u can be used to define
6720       a character by code point, as described above.
6721
6722   Absolute and relative backreferences
6723
6724       The sequence \g followed by a signed or unsigned number, optionally en-
6725       closed  in  braces,  is  an absolute or relative backreference. A named
6726       backreference can be coded as \g{name}.  Backreferences  are  discussed
6727       later, following the discussion of parenthesized groups.
6728
6729   Absolute and relative subroutine calls
6730
6731       For  compatibility with Oniguruma, the non-Perl syntax \g followed by a
6732       name or a number enclosed either in angle brackets or single quotes, is
6733       an  alternative syntax for referencing a capture group as a subroutine.
6734       Details are discussed later.   Note  that  \g{...}  (Perl  syntax)  and
6735       \g<...> (Oniguruma syntax) are not synonymous. The former is a backref-
6736       erence; the latter is a subroutine call.
6737
6738   Generic character types
6739
6740       Another use of backslash is for specifying generic character types:
6741
6742         \d     any decimal digit
6743         \D     any character that is not a decimal digit
6744         \h     any horizontal white space character
6745         \H     any character that is not a horizontal white space character
6746         \N     any character that is not a newline
6747         \s     any white space character
6748         \S     any character that is not a white space character
6749         \v     any vertical white space character
6750         \V     any character that is not a vertical white space character
6751         \w     any "word" character
6752         \W     any "non-word" character
6753
6754       The \N escape sequence has the same meaning as  the  "."  metacharacter
6755       when  PCRE2_DOTALL is not set, but setting PCRE2_DOTALL does not change
6756       the meaning of \N. Note that when \N is followed by an opening brace it
6757       has a different meaning. See the section entitled "Non-printing charac-
6758       ters" above for details. Perl also uses \N{name} to specify  characters
6759       by Unicode name; PCRE2 does not support this.
6760
6761       Each  pair of lower and upper case escape sequences partitions the com-
6762       plete set of characters into two disjoint  sets.  Any  given  character
6763       matches  one, and only one, of each pair. The sequences can appear both
6764       inside and outside character classes. They each match one character  of
6765       the  appropriate  type.  If the current matching point is at the end of
6766       the subject string, all of them fail, because there is no character  to
6767       match.
6768
6769       The  default  \s  characters  are HT (9), LF (10), VT (11), FF (12), CR
6770       (13), and space (32), which are defined as white space in the  "C"  lo-
6771       cale.  This  list may vary if locale-specific matching is taking place.
6772       For example, in some locales the "non-breaking space" character  (\xA0)
6773       is recognized as white space, and in others the VT character is not.
6774
6775       A  "word"  character is an underscore or any character that is a letter
6776       or digit.  By default, the definition of letters  and  digits  is  con-
6777       trolled by PCRE2's low-valued character tables, and may vary if locale-
6778       specific matching is taking place (see "Locale support" in the pcre2api
6779       page).  For  example,  in  a French locale such as "fr_FR" in Unix-like
6780       systems, or "french" in Windows, some character codes greater than  127
6781       are  used  for  accented letters, and these are then matched by \w. The
6782       use of locales with Unicode is discouraged.
6783
6784       By default, characters whose code points are  greater  than  127  never
6785       match \d, \s, or \w, and always match \D, \S, and \W, although this may
6786       be different for characters in the range 128-255  when  locale-specific
6787       matching  is  happening.   These escape sequences retain their original
6788       meanings from before Unicode support was available,  mainly  for  effi-
6789       ciency  reasons.  If  the  PCRE2_UCP  option  is  set, the behaviour is
6790       changed so that Unicode properties  are  used  to  determine  character
6791       types, as follows:
6792
6793         \d  any character that matches \p{Nd} (decimal digit)
6794         \s  any character that matches \p{Z} or \h or \v
6795         \w  any character that matches \p{L} or \p{N}, plus underscore
6796
6797       The  upper case escapes match the inverse sets of characters. Note that
6798       \d matches only decimal digits, whereas \w matches any  Unicode  digit,
6799       as well as any Unicode letter, and underscore. Note also that PCRE2_UCP
6800       affects \b, and \B because they are defined in  terms  of  \w  and  \W.
6801       Matching these sequences is noticeably slower when PCRE2_UCP is set.
6802
6803       The  sequences  \h, \H, \v, and \V, in contrast to the other sequences,
6804       which match only ASCII characters by default, always match  a  specific
6805       list  of  code  points, whether or not PCRE2_UCP is set. The horizontal
6806       space characters are:
6807
6808         U+0009     Horizontal tab (HT)
6809         U+0020     Space
6810         U+00A0     Non-break space
6811         U+1680     Ogham space mark
6812         U+180E     Mongolian vowel separator
6813         U+2000     En quad
6814         U+2001     Em quad
6815         U+2002     En space
6816         U+2003     Em space
6817         U+2004     Three-per-em space
6818         U+2005     Four-per-em space
6819         U+2006     Six-per-em space
6820         U+2007     Figure space
6821         U+2008     Punctuation space
6822         U+2009     Thin space
6823         U+200A     Hair space
6824         U+202F     Narrow no-break space
6825         U+205F     Medium mathematical space
6826         U+3000     Ideographic space
6827
6828       The vertical space characters are:
6829
6830         U+000A     Linefeed (LF)
6831         U+000B     Vertical tab (VT)
6832         U+000C     Form feed (FF)
6833         U+000D     Carriage return (CR)
6834         U+0085     Next line (NEL)
6835         U+2028     Line separator
6836         U+2029     Paragraph separator
6837
6838       In 8-bit, non-UTF-8 mode, only the characters  with  code  points  less
6839       than 256 are relevant.
6840
6841   Newline sequences
6842
6843       Outside  a  character class, by default, the escape sequence \R matches
6844       any Unicode newline sequence. In 8-bit non-UTF-8 mode \R is  equivalent
6845       to the following:
6846
6847         (?>\r\n|\n|\x0b|\f|\r|\x85)
6848
6849       This is an example of an "atomic group", details of which are given be-
6850       low.  This particular group matches either the  two-character  sequence
6851       CR  followed  by  LF,  or  one  of  the single characters LF (linefeed,
6852       U+000A), VT (vertical tab, U+000B), FF (form feed,  U+000C),  CR  (car-
6853       riage  return,  U+000D), or NEL (next line, U+0085). Because this is an
6854       atomic group, the two-character sequence is treated as  a  single  unit
6855       that cannot be split.
6856
6857       In other modes, two additional characters whose code points are greater
6858       than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-
6859       rator,  U+2029).  Unicode support is not needed for these characters to
6860       be recognized.
6861
6862       It is possible to restrict \R to match only CR, LF, or CRLF (instead of
6863       the  complete  set  of  Unicode  line  endings)  by  setting the option
6864       PCRE2_BSR_ANYCRLF at compile time. (BSR is an abbreviation  for  "back-
6865       slash R".) This can be made the default when PCRE2 is built; if this is
6866       the case, the other behaviour can be requested via  the  PCRE2_BSR_UNI-
6867       CODE  option. It is also possible to specify these settings by starting
6868       a pattern string with one of the following sequences:
6869
6870         (*BSR_ANYCRLF)   CR, LF, or CRLF only
6871         (*BSR_UNICODE)   any Unicode newline sequence
6872
6873       These override the default and the options given to the compiling func-
6874       tion.  Note that these special settings, which are not Perl-compatible,
6875       are recognized only at the very start of a pattern, and that they  must
6876       be  in upper case. If more than one of them is present, the last one is
6877       used. They can be combined with a change of newline convention; for ex-
6878       ample, a pattern can start with:
6879
6880         (*ANY)(*BSR_ANYCRLF)
6881
6882       They  can also be combined with the (*UTF) or (*UCP) special sequences.
6883       Inside a character class, \R is treated as an unrecognized  escape  se-
6884       quence, and causes an error.
6885
6886   Unicode character properties
6887
6888       When  PCRE2  is  built  with Unicode support (the default), three addi-
6889       tional escape sequences that match characters with specific  properties
6890       are available. They can be used in any mode, though in 8-bit and 16-bit
6891       non-UTF modes these sequences are of course limited to testing  charac-
6892       ters  whose code points are less than U+0100 and U+10000, respectively.
6893       In 32-bit non-UTF mode, code points greater than 0x10ffff (the  Unicode
6894       limit)  may  be  encountered. These are all treated as being in the Un-
6895       known script and with an unassigned type.
6896
6897       Matching characters by Unicode property is not fast, because PCRE2  has
6898       to  do  a  multistage table lookup in order to find a character's prop-
6899       erty. That is why the traditional escape sequences such as \d and \w do
6900       not  use  Unicode  properties  in PCRE2 by default, though you can make
6901       them do so by setting the PCRE2_UCP option or by starting  the  pattern
6902       with (*UCP).
6903
6904       The extra escape sequences that provide property support are:
6905
6906         \p{xx}   a character with the xx property
6907         \P{xx}   a character without the xx property
6908         \X       a Unicode extended grapheme cluster
6909
6910       The  property names represented by xx above are not case-sensitive, and
6911       in accordance with Unicode's "loose matching" rules,  spaces,  hyphens,
6912       and underscores are ignored. There is support for Unicode script names,
6913       Unicode general category properties, "Any", which matches any character
6914       (including  newline),  Bidi_Class,  a number of binary (yes/no) proper-
6915       ties, and some special PCRE2  properties  (described  below).   Certain
6916       other  Perl  properties such as "InMusicalSymbols" are not supported by
6917       PCRE2. Note that \P{Any} does  not  match  any  characters,  so  always
6918       causes a match failure.
6919
6920   Script properties for \p and \P
6921
6922       There are three different syntax forms for matching a script. Each Uni-
6923       code character has a basic script and,  optionally,  a  list  of  other
6924       scripts ("Script Extensions") with which it is commonly used. Using the
6925       Adlam script as an example, \p{sc:Adlam} matches characters whose basic
6926       script is Adlam, whereas \p{scx:Adlam} matches, in addition, characters
6927       that have Adlam in their extensions list. The full names  "script"  and
6928       "script extensions" for the property types are recognized, and a equals
6929       sign is an alternative to the colon. If a script name is given  without
6930       a  property  type,  for example, \p{Adlam}, it is treated as \p{scx:Ad-
6931       lam}. Perl changed to this interpretation at  release  5.26  and  PCRE2
6932       changed at release 10.40.
6933
6934       Unassigned characters (and in non-UTF 32-bit mode, characters with code
6935       points greater than 0x10FFFF) are assigned the "Unknown" script. Others
6936       that  are not part of an identified script are lumped together as "Com-
6937       mon". The current list of recognized script names and their 4-character
6938       abbreviations can be obtained by running this command:
6939
6940         pcre2test -LS
6941
6942
6943   The general category property for \p and \P
6944
6945       Each character has exactly one Unicode general category property, spec-
6946       ified by a two-letter abbreviation. For compatibility with Perl,  nega-
6947       tion  can  be  specified  by including a circumflex between the opening
6948       brace and the property name.  For  example,  \p{^Lu}  is  the  same  as
6949       \P{Lu}.
6950
6951       If only one letter is specified with \p or \P, it includes all the gen-
6952       eral category properties that start with that letter. In this case,  in
6953       the  absence of negation, the curly brackets in the escape sequence are
6954       optional; these two examples have the same effect:
6955
6956         \p{L}
6957         \pL
6958
6959       The following general category property codes are supported:
6960
6961         C     Other
6962         Cc    Control
6963         Cf    Format
6964         Cn    Unassigned
6965         Co    Private use
6966         Cs    Surrogate
6967
6968         L     Letter
6969         Ll    Lower case letter
6970         Lm    Modifier letter
6971         Lo    Other letter
6972         Lt    Title case letter
6973         Lu    Upper case letter
6974
6975         M     Mark
6976         Mc    Spacing mark
6977         Me    Enclosing mark
6978         Mn    Non-spacing mark
6979
6980         N     Number
6981         Nd    Decimal number
6982         Nl    Letter number
6983         No    Other number
6984
6985         P     Punctuation
6986         Pc    Connector punctuation
6987         Pd    Dash punctuation
6988         Pe    Close punctuation
6989         Pf    Final punctuation
6990         Pi    Initial punctuation
6991         Po    Other punctuation
6992         Ps    Open punctuation
6993
6994         S     Symbol
6995         Sc    Currency symbol
6996         Sk    Modifier symbol
6997         Sm    Mathematical symbol
6998         So    Other symbol
6999
7000         Z     Separator
7001         Zl    Line separator
7002         Zp    Paragraph separator
7003         Zs    Space separator
7004
7005       The special property LC, which has the synonym L&, is  also  supported:
7006       it  matches  a  character that has the Lu, Ll, or Lt property, in other
7007       words, a letter that is not classified as a modifier or "other".
7008
7009       The Cs (Surrogate) property  applies  only  to  characters  whose  code
7010       points  are in the range U+D800 to U+DFFF. These characters are no dif-
7011       ferent to any other character when PCRE2 is not in UTF mode (using  the
7012       16-bit  or  32-bit  library).   However,  they are not valid in Unicode
7013       strings and so cannot be tested by PCRE2 in UTF mode, unless UTF valid-
7014       ity   checking   has   been   turned   off   (see   the  discussion  of
7015       PCRE2_NO_UTF_CHECK in the pcre2api page).
7016
7017       The long synonyms for  property  names  that  Perl  supports  (such  as
7018       \p{Letter})  are  not supported by PCRE2, nor is it permitted to prefix
7019       any of these properties with "Is".
7020
7021       No character that is in the Unicode table has the Cn (unassigned) prop-
7022       erty.  Instead, this property is assumed for any code point that is not
7023       in the Unicode table.
7024
7025       Specifying caseless matching does not affect  these  escape  sequences.
7026       For  example,  \p{Lu}  always  matches only upper case letters. This is
7027       different from the behaviour of current versions of Perl.
7028
7029   Binary (yes/no) properties for \p and \P
7030
7031       Unicode defines a number of  binary  properties,  that  is,  properties
7032       whose  only  values  are  true or false. You can obtain a list of those
7033       that are recognized by \p and \P, along with  their  abbreviations,  by
7034       running this command:
7035
7036         pcre2test -LP
7037
7038
7039   The Bidi_Class property for \p and \P
7040
7041         \p{Bidi_Class:<class>}   matches a character with the given class
7042         \p{BC:<class>}           matches a character with the given class
7043
7044       The recognized classes are:
7045
7046         AL          Arabic letter
7047         AN          Arabic number
7048         B           paragraph separator
7049         BN          boundary neutral
7050         CS          common separator
7051         EN          European number
7052         ES          European separator
7053         ET          European terminator
7054         FSI         first strong isolate
7055         L           left-to-right
7056         LRE         left-to-right embedding
7057         LRI         left-to-right isolate
7058         LRO         left-to-right override
7059         NSM         non-spacing mark
7060         ON          other neutral
7061         PDF         pop directional format
7062         PDI         pop directional isolate
7063         R           right-to-left
7064         RLE         right-to-left embedding
7065         RLI         right-to-left isolate
7066         RLO         right-to-left override
7067         S           segment separator
7068         WS          which space
7069
7070       An  equals  sign  may  be  used instead of a colon. The class names are
7071       case-insensitive; only the short names listed above are recognized.
7072
7073   Extended grapheme clusters
7074
7075       The \X escape matches any number of Unicode  characters  that  form  an
7076       "extended grapheme cluster", and treats the sequence as an atomic group
7077       (see below).  Unicode supports various kinds of composite character  by
7078       giving  each  character  a grapheme breaking property, and having rules
7079       that use these properties to define the boundaries of extended grapheme
7080       clusters.  The rules are defined in Unicode Standard Annex 29, "Unicode
7081       Text Segmentation". Unicode 11.0.0 abandoned the use of  some  previous
7082       properties  that had been used for emojis.  Instead it introduced vari-
7083       ous emoji-specific properties. PCRE2  uses  only  the  Extended  Picto-
7084       graphic property.
7085
7086       \X  always  matches  at least one character. Then it decides whether to
7087       add additional characters according to the following rules for ending a
7088       cluster:
7089
7090       1. End at the end of the subject string.
7091
7092       2.  Do not end between CR and LF; otherwise end after any control char-
7093       acter.
7094
7095       3. Do not break Hangul (a Korean  script)  syllable  sequences.  Hangul
7096       characters  are of five types: L, V, T, LV, and LVT. An L character may
7097       be followed by an L, V, LV, or LVT character; an LV or V character  may
7098       be  followed  by  a V or T character; an LVT or T character may be fol-
7099       lowed only by a T character.
7100
7101       4. Do not end before extending  characters  or  spacing  marks  or  the
7102       "zero-width  joiner" character. Characters with the "mark" property al-
7103       ways have the "extend" grapheme breaking property.
7104
7105       5. Do not end after prepend characters.
7106
7107       6. Do not break within emoji modifier sequences or emoji zwj sequences.
7108       That is, do not break between characters with the Extended_Pictographic
7109       property.  Extend and ZWJ characters are allowed  between  the  charac-
7110       ters.
7111
7112       7.  Do not break within emoji flag sequences. That is, do not break be-
7113       tween regional indicator (RI) characters if there are an odd number  of
7114       RI characters before the break point.
7115
7116       8. Otherwise, end the cluster.
7117
7118   PCRE2's additional properties
7119
7120       As  well as the standard Unicode properties described above, PCRE2 sup-
7121       ports four more that make it possible to convert traditional escape se-
7122       quences  such  as \w and \s to use Unicode properties. PCRE2 uses these
7123       non-standard, non-Perl properties internally  when  PCRE2_UCP  is  set.
7124       However, they may also be used explicitly. These properties are:
7125
7126         Xan   Any alphanumeric character
7127         Xps   Any POSIX space character
7128         Xsp   Any Perl space character
7129         Xwd   Any Perl "word" character
7130
7131       Xan  matches  characters that have either the L (letter) or the N (num-
7132       ber) property. Xps matches the characters tab, linefeed, vertical  tab,
7133       form  feed,  or carriage return, and any other character that has the Z
7134       (separator) property.  Xsp is the same as Xps; in PCRE1 it used to  ex-
7135       clude  vertical  tab,  for  Perl  compatibility,  but Perl changed. Xwd
7136       matches the same characters as Xan, plus underscore.
7137
7138       There is another non-standard property, Xuc, which matches any  charac-
7139       ter  that  can  be represented by a Universal Character Name in C++ and
7140       other programming languages. These are the characters $,  @,  `  (grave
7141       accent),  and  all  characters with Unicode code points greater than or
7142       equal to U+00A0, except for the surrogates U+D800 to U+DFFF. Note  that
7143       most  base  (ASCII) characters are excluded. (Universal Character Names
7144       are of the form \uHHHH or \UHHHHHHHH where H is  a  hexadecimal  digit.
7145       Note that the Xuc property does not match these sequences but the char-
7146       acters that they represent.)
7147
7148   Resetting the match start
7149
7150       In normal use, the escape sequence \K  causes  any  previously  matched
7151       characters not to be included in the final matched sequence that is re-
7152       turned. For example, the pattern:
7153
7154         foo\Kbar
7155
7156       matches "foobar", but reports that it has matched "bar".  \K  does  not
7157       interact with anchoring in any way. The pattern:
7158
7159         ^foo\Kbar
7160
7161       matches  only  when  the  subject  begins with "foobar" (in single line
7162       mode), though it again reports the matched string as "bar".  This  fea-
7163       ture  is similar to a lookbehind assertion (described below).  However,
7164       in this case, the part of the subject before the real  match  does  not
7165       have  to be of fixed length, as lookbehind assertions do. The use of \K
7166       does not interfere with the setting of captured substrings.  For  exam-
7167       ple, when the pattern
7168
7169         (foo)\Kbar
7170
7171       matches "foobar", the first substring is still set to "foo".
7172
7173       From  version  5.32.0  Perl  forbids the use of \K in lookaround asser-
7174       tions. From release 10.38 PCRE2 also forbids this by default.  However,
7175       the  PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK  option  can be used when calling
7176       pcre2_compile() to re-enable the previous behaviour. When  this  option
7177       is set, \K is acted upon when it occurs inside positive assertions, but
7178       is ignored in negative assertions. Note that when  a  pattern  such  as
7179       (?=ab\K)  matches,  the reported start of the match can be greater than
7180       the end of the match. Using \K in a lookbehind assertion at  the  start
7181       of  a  pattern can also lead to odd effects. For example, consider this
7182       pattern:
7183
7184         (?<=\Kfoo)bar
7185
7186       If the subject is "foobar", a call to  pcre2_match()  with  a  starting
7187       offset  of 3 succeeds and reports the matching string as "foobar", that
7188       is, the start of the reported match is earlier  than  where  the  match
7189       started.
7190
7191   Simple assertions
7192
7193       The  final use of backslash is for certain simple assertions. An asser-
7194       tion specifies a condition that has to be met at a particular point  in
7195       a  match, without consuming any characters from the subject string. The
7196       use of groups for more complicated assertions is described below.   The
7197       backslashed assertions are:
7198
7199         \b     matches at a word boundary
7200         \B     matches when not at a word boundary
7201         \A     matches at the start of the subject
7202         \Z     matches at the end of the subject
7203                 also matches before a newline at the end of the subject
7204         \z     matches only at the end of the subject
7205         \G     matches at the first matching position in the subject
7206
7207       Inside  a  character  class, \b has a different meaning; it matches the
7208       backspace character. If any other of  these  assertions  appears  in  a
7209       character class, an "invalid escape sequence" error is generated.
7210
7211       A  word  boundary is a position in the subject string where the current
7212       character and the previous character do not both match \w or  \W  (i.e.
7213       one  matches  \w  and the other matches \W), or the start or end of the
7214       string if the first or last character matches  \w,  respectively.  When
7215       PCRE2  is  built with Unicode support, the meanings of \w and \W can be
7216       changed by setting the PCRE2_UCP option. When this is done, it also af-
7217       fects  \b and \B. Neither PCRE2 nor Perl has a separate "start of word"
7218       or "end of word" metasequence. However, whatever  follows  \b  normally
7219       determines  which  it  is. For example, the fragment \ba matches "a" at
7220       the start of a word.
7221
7222       The \A, \Z, and \z assertions differ from  the  traditional  circumflex
7223       and dollar (described in the next section) in that they only ever match
7224       at the very start and end of the subject string, whatever  options  are
7225       set.  Thus,  they are independent of multiline mode. These three asser-
7226       tions are not affected by the  PCRE2_NOTBOL  or  PCRE2_NOTEOL  options,
7227       which  affect only the behaviour of the circumflex and dollar metachar-
7228       acters. However, if the startoffset argument of pcre2_match()  is  non-
7229       zero,  indicating  that  matching is to start at a point other than the
7230       beginning of the subject, \A can never match.  The  difference  between
7231       \Z  and \z is that \Z matches before a newline at the end of the string
7232       as well as at the very end, whereas \z matches only at the end.
7233
7234       The \G assertion is true only when the current matching position is  at
7235       the  start point of the matching process, as specified by the startoff-
7236       set argument of pcre2_match(). It differs from \A  when  the  value  of
7237       startoffset  is  non-zero. By calling pcre2_match() multiple times with
7238       appropriate arguments, you can mimic Perl's /g option,  and  it  is  in
7239       this kind of implementation where \G can be useful.
7240
7241       Note,  however,  that  PCRE2's  implementation of \G, being true at the
7242       starting character of the matching process, is  subtly  different  from
7243       Perl's,  which  defines it as true at the end of the previous match. In
7244       Perl, these can be different when the  previously  matched  string  was
7245       empty. Because PCRE2 does just one match at a time, it cannot reproduce
7246       this behaviour.
7247
7248       If all the alternatives of a pattern begin with \G, the  expression  is
7249       anchored to the starting match position, and the "anchored" flag is set
7250       in the compiled regular expression.
7251
7252
7253CIRCUMFLEX AND DOLLAR
7254
7255       The circumflex and dollar  metacharacters  are  zero-width  assertions.
7256       That  is,  they test for a particular condition being true without con-
7257       suming any characters from the subject string. These two metacharacters
7258       are  concerned  with matching the starts and ends of lines. If the new-
7259       line convention is set so that only the two-character sequence CRLF  is
7260       recognized  as  a newline, isolated CR and LF characters are treated as
7261       ordinary data characters, and are not recognized as newlines.
7262
7263       Outside a character class, in the default matching mode, the circumflex
7264       character  is  an  assertion  that is true only if the current matching
7265       point is at the start of the subject string. If the  startoffset  argu-
7266       ment  of  pcre2_match() is non-zero, or if PCRE2_NOTBOL is set, circum-
7267       flex can never match if the PCRE2_MULTILINE option is unset.  Inside  a
7268       character  class, circumflex has an entirely different meaning (see be-
7269       low).
7270
7271       Circumflex need not be the first character of the pattern if  a  number
7272       of  alternatives are involved, but it should be the first thing in each
7273       alternative in which it appears if the pattern is ever  to  match  that
7274       branch.  If all possible alternatives start with a circumflex, that is,
7275       if the pattern is constrained to match only at the start  of  the  sub-
7276       ject,  it  is  said  to be an "anchored" pattern. (There are also other
7277       constructs that can cause a pattern to be anchored.)
7278
7279       The dollar character is an assertion that is true only if  the  current
7280       matching  point is at the end of the subject string, or immediately be-
7281       fore a newline at the end of the string (by default), unless  PCRE2_NO-
7282       TEOL  is  set.  Note, however, that it does not actually match the new-
7283       line. Dollar need not be the last character of the pattern if a  number
7284       of  alternatives  are  involved,  but it should be the last item in any
7285       branch in which it appears. Dollar has no special meaning in a  charac-
7286       ter class.
7287
7288       The  meaning  of  dollar  can be changed so that it matches only at the
7289       very end of the string, by setting the PCRE2_DOLLAR_ENDONLY  option  at
7290       compile time. This does not affect the \Z assertion.
7291
7292       The meanings of the circumflex and dollar metacharacters are changed if
7293       the PCRE2_MULTILINE option is set. When this  is  the  case,  a  dollar
7294       character  matches before any newlines in the string, as well as at the
7295       very end, and a circumflex matches immediately after internal  newlines
7296       as  well as at the start of the subject string. It does not match after
7297       a newline that ends the string, for compatibility with  Perl.  However,
7298       this can be changed by setting the PCRE2_ALT_CIRCUMFLEX option.
7299
7300       For  example, the pattern /^abc$/ matches the subject string "def\nabc"
7301       (where \n represents a newline) in multiline mode, but  not  otherwise.
7302       Consequently,  patterns  that  are anchored in single line mode because
7303       all branches start with ^ are not anchored in  multiline  mode,  and  a
7304       match  for  circumflex  is  possible  when  the startoffset argument of
7305       pcre2_match() is non-zero. The PCRE2_DOLLAR_ENDONLY option  is  ignored
7306       if PCRE2_MULTILINE is set.
7307
7308       When  the  newline  convention (see "Newline conventions" below) recog-
7309       nizes the two-character sequence CRLF as a newline, this is  preferred,
7310       even  if  the  single  characters CR and LF are also recognized as new-
7311       lines. For example, if the newline convention  is  "any",  a  multiline
7312       mode  circumflex matches before "xyz" in the string "abc\r\nxyz" rather
7313       than after CR, even though CR on its own is a valid newline.  (It  also
7314       matches at the very start of the string, of course.)
7315
7316       Note  that  the sequences \A, \Z, and \z can be used to match the start
7317       and end of the subject in both modes, and if all branches of a  pattern
7318       start  with \A it is always anchored, whether or not PCRE2_MULTILINE is
7319       set.
7320
7321
7322FULL STOP (PERIOD, DOT) AND \N
7323
7324       Outside a character class, a dot in the pattern matches any one charac-
7325       ter  in  the subject string except (by default) a character that signi-
7326       fies the end of a line. One or more characters may be specified as line
7327       terminators (see "Newline conventions" above).
7328
7329       Dot  never matches a single line-ending character. When the two-charac-
7330       ter sequence CRLF is the only line ending, dot does not match CR if  it
7331       is  immediately followed by LF, but otherwise it matches all characters
7332       (including isolated CRs and LFs). When ANYCRLF  is  selected  for  line
7333       endings,  no  occurences  of  CR of LF match dot. When all Unicode line
7334       endings are being recognized, dot does not match CR or LF or any of the
7335       other line ending characters.
7336
7337       The  behaviour  of  dot  with regard to newlines can be changed. If the
7338       PCRE2_DOTALL option is set, a dot matches any  one  character,  without
7339       exception.   If  the two-character sequence CRLF is present in the sub-
7340       ject string, it takes two dots to match it.
7341
7342       The handling of dot is entirely independent of the handling of  circum-
7343       flex  and  dollar,  the  only relationship being that they both involve
7344       newlines. Dot has no special meaning in a character class.
7345
7346       The escape sequence \N when not followed by an  opening  brace  behaves
7347       like  a dot, except that it is not affected by the PCRE2_DOTALL option.
7348       In other words, it matches any character except one that signifies  the
7349       end of a line.
7350
7351       When \N is followed by an opening brace it has a different meaning. See
7352       the section entitled "Non-printing characters" above for details.  Perl
7353       also  uses  \N{name}  to specify characters by Unicode name; PCRE2 does
7354       not support this.
7355
7356
7357MATCHING A SINGLE CODE UNIT
7358
7359       Outside a character class, the escape sequence \C matches any one  code
7360       unit,  whether or not a UTF mode is set. In the 8-bit library, one code
7361       unit is one byte; in the 16-bit library it is a  16-bit  unit;  in  the
7362       32-bit  library  it  is  a 32-bit unit. Unlike a dot, \C always matches
7363       line-ending characters. The feature is provided in  Perl  in  order  to
7364       match individual bytes in UTF-8 mode, but it is unclear how it can use-
7365       fully be used.
7366
7367       Because \C breaks up characters into individual  code  units,  matching
7368       one  unit  with  \C  in UTF-8 or UTF-16 mode means that the rest of the
7369       string may start with a malformed UTF character. This has undefined re-
7370       sults, because PCRE2 assumes that it is matching character by character
7371       in a valid UTF string (by default it checks the subject string's valid-
7372       ity  at  the  start  of  processing  unless  the  PCRE2_NO_UTF_CHECK or
7373       PCRE2_MATCH_INVALID_UTF option is used).
7374
7375       An  application  can  lock  out  the  use  of   \C   by   setting   the
7376       PCRE2_NEVER_BACKSLASH_C  option  when  compiling  a pattern. It is also
7377       possible to build PCRE2 with the use of \C permanently disabled.
7378
7379       PCRE2 does not allow \C to appear in lookbehind  assertions  (described
7380       below)  in UTF-8 or UTF-16 modes, because this would make it impossible
7381       to calculate the length of  the  lookbehind.  Neither  the  alternative
7382       matching function pcre2_dfa_match() nor the JIT optimizer support \C in
7383       these UTF modes.  The former gives a match-time error; the latter fails
7384       to optimize and so the match is always run using the interpreter.
7385
7386       In  the  32-bit  library, however, \C is always supported (when not ex-
7387       plicitly locked out) because it always  matches  a  single  code  unit,
7388       whether or not UTF-32 is specified.
7389
7390       In general, the \C escape sequence is best avoided. However, one way of
7391       using it that avoids the problem of malformed UTF-8 or  UTF-16  charac-
7392       ters  is  to use a lookahead to check the length of the next character,
7393       as in this pattern, which could be used with  a  UTF-8  string  (ignore
7394       white space and line breaks):
7395
7396         (?| (?=[\x00-\x7f])(\C) |
7397             (?=[\x80-\x{7ff}])(\C)(\C) |
7398             (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
7399             (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
7400
7401       In  this  example,  a  group  that starts with (?| resets the capturing
7402       parentheses numbers in each alternative (see "Duplicate Group  Numbers"
7403       below). The assertions at the start of each branch check the next UTF-8
7404       character for values whose encoding uses 1, 2, 3, or 4  bytes,  respec-
7405       tively.  The  character's individual bytes are then captured by the ap-
7406       propriate number of \C groups.
7407
7408
7409SQUARE BRACKETS AND CHARACTER CLASSES
7410
7411       An opening square bracket introduces a character class, terminated by a
7412       closing square bracket. A closing square bracket on its own is not spe-
7413       cial by default.  If a closing square bracket is required as  a  member
7414       of the class, it should be the first data character in the class (after
7415       an initial circumflex, if present) or escaped with  a  backslash.  This
7416       means  that,  by default, an empty class cannot be defined. However, if
7417       the PCRE2_ALLOW_EMPTY_CLASS option is set, a closing square bracket  at
7418       the start does end the (empty) class.
7419
7420       A  character class matches a single character in the subject. A matched
7421       character must be in the set of characters defined by the class, unless
7422       the  first  character in the class definition is a circumflex, in which
7423       case the subject character must not be in the set defined by the class.
7424       If  a  circumflex is actually required as a member of the class, ensure
7425       it is not the first character, or escape it with a backslash.
7426
7427       For example, the character class [aeiou] matches any lower case  vowel,
7428       while  [^aeiou]  matches  any character that is not a lower case vowel.
7429       Note that a circumflex is just a convenient notation for specifying the
7430       characters  that  are in the class by enumerating those that are not. A
7431       class that starts with a circumflex is not an assertion; it still  con-
7432       sumes  a  character  from the subject string, and therefore it fails if
7433       the current pointer is at the end of the string.
7434
7435       Characters in a class may be specified by their code points  using  \o,
7436       \x,  or \N{U+hh..} in the usual way. When caseless matching is set, any
7437       letters in a class represent both their upper case and lower case  ver-
7438       sions,  so  for example, a caseless [aeiou] matches "A" as well as "a",
7439       and a caseless [^aeiou] does not match "A", whereas a  caseful  version
7440       would.  Note that there are two ASCII characters, K and S, that, in ad-
7441       dition to their lower case ASCII equivalents, are case-equivalent  with
7442       Unicode  U+212A (Kelvin sign) and U+017F (long S) respectively when ei-
7443       ther PCRE2_UTF or PCRE2_UCP is set.
7444
7445       Characters that might indicate line breaks are  never  treated  in  any
7446       special  way  when matching character classes, whatever line-ending se-
7447       quence is  in  use,  and  whatever  setting  of  the  PCRE2_DOTALL  and
7448       PCRE2_MULTILINE  options  is  used. A class such as [^a] always matches
7449       one of these characters.
7450
7451       The generic character type escape sequences \d, \D, \h, \H, \p, \P, \s,
7452       \S,  \v,  \V,  \w,  and \W may appear in a character class, and add the
7453       characters that they  match  to  the  class.  For  example,  [\dABCDEF]
7454       matches  any  hexadecimal digit. In UTF modes, the PCRE2_UCP option af-
7455       fects the meanings of \d, \s, \w and their upper case partners, just as
7456       it does when they appear outside a character class, as described in the
7457       section entitled "Generic character types" above. The  escape  sequence
7458       \b  has  a  different  meaning inside a character class; it matches the
7459       backspace character. The sequences \B, \R, and \X are not  special  in-
7460       side  a  character class. Like any other unrecognized escape sequences,
7461       they cause an error. The same is true for \N when not  followed  by  an
7462       opening brace.
7463
7464       The  minus (hyphen) character can be used to specify a range of charac-
7465       ters in a character class. For example, [d-m] matches  any  letter  be-
7466       tween  d and m, inclusive. If a minus character is required in a class,
7467       it must be escaped with a backslash or appear in a  position  where  it
7468       cannot  be interpreted as indicating a range, typically as the first or
7469       last character in the class, or immediately after a range. For example,
7470       [b-d-z] matches letters in the range b to d, a hyphen character, or z.
7471
7472       Perl treats a hyphen as a literal if it appears before or after a POSIX
7473       class (see below) or before or after a character type escape such as as
7474       \d  or  \H.   However,  unless  the hyphen is the last character in the
7475       class, Perl outputs a warning in its warning  mode,  as  this  is  most
7476       likely  a user error. As PCRE2 has no facility for warning, an error is
7477       given in these cases.
7478
7479       It is not possible to have the literal character "]" as the end charac-
7480       ter  of a range. A pattern such as [W-]46] is interpreted as a class of
7481       two characters ("W" and "-") followed by a literal string "46]", so  it
7482       would  match  "W46]"  or  "-46]". However, if the "]" is escaped with a
7483       backslash it is interpreted as the end of range, so [W-\]46] is  inter-
7484       preted  as a class containing a range followed by two other characters.
7485       The octal or hexadecimal representation of "]" can also be used to  end
7486       a range.
7487
7488       Ranges normally include all code points between the start and end char-
7489       acters, inclusive. They can also be used for code points specified  nu-
7490       merically,  for  example [\000-\037]. Ranges can include any characters
7491       that are valid for the current mode. In any  UTF  mode,  the  so-called
7492       "surrogate"  characters (those whose code points lie between 0xd800 and
7493       0xdfff inclusive) may not  be  specified  explicitly  by  default  (the
7494       PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES  option  disables this check). How-
7495       ever, ranges such as [\x{d7ff}-\x{e000}], which include the surrogates,
7496       are always permitted.
7497
7498       There  is  a  special  case in EBCDIC environments for ranges whose end
7499       points are both specified as literal letters in the same case. For com-
7500       patibility  with Perl, EBCDIC code points within the range that are not
7501       letters are omitted. For example, [h-k] matches only  four  characters,
7502       even though the codes for h and k are 0x88 and 0x92, a range of 11 code
7503       points. However, if the range is specified  numerically,  for  example,
7504       [\x88-\x92] or [h-\x92], all code points are included.
7505
7506       If a range that includes letters is used when caseless matching is set,
7507       it matches the letters in either case. For example, [W-c] is equivalent
7508       to  [][\\^_`wxyzabc],  matched  caselessly,  and  in a non-UTF mode, if
7509       character tables for a French locale are in  use,  [\xc8-\xcb]  matches
7510       accented E characters in both cases.
7511
7512       A  circumflex  can  conveniently  be used with the upper case character
7513       types to specify a more restricted set of characters than the  matching
7514       lower  case  type.  For example, the class [^\W_] matches any letter or
7515       digit, but not underscore, whereas [\w] includes underscore. A positive
7516       character class should be read as "something OR something OR ..." and a
7517       negative class as "NOT something AND NOT something AND NOT ...".
7518
7519       The only metacharacters that are recognized in  character  classes  are
7520       backslash,  hyphen  (only  where  it can be interpreted as specifying a
7521       range), circumflex (only at the start), opening  square  bracket  (only
7522       when  it can be interpreted as introducing a POSIX class name, or for a
7523       special compatibility feature - see the next  two  sections),  and  the
7524       terminating  closing  square  bracket.  However, escaping other non-al-
7525       phanumeric characters does no harm.
7526
7527
7528POSIX CHARACTER CLASSES
7529
7530       Perl supports the POSIX notation for character classes. This uses names
7531       enclosed  by [: and :] within the enclosing square brackets. PCRE2 also
7532       supports this notation. For example,
7533
7534         [01[:alpha:]%]
7535
7536       matches "0", "1", any alphabetic character, or "%". The supported class
7537       names are:
7538
7539         alnum    letters and digits
7540         alpha    letters
7541         ascii    character codes 0 - 127
7542         blank    space or tab only
7543         cntrl    control characters
7544         digit    decimal digits (same as \d)
7545         graph    printing characters, excluding space
7546         lower    lower case letters
7547         print    printing characters, including space
7548         punct    printing characters, excluding letters and digits and space
7549         space    white space (the same as \s from PCRE2 8.34)
7550         upper    upper case letters
7551         word     "word" characters (same as \w)
7552         xdigit   hexadecimal digits
7553
7554       The  default  "space" characters are HT (9), LF (10), VT (11), FF (12),
7555       CR (13), and space (32). If locale-specific matching is  taking  place,
7556       the  list  of  space characters may be different; there may be fewer or
7557       more of them. "Space" and \s match the same set of characters.
7558
7559       The name "word" is a Perl extension, and "blank"  is  a  GNU  extension
7560       from  Perl  5.8. Another Perl extension is negation, which is indicated
7561       by a ^ character after the colon. For example,
7562
7563         [12[:^digit:]]
7564
7565       matches "1", "2", or any non-digit. PCRE2 (and Perl) also recognize the
7566       POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
7567       these are not supported, and an error is given if they are encountered.
7568
7569       By default, characters with values greater than 127 do not match any of
7570       the POSIX character classes, although this may be different for charac-
7571       ters in the range 128-255 when locale-specific matching  is  happening.
7572       However,  if the PCRE2_UCP option is passed to pcre2_compile(), some of
7573       the classes are changed so that Unicode character properties are  used.
7574       This  is  achieved  by  replacing  certain POSIX classes with other se-
7575       quences, as follows:
7576
7577         [:alnum:]  becomes  \p{Xan}
7578         [:alpha:]  becomes  \p{L}
7579         [:blank:]  becomes  \h
7580         [:cntrl:]  becomes  \p{Cc}
7581         [:digit:]  becomes  \p{Nd}
7582         [:lower:]  becomes  \p{Ll}
7583         [:space:]  becomes  \p{Xps}
7584         [:upper:]  becomes  \p{Lu}
7585         [:word:]   becomes  \p{Xwd}
7586
7587       Negated versions, such as [:^alpha:] use \P instead of \p. Three  other
7588       POSIX classes are handled specially in UCP mode:
7589
7590       [:graph:] This  matches  characters that have glyphs that mark the page
7591                 when printed. In Unicode property terms, it matches all char-
7592                 acters with the L, M, N, P, S, or Cf properties, except for:
7593
7594                   U+061C           Arabic Letter Mark
7595                   U+180E           Mongolian Vowel Separator
7596                   U+2066 - U+2069  Various "isolate"s
7597
7598
7599       [:print:] This  matches  the  same  characters  as [:graph:] plus space
7600                 characters that are not controls, that  is,  characters  with
7601                 the Zs property.
7602
7603       [:punct:] This matches all characters that have the Unicode P (punctua-
7604                 tion) property, plus those characters with code  points  less
7605                 than 256 that have the S (Symbol) property.
7606
7607       The  other  POSIX classes are unchanged, and match only characters with
7608       code points less than 256.
7609
7610
7611COMPATIBILITY FEATURE FOR WORD BOUNDARIES
7612
7613       In the POSIX.2 compliant library that was included in 4.4BSD Unix,  the
7614       ugly  syntax  [[:<:]]  and [[:>:]] is used for matching "start of word"
7615       and "end of word". PCRE2 treats these items as follows:
7616
7617         [[:<:]]  is converted to  \b(?=\w)
7618         [[:>:]]  is converted to  \b(?<=\w)
7619
7620       Only these exact character sequences are recognized. A sequence such as
7621       [a[:<:]b]  provokes  error  for  an unrecognized POSIX class name. This
7622       support is not compatible with Perl. It is provided to help  migrations
7623       from other environments, and is best not used in any new patterns. Note
7624       that \b matches at the start and the end of a word (see "Simple  asser-
7625       tions"  above),  and in a Perl-style pattern the preceding or following
7626       character normally shows which is wanted, without the need for the  as-
7627       sertions  that are used above in order to give exactly the POSIX behav-
7628       iour.
7629
7630
7631VERTICAL BAR
7632
7633       Vertical bar characters are used to separate alternative patterns.  For
7634       example, the pattern
7635
7636         gilbert|sullivan
7637
7638       matches  either "gilbert" or "sullivan". Any number of alternatives may
7639       appear, and an empty  alternative  is  permitted  (matching  the  empty
7640       string). The matching process tries each alternative in turn, from left
7641       to right, and the first one that succeeds is used. If the  alternatives
7642       are  within a group (defined below), "succeeds" means matching the rest
7643       of the main pattern as well as the alternative in the group.
7644
7645
7646INTERNAL OPTION SETTING
7647
7648       The settings  of  the  PCRE2_CASELESS,  PCRE2_MULTILINE,  PCRE2_DOTALL,
7649       PCRE2_EXTENDED,  PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE options
7650       can be changed from within the pattern by a  sequence  of  letters  en-
7651       closed  between  "(?"   and ")". These options are Perl-compatible, and
7652       are described in detail in the pcre2api documentation. The option  let-
7653       ters are:
7654
7655         i  for PCRE2_CASELESS
7656         m  for PCRE2_MULTILINE
7657         n  for PCRE2_NO_AUTO_CAPTURE
7658         s  for PCRE2_DOTALL
7659         x  for PCRE2_EXTENDED
7660         xx for PCRE2_EXTENDED_MORE
7661
7662       For example, (?im) sets caseless, multiline matching. It is also possi-
7663       ble to unset these options by preceding the relevant letters with a hy-
7664       phen,  for  example (?-im). The two "extended" options are not indepen-
7665       dent; unsetting either one cancels the effects of both of them.
7666
7667       A  combined  setting  and  unsetting  such  as  (?im-sx),  which   sets
7668       PCRE2_CASELESS  and  PCRE2_MULTILINE  while  unsetting PCRE2_DOTALL and
7669       PCRE2_EXTENDED, is also permitted. Only one hyphen may  appear  in  the
7670       options  string.  If a letter appears both before and after the hyphen,
7671       the option is unset. An empty options setting "(?)" is  allowed.  Need-
7672       less to say, it has no effect.
7673
7674       If  the  first character following (? is a circumflex, it causes all of
7675       the above options to be unset. Thus, (?^) is equivalent  to  (?-imnsx).
7676       Letters  may  follow  the circumflex to cause some options to be re-in-
7677       stated, but a hyphen may not appear.
7678
7679       The PCRE2-specific options PCRE2_DUPNAMES  and  PCRE2_UNGREEDY  can  be
7680       changed  in  the  same  way as the Perl-compatible options by using the
7681       characters J and U respectively. However, these are not unset by (?^).
7682
7683       When one of these option changes occurs at top level (that is, not  in-
7684       side  group  parentheses),  the  change applies to the remainder of the
7685       pattern that follows. An option change within a group (see below for  a
7686       description of groups) affects only that part of the group that follows
7687       it, so
7688
7689         (a(?i)b)c
7690
7691       matches abc and aBc and no other strings  (assuming  PCRE2_CASELESS  is
7692       not  used).   By this means, options can be made to have different set-
7693       tings in different parts of the pattern. Any changes made in one alter-
7694       native  do carry on into subsequent branches within the same group. For
7695       example,
7696
7697         (a(?i)b|c)
7698
7699       matches "ab", "aB", "c", and "C", even though  when  matching  "C"  the
7700       first  branch  is  abandoned before the option setting. This is because
7701       the effects of option settings happen at compile time. There  would  be
7702       some very weird behaviour otherwise.
7703
7704       As  a  convenient shorthand, if any option settings are required at the
7705       start of a non-capturing group (see the next section), the option  let-
7706       ters may appear between the "?" and the ":". Thus the two patterns
7707
7708         (?i:saturday|sunday)
7709         (?:(?i)saturday|sunday)
7710
7711       match exactly the same set of strings.
7712
7713       Note:  There  are  other  PCRE2-specific options, applying to the whole
7714       pattern, which can be set by the application when the  compiling  func-
7715       tion  is  called.  In addition, the pattern can contain special leading
7716       sequences such as (*CRLF) to override what the application has  set  or
7717       what  has  been  defaulted.   Details are given in the section entitled
7718       "Newline sequences" above. There are also the (*UTF) and (*UCP) leading
7719       sequences  that can be used to set UTF and Unicode property modes; they
7720       are equivalent to setting the PCRE2_UTF and PCRE2_UCP options,  respec-
7721       tively.  However,  the  application  can  set  the  PCRE2_NEVER_UTF and
7722       PCRE2_NEVER_UCP options, which lock out  the  use  of  the  (*UTF)  and
7723       (*UCP) sequences.
7724
7725
7726GROUPS
7727
7728       Groups  are  delimited  by  parentheses  (round brackets), which can be
7729       nested.  Turning part of a pattern into a group does two things:
7730
7731       1. It localizes a set of alternatives. For example, the pattern
7732
7733         cat(aract|erpillar|)
7734
7735       matches "cataract", "caterpillar", or "cat". Without  the  parentheses,
7736       it would match "cataract", "erpillar" or an empty string.
7737
7738       2.  It  creates a "capture group". This means that, when the whole pat-
7739       tern matches, the portion of the subject string that matched the  group
7740       is  passed back to the caller, separately from the portion that matched
7741       the whole pattern.  (This applies  only  to  the  traditional  matching
7742       function; the DFA matching function does not support capturing.)
7743
7744       Opening parentheses are counted from left to right (starting from 1) to
7745       obtain numbers for capture groups. For example, if the string "the  red
7746       king" is matched against the pattern
7747
7748         the ((red|white) (king|queen))
7749
7750       the captured substrings are "red king", "red", and "king", and are num-
7751       bered 1, 2, and 3, respectively.
7752
7753       The fact that plain parentheses fulfil  two  functions  is  not  always
7754       helpful.   There are often times when grouping is required without cap-
7755       turing. If an opening parenthesis is followed by a question mark and  a
7756       colon,  the  group  does  not do any capturing, and is not counted when
7757       computing the number of any subsequent capture groups. For example,  if
7758       the string "the white queen" is matched against the pattern
7759
7760         the ((?:red|white) (king|queen))
7761
7762       the captured substrings are "white queen" and "queen", and are numbered
7763       1 and 2. The maximum number of capture groups is 65535.
7764
7765       As a convenient shorthand, if any option settings are required  at  the
7766       start  of  a non-capturing group, the option letters may appear between
7767       the "?" and the ":". Thus the two patterns
7768
7769         (?i:saturday|sunday)
7770         (?:(?i)saturday|sunday)
7771
7772       match exactly the same set of strings. Because alternative branches are
7773       tried  from  left  to right, and options are not reset until the end of
7774       the group is reached, an option setting in one branch does affect  sub-
7775       sequent branches, so the above patterns match "SUNDAY" as well as "Sat-
7776       urday".
7777
7778
7779DUPLICATE GROUP NUMBERS
7780
7781       Perl 5.10 introduced a feature whereby each alternative in a group uses
7782       the  same  numbers  for  its capturing parentheses. Such a group starts
7783       with (?| and is itself a non-capturing  group.  For  example,  consider
7784       this pattern:
7785
7786         (?|(Sat)ur|(Sun))day
7787
7788       Because  the two alternatives are inside a (?| group, both sets of cap-
7789       turing parentheses are numbered one. Thus, when  the  pattern  matches,
7790       you  can  look  at captured substring number one, whichever alternative
7791       matched. This construct is useful when you want to  capture  part,  but
7792       not all, of one of a number of alternatives. Inside a (?| group, paren-
7793       theses are numbered as usual, but the number is reset at the  start  of
7794       each  branch.  The numbers of any capturing parentheses that follow the
7795       whole group start after the highest number used in any branch. The fol-
7796       lowing example is taken from the Perl documentation. The numbers under-
7797       neath show in which buffer the captured content will be stored.
7798
7799         # before  ---------------branch-reset----------- after
7800         / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
7801         # 1            2         2  3        2     3     4
7802
7803       A backreference to a capture group uses the most recent value  that  is
7804       set for the group. The following pattern matches "abcabc" or "defdef":
7805
7806         /(?|(abc)|(def))\1/
7807
7808       In  contrast, a subroutine call to a capture group always refers to the
7809       first one in the pattern with the given number. The  following  pattern
7810       matches "abcabc" or "defabc":
7811
7812         /(?|(abc)|(def))(?1)/
7813
7814       A relative reference such as (?-1) is no different: it is just a conve-
7815       nient way of computing an absolute group number.
7816
7817       If a condition test for a group's having matched refers to a non-unique
7818       number, the test is true if any group with that number has matched.
7819
7820       An  alternative approach to using this "branch reset" feature is to use
7821       duplicate named groups, as described in the next section.
7822
7823
7824NAMED CAPTURE GROUPS
7825
7826       Identifying capture groups by number is simple, but it can be very hard
7827       to  keep  track of the numbers in complicated patterns. Furthermore, if
7828       an expression is modified, the numbers may change. To  help  with  this
7829       difficulty,  PCRE2  supports the naming of capture groups. This feature
7830       was not added to Perl until release 5.10. Python had the  feature  ear-
7831       lier,  and PCRE1 introduced it at release 4.0, using the Python syntax.
7832       PCRE2 supports both the Perl and the Python syntax.
7833
7834       In PCRE2,  a  capture  group  can  be  named  in  one  of  three  ways:
7835       (?<name>...) or (?'name'...) as in Perl, or (?P<name>...) as in Python.
7836       Names may be up to 32 code units long. When PCRE2_UTF is not set,  they
7837       may  contain  only  ASCII  alphanumeric characters and underscores, but
7838       must start with a non-digit. When PCRE2_UTF is set, the syntax of group
7839       names is extended to allow any Unicode letter or Unicode decimal digit.
7840       In other words, group names must match one of these patterns:
7841
7842         ^[_A-Za-z][_A-Za-z0-9]*\z   when PCRE2_UTF is not set
7843         ^[_\p{L}][_\p{L}\p{Nd}]*\z  when PCRE2_UTF is set
7844
7845       References to capture groups from other parts of the pattern,  such  as
7846       backreferences,  recursion,  and conditions, can all be made by name as
7847       well as by number.
7848
7849       Named capture groups are allocated numbers as well as names, exactly as
7850       if  the  names were not present. In both PCRE2 and Perl, capture groups
7851       are primarily identified by numbers; any names  are  just  aliases  for
7852       these numbers. The PCRE2 API provides function calls for extracting the
7853       complete name-to-number translation table from a compiled  pattern,  as
7854       well  as  convenience  functions  for extracting captured substrings by
7855       name.
7856
7857       Warning: When more than one capture group has the same number,  as  de-
7858       scribed in the previous section, a name given to one of them applies to
7859       all of them. Perl allows identically numbered groups to have  different
7860       names.  Consider this pattern, where there are two capture groups, both
7861       numbered 1:
7862
7863         (?|(?<AA>aa)|(?<BB>bb))
7864
7865       Perl allows this, with both names AA and BB  as  aliases  of  group  1.
7866       Thus, after a successful match, both names yield the same value (either
7867       "aa" or "bb").
7868
7869       In an attempt to reduce confusion, PCRE2 does not allow the same  group
7870       number to be associated with more than one name. The example above pro-
7871       vokes a compile-time error. However, there is still  scope  for  confu-
7872       sion. Consider this pattern:
7873
7874         (?|(?<AA>aa)|(bb))
7875
7876       Although the second group number 1 is not explicitly named, the name AA
7877       is still an alias for any group 1. Whether the pattern matches "aa"  or
7878       "bb", a reference by name to group AA yields the matched string.
7879
7880       By  default, a name must be unique within a pattern, except that dupli-
7881       cate names are permitted for groups with the same number, for example:
7882
7883         (?|(?<AA>aa)|(?<AA>bb))
7884
7885       The duplicate name constraint can be disabled by setting the PCRE2_DUP-
7886       NAMES option at compile time, or by the use of (?J) within the pattern,
7887       as described in the section entitled "Internal Option Setting" above.
7888
7889       Duplicate names can be useful for patterns where only one  instance  of
7890       the  named  capture group can match. Suppose you want to match the name
7891       of a weekday, either as a 3-letter abbreviation or as  the  full  name,
7892       and  in  both  cases you want to extract the abbreviation. This pattern
7893       (ignoring the line breaks) does the job:
7894
7895         (?J)
7896         (?<DN>Mon|Fri|Sun)(?:day)?|
7897         (?<DN>Tue)(?:sday)?|
7898         (?<DN>Wed)(?:nesday)?|
7899         (?<DN>Thu)(?:rsday)?|
7900         (?<DN>Sat)(?:urday)?
7901
7902       There are five capture groups, but only one is ever set after a  match.
7903       The  convenience  functions for extracting the data by name returns the
7904       substring for the first (and in this example, the only) group  of  that
7905       name that matched. This saves searching to find which numbered group it
7906       was. (An alternative way of solving this problem is to  use  a  "branch
7907       reset" group, as described in the previous section.)
7908
7909       If  you make a backreference to a non-unique named group from elsewhere
7910       in the pattern, the groups to which the name refers are checked in  the
7911       order  in  which they appear in the overall pattern. The first one that
7912       is set is used for the reference. For  example,  this  pattern  matches
7913       both "foofoo" and "barbar" but not "foobar" or "barfoo":
7914
7915         (?J)(?:(?<n>foo)|(?<n>bar))\k<n>
7916
7917
7918       If you make a subroutine call to a non-unique named group, the one that
7919       corresponds to the first occurrence of the name is used. In the absence
7920       of duplicate numbers this is the one with the lowest number.
7921
7922       If you use a named reference in a condition test (see the section about
7923       conditions below), either to check whether a capture group has matched,
7924       or to check for recursion, all groups with the same name are tested. If
7925       the condition is true for any one of them,  the  overall  condition  is
7926       true.  This is the same behaviour as testing by number. For further de-
7927       tails of the interfaces for handling  named  capture  groups,  see  the
7928       pcre2api documentation.
7929
7930
7931REPETITION
7932
7933       Repetition  is  specified  by  quantifiers, which can follow any of the
7934       following items:
7935
7936         a literal data character
7937         the dot metacharacter
7938         the \C escape sequence
7939         the \R escape sequence
7940         the \X escape sequence
7941         an escape such as \d or \pL that matches a single character
7942         a character class
7943         a backreference
7944         a parenthesized group (including lookaround assertions)
7945         a subroutine call (recursive or otherwise)
7946
7947       The general repetition quantifier specifies a minimum and maximum  num-
7948       ber  of  permitted matches, by giving the two numbers in curly brackets
7949       (braces), separated by a comma. The numbers must be  less  than  65536,
7950       and the first must be less than or equal to the second. For example,
7951
7952         z{2,4}
7953
7954       matches  "zz",  "zzz",  or  "zzzz". A closing brace on its own is not a
7955       special character. If the second number is omitted, but  the  comma  is
7956       present,  there  is  no upper limit; if the second number and the comma
7957       are both omitted, the quantifier specifies an exact number of  required
7958       matches. Thus
7959
7960         [aeiou]{3,}
7961
7962       matches at least 3 successive vowels, but may match many more, whereas
7963
7964         \d{8}
7965
7966       matches  exactly  8  digits. An opening curly bracket that appears in a
7967       position where a quantifier is not allowed, or one that does not  match
7968       the  syntax of a quantifier, is taken as a literal character. For exam-
7969       ple, {,6} is not a quantifier, but a literal string of four characters.
7970
7971       In UTF modes, quantifiers apply to characters rather than to individual
7972       code  units. Thus, for example, \x{100}{2} matches two characters, each
7973       of which is represented by a two-byte sequence in a UTF-8 string. Simi-
7974       larly,  \X{3} matches three Unicode extended grapheme clusters, each of
7975       which may be several code units long (and  they  may  be  of  different
7976       lengths).
7977
7978       The quantifier {0} is permitted, causing the expression to behave as if
7979       the previous item and the quantifier were not present. This may be use-
7980       ful  for  capture  groups that are referenced as subroutines from else-
7981       where in the pattern (but see also the section entitled "Defining  cap-
7982       ture groups for use by reference only" below). Except for parenthesized
7983       groups, items that have a {0} quantifier are omitted from the  compiled
7984       pattern.
7985
7986       For  convenience, the three most common quantifiers have single-charac-
7987       ter abbreviations:
7988
7989         *    is equivalent to {0,}
7990         +    is equivalent to {1,}
7991         ?    is equivalent to {0,1}
7992
7993       It is possible to construct infinite loops by following  a  group  that
7994       can  match no characters with a quantifier that has no upper limit, for
7995       example:
7996
7997         (a?)*
7998
7999       Earlier versions of Perl and PCRE1 used to give  an  error  at  compile
8000       time for such patterns. However, because there are cases where this can
8001       be useful, such patterns are now accepted, but whenever an iteration of
8002       such  a group matches no characters, matching moves on to the next item
8003       in the pattern instead of repeatedly matching  an  empty  string.  This
8004       does  not  prevent  backtracking into any of the iterations if a subse-
8005       quent item fails to match.
8006
8007       By default, quantifiers are "greedy", that is, they match  as  much  as
8008       possible (up to the maximum number of permitted times), without causing
8009       the rest of the pattern to fail. The  classic  example  of  where  this
8010       gives  problems is in trying to match comments in C programs. These ap-
8011       pear between /* and */ and within the comment, individual * and / char-
8012       acters  may appear. An attempt to match C comments by applying the pat-
8013       tern
8014
8015         /\*.*\*/
8016
8017       to the string
8018
8019         /* first comment */  not comment  /* second comment */
8020
8021       fails, because it matches the entire string owing to the greediness  of
8022       the  .*  item. However, if a quantifier is followed by a question mark,
8023       it ceases to be greedy, and instead matches the minimum number of times
8024       possible, so the pattern
8025
8026         /\*.*?\*/
8027
8028       does  the  right  thing with the C comments. The meaning of the various
8029       quantifiers is not otherwise changed,  just  the  preferred  number  of
8030       matches.   Do  not  confuse this use of question mark with its use as a
8031       quantifier in its own right. Because it has two uses, it can  sometimes
8032       appear doubled, as in
8033
8034         \d??\d
8035
8036       which matches one digit by preference, but can match two if that is the
8037       only way the rest of the pattern matches.
8038
8039       If the PCRE2_UNGREEDY option is set (an option that is not available in
8040       Perl),  the  quantifiers are not greedy by default, but individual ones
8041       can be made greedy by following them with a  question  mark.  In  other
8042       words, it inverts the default behaviour.
8043
8044       When  a  parenthesized  group is quantified with a minimum repeat count
8045       that is greater than 1 or with a limited maximum, more  memory  is  re-
8046       quired for the compiled pattern, in proportion to the size of the mini-
8047       mum or maximum.
8048
8049       If a pattern starts with  .*  or  .{0,}  and  the  PCRE2_DOTALL  option
8050       (equivalent  to  Perl's /s) is set, thus allowing the dot to match new-
8051       lines, the pattern is implicitly  anchored,  because  whatever  follows
8052       will  be  tried against every character position in the subject string,
8053       so there is no point in retrying the overall match at any position  af-
8054       ter  the  first. PCRE2 normally treats such a pattern as though it were
8055       preceded by \A.
8056
8057       In cases where it is known that the subject  string  contains  no  new-
8058       lines,  it  is worth setting PCRE2_DOTALL in order to obtain this opti-
8059       mization, or alternatively, using ^ to indicate anchoring explicitly.
8060
8061       However, there are some cases where the optimization  cannot  be  used.
8062       When  .*   is  inside  capturing  parentheses that are the subject of a
8063       backreference elsewhere in the pattern, a match at the start  may  fail
8064       where a later one succeeds. Consider, for example:
8065
8066         (.*)abc\1
8067
8068       If  the subject is "xyz123abc123" the match point is the fourth charac-
8069       ter. For this reason, such a pattern is not implicitly anchored.
8070
8071       Another case where implicit anchoring is not applied is when the  lead-
8072       ing  .* is inside an atomic group. Once again, a match at the start may
8073       fail where a later one succeeds. Consider this pattern:
8074
8075         (?>.*?a)b
8076
8077       It matches "ab" in the subject "aab". The use of the backtracking  con-
8078       trol  verbs  (*PRUNE)  and  (*SKIP) also disable this optimization, and
8079       there is an option, PCRE2_NO_DOTSTAR_ANCHOR, to do so explicitly.
8080
8081       When a capture group is repeated, the value captured is  the  substring
8082       that matched the final iteration. For example, after
8083
8084         (tweedle[dume]{3}\s*)+
8085
8086       has matched "tweedledum tweedledee" the value of the captured substring
8087       is "tweedledee". However, if there are nested capture groups, the  cor-
8088       responding  captured  values  may have been set in previous iterations.
8089       For example, after
8090
8091         (a|(b))+
8092
8093       matches "aba" the value of the second captured substring is "b".
8094
8095
8096ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
8097
8098       With both maximizing ("greedy") and minimizing ("ungreedy"  or  "lazy")
8099       repetition,  failure  of what follows normally causes the repeated item
8100       to be re-evaluated to see if a different number of repeats  allows  the
8101       rest  of  the pattern to match. Sometimes it is useful to prevent this,
8102       either to change the nature of the match, or to cause it  fail  earlier
8103       than  it otherwise might, when the author of the pattern knows there is
8104       no point in carrying on.
8105
8106       Consider, for example, the pattern \d+foo when applied to  the  subject
8107       line
8108
8109         123456bar
8110
8111       After matching all 6 digits and then failing to match "foo", the normal
8112       action of the matcher is to try again with only 5 digits  matching  the
8113       \d+  item,  and  then  with  4,  and  so on, before ultimately failing.
8114       "Atomic grouping" (a term taken from Jeffrey  Friedl's  book)  provides
8115       the means for specifying that once a group has matched, it is not to be
8116       re-evaluated in this way.
8117
8118       If we use atomic grouping for the previous example, the  matcher  gives
8119       up  immediately  on failing to match "foo" the first time. The notation
8120       is a kind of special parenthesis, starting with (?> as in this example:
8121
8122         (?>\d+)foo
8123
8124       Perl 5.28 introduced an experimental alphabetic form starting  with  (*
8125       which may be easier to remember:
8126
8127         (*atomic:\d+)foo
8128
8129       This  kind of parenthesized group "locks up" the part of the pattern it
8130       contains once it has matched, and a failure further into the pattern is
8131       prevented  from  backtracking into it. Backtracking past it to previous
8132       items, however, works as normal.
8133
8134       An alternative description is that a group of this type matches exactly
8135       the  string  of  characters  that an identical standalone pattern would
8136       match, if anchored at the current point in the subject string.
8137
8138       Atomic groups are not capture groups. Simple cases such  as  the  above
8139       example  can be thought of as a maximizing repeat that must swallow ev-
8140       erything it can.  So, while both \d+ and \d+? are  prepared  to  adjust
8141       the  number  of digits they match in order to make the rest of the pat-
8142       tern match, (?>\d+) can only match an entire sequence of digits.
8143
8144       Atomic groups in general can of course contain arbitrarily  complicated
8145       expressions, and can be nested. However, when the contents of an atomic
8146       group is just a single repeated item, as in the example above,  a  sim-
8147       pler  notation, called a "possessive quantifier" can be used. This con-
8148       sists of an additional + character following a quantifier.  Using  this
8149       notation, the previous example can be rewritten as
8150
8151         \d++foo
8152
8153       Note that a possessive quantifier can be used with an entire group, for
8154       example:
8155
8156         (abc|xyz){2,3}+
8157
8158       Possessive quantifiers are always greedy; the setting of the  PCRE2_UN-
8159       GREEDY  option  is ignored. They are a convenient notation for the sim-
8160       pler forms of atomic group. However, there  is  no  difference  in  the
8161       meaning  of  a  possessive  quantifier and the equivalent atomic group,
8162       though there may be a performance  difference;  possessive  quantifiers
8163       should be slightly faster.
8164
8165       The  possessive  quantifier syntax is an extension to the Perl 5.8 syn-
8166       tax.  Jeffrey Friedl originated the idea (and the name)  in  the  first
8167       edition of his book. Mike McCloskey liked it, so implemented it when he
8168       built Sun's Java package, and PCRE1 copied it from there. It found  its
8169       way into Perl at release 5.10.
8170
8171       PCRE2  has  an  optimization  that automatically "possessifies" certain
8172       simple pattern constructs. For example, the sequence A+B is treated  as
8173       A++B  because  there is no point in backtracking into a sequence of A's
8174       when B must follow.  This feature can be disabled by the PCRE2_NO_AUTO-
8175       POSSESS option, or starting the pattern with (*NO_AUTO_POSSESS).
8176
8177       When a pattern contains an unlimited repeat inside a group that can it-
8178       self be repeated an unlimited number of times, the  use  of  an  atomic
8179       group  is the only way to avoid some failing matches taking a very long
8180       time indeed. The pattern
8181
8182         (\D+|<\d+>)*[!?]
8183
8184       matches an unlimited number of substrings that either consist  of  non-
8185       digits,  or  digits  enclosed in <>, followed by either ! or ?. When it
8186       matches, it runs quickly. However, if it is applied to
8187
8188         aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
8189
8190       it takes a long time before reporting  failure.  This  is  because  the
8191       string  can be divided between the internal \D+ repeat and the external
8192       * repeat in a large number of ways, and all have to be tried. (The  ex-
8193       ample uses [!?] rather than a single character at the end, because both
8194       PCRE2 and Perl have an optimization that allows for fast failure when a
8195       single  character is used. They remember the last single character that
8196       is required for a match, and fail early if it is  not  present  in  the
8197       string.)  If  the  pattern  is changed so that it uses an atomic group,
8198       like this:
8199
8200         ((?>\D+)|<\d+>)*[!?]
8201
8202       sequences of non-digits cannot be broken, and failure happens quickly.
8203
8204
8205BACKREFERENCES
8206
8207       Outside a character class, a backslash followed by a digit greater than
8208       0  (and  possibly further digits) is a backreference to a capture group
8209       earlier (that is, to its left) in the pattern, provided there have been
8210       that many previous capture groups.
8211
8212       However,  if the decimal number following the backslash is less than 8,
8213       it is always taken as a backreference, and  causes  an  error  only  if
8214       there  are not that many capture groups in the entire pattern. In other
8215       words, the group that is referenced need not be to the left of the ref-
8216       erence  for numbers less than 8. A "forward backreference" of this type
8217       can make sense when a repetition is involved and the group to the right
8218       has participated in an earlier iteration.
8219
8220       It  is  not  possible  to have a numerical "forward backreference" to a
8221       group whose number is 8 or more using this syntax  because  a  sequence
8222       such  as  \50  is  interpreted as a character defined in octal. See the
8223       subsection entitled "Non-printing characters" above for further details
8224       of  the  handling of digits following a backslash. Other forms of back-
8225       referencing do not suffer from this restriction. In  particular,  there
8226       is no problem when named capture groups are used (see below).
8227
8228       Another  way  of  avoiding  the ambiguity inherent in the use of digits
8229       following a backslash is to use the \g  escape  sequence.  This  escape
8230       must be followed by a signed or unsigned number, optionally enclosed in
8231       braces. These examples are all identical:
8232
8233         (ring), \1
8234         (ring), \g1
8235         (ring), \g{1}
8236
8237       An unsigned number specifies an absolute reference without the  ambigu-
8238       ity that is present in the older syntax. It is also useful when literal
8239       digits follow the reference. A signed number is a  relative  reference.
8240       Consider this example:
8241
8242         (abc(def)ghi)\g{-1}
8243
8244       The sequence \g{-1} is a reference to the most recently started capture
8245       group before \g, that is, is it equivalent to \2 in this example. Simi-
8246       larly, \g{-2} would be equivalent to \1. The use of relative references
8247       can be helpful in long patterns, and also in patterns that are  created
8248       by  joining  together  fragments  that  contain references within them-
8249       selves.
8250
8251       The sequence \g{+1} is a reference to the next capture group. This kind
8252       of  forward  reference can be useful in patterns that repeat. Perl does
8253       not support the use of + in this way.
8254
8255       A backreference matches whatever actually  most  recently  matched  the
8256       capture  group  in  the current subject string, rather than anything at
8257       all that matches the group (see "Groups as subroutines" below for a way
8258       of doing that). So the pattern
8259
8260         (sens|respons)e and \1ibility
8261
8262       matches  "sense and sensibility" and "response and responsibility", but
8263       not "sense and responsibility". If caseful matching is in force at  the
8264       time  of  the backreference, the case of letters is relevant. For exam-
8265       ple,
8266
8267         ((?i)rah)\s+\1
8268
8269       matches "rah rah" and "RAH RAH", but not "RAH  rah",  even  though  the
8270       original capture group is matched caselessly.
8271
8272       There  are  several  different  ways of writing backreferences to named
8273       capture groups. The .NET syntax \k{name} and the Perl  syntax  \k<name>
8274       or  \k'name'  are  supported,  as  is the Python syntax (?P=name). Perl
8275       5.10's unified backreference syntax, in which \g can be used  for  both
8276       numeric  and  named references, is also supported. We could rewrite the
8277       above example in any of the following ways:
8278
8279         (?<p1>(?i)rah)\s+\k<p1>
8280         (?'p1'(?i)rah)\s+\k{p1}
8281         (?P<p1>(?i)rah)\s+(?P=p1)
8282         (?<p1>(?i)rah)\s+\g{p1}
8283
8284       A capture group that is referenced by name may appear  in  the  pattern
8285       before or after the reference.
8286
8287       There  may be more than one backreference to the same group. If a group
8288       has not actually been used in a particular match, backreferences to  it
8289       always fail by default. For example, the pattern
8290
8291         (a|(bc))\2
8292
8293       always  fails  if  it starts to match "a" rather than "bc". However, if
8294       the PCRE2_MATCH_UNSET_BACKREF option is set at compile time, a backref-
8295       erence to an unset value matches an empty string.
8296
8297       Because  there may be many capture groups in a pattern, all digits fol-
8298       lowing a backslash are taken as part of a potential backreference  num-
8299       ber.  If  the  pattern continues with a digit character, some delimiter
8300       must be used to terminate the backreference. If the  PCRE2_EXTENDED  or
8301       PCRE2_EXTENDED_MORE  option is set, this can be white space. Otherwise,
8302       the \g{} syntax or an empty comment (see "Comments" below) can be used.
8303
8304   Recursive backreferences
8305
8306       A backreference that occurs inside the group to which it  refers  fails
8307       when  the  group  is  first used, so, for example, (a\1) never matches.
8308       However, such references can be useful inside repeated groups. For  ex-
8309       ample, the pattern
8310
8311         (a|b\1)+
8312
8313       matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
8314       ation of the group, the backreference matches the character string cor-
8315       responding  to  the  previous iteration. In order for this to work, the
8316       pattern must be such that the first iteration does not  need  to  match
8317       the  backreference. This can be done using alternation, as in the exam-
8318       ple above, or by a quantifier with a minimum of zero.
8319
8320       For versions of PCRE2 less than 10.25, backreferences of this type used
8321       to  cause  the  group  that  they  reference to be treated as an atomic
8322       group.  This restriction no longer applies, and backtracking into  such
8323       groups can occur as normal.
8324
8325
8326ASSERTIONS
8327
8328       An  assertion  is  a  test on the characters following or preceding the
8329       current matching point that does not consume any characters. The simple
8330       assertions  coded  as  \b,  \B,  \A,  \G, \Z, \z, ^ and $ are described
8331       above.
8332
8333       More complicated assertions are coded as  parenthesized  groups.  There
8334       are  two  kinds:  those  that look ahead of the current position in the
8335       subject string, and those that look behind it, and in each case an  as-
8336       sertion  may  be  positive (must match for the assertion to be true) or
8337       negative (must not match for the assertion to be  true).  An  assertion
8338       group is matched in the normal way, and if it is true, matching contin-
8339       ues after it, but with the matching position in the subject string  re-
8340       set to what it was before the assertion was processed.
8341
8342       The  Perl-compatible  lookaround assertions are atomic. If an assertion
8343       is true, but there is a subsequent matching failure, there is no  back-
8344       tracking  into  the assertion. However, there are some cases where non-
8345       atomic assertions can be useful. PCRE2 has some support for these,  de-
8346       scribed in the section entitled "Non-atomic assertions" below, but they
8347       are not Perl-compatible.
8348
8349       A lookaround assertion may appear as the  condition  in  a  conditional
8350       group  (see  below). In this case, the result of matching the assertion
8351       determines which branch of the condition is followed.
8352
8353       Assertion groups are not capture groups. If an assertion contains  cap-
8354       ture  groups within it, these are counted for the purposes of numbering
8355       the capture groups in the whole pattern. Within each branch of  an  as-
8356       sertion,  locally  captured  substrings  may be referenced in the usual
8357       way. For example, a sequence such as (.)\g{-1} can  be  used  to  check
8358       that two adjacent characters are the same.
8359
8360       When  a  branch within an assertion fails to match, any substrings that
8361       were captured are discarded (as happens with any  pattern  branch  that
8362       fails  to  match).  A  negative  assertion  is  true  only when all its
8363       branches fail to match; this means that no captured substrings are ever
8364       retained  after a successful negative assertion. When an assertion con-
8365       tains a matching branch, what happens depends on the type of assertion.
8366
8367       For a positive assertion, internally captured substrings  in  the  suc-
8368       cessful  branch are retained, and matching continues with the next pat-
8369       tern item after the assertion. For a  negative  assertion,  a  matching
8370       branch  means  that  the assertion is not true. If such an assertion is
8371       being used as a condition in a conditional group (see below),  captured
8372       substrings  are  retained,  because  matching  continues  with the "no"
8373       branch of the condition. For other failing negative assertions, control
8374       passes to the previous backtracking point, thus discarding any captured
8375       strings within the assertion.
8376
8377       Most assertion groups may be repeated; though it makes no sense to  as-
8378       sert the same thing several times, the side effect of capturing in pos-
8379       itive assertions may occasionally be useful. However, an assertion that
8380       forms  the  condition  for  a  conditional group may not be quantified.
8381       PCRE2 used to restrict the repetition of assertions, but  from  release
8382       10.35  the  only restriction is that an unlimited maximum repetition is
8383       changed to be one more than the minimum. For example, {3,}  is  treated
8384       as {3,4}.
8385
8386   Alphabetic assertion names
8387
8388       Traditionally,  symbolic  sequences such as (?= and (?<= have been used
8389       to specify lookaround assertions. Perl 5.28 introduced some  experimen-
8390       tal alphabetic alternatives which might be easier to remember. They all
8391       start with (* instead of (? and must be written using lower  case  let-
8392       ters. PCRE2 supports the following synonyms:
8393
8394         (*positive_lookahead:  or (*pla: is the same as (?=
8395         (*negative_lookahead:  or (*nla: is the same as (?!
8396         (*positive_lookbehind: or (*plb: is the same as (?<=
8397         (*negative_lookbehind: or (*nlb: is the same as (?<!
8398
8399       For  example,  (*pla:foo) is the same assertion as (?=foo). In the fol-
8400       lowing sections, the various assertions are described using the  origi-
8401       nal symbolic forms.
8402
8403   Lookahead assertions
8404
8405       Lookahead assertions start with (?= for positive assertions and (?! for
8406       negative assertions. For example,
8407
8408         \w+(?=;)
8409
8410       matches a word followed by a semicolon, but does not include the  semi-
8411       colon in the match, and
8412
8413         foo(?!bar)
8414
8415       matches  any  occurrence  of  "foo" that is not followed by "bar". Note
8416       that the apparently similar pattern
8417
8418         (?!foo)bar
8419
8420       does not find an occurrence of "bar"  that  is  preceded  by  something
8421       other  than "foo"; it finds any occurrence of "bar" whatsoever, because
8422       the assertion (?!foo) is always true when the next three characters are
8423       "bar". A lookbehind assertion is needed to achieve the other effect.
8424
8425       If you want to force a matching failure at some point in a pattern, the
8426       most convenient way to do it is with (?!) because an empty  string  al-
8427       ways  matches,  so  an assertion that requires there not to be an empty
8428       string must always fail.  The backtracking control verb (*FAIL) or (*F)
8429       is a synonym for (?!).
8430
8431   Lookbehind assertions
8432
8433       Lookbehind  assertions start with (?<= for positive assertions and (?<!
8434       for negative assertions. For example,
8435
8436         (?<!foo)bar
8437
8438       does find an occurrence of "bar" that is not  preceded  by  "foo".  The
8439       contents  of  a  lookbehind  assertion are restricted such that all the
8440       strings it matches must have a fixed length. However, if there are sev-
8441       eral  top-level  alternatives,  they  do  not all have to have the same
8442       fixed length. Thus
8443
8444         (?<=bullock|donkey)
8445
8446       is permitted, but
8447
8448         (?<!dogs?|cats?)
8449
8450       causes an error at compile time. Branches that match  different  length
8451       strings  are permitted only at the top level of a lookbehind assertion.
8452       This is an extension compared with Perl, which requires all branches to
8453       match the same length of string. An assertion such as
8454
8455         (?<=ab(c|de))
8456
8457       is  not  permitted,  because  its single top-level branch can match two
8458       different lengths, but it is acceptable to PCRE2 if  rewritten  to  use
8459       two top-level branches:
8460
8461         (?<=abc|abde)
8462
8463       In  some  cases, the escape sequence \K (see above) can be used instead
8464       of a lookbehind assertion to get round the fixed-length restriction.
8465
8466       The implementation of lookbehind assertions is, for  each  alternative,
8467       to  temporarily  move the current position back by the fixed length and
8468       then try to match. If there are insufficient characters before the cur-
8469       rent position, the assertion fails.
8470
8471       In  UTF-8  and  UTF-16 modes, PCRE2 does not allow the \C escape (which
8472       matches a single code unit even in a UTF mode) to appear in  lookbehind
8473       assertions,  because  it makes it impossible to calculate the length of
8474       the lookbehind. The \X and \R escapes, which can match  different  num-
8475       bers of code units, are never permitted in lookbehinds.
8476
8477       "Subroutine"  calls  (see below) such as (?2) or (?&X) are permitted in
8478       lookbehinds, as long as the called capture group matches a fixed-length
8479       string.  However,  recursion, that is, a "subroutine" call into a group
8480       that is already active, is not supported.
8481
8482       Perl does not support backreferences in lookbehinds. PCRE2 does support
8483       them,  but  only  if  certain  conditions  are met. The PCRE2_MATCH_UN-
8484       SET_BACKREF option must not be set, there must be no use of (?| in  the
8485       pattern  (it creates duplicate group numbers), and if the backreference
8486       is by name, the name must be unique. Of course,  the  referenced  group
8487       must  itself  match  a  fixed  length  substring. The following pattern
8488       matches words containing at least two characters  that  begin  and  end
8489       with the same character:
8490
8491          \b(\w)\w++(?<=\1)
8492
8493       Possessive  quantifiers  can be used in conjunction with lookbehind as-
8494       sertions to specify efficient matching of fixed-length strings  at  the
8495       end of subject strings. Consider a simple pattern such as
8496
8497         abcd$
8498
8499       when  applied  to  a  long string that does not match. Because matching
8500       proceeds from left to right, PCRE2 will look for each "a" in  the  sub-
8501       ject  and  then see if what follows matches the rest of the pattern. If
8502       the pattern is specified as
8503
8504         ^.*abcd$
8505
8506       the initial .* matches the entire string at first, but when this  fails
8507       (because there is no following "a"), it backtracks to match all but the
8508       last character, then all but the last two characters, and so  on.  Once
8509       again  the search for "a" covers the entire string, from right to left,
8510       so we are no better off. However, if the pattern is written as
8511
8512         ^.*+(?<=abcd)
8513
8514       there can be no backtracking for the .*+ item because of the possessive
8515       quantifier; it can match only the entire string. The subsequent lookbe-
8516       hind assertion does a single test on the last four  characters.  If  it
8517       fails,  the  match  fails  immediately. For long strings, this approach
8518       makes a significant difference to the processing time.
8519
8520   Using multiple assertions
8521
8522       Several assertions (of any sort) may occur in succession. For example,
8523
8524         (?<=\d{3})(?<!999)foo
8525
8526       matches "foo" preceded by three digits that are not "999". Notice  that
8527       each  of  the  assertions is applied independently at the same point in
8528       the subject string. First there is a  check  that  the  previous  three
8529       characters  are  all  digits,  and  then there is a check that the same
8530       three characters are not "999".  This pattern does not match "foo" pre-
8531       ceded  by  six  characters,  the first of which are digits and the last
8532       three of which are not "999". For example, it  doesn't  match  "123abc-
8533       foo". A pattern to do that is
8534
8535         (?<=\d{3}...)(?<!999)foo
8536
8537       This  time  the  first assertion looks at the preceding six characters,
8538       checking that the first three are digits, and then the second assertion
8539       checks that the preceding three characters are not "999".
8540
8541       Assertions can be nested in any combination. For example,
8542
8543         (?<=(?<!foo)bar)baz
8544
8545       matches  an occurrence of "baz" that is preceded by "bar" which in turn
8546       is not preceded by "foo", while
8547
8548         (?<=\d{3}(?!999)...)foo
8549
8550       is another pattern that matches "foo" preceded by three digits and  any
8551       three characters that are not "999".
8552
8553
8554NON-ATOMIC ASSERTIONS
8555
8556       The  traditional Perl-compatible lookaround assertions are atomic. That
8557       is, if an assertion is true, but there is a subsequent  matching  fail-
8558       ure,  there  is  no backtracking into the assertion. However, there are
8559       some cases where non-atomic positive assertions can  be  useful.  PCRE2
8560       provides these using the following syntax:
8561
8562         (*non_atomic_positive_lookahead:  or (*napla: or (?*
8563         (*non_atomic_positive_lookbehind: or (*naplb: or (?<*
8564
8565       Consider  the  problem  of finding the right-most word in a string that
8566       also appears earlier in the string, that is, it must  appear  at  least
8567       twice  in  total.  This pattern returns the required result as captured
8568       substring 1:
8569
8570         ^(?x)(*napla: .* \b(\w++)) (?> .*? \b\1\b ){2}
8571
8572       For a subject such as "word1 word2 word3 word2 word3 word4" the  result
8573       is  "word3".  How does it work? At the start, ^(?x) anchors the pattern
8574       and sets the "x" option, which causes white space (introduced for read-
8575       ability)  to  be  ignored. Inside the assertion, the greedy .* at first
8576       consumes the entire string, but then has to backtrack until the rest of
8577       the  assertion can match a word, which is captured by group 1. In other
8578       words, when the assertion first succeeds, it  captures  the  right-most
8579       word in the string.
8580
8581       The  current  matching point is then reset to the start of the subject,
8582       and the rest of the pattern match checks for  two  occurrences  of  the
8583       captured  word,  using  an  ungreedy .*? to scan from the left. If this
8584       succeeds, we are done, but if the last word in the string does not  oc-
8585       cur  twice,  this  part  of  the pattern fails. If a traditional atomic
8586       lookhead (?= or (*pla: had been used, the assertion could not be re-en-
8587       tered,  and  the whole match would fail. The pattern would succeed only
8588       if the very last word in the subject was found twice.
8589
8590       Using a non-atomic lookahead, however, means that when  the  last  word
8591       does  not  occur  twice  in the string, the lookahead can backtrack and
8592       find the second-last word, and so on, until either the match  succeeds,
8593       or all words have been tested.
8594
8595       Two conditions must be met for a non-atomic assertion to be useful: the
8596       contents of one or more capturing groups must change after a  backtrack
8597       into  the  assertion,  and  there  must be a backreference to a changed
8598       group later in the pattern. If this is not the case, the  rest  of  the
8599       pattern  match  fails exactly as before because nothing has changed, so
8600       using a non-atomic assertion just wastes resources.
8601
8602       There is one exception to backtracking into a non-atomic assertion.  If
8603       an  (*ACCEPT)  control verb is triggered, the assertion succeeds atomi-
8604       cally. That is, a subsequent match failure cannot  backtrack  into  the
8605       assertion.
8606
8607       Non-atomic  assertions  are  not  supported by the alternative matching
8608       function pcre2_dfa_match(). They are supported by JIT, but only if they
8609       do not contain any control verbs such as (*ACCEPT). (This may change in
8610       future). Note that assertions that appear as conditions for conditional
8611       groups (see below) must be atomic.
8612
8613
8614SCRIPT RUNS
8615
8616       In  concept, a script run is a sequence of characters that are all from
8617       the same Unicode script such as Latin or Greek. However,  because  some
8618       scripts  are  commonly  used together, and because some diacritical and
8619       other marks are used with multiple scripts,  it  is  not  that  simple.
8620       There is a full description of the rules that PCRE2 uses in the section
8621       entitled "Script Runs" in the pcre2unicode documentation.
8622
8623       If part of a pattern is enclosed between (*script_run: or (*sr:  and  a
8624       closing  parenthesis,  it  fails  if the sequence of characters that it
8625       matches are not a script run. After a failure, normal backtracking  oc-
8626       curs.  Script runs can be used to detect spoofing attacks using charac-
8627       ters that look the same, but are from  different  scripts.  The  string
8628       "paypal.com"  is an infamous example, where the letters could be a mix-
8629       ture of Latin and Cyrillic. This pattern ensures that the matched char-
8630       acters in a sequence of non-spaces that follow white space are a script
8631       run:
8632
8633         \s+(*sr:\S+)
8634
8635       To be sure that they are all from the Latin  script  (for  example),  a
8636       lookahead can be used:
8637
8638         \s+(?=\p{Latin})(*sr:\S+)
8639
8640       This works as long as the first character is expected to be a character
8641       in that script, and not (for example)  punctuation,  which  is  allowed
8642       with  any script. If this is not the case, a more creative lookahead is
8643       needed. For example, if digits, underscore, and dots are  permitted  at
8644       the start:
8645
8646         \s+(?=[0-9_.]*\p{Latin})(*sr:\S+)
8647
8648
8649       In  many  cases, backtracking into a script run pattern fragment is not
8650       desirable. The script run can employ an atomic group to  prevent  this.
8651       Because  this is a common requirement, a shorthand notation is provided
8652       by (*atomic_script_run: or (*asr:
8653
8654         (*asr:...) is the same as (*sr:(?>...))
8655
8656       Note that the atomic group is inside the script run. Putting it outside
8657       would not prevent backtracking into the script run pattern.
8658
8659       Support  for  script runs is not available if PCRE2 is compiled without
8660       Unicode support. A compile-time error is given if any of the above con-
8661       structs  is encountered. Script runs are not supported by the alternate
8662       matching function, pcre2_dfa_match() because they use the  same  mecha-
8663       nism as capturing parentheses.
8664
8665       Warning:  The  (*ACCEPT)  control  verb  (see below) should not be used
8666       within a script run group, because it causes an immediate exit from the
8667       group, bypassing the script run checking.
8668
8669
8670CONDITIONAL GROUPS
8671
8672       It is possible to cause the matching process to obey a pattern fragment
8673       conditionally or to choose between two alternative fragments, depending
8674       on  the result of an assertion, or whether a specific capture group has
8675       already been matched. The two possible forms of conditional group are:
8676
8677         (?(condition)yes-pattern)
8678         (?(condition)yes-pattern|no-pattern)
8679
8680       If the condition is satisfied, the yes-pattern is used;  otherwise  the
8681       no-pattern  (if present) is used. An absent no-pattern is equivalent to
8682       an empty string (it always matches). If there are more than two  alter-
8683       natives  in the group, a compile-time error occurs. Each of the two al-
8684       ternatives may itself contain nested groups of any form, including con-
8685       ditional  groups;  the  restriction to two alternatives applies only at
8686       the level of the condition itself. This pattern fragment is an  example
8687       where the alternatives are complex:
8688
8689         (?(1) (A|B|C) | (D | (?(2)E|F) | E) )
8690
8691
8692       There are five kinds of condition: references to capture groups, refer-
8693       ences to recursion, two pseudo-conditions called  DEFINE  and  VERSION,
8694       and assertions.
8695
8696   Checking for a used capture group by number
8697
8698       If  the  text between the parentheses consists of a sequence of digits,
8699       the condition is true if a capture group of that number has  previously
8700       matched.  If  there is more than one capture group with the same number
8701       (see the earlier section about duplicate group numbers), the  condition
8702       is true if any of them have matched. An alternative notation is to pre-
8703       cede the digits with a plus or minus sign. In this case, the group num-
8704       ber  is relative rather than absolute. The most recently opened capture
8705       group can be referenced by (?(-1), the next most recent by (?(-2),  and
8706       so  on.  Inside  loops  it  can  also make sense to refer to subsequent
8707       groups. The next capture group can be referenced as (?(+1), and so  on.
8708       (The  value  zero in any of these forms is not used; it provokes a com-
8709       pile-time error.)
8710
8711       Consider the following pattern, which  contains  non-significant  white
8712       space  to  make it more readable (assume the PCRE2_EXTENDED option) and
8713       to divide it into three parts for ease of discussion:
8714
8715         ( \( )?    [^()]+    (?(1) \) )
8716
8717       The first part matches an optional opening  parenthesis,  and  if  that
8718       character is present, sets it as the first captured substring. The sec-
8719       ond part matches one or more characters that are not  parentheses.  The
8720       third  part  is a conditional group that tests whether or not the first
8721       capture group matched. If it did, that is, if subject started  with  an
8722       opening  parenthesis,  the condition is true, and so the yes-pattern is
8723       executed and a closing parenthesis is required.  Otherwise,  since  no-
8724       pattern is not present, the conditional group matches nothing. In other
8725       words, this pattern matches a sequence of  non-parentheses,  optionally
8726       enclosed in parentheses.
8727
8728       If  you  were  embedding  this pattern in a larger one, you could use a
8729       relative reference:
8730
8731         ...other stuff... ( \( )?    [^()]+    (?(-1) \) ) ...
8732
8733       This makes the fragment independent of the parentheses  in  the  larger
8734       pattern.
8735
8736   Checking for a used capture group by name
8737
8738       Perl  uses  the  syntax  (?(<name>)...) or (?('name')...) to test for a
8739       used capture group by name. For compatibility with earlier versions  of
8740       PCRE1,  which had this facility before Perl, the syntax (?(name)...) is
8741       also recognized.  Note, however, that undelimited names  consisting  of
8742       the  letter  R followed by digits are ambiguous (see the following sec-
8743       tion). Rewriting the above example to use a named group gives this:
8744
8745         (?<OPEN> \( )?    [^()]+    (?(<OPEN>) \) )
8746
8747       If the name used in a condition of this kind is a duplicate,  the  test
8748       is  applied  to  all groups of the same name, and is true if any one of
8749       them has matched.
8750
8751   Checking for pattern recursion
8752
8753       "Recursion" in this sense refers to any subroutine-like call  from  one
8754       part  of  the  pattern to another, whether or not it is actually recur-
8755       sive. See the sections entitled "Recursive  patterns"  and  "Groups  as
8756       subroutines" below for details of recursion and subroutine calls.
8757
8758       If  a  condition  is the string (R), and there is no capture group with
8759       the name R, the condition is true if matching is currently in a  recur-
8760       sion  or  subroutine call to the whole pattern or any capture group. If
8761       digits follow the letter R, and there is no group with that  name,  the
8762       condition  is  true  if  the  most recent call is into a group with the
8763       given number, which must exist somewhere in the overall  pattern.  This
8764       is a contrived example that is equivalent to a+b:
8765
8766         ((?(R1)a+|(?1)b))
8767
8768       However,  in  both  cases,  if there is a capture group with a matching
8769       name, the condition tests for its being set, as described in  the  sec-
8770       tion  above,  instead of testing for recursion. For example, creating a
8771       group with the name R1 by adding (?<R1>)  to  the  above  pattern  com-
8772       pletely changes its meaning.
8773
8774       If a name preceded by ampersand follows the letter R, for example:
8775
8776         (?(R&name)...)
8777
8778       the  condition  is true if the most recent recursion is into a group of
8779       that name (which must exist within the pattern).
8780
8781       This condition does not check the entire recursion stack. It tests only
8782       the  current  level.  If the name used in a condition of this kind is a
8783       duplicate, the test is applied to all groups of the same name,  and  is
8784       true if any one of them is the most recent recursion.
8785
8786       At "top level", all these recursion test conditions are false.
8787
8788   Defining capture groups for use by reference only
8789
8790       If the condition is the string (DEFINE), the condition is always false,
8791       even if there is a group with the name DEFINE. In this case, there  may
8792       be only one alternative in the rest of the conditional group. It is al-
8793       ways skipped if control reaches this point in the pattern; the idea  of
8794       DEFINE  is that it can be used to define subroutines that can be refer-
8795       enced from elsewhere. (The use of subroutines is described below.)  For
8796       example,  a  pattern  to match an IPv4 address such as "192.168.23.245"
8797       could be written like this (ignore white space and line breaks):
8798
8799         (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
8800         \b (?&byte) (\.(?&byte)){3} \b
8801
8802       The first part of the pattern is a DEFINE group  inside  which  another
8803       group  named "byte" is defined. This matches an individual component of
8804       an IPv4 address (a number less than 256). When  matching  takes  place,
8805       this  part  of  the pattern is skipped because DEFINE acts like a false
8806       condition. The rest of the pattern uses references to the  named  group
8807       to  match the four dot-separated components of an IPv4 address, insist-
8808       ing on a word boundary at each end.
8809
8810   Checking the PCRE2 version
8811
8812       Programs that link with a PCRE2 library can check the version by  call-
8813       ing  pcre2_config()  with  appropriate arguments. Users of applications
8814       that do not have access to the underlying code cannot do this.  A  spe-
8815       cial  "condition" called VERSION exists to allow such users to discover
8816       which version of PCRE2 they are dealing with by using this condition to
8817       match  a string such as "yesno". VERSION must be followed either by "="
8818       or ">=" and a version number.  For example:
8819
8820         (?(VERSION>=10.4)yes|no)
8821
8822       This pattern matches "yes" if the PCRE2 version is greater or equal  to
8823       10.4,  or "no" otherwise. The fractional part of the version number may
8824       not contain more than two digits.
8825
8826   Assertion conditions
8827
8828       If the condition is not in any of the  above  formats,  it  must  be  a
8829       parenthesized  assertion.  This may be a positive or negative lookahead
8830       or lookbehind assertion. However, it must be a traditional  atomic  as-
8831       sertion, not one of the PCRE2-specific non-atomic assertions.
8832
8833       Consider  this  pattern,  again containing non-significant white space,
8834       and with the two alternatives on the second line:
8835
8836         (?(?=[^a-z]*[a-z])
8837         \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )
8838
8839       The condition is a positive lookahead assertion  that  matches  an  op-
8840       tional sequence of non-letters followed by a letter. In other words, it
8841       tests for the presence of at least one letter in the subject. If a let-
8842       ter  is  found,  the  subject is matched against the first alternative;
8843       otherwise it is  matched  against  the  second.  This  pattern  matches
8844       strings  in  one  of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
8845       letters and dd are digits.
8846
8847       When an assertion that is a condition contains capture groups, any cap-
8848       turing  that  occurs  in  a matching branch is retained afterwards, for
8849       both positive and negative assertions, because matching always  contin-
8850       ues  after  the  assertion, whether it succeeds or fails. (Compare non-
8851       conditional assertions, for which captures are retained only for  posi-
8852       tive assertions that succeed.)
8853
8854
8855COMMENTS
8856
8857       There are two ways of including comments in patterns that are processed
8858       by PCRE2. In both cases, the start of the comment  must  not  be  in  a
8859       character  class,  nor  in  the middle of any other sequence of related
8860       characters such as (?: or a group name or number. The  characters  that
8861       make up a comment play no part in the pattern matching.
8862
8863       The  sequence (?# marks the start of a comment that continues up to the
8864       next closing parenthesis. Nested parentheses are not permitted. If  the
8865       PCRE2_EXTENDED  or  PCRE2_EXTENDED_MORE  option  is set, an unescaped #
8866       character also introduces a comment, which in this  case  continues  to
8867       immediately  after  the next newline character or character sequence in
8868       the pattern. Which characters are interpreted as newlines is controlled
8869       by  an option passed to the compiling function or by a special sequence
8870       at the start of the pattern, as described in the section entitled "New-
8871       line conventions" above. Note that the end of this type of comment is a
8872       literal newline sequence in the pattern; escape sequences  that  happen
8873       to represent a newline do not count. For example, consider this pattern
8874       when PCRE2_EXTENDED is set, and the default newline convention (a  sin-
8875       gle linefeed character) is in force:
8876
8877         abc #comment \n still comment
8878
8879       On  encountering  the # character, pcre2_compile() skips along, looking
8880       for a newline in the pattern. The sequence \n is still literal at  this
8881       stage,  so  it does not terminate the comment. Only an actual character
8882       with the code value 0x0a (the default newline) does so.
8883
8884
8885RECURSIVE PATTERNS
8886
8887       Consider the problem of matching a string in parentheses, allowing  for
8888       unlimited  nested  parentheses.  Without the use of recursion, the best
8889       that can be done is to use a pattern that  matches  up  to  some  fixed
8890       depth  of  nesting.  It  is not possible to handle an arbitrary nesting
8891       depth.
8892
8893       For some time, Perl has provided a facility that allows regular expres-
8894       sions  to recurse (amongst other things). It does this by interpolating
8895       Perl code in the expression at run time, and the code can refer to  the
8896       expression itself. A Perl pattern using code interpolation to solve the
8897       parentheses problem can be created like this:
8898
8899         $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
8900
8901       The (?p{...}) item interpolates Perl code at run time, and in this case
8902       refers recursively to the pattern in which it appears.
8903
8904       Obviously,  PCRE2  cannot  support  the interpolation of Perl code. In-
8905       stead, it supports special syntax for recursion of the entire  pattern,
8906       and also for individual capture group recursion. After its introduction
8907       in PCRE1 and Python, this kind of recursion was subsequently introduced
8908       into Perl at release 5.10.
8909
8910       A  special  item  that consists of (? followed by a number greater than
8911       zero and a closing parenthesis is a recursive subroutine  call  of  the
8912       capture  group of the given number, provided that it occurs inside that
8913       group. (If not, it is a non-recursive subroutine  call,  which  is  de-
8914       scribed in the next section.) The special item (?R) or (?0) is a recur-
8915       sive call of the entire regular expression.
8916
8917       This PCRE2 pattern solves the nested parentheses  problem  (assume  the
8918       PCRE2_EXTENDED option is set so that white space is ignored):
8919
8920         \( ( [^()]++ | (?R) )* \)
8921
8922       First  it matches an opening parenthesis. Then it matches any number of
8923       substrings which can either be a sequence of non-parentheses, or a  re-
8924       cursive match of the pattern itself (that is, a correctly parenthesized
8925       substring).  Finally there is a closing parenthesis. Note the use of  a
8926       possessive  quantifier  to  avoid  backtracking  into sequences of non-
8927       parentheses.
8928
8929       If this were part of a larger pattern, you would not  want  to  recurse
8930       the entire pattern, so instead you could use this:
8931
8932         ( \( ( [^()]++ | (?1) )* \) )
8933
8934       We  have  put the pattern into parentheses, and caused the recursion to
8935       refer to them instead of the whole pattern.
8936
8937       In a larger pattern,  keeping  track  of  parenthesis  numbers  can  be
8938       tricky.  This is made easier by the use of relative references. Instead
8939       of (?1) in the pattern above you can write (?-2) to refer to the second
8940       most  recently  opened  parentheses  preceding  the recursion. In other
8941       words, a negative number counts capturing  parentheses  leftwards  from
8942       the point at which it is encountered.
8943
8944       Be  aware  however, that if duplicate capture group numbers are in use,
8945       relative references refer to the earliest group  with  the  appropriate
8946       number. Consider, for example:
8947
8948         (?|(a)|(b)) (c) (?-2)
8949
8950       The first two capture groups (a) and (b) are both numbered 1, and group
8951       (c) is number 2. When the reference (?-2) is  encountered,  the  second
8952       most  recently opened parentheses has the number 1, but it is the first
8953       such group (the (a) group) to which the recursion refers. This would be
8954       the  same if an absolute reference (?1) was used. In other words, rela-
8955       tive references are just a shorthand for computing a group number.
8956
8957       It is also possible to refer to subsequent capture groups,  by  writing
8958       references  such  as  (?+2). However, these cannot be recursive because
8959       the reference is not inside the parentheses that are  referenced.  They
8960       are  always  non-recursive  subroutine  calls, as described in the next
8961       section.
8962
8963       An alternative approach is to use named parentheses.  The  Perl  syntax
8964       for  this  is  (?&name);  PCRE1's earlier syntax (?P>name) is also sup-
8965       ported. We could rewrite the above example as follows:
8966
8967         (?<pn> \( ( [^()]++ | (?&pn) )* \) )
8968
8969       If there is more than one group with the same name, the earliest one is
8970       used.
8971
8972       The example pattern that we have been looking at contains nested unlim-
8973       ited repeats, and so the use of a possessive  quantifier  for  matching
8974       strings  of  non-parentheses  is important when applying the pattern to
8975       strings that do not match. For example, when this pattern is applied to
8976
8977         (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
8978
8979       it yields "no match" quickly. However, if a  possessive  quantifier  is
8980       not  used, the match runs for a very long time indeed because there are
8981       so many different ways the + and * repeats can carve  up  the  subject,
8982       and all have to be tested before failure can be reported.
8983
8984       At  the  end  of a match, the values of capturing parentheses are those
8985       from the outermost level. If you want to obtain intermediate values,  a
8986       callout function can be used (see below and the pcre2callout documenta-
8987       tion). If the pattern above is matched against
8988
8989         (ab(cd)ef)
8990
8991       the value for the inner capturing parentheses  (numbered  2)  is  "ef",
8992       which  is  the last value taken on at the top level. If a capture group
8993       is not matched at the top level, its final  captured  value  is  unset,
8994       even  if it was (temporarily) set at a deeper level during the matching
8995       process.
8996
8997       Do not confuse the (?R) item with the condition (R),  which  tests  for
8998       recursion.   Consider  this pattern, which matches text in angle brack-
8999       ets, allowing for arbitrary nesting. Only digits are allowed in  nested
9000       brackets  (that is, when recursing), whereas any characters are permit-
9001       ted at the outer level.
9002
9003         < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >
9004
9005       In this pattern, (?(R) is the start of a conditional  group,  with  two
9006       different  alternatives  for the recursive and non-recursive cases. The
9007       (?R) item is the actual recursive call.
9008
9009   Differences in recursion processing between PCRE2 and Perl
9010
9011       Some former differences between PCRE2 and Perl no longer exist.
9012
9013       Before release 10.30, recursion processing in PCRE2 differed from  Perl
9014       in  that  a  recursive  subroutine call was always treated as an atomic
9015       group. That is, once it had matched some of the subject string, it  was
9016       never  re-entered,  even if it contained untried alternatives and there
9017       was a subsequent matching failure. (Historical note:  PCRE  implemented
9018       recursion before Perl did.)
9019
9020       Starting  with  release 10.30, recursive subroutine calls are no longer
9021       treated as atomic. That is, they can be re-entered to try unused alter-
9022       natives  if  there  is a matching failure later in the pattern. This is
9023       now compatible with the way Perl works. If you want a  subroutine  call
9024       to be atomic, you must explicitly enclose it in an atomic group.
9025
9026       Supporting backtracking into recursions simplifies certain types of re-
9027       cursive pattern. For example, this pattern matches palindromic strings:
9028
9029         ^((.)(?1)\2|.?)$
9030
9031       The second branch in the group matches a single  central  character  in
9032       the  palindrome  when there are an odd number of characters, or nothing
9033       when there are an even number of characters, but in order  to  work  it
9034       has  to  be  able  to  try the second case when the rest of the pattern
9035       match fails. If you want to match typical palindromic phrases, the pat-
9036       tern  has  to  ignore  all  non-word characters, which can be done like
9037       this:
9038
9039         ^\W*+((.)\W*+(?1)\W*+\2|\W*+.?)\W*+$
9040
9041       If run with the PCRE2_CASELESS option,  this  pattern  matches  phrases
9042       such  as "A man, a plan, a canal: Panama!". Note the use of the posses-
9043       sive quantifier *+ to avoid backtracking  into  sequences  of  non-word
9044       characters. Without this, PCRE2 takes a great deal longer (ten times or
9045       more) to match typical phrases, and Perl takes so long that  you  think
9046       it has gone into a loop.
9047
9048       Another  way  in which PCRE2 and Perl used to differ in their recursion
9049       processing is in the handling of captured  values.  Formerly  in  Perl,
9050       when  a  group  was called recursively or as a subroutine (see the next
9051       section), it had no access to any values that were captured outside the
9052       recursion,  whereas  in  PCRE2 these values can be referenced. Consider
9053       this pattern:
9054
9055         ^(.)(\1|a(?2))
9056
9057       This pattern matches "bab". The first capturing parentheses match  "b",
9058       then in the second group, when the backreference \1 fails to match "b",
9059       the second alternative matches "a" and then recurses. In the recursion,
9060       \1  does now match "b" and so the whole match succeeds. This match used
9061       to fail in Perl, but in later versions (I tried 5.024) it now works.
9062
9063
9064GROUPS AS SUBROUTINES
9065
9066       If the syntax for a recursive group call (either by number or by  name)
9067       is  used  outside the parentheses to which it refers, it operates a bit
9068       like a subroutine in a programming  language.  More  accurately,  PCRE2
9069       treats the referenced group as an independent subpattern which it tries
9070       to match at the current matching position. The called group may be  de-
9071       fined  before or after the reference. A numbered reference can be abso-
9072       lute or relative, as in these examples:
9073
9074         (...(absolute)...)...(?2)...
9075         (...(relative)...)...(?-1)...
9076         (...(?+1)...(relative)...
9077
9078       An earlier example pointed out that the pattern
9079
9080         (sens|respons)e and \1ibility
9081
9082       matches "sense and sensibility" and "response and responsibility",  but
9083       not "sense and responsibility". If instead the pattern
9084
9085         (sens|respons)e and (?1)ibility
9086
9087       is  used, it does match "sense and responsibility" as well as the other
9088       two strings. Another example is  given  in  the  discussion  of  DEFINE
9089       above.
9090
9091       Like  recursions,  subroutine  calls  used to be treated as atomic, but
9092       this changed at PCRE2 release 10.30, so  backtracking  into  subroutine
9093       calls  can  now  occur. However, any capturing parentheses that are set
9094       during the subroutine call revert to their previous values afterwards.
9095
9096       Processing options such as case-independence are fixed when a group  is
9097       defined,  so  if  it  is  used  as a subroutine, such options cannot be
9098       changed for different calls. For example, consider this pattern:
9099
9100         (abc)(?i:(?-1))
9101
9102       It matches "abcabc". It does not match "abcABC" because the  change  of
9103       processing option does not affect the called group.
9104
9105       The  behaviour  of  backtracking control verbs in groups when called as
9106       subroutines is described in the section entitled "Backtracking verbs in
9107       subroutines" below.
9108
9109
9110ONIGURUMA SUBROUTINE SYNTAX
9111
9112       For  compatibility with Oniguruma, the non-Perl syntax \g followed by a
9113       name or a number enclosed either in angle brackets or single quotes, is
9114       an alternative syntax for calling a group as a subroutine, possibly re-
9115       cursively. Here are two of the examples  used  above,  rewritten  using
9116       this syntax:
9117
9118         (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
9119         (sens|respons)e and \g'1'ibility
9120
9121       PCRE2  supports an extension to Oniguruma: if a number is preceded by a
9122       plus or a minus sign it is taken as a relative reference. For example:
9123
9124         (abc)(?i:\g<-1>)
9125
9126       Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are  not
9127       synonymous.  The  former is a backreference; the latter is a subroutine
9128       call.
9129
9130
9131CALLOUTS
9132
9133       Perl has a feature whereby using the sequence (?{...}) causes arbitrary
9134       Perl  code to be obeyed in the middle of matching a regular expression.
9135       This makes it possible, amongst other things, to extract different sub-
9136       strings that match the same pair of parentheses when there is a repeti-
9137       tion.
9138
9139       PCRE2 provides a similar feature, but of course it  cannot  obey  arbi-
9140       trary  Perl  code. The feature is called "callout". The caller of PCRE2
9141       provides an external function by putting its entry  point  in  a  match
9142       context  using  the function pcre2_set_callout(), and then passing that
9143       context to pcre2_match() or pcre2_dfa_match(). If no match  context  is
9144       passed, or if the callout entry point is set to NULL, callouts are dis-
9145       abled.
9146
9147       Within a regular expression, (?C<arg>) indicates a point at  which  the
9148       external  function  is  to  be  called. There are two kinds of callout:
9149       those with a numerical argument and those with a string argument.  (?C)
9150       on  its  own with no argument is treated as (?C0). A numerical argument
9151       allows the  application  to  distinguish  between  different  callouts.
9152       String  arguments  were added for release 10.20 to make it possible for
9153       script languages that use PCRE2 to embed short scripts within  patterns
9154       in a similar way to Perl.
9155
9156       During matching, when PCRE2 reaches a callout point, the external func-
9157       tion is called. It is provided with the number or  string  argument  of
9158       the  callout, the position in the pattern, and one item of data that is
9159       also set in the match block. The callout function may cause matching to
9160       proceed, to backtrack, or to fail.
9161
9162       By  default,  PCRE2  implements  a  number of optimizations at matching
9163       time, and one side-effect is that sometimes callouts  are  skipped.  If
9164       you  need all possible callouts to happen, you need to set options that
9165       disable the relevant optimizations. More details, including a  complete
9166       description  of  the programming interface to the callout function, are
9167       given in the pcre2callout documentation.
9168
9169   Callouts with numerical arguments
9170
9171       If you just want to have  a  means  of  identifying  different  callout
9172       points,  put  a  number  less than 256 after the letter C. For example,
9173       this pattern has two callout points:
9174
9175         (?C1)abc(?C2)def
9176
9177       If the PCRE2_AUTO_CALLOUT flag is passed to pcre2_compile(),  numerical
9178       callouts  are  automatically installed before each item in the pattern.
9179       They are all numbered 255. If there is a conditional group in the  pat-
9180       tern whose condition is an assertion, an additional callout is inserted
9181       just before the condition. An explicit callout may also be set at  this
9182       position, as in this example:
9183
9184         (?(?C9)(?=a)abc|def)
9185
9186       Note that this applies only to assertion conditions, not to other types
9187       of condition.
9188
9189   Callouts with string arguments
9190
9191       A delimited string may be used instead of a number as a  callout  argu-
9192       ment.  The  starting  delimiter  must be one of ` ' " ^ % # $ { and the
9193       ending delimiter is the same as the start, except for {, where the end-
9194       ing  delimiter  is  }.  If  the  ending  delimiter is needed within the
9195       string, it must be doubled. For example:
9196
9197         (?C'ab ''c'' d')xyz(?C{any text})pqr
9198
9199       The doubling is removed before the string  is  passed  to  the  callout
9200       function.
9201
9202
9203BACKTRACKING CONTROL
9204
9205       There  are  a  number  of  special "Backtracking Control Verbs" (to use
9206       Perl's terminology) that modify the behaviour  of  backtracking  during
9207       matching.  They are generally of the form (*VERB) or (*VERB:NAME). Some
9208       verbs take either form, and may behave differently depending on whether
9209       or  not  a  name  argument is present. The names are not required to be
9210       unique within the pattern.
9211
9212       By default, for compatibility with Perl, a  name  is  any  sequence  of
9213       characters that does not include a closing parenthesis. The name is not
9214       processed in any way, and it is  not  possible  to  include  a  closing
9215       parenthesis   in  the  name.   This  can  be  changed  by  setting  the
9216       PCRE2_ALT_VERBNAMES option, but the result is no  longer  Perl-compati-
9217       ble.
9218
9219       When  PCRE2_ALT_VERBNAMES  is  set,  backslash processing is applied to
9220       verb names and only an unescaped  closing  parenthesis  terminates  the
9221       name.  However, the only backslash items that are permitted are \Q, \E,
9222       and sequences such as \x{100} that define character code points.  Char-
9223       acter type escapes such as \d are faulted.
9224
9225       A closing parenthesis can be included in a name either as \) or between
9226       \Q and \E. In addition to backslash processing, if  the  PCRE2_EXTENDED
9227       or PCRE2_EXTENDED_MORE option is also set, unescaped whitespace in verb
9228       names is skipped, and #-comments are recognized, exactly as in the rest
9229       of  the  pattern.  PCRE2_EXTENDED and PCRE2_EXTENDED_MORE do not affect
9230       verb names unless PCRE2_ALT_VERBNAMES is also set.
9231
9232       The maximum length of a name is 255 in the 8-bit library and  65535  in
9233       the  16-bit and 32-bit libraries. If the name is empty, that is, if the
9234       closing parenthesis immediately follows the colon, the effect is as  if
9235       the colon were not there. Any number of these verbs may occur in a pat-
9236       tern. Except for (*ACCEPT), they may not be quantified.
9237
9238       Since these verbs are specifically related  to  backtracking,  most  of
9239       them  can be used only when the pattern is to be matched using the tra-
9240       ditional matching function, because that uses a backtracking algorithm.
9241       With  the  exception  of (*FAIL), which behaves like a failing negative
9242       assertion, the backtracking control verbs cause an error if encountered
9243       by the DFA matching function.
9244
9245       The  behaviour  of  these  verbs in repeated groups, assertions, and in
9246       capture groups called as subroutines (whether or  not  recursively)  is
9247       documented below.
9248
9249   Optimizations that affect backtracking verbs
9250
9251       PCRE2 contains some optimizations that are used to speed up matching by
9252       running some checks at the start of each match attempt. For example, it
9253       may  know  the minimum length of matching subject, or that a particular
9254       character must be present. When one of these optimizations bypasses the
9255       running  of  a  match,  any  included  backtracking  verbs will not, of
9256       course, be processed. You can suppress the start-of-match optimizations
9257       by  setting  the PCRE2_NO_START_OPTIMIZE option when calling pcre2_com-
9258       pile(), or by starting the pattern with (*NO_START_OPT). There is  more
9259       discussion of this option in the section entitled "Compiling a pattern"
9260       in the pcre2api documentation.
9261
9262       Experiments with Perl suggest that it too  has  similar  optimizations,
9263       and like PCRE2, turning them off can change the result of a match.
9264
9265   Verbs that act immediately
9266
9267       The following verbs act as soon as they are encountered.
9268
9269          (*ACCEPT) or (*ACCEPT:NAME)
9270
9271       This  verb causes the match to end successfully, skipping the remainder
9272       of the pattern. However, when it is inside  a  capture  group  that  is
9273       called as a subroutine, only that group is ended successfully. Matching
9274       then continues at the outer level. If (*ACCEPT) in triggered in a posi-
9275       tive  assertion,  the  assertion succeeds; in a negative assertion, the
9276       assertion fails.
9277
9278       If (*ACCEPT) is inside capturing parentheses, the data so far  is  cap-
9279       tured. For example:
9280
9281         A((?:A|B(*ACCEPT)|C)D)
9282
9283       This  matches  "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap-
9284       tured by the outer parentheses.
9285
9286       (*ACCEPT) is the only backtracking verb that is allowed to  be  quanti-
9287       fied  because  an  ungreedy  quantification with a minimum of zero acts
9288       only when a backtrack happens. Consider, for example,
9289
9290         (A(*ACCEPT)??B)C
9291
9292       where A, B, and C may be complex expressions. After matching  "A",  the
9293       matcher  processes  "BC"; if that fails, causing a backtrack, (*ACCEPT)
9294       is triggered and the match succeeds. In both cases, all but C  is  cap-
9295       tured.  Whereas  (*COMMIT) (see below) means "fail on backtrack", a re-
9296       peated (*ACCEPT) of this type means "succeed on backtrack".
9297
9298       Warning: (*ACCEPT) should not be used within a script  run  group,  be-
9299       cause  it causes an immediate exit from the group, bypassing the script
9300       run checking.
9301
9302         (*FAIL) or (*FAIL:NAME)
9303
9304       This verb causes a matching failure, forcing backtracking to occur.  It
9305       may  be  abbreviated  to  (*F).  It is equivalent to (?!) but easier to
9306       read. The Perl documentation notes that it is probably useful only when
9307       combined with (?{}) or (??{}). Those are, of course, Perl features that
9308       are not present in PCRE2. The nearest equivalent is  the  callout  fea-
9309       ture, as for example in this pattern:
9310
9311         a+(?C)(*FAIL)
9312
9313       A  match  with the string "aaaa" always fails, but the callout is taken
9314       before each backtrack happens (in this example, 10 times).
9315
9316       (*ACCEPT:NAME) and (*FAIL:NAME) behave the  same  as  (*MARK:NAME)(*AC-
9317       CEPT)  and  (*MARK:NAME)(*FAIL),  respectively,  that  is, a (*MARK) is
9318       recorded just before the verb acts.
9319
9320   Recording which path was taken
9321
9322       There is one verb whose main purpose is to track how a  match  was  ar-
9323       rived  at,  though  it also has a secondary use in conjunction with ad-
9324       vancing the match starting point (see (*SKIP) below).
9325
9326         (*MARK:NAME) or (*:NAME)
9327
9328       A name is always required with this verb. For all the other  backtrack-
9329       ing control verbs, a NAME argument is optional.
9330
9331       When  a  match  succeeds, the name of the last-encountered mark name on
9332       the matching path is passed back to the caller as described in the sec-
9333       tion entitled "Other information about the match" in the pcre2api docu-
9334       mentation. This applies to all instances of (*MARK)  and  other  verbs,
9335       including those inside assertions and atomic groups. However, there are
9336       differences in those cases when (*MARK) is  used  in  conjunction  with
9337       (*SKIP) as described below.
9338
9339       The  mark name that was last encountered on the matching path is passed
9340       back. A verb without a NAME argument is ignored for this purpose.  Here
9341       is  an  example of pcre2test output, where the "mark" modifier requests
9342       the retrieval and outputting of (*MARK) data:
9343
9344           re> /X(*MARK:A)Y|X(*MARK:B)Z/mark
9345         data> XY
9346          0: XY
9347         MK: A
9348         XZ
9349          0: XZ
9350         MK: B
9351
9352       The (*MARK) name is tagged with "MK:" in this output, and in this exam-
9353       ple  it indicates which of the two alternatives matched. This is a more
9354       efficient way of obtaining this information than putting each  alterna-
9355       tive in its own capturing parentheses.
9356
9357       If  a  verb  with a name is encountered in a positive assertion that is
9358       true, the name is recorded and passed back if it  is  the  last-encoun-
9359       tered. This does not happen for negative assertions or failing positive
9360       assertions.
9361
9362       After a partial match or a failed match, the last encountered  name  in
9363       the entire match process is returned. For example:
9364
9365           re> /X(*MARK:A)Y|X(*MARK:B)Z/mark
9366         data> XP
9367         No match, mark = B
9368
9369       Note  that  in  this  unanchored  example the mark is retained from the
9370       match attempt that started at the letter "X" in the subject. Subsequent
9371       match attempts starting at "P" and then with an empty string do not get
9372       as far as the (*MARK) item, but nevertheless do not reset it.
9373
9374       If you are interested in  (*MARK)  values  after  failed  matches,  you
9375       should  probably  set the PCRE2_NO_START_OPTIMIZE option (see above) to
9376       ensure that the match is always attempted.
9377
9378   Verbs that act after backtracking
9379
9380       The following verbs do nothing when they are encountered. Matching con-
9381       tinues  with  what follows, but if there is a subsequent match failure,
9382       causing a backtrack to the verb, a failure is forced.  That  is,  back-
9383       tracking  cannot  pass  to  the  left of the verb. However, when one of
9384       these verbs appears inside an atomic group or in a lookaround assertion
9385       that  is  true,  its effect is confined to that group, because once the
9386       group has been matched, there is never any backtracking into it.  Back-
9387       tracking from beyond an assertion or an atomic group ignores the entire
9388       group, and seeks a preceding backtracking point.
9389
9390       These verbs differ in exactly what kind of failure  occurs  when  back-
9391       tracking  reaches  them.  The behaviour described below is what happens
9392       when the verb is not in a subroutine or an assertion.  Subsequent  sec-
9393       tions cover these special cases.
9394
9395         (*COMMIT) or (*COMMIT:NAME)
9396
9397       This  verb  causes the whole match to fail outright if there is a later
9398       matching failure that causes backtracking to reach it. Even if the pat-
9399       tern  is  unanchored,  no further attempts to find a match by advancing
9400       the starting point take place. If (*COMMIT) is  the  only  backtracking
9401       verb that is encountered, once it has been passed pcre2_match() is com-
9402       mitted to finding a match at the current starting point, or not at all.
9403       For example:
9404
9405         a+(*COMMIT)b
9406
9407       This  matches  "xxaab" but not "aacaab". It can be thought of as a kind
9408       of dynamic anchor, or "I've started, so I must finish."
9409
9410       The behaviour of (*COMMIT:NAME) is not the same  as  (*MARK:NAME)(*COM-
9411       MIT).  It is like (*MARK:NAME) in that the name is remembered for pass-
9412       ing back to the caller. However, (*SKIP:NAME) searches only  for  names
9413       that are set with (*MARK), ignoring those set by any of the other back-
9414       tracking verbs.
9415
9416       If there is more than one backtracking verb in a pattern,  a  different
9417       one  that  follows  (*COMMIT) may be triggered first, so merely passing
9418       (*COMMIT) during a match does not always guarantee that a match must be
9419       at this starting point.
9420
9421       Note that (*COMMIT) at the start of a pattern is not the same as an an-
9422       chor, unless PCRE2's start-of-match optimizations are  turned  off,  as
9423       shown in this output from pcre2test:
9424
9425           re> /(*COMMIT)abc/
9426         data> xyzabc
9427          0: abc
9428         data>
9429         re> /(*COMMIT)abc/no_start_optimize
9430         data> xyzabc
9431         No match
9432
9433       For  the first pattern, PCRE2 knows that any match must start with "a",
9434       so the optimization skips along the subject to "a" before applying  the
9435       pattern  to the first set of data. The match attempt then succeeds. The
9436       second pattern disables the optimization that skips along to the  first
9437       character.  The  pattern  is  now  applied  starting at "x", and so the
9438       (*COMMIT) causes the match to fail without trying  any  other  starting
9439       points.
9440
9441         (*PRUNE) or (*PRUNE:NAME)
9442
9443       This  verb causes the match to fail at the current starting position in
9444       the subject if there is a later matching failure that causes backtrack-
9445       ing  to  reach it. If the pattern is unanchored, the normal "bumpalong"
9446       advance to the next starting character then happens.  Backtracking  can
9447       occur  as  usual to the left of (*PRUNE), before it is reached, or when
9448       matching to the right of (*PRUNE), but if there  is  no  match  to  the
9449       right,  backtracking cannot cross (*PRUNE). In simple cases, the use of
9450       (*PRUNE) is just an alternative to an atomic group or possessive  quan-
9451       tifier, but there are some uses of (*PRUNE) that cannot be expressed in
9452       any other way. In an anchored pattern (*PRUNE) has the same  effect  as
9453       (*COMMIT).
9454
9455       The behaviour of (*PRUNE:NAME) is not the same as (*MARK:NAME)(*PRUNE).
9456       It is like (*MARK:NAME) in that the name is remembered for passing back
9457       to  the  caller. However, (*SKIP:NAME) searches only for names set with
9458       (*MARK), ignoring those set by other backtracking verbs.
9459
9460         (*SKIP)
9461
9462       This verb, when given without a name, is like (*PRUNE), except that  if
9463       the  pattern  is unanchored, the "bumpalong" advance is not to the next
9464       character, but to the position in the subject where (*SKIP) was encoun-
9465       tered.  (*SKIP)  signifies that whatever text was matched leading up to
9466       it cannot be part of a successful match if there is a  later  mismatch.
9467       Consider:
9468
9469         a+(*SKIP)b
9470
9471       If  the  subject  is  "aaaac...",  after  the first match attempt fails
9472       (starting at the first character in the  string),  the  starting  point
9473       skips on to start the next attempt at "c". Note that a possessive quan-
9474       tifier does not have the same effect as this example; although it would
9475       suppress  backtracking  during  the first match attempt, the second at-
9476       tempt would start at the second character instead  of  skipping  on  to
9477       "c".
9478
9479       If  (*SKIP) is used to specify a new starting position that is the same
9480       as the starting position of the current match, or (by  being  inside  a
9481       lookbehind)  earlier, the position specified by (*SKIP) is ignored, and
9482       instead the normal "bumpalong" occurs.
9483
9484         (*SKIP:NAME)
9485
9486       When (*SKIP) has an associated name, its behaviour  is  modified.  When
9487       such  a  (*SKIP) is triggered, the previous path through the pattern is
9488       searched for the most recent (*MARK) that has the same name. If one  is
9489       found,  the  "bumpalong" advance is to the subject position that corre-
9490       sponds to that (*MARK) instead of to where (*SKIP) was encountered.  If
9491       no (*MARK) with a matching name is found, the (*SKIP) is ignored.
9492
9493       The  search  for a (*MARK) name uses the normal backtracking mechanism,
9494       which means that it does not  see  (*MARK)  settings  that  are  inside
9495       atomic groups or assertions, because they are never re-entered by back-
9496       tracking. Compare the following pcre2test examples:
9497
9498           re> /a(?>(*MARK:X))(*SKIP:X)(*F)|(.)/
9499         data: abc
9500          0: a
9501          1: a
9502         data:
9503           re> /a(?:(*MARK:X))(*SKIP:X)(*F)|(.)/
9504         data: abc
9505          0: b
9506          1: b
9507
9508       In the first example, the (*MARK) setting is in an atomic group, so  it
9509       is not seen when (*SKIP:X) triggers, causing the (*SKIP) to be ignored.
9510       This allows the second branch of the pattern to be tried at  the  first
9511       character  position.  In the second example, the (*MARK) setting is not
9512       in an atomic group. This allows (*SKIP:X) to find the (*MARK)  when  it
9513       backtracks, and this causes a new matching attempt to start at the sec-
9514       ond character. This time, the (*MARK) is never seen  because  "a"  does
9515       not match "b", so the matcher immediately jumps to the second branch of
9516       the pattern.
9517
9518       Note that (*SKIP:NAME) searches only for names set by (*MARK:NAME).  It
9519       ignores names that are set by other backtracking verbs.
9520
9521         (*THEN) or (*THEN:NAME)
9522
9523       This  verb  causes  a skip to the next innermost alternative when back-
9524       tracking reaches it. That  is,  it  cancels  any  further  backtracking
9525       within  the  current  alternative.  Its name comes from the observation
9526       that it can be used for a pattern-based if-then-else block:
9527
9528         ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
9529
9530       If the COND1 pattern matches, FOO is tried (and possibly further  items
9531       after  the  end  of the group if FOO succeeds); on failure, the matcher
9532       skips to the second alternative and tries COND2,  without  backtracking
9533       into  COND1.  If that succeeds and BAR fails, COND3 is tried. If subse-
9534       quently BAZ fails, there are no more alternatives, so there is a  back-
9535       track  to  whatever came before the entire group. If (*THEN) is not in-
9536       side an alternation, it acts like (*PRUNE).
9537
9538       The behaviour of (*THEN:NAME) is not the same  as  (*MARK:NAME)(*THEN).
9539       It is like (*MARK:NAME) in that the name is remembered for passing back
9540       to the caller. However, (*SKIP:NAME) searches only for names  set  with
9541       (*MARK), ignoring those set by other backtracking verbs.
9542
9543       A  group  that does not contain a | character is just a part of the en-
9544       closing alternative; it is not a nested alternation with only  one  al-
9545       ternative. The effect of (*THEN) extends beyond such a group to the en-
9546       closing alternative.  Consider this pattern, where A, B, etc. are  com-
9547       plex  pattern  fragments  that  do not contain any | characters at this
9548       level:
9549
9550         A (B(*THEN)C) | D
9551
9552       If A and B are matched, but there is a failure in C, matching does  not
9553       backtrack into A; instead it moves to the next alternative, that is, D.
9554       However, if the group containing (*THEN) is given  an  alternative,  it
9555       behaves differently:
9556
9557         A (B(*THEN)C | (*FAIL)) | D
9558
9559       The effect of (*THEN) is now confined to the inner group. After a fail-
9560       ure in C, matching moves to (*FAIL), which causes the  whole  group  to
9561       fail  because  there  are  no  more  alternatives to try. In this case,
9562       matching does backtrack into A.
9563
9564       Note that a conditional group is not considered as having two  alterna-
9565       tives,  because  only one is ever used. In other words, the | character
9566       in a conditional group has a different meaning. Ignoring  white  space,
9567       consider:
9568
9569         ^.*? (?(?=a) a | b(*THEN)c )
9570
9571       If the subject is "ba", this pattern does not match. Because .*? is un-
9572       greedy, it initially matches zero characters. The condition (?=a)  then
9573       fails,  the  character  "b"  is matched, but "c" is not. At this point,
9574       matching does not backtrack to .*? as might perhaps  be  expected  from
9575       the  presence  of the | character. The conditional group is part of the
9576       single alternative that comprises the whole pattern, and so  the  match
9577       fails.  (If  there  was a backtrack into .*?, allowing it to match "b",
9578       the match would succeed.)
9579
9580       The verbs just described provide four different "strengths" of  control
9581       when subsequent matching fails. (*THEN) is the weakest, carrying on the
9582       match at the next alternative. (*PRUNE) comes next, failing  the  match
9583       at  the  current starting position, but allowing an advance to the next
9584       character (for an unanchored pattern). (*SKIP) is similar, except  that
9585       the advance may be more than one character. (*COMMIT) is the strongest,
9586       causing the entire match to fail.
9587
9588   More than one backtracking verb
9589
9590       If more than one backtracking verb is present in  a  pattern,  the  one
9591       that  is  backtracked  onto first acts. For example, consider this pat-
9592       tern, where A, B, etc. are complex pattern fragments:
9593
9594         (A(*COMMIT)B(*THEN)C|ABD)
9595
9596       If A matches but B fails, the backtrack to (*COMMIT) causes the  entire
9597       match to fail. However, if A and B match, but C fails, the backtrack to
9598       (*THEN) causes the next alternative (ABD) to be tried.  This  behaviour
9599       is  consistent,  but is not always the same as Perl's. It means that if
9600       two or more backtracking verbs appear in succession, all the  the  last
9601       of them has no effect. Consider this example:
9602
9603         ...(*COMMIT)(*PRUNE)...
9604
9605       If there is a matching failure to the right, backtracking onto (*PRUNE)
9606       causes it to be triggered, and its action is taken. There can never  be
9607       a backtrack onto (*COMMIT).
9608
9609   Backtracking verbs in repeated groups
9610
9611       PCRE2 sometimes differs from Perl in its handling of backtracking verbs
9612       in repeated groups. For example, consider:
9613
9614         /(a(*COMMIT)b)+ac/
9615
9616       If the subject is "abac", Perl matches  unless  its  optimizations  are
9617       disabled,  but  PCRE2  always fails because the (*COMMIT) in the second
9618       repeat of the group acts.
9619
9620   Backtracking verbs in assertions
9621
9622       (*FAIL) in any assertion has its normal effect: it forces an  immediate
9623       backtrack.  The  behaviour  of  the other backtracking verbs depends on
9624       whether or not the assertion is standalone or acting as  the  condition
9625       in a conditional group.
9626
9627       (*ACCEPT)  in  a  standalone positive assertion causes the assertion to
9628       succeed without any further processing; captured  strings  and  a  mark
9629       name  (if  set) are retained. In a standalone negative assertion, (*AC-
9630       CEPT) causes the assertion to fail without any further processing; cap-
9631       tured substrings and any mark name are discarded.
9632
9633       If  the  assertion is a condition, (*ACCEPT) causes the condition to be
9634       true for a positive assertion and false for a  negative  one;  captured
9635       substrings are retained in both cases.
9636
9637       The remaining verbs act only when a later failure causes a backtrack to
9638       reach them. This means that, for the Perl-compatible assertions,  their
9639       effect is confined to the assertion, because Perl lookaround assertions
9640       are atomic. A backtrack that occurs after such an assertion is complete
9641       does  not  jump  back  into  the  assertion.  Note in particular that a
9642       (*MARK) name that is set in an assertion is not "seen" by  an  instance
9643       of (*SKIP:NAME) later in the pattern.
9644
9645       PCRE2  now supports non-atomic positive assertions, as described in the
9646       section entitled "Non-atomic assertions" above. These  assertions  must
9647       be  standalone  (not used as conditions). They are not Perl-compatible.
9648       For these assertions, a later backtrack does jump back into the  asser-
9649       tion,  and  therefore verbs such as (*COMMIT) can be triggered by back-
9650       tracks from later in the pattern.
9651
9652       The effect of (*THEN) is not allowed to escape beyond an assertion.  If
9653       there  are no more branches to try, (*THEN) causes a positive assertion
9654       to be false, and a negative assertion to be true.
9655
9656       The other backtracking verbs are not treated specially if  they  appear
9657       in  a  standalone  positive assertion. In a conditional positive asser-
9658       tion, backtracking (from within the assertion) into (*COMMIT), (*SKIP),
9659       or  (*PRUNE) causes the condition to be false. However, for both stand-
9660       alone and conditional negative assertions, backtracking into (*COMMIT),
9661       (*SKIP), or (*PRUNE) causes the assertion to be true, without consider-
9662       ing any further alternative branches.
9663
9664   Backtracking verbs in subroutines
9665
9666       These behaviours occur whether or not the group is called recursively.
9667
9668       (*ACCEPT) in a group called as a subroutine causes the subroutine match
9669       to  succeed without any further processing. Matching then continues af-
9670       ter the subroutine call. Perl documents this behaviour.  Perl's  treat-
9671       ment of the other verbs in subroutines is different in some cases.
9672
9673       (*FAIL)  in  a  group  called as a subroutine has its normal effect: it
9674       forces an immediate backtrack.
9675
9676       (*COMMIT), (*SKIP), and (*PRUNE) cause the  subroutine  match  to  fail
9677       when  triggered  by being backtracked to in a group called as a subrou-
9678       tine. There is then a backtrack at the outer level.
9679
9680       (*THEN), when triggered, skips to the next alternative in the innermost
9681       enclosing  group that has alternatives (its normal behaviour). However,
9682       if there is no such group within the subroutine's group, the subroutine
9683       match fails and there is a backtrack at the outer level.
9684
9685
9686SEE ALSO
9687
9688       pcre2api(3),    pcre2callout(3),    pcre2matching(3),   pcre2syntax(3),
9689       pcre2(3).
9690
9691
9692AUTHOR
9693
9694       Philip Hazel
9695       Retired from University Computing Service
9696       Cambridge, England.
9697
9698
9699REVISION
9700
9701       Last updated: 12 January 2022
9702       Copyright (c) 1997-2022 University of Cambridge.
9703------------------------------------------------------------------------------
9704
9705
9706PCRE2PERFORM(3)            Library Functions Manual            PCRE2PERFORM(3)
9707
9708
9709
9710NAME
9711       PCRE2 - Perl-compatible regular expressions (revised API)
9712
9713PCRE2 PERFORMANCE
9714
9715       Two  aspects  of performance are discussed below: memory usage and pro-
9716       cessing time. The way you express your pattern as a regular  expression
9717       can affect both of them.
9718
9719
9720COMPILED PATTERN MEMORY USAGE
9721
9722       Patterns are compiled by PCRE2 into a reasonably efficient interpretive
9723       code, so that most simple patterns do not use much memory  for  storing
9724       the compiled version. However, there is one case where the memory usage
9725       of a compiled pattern can be unexpectedly  large.  If  a  parenthesized
9726       group  has  a quantifier with a minimum greater than 1 and/or a limited
9727       maximum, the whole group is repeated in the compiled code. For example,
9728       the pattern
9729
9730         (abc|def){2,4}
9731
9732       is compiled as if it were
9733
9734         (abc|def)(abc|def)((abc|def)(abc|def)?)?
9735
9736       (Technical  aside:  It is done this way so that backtrack points within
9737       each of the repetitions can be independently maintained.)
9738
9739       For regular expressions whose quantifiers use only small numbers,  this
9740       is  not  usually a problem. However, if the numbers are large, and par-
9741       ticularly if such repetitions are nested, the memory usage  can  become
9742       an embarrassment. For example, the very simple pattern
9743
9744         ((ab){1,1000}c){1,3}
9745
9746       uses  over  50KiB  when compiled using the 8-bit library. When PCRE2 is
9747       compiled with its default internal pointer size of two bytes, the  size
9748       limit on a compiled pattern is 65535 code units in the 8-bit and 16-bit
9749       libraries, and this is reached with the above pattern if the outer rep-
9750       etition  is  increased from 3 to 4. PCRE2 can be compiled to use larger
9751       internal pointers and thus handle larger compiled patterns, but  it  is
9752       better to try to rewrite your pattern to use less memory if you can.
9753
9754       One  way  of reducing the memory usage for such patterns is to make use
9755       of PCRE2's "subroutine" facility. Re-writing the above pattern as
9756
9757         ((ab)(?2){0,999}c)(?1){0,2}
9758
9759       reduces the memory requirements to around 16KiB, and indeed it  remains
9760       under  20KiB  even with the outer repetition increased to 100. However,
9761       this kind of pattern is not always exactly equivalent, because any cap-
9762       tures  within  subroutine calls are lost when the subroutine completes.
9763       If this is not a problem, this kind of  rewriting  will  allow  you  to
9764       process  patterns that PCRE2 cannot otherwise handle. The matching per-
9765       formance of the two different versions of the pattern are  roughly  the
9766       same.  (This applies from release 10.30 - things were different in ear-
9767       lier releases.)
9768
9769
9770STACK AND HEAP USAGE AT RUN TIME
9771
9772       From release 10.30, the interpretive (non-JIT) version of pcre2_match()
9773       uses  very  little system stack at run time. In earlier releases recur-
9774       sive function calls could use a great deal of  stack,  and  this  could
9775       cause  problems, but this usage has been eliminated. Backtracking posi-
9776       tions are now explicitly remembered in memory frames controlled by  the
9777       code.
9778
9779       The size of each frame depends on the size of pointer variables and the
9780       number of capturing parenthesized groups in the pattern being  matched.
9781       On a 64-bit system the frame size for a pattern with no captures is 128
9782       bytes. For each capturing group the size increases by 16 bytes.
9783
9784       Until release 10.41, an initial 20KiB frames vector  was  allocated  on
9785       the  system  stack,  but this still caused some issues for multi-thread
9786       applications where each thread has a very  small  stack.  From  release
9787       10.41  backtracking  memory  frames  are always held in heap memory. An
9788       initial heap allocation is obtained the first time any match data block
9789       is  passed  to  pcre2_match().  This  is remembered with the match data
9790       block and re-used if that block is used for another match. It is  freed
9791       when the match data block itself is freed.
9792
9793       The  size  of the initial block is the larger of 20KiB or ten times the
9794       pattern's frame size, unless the heap limit is less than this, in which
9795       case  the  heap  limit  is  used. If the initial block proves to be too
9796       small during matching, it is replaced by a larger block, subject to the
9797       heap  limit.  The  heap limit is checked only when a new block is to be
9798       allocated. Reducing the heap limit between calls to pcre2_match()  with
9799       the same match data block does not affect the saved block.
9800
9801       In  contrast  to  pcre2_match(),  pcre2_dfa_match()  does use recursive
9802       function calls, but only for processing atomic groups,  lookaround  as-
9803       sertions, and recursion within the pattern. The original version of the
9804       code used to allocate quite large internal  workspace  vectors  on  the
9805       stack,  which  caused  some  problems for some patterns in environments
9806       with small stacks. From release 10.32 the  code  for  pcre2_dfa_match()
9807       has  been  re-factored  to  use heap memory when necessary for internal
9808       workspace when recursing, though recursive  function  calls  are  still
9809       used.
9810
9811       The  "match depth" parameter can be used to limit the depth of function
9812       recursion, and the "match heap"  parameter  to  limit  heap  memory  in
9813       pcre2_dfa_match().
9814
9815
9816PROCESSING TIME
9817
9818       Certain  items  in regular expression patterns are processed more effi-
9819       ciently than others. It is more efficient to use a character class like
9820       [aeiou]   than   a   set   of  single-character  alternatives  such  as
9821       (a|e|i|o|u). In general, the simplest construction  that  provides  the
9822       required behaviour is usually the most efficient. Jeffrey Friedl's book
9823       contains a lot of useful general discussion  about  optimizing  regular
9824       expressions for efficient performance. This document contains a few ob-
9825       servations about PCRE2.
9826
9827       Using Unicode character properties (the \p,  \P,  and  \X  escapes)  is
9828       slow,  because  PCRE2 has to use a multi-stage table lookup whenever it
9829       needs a character's property. If you can find  an  alternative  pattern
9830       that does not use character properties, it will probably be faster.
9831
9832       By  default,  the  escape  sequences  \b, \d, \s, and \w, and the POSIX
9833       character classes such as [:alpha:]  do  not  use  Unicode  properties,
9834       partly for backwards compatibility, and partly for performance reasons.
9835       However, you can set the PCRE2_UCP option or  start  the  pattern  with
9836       (*UCP)  if  you  want Unicode character properties to be used. This can
9837       double the matching time for  items  such  as  \d,  when  matched  with
9838       pcre2_match();  the  performance loss is less with a DFA matching func-
9839       tion, and in both cases there is not much difference for \b.
9840
9841       When a pattern begins with .* not in atomic parentheses, nor in  paren-
9842       theses  that  are  the subject of a backreference, and the PCRE2_DOTALL
9843       option is set, the pattern is implicitly anchored by  PCRE2,  since  it
9844       can  match  only  at  the start of a subject string. If the pattern has
9845       multiple top-level branches, they must all be anchorable. The optimiza-
9846       tion  can be disabled by the PCRE2_NO_DOTSTAR_ANCHOR option, and is au-
9847       tomatically disabled if the pattern contains (*PRUNE) or (*SKIP).
9848
9849       If PCRE2_DOTALL is not set, PCRE2 cannot make  this  optimization,  be-
9850       cause  the  dot metacharacter does not then match a newline, and if the
9851       subject string contains newlines, the pattern may match from the  char-
9852       acter immediately following one of them instead of from the very start.
9853       For example, the pattern
9854
9855         .*second
9856
9857       matches the subject "first\nand second" (where \n stands for a  newline
9858       character),  with the match starting at the seventh character. In order
9859       to do this, PCRE2 has to retry the match starting after  every  newline
9860       in the subject.
9861
9862       If  you  are using such a pattern with subject strings that do not con-
9863       tain  newlines,  the  best   performance   is   obtained   by   setting
9864       PCRE2_DOTALL,  or starting the pattern with ^.* or ^.*? to indicate ex-
9865       plicit anchoring. That saves PCRE2 from having to scan along  the  sub-
9866       ject looking for a newline to restart at.
9867
9868       Beware  of  patterns  that contain nested indefinite repeats. These can
9869       take a long time to run when applied to a string that does  not  match.
9870       Consider the pattern fragment
9871
9872         ^(a+)*
9873
9874       This  can  match "aaaa" in 16 different ways, and this number increases
9875       very rapidly as the string gets longer. (The * repeat can match  0,  1,
9876       2,  3, or 4 times, and for each of those cases other than 0 or 4, the +
9877       repeats can match different numbers of times.) When  the  remainder  of
9878       the  pattern  is such that the entire match is going to fail, PCRE2 has
9879       in principle to try every possible variation, and this can take an  ex-
9880       tremely long time, even for relatively short strings.
9881
9882       An optimization catches some of the more simple cases such as
9883
9884         (a+)*b
9885
9886       where  a  literal  character  follows. Before embarking on the standard
9887       matching procedure, PCRE2 checks that there is a "b" later in the  sub-
9888       ject  string, and if there is not, it fails the match immediately. How-
9889       ever, when there is no following literal this  optimization  cannot  be
9890       used. You can see the difference by comparing the behaviour of
9891
9892         (a+)*\d
9893
9894       with  the  pattern  above.  The former gives a failure almost instantly
9895       when applied to a whole line of  "a"  characters,  whereas  the  latter
9896       takes an appreciable time with strings longer than about 20 characters.
9897
9898       In many cases, the solution to this kind of performance issue is to use
9899       an atomic group or a possessive quantifier. This can often reduce  mem-
9900       ory requirements as well. As another example, consider this pattern:
9901
9902         ([^<]|<(?!inet))+
9903
9904       It  matches  from wherever it starts until it encounters "<inet" or the
9905       end of the data, and is the kind of pattern that  might  be  used  when
9906       processing an XML file. Each iteration of the outer parentheses matches
9907       either one character that is not "<" or a "<" that is not  followed  by
9908       "inet".  However,  each time a parenthesis is processed, a backtracking
9909       position is passed, so this formulation uses a memory  frame  for  each
9910       matched character. For a long string, a lot of memory is required. Con-
9911       sider now this  rewritten  pattern,  which  matches  exactly  the  same
9912       strings:
9913
9914         ([^<]++|<(?!inet))+
9915
9916       This runs much faster, because sequences of characters that do not con-
9917       tain "<" are "swallowed" in one item inside the parentheses, and a pos-
9918       sessive  quantifier  is  used to stop any backtracking into the runs of
9919       non-"<" characters. This version also uses a lot  less  memory  because
9920       entry  to  a  new  set of parentheses happens only when a "<" character
9921       that is not followed by "inet" is encountered (and we  assume  this  is
9922       relatively rare).
9923
9924       This example shows that one way of optimizing performance when matching
9925       long subject strings is to write repeated parenthesized subpatterns  to
9926       match more than one character whenever possible.
9927
9928   SETTING RESOURCE LIMITS
9929
9930       You  can  set  limits on the amount of processing that takes place when
9931       matching, and on the amount of heap memory that is  used.  The  default
9932       values of the limits are very large, and unlikely ever to operate. They
9933       can be changed when PCRE2 is built, and  they  can  also  be  set  when
9934       pcre2_match()  or pcre2_dfa_match() is called. For details of these in-
9935       terfaces, see the pcre2build documentation  and  the  section  entitled
9936       "The match context" in the pcre2api documentation.
9937
9938       The  pcre2test  test program has a modifier called "find_limits" which,
9939       if applied to a subject line, causes it to  find  the  smallest  limits
9940       that allow a pattern to match. This is done by repeatedly matching with
9941       different limits.
9942
9943
9944AUTHOR
9945
9946       Philip Hazel
9947       Retired from University Computing Service
9948       Cambridge, England.
9949
9950
9951REVISION
9952
9953       Last updated: 27 July 2022
9954       Copyright (c) 1997-2022 University of Cambridge.
9955------------------------------------------------------------------------------
9956
9957
9958PCRE2POSIX(3)              Library Functions Manual              PCRE2POSIX(3)
9959
9960
9961
9962NAME
9963       PCRE2 - Perl-compatible regular expressions (revised API)
9964
9965SYNOPSIS
9966
9967       #include <pcre2posix.h>
9968
9969       int pcre2_regcomp(regex_t *preg, const char *pattern,
9970            int cflags);
9971
9972       int pcre2_regexec(const regex_t *preg, const char *string,
9973            size_t nmatch, regmatch_t pmatch[], int eflags);
9974
9975       size_t pcre2_regerror(int errcode, const regex_t *preg,
9976            char *errbuf, size_t errbuf_size);
9977
9978       void pcre2_regfree(regex_t *preg);
9979
9980
9981DESCRIPTION
9982
9983       This  set of functions provides a POSIX-style API for the PCRE2 regular
9984       expression 8-bit library. There are no POSIX-style wrappers for PCRE2's
9985       16-bit  and  32-bit libraries. See the pcre2api documentation for a de-
9986       scription of PCRE2's native API, which contains much  additional  func-
9987       tionality.
9988
9989       The functions described here are wrapper functions that ultimately call
9990       the PCRE2 native API. Their prototypes are defined in the  pcre2posix.h
9991       header  file, and they all have unique names starting with pcre2_. How-
9992       ever, the pcre2posix.h header also contains macro definitions that con-
9993       vert  the standard POSIX names such regcomp() into pcre2_regcomp() etc.
9994       This means that a program can use the usual POSIX names without running
9995       the  risk of accidentally linking with POSIX functions from a different
9996       library.
9997
9998       On Unix-like systems the PCRE2 POSIX library is called  libpcre2-posix,
9999       so  can  be accessed by adding -lpcre2-posix to the command for linking
10000       an application. Because the POSIX functions call the native ones, it is
10001       also necessary to add -lpcre2-8.
10002
10003       Although  they  were  not defined as protypes in pcre2posix.h, releases
10004       10.33 to 10.36 of the library contained functions with the POSIX  names
10005       regcomp()  etc.  These simply passed their arguments to the PCRE2 func-
10006       tions. These functions were provided for backwards  compatibility  with
10007       earlier  versions  of  PCRE2, which had only POSIX names. However, this
10008       has proved troublesome in situations where a program links with several
10009       libraries,  some  of which use PCRE2's POSIX interface while others use
10010       the real POSIX functions.  For this reason, the POSIX names  have  been
10011       removed since release 10.37.
10012
10013       Calling  the  header  file  pcre2posix.h avoids any conflict with other
10014       POSIX libraries. It can, of course, be renamed or aliased  as  regex.h,
10015       which  is  the  "correct"  name,  if there is no clash. It provides two
10016       structure types, regex_t for compiled internal  forms,  and  regmatch_t
10017       for returning captured substrings. It also defines some constants whose
10018       names start with "REG_"; these are used for setting options and identi-
10019       fying error codes.
10020
10021
10022USING THE POSIX FUNCTIONS
10023
10024       Those  POSIX  option bits that can reasonably be mapped to PCRE2 native
10025       options have been implemented. In addition, the option REG_EXTENDED  is
10026       defined  with  the  value  zero. This has no effect, but since programs
10027       that are written to the POSIX interface often use  it,  this  makes  it
10028       easier  to  slot in PCRE2 as a replacement library. Other POSIX options
10029       are not even defined.
10030
10031       There are also some options that are not defined by POSIX.  These  have
10032       been  added  at  the  request  of users who want to make use of certain
10033       PCRE2-specific features via the POSIX calling interface or to  add  BSD
10034       or GNU functionality.
10035
10036       When  PCRE2  is  called via these functions, it is only the API that is
10037       POSIX-like in style. The syntax and semantics of  the  regular  expres-
10038       sions  themselves  are  still  those of Perl, subject to the setting of
10039       various PCRE2 options, as described below. "POSIX-like in style"  means
10040       that  the  API  approximates  to  the POSIX definition; it is not fully
10041       POSIX-compatible, and in multi-unit encoding  domains  it  is  probably
10042       even less compatible.
10043
10044       The  descriptions  below use the actual names of the functions, but, as
10045       described above, the standard POSIX names (without the  pcre2_  prefix)
10046       may also be used.
10047
10048
10049COMPILING A PATTERN
10050
10051       The function pcre2_regcomp() is called to compile a pattern into an in-
10052       ternal form. By default, the pattern is a C string terminated by a  bi-
10053       nary zero (but see REG_PEND below). The preg argument is a pointer to a
10054       regex_t structure that is used as a base for storing information  about
10055       the  compiled  regular  expression.  (It  is  also  used for input when
10056       REG_PEND is set.)
10057
10058       The argument cflags is either zero, or contains one or more of the bits
10059       defined by the following macros:
10060
10061         REG_DOTALL
10062
10063       The  PCRE2_DOTALL  option  is set when the regular expression is passed
10064       for compilation to the native function. Note  that  REG_DOTALL  is  not
10065       part of the POSIX standard.
10066
10067         REG_ICASE
10068
10069       The  PCRE2_CASELESS option is set when the regular expression is passed
10070       for compilation to the native function.
10071
10072         REG_NEWLINE
10073
10074       The PCRE2_MULTILINE option is set when the regular expression is passed
10075       for  compilation  to the native function. Note that this does not mimic
10076       the defined POSIX behaviour for REG_NEWLINE  (see  the  following  sec-
10077       tion).
10078
10079         REG_NOSPEC
10080
10081       The  PCRE2_LITERAL  option is set when the regular expression is passed
10082       for compilation to the native function. This disables all meta  charac-
10083       ters  in the pattern, causing it to be treated as a literal string. The
10084       only other options that are  allowed  with  REG_NOSPEC  are  REG_ICASE,
10085       REG_NOSUB,  REG_PEND,  and REG_UTF. Note that REG_NOSPEC is not part of
10086       the POSIX standard.
10087
10088         REG_NOSUB
10089
10090       When  a  pattern  that  is  compiled  with  this  flag  is  passed   to
10091       pcre2_regexec()  for  matching, the nmatch and pmatch arguments are ig-
10092       nored, and no captured strings are returned. Versions of the  PCRE  li-
10093       brary  prior to 10.22 used to set the PCRE2_NO_AUTO_CAPTURE compile op-
10094       tion, but this no longer happens because it disables the use  of  back-
10095       references.
10096
10097         REG_PEND
10098
10099       If  this option is set, the reg_endp field in the preg structure (which
10100       has the type const char *) must be set to point to the character beyond
10101       the  end of the pattern before calling pcre2_regcomp(). The pattern it-
10102       self may now contain binary zeros, which are treated  as  data  charac-
10103       ters.  Without  REG_PEND,  a binary zero terminates the pattern and the
10104       re_endp field is ignored. This is a GNU extension to the POSIX standard
10105       and  should be used with caution in software intended to be portable to
10106       other systems.
10107
10108         REG_UCP
10109
10110       The PCRE2_UCP option is set when the regular expression is  passed  for
10111       compilation  to  the  native function. This causes PCRE2 to use Unicode
10112       properties when matchine \d, \w,  etc.,  instead  of  just  recognizing
10113       ASCII values. Note that REG_UCP is not part of the POSIX standard.
10114
10115         REG_UNGREEDY
10116
10117       The  PCRE2_UNGREEDY option is set when the regular expression is passed
10118       for compilation to the native function. Note that REG_UNGREEDY  is  not
10119       part of the POSIX standard.
10120
10121         REG_UTF
10122
10123       The  PCRE2_UTF  option is set when the regular expression is passed for
10124       compilation to the native function. This causes the pattern itself  and
10125       all  data  strings used for matching it to be treated as UTF-8 strings.
10126       Note that REG_UTF is not part of the POSIX standard.
10127
10128       In the absence of these flags, no options  are  passed  to  the  native
10129       function.   This means the the regex is compiled with PCRE2 default se-
10130       mantics. In particular, the way it handles newline  characters  in  the
10131       subject  string  is  the Perl way, not the POSIX way. Note that setting
10132       PCRE2_MULTILINE has only some of the effects specified for REG_NEWLINE.
10133       It  does not affect the way newlines are matched by the dot metacharac-
10134       ter (they are not) or by a negative class such as [^a] (they are).
10135
10136       The yield of pcre2_regcomp() is zero on success,  and  non-zero  other-
10137       wise.  The preg structure is filled in on success, and one other member
10138       of the structure (as well as re_endp) is public: re_nsub  contains  the
10139       number  of capturing subpatterns in the regular expression. Various er-
10140       ror codes are defined in the header file.
10141
10142       NOTE: If the yield of pcre2_regcomp() is non-zero, you must not attempt
10143       to use the contents of the preg structure. If, for example, you pass it
10144       to pcre2_regexec(), the result is undefined and your program is  likely
10145       to crash.
10146
10147
10148MATCHING NEWLINE CHARACTERS
10149
10150       This area is not simple, because POSIX and Perl take different views of
10151       things.  It is not possible to get PCRE2 to obey POSIX  semantics,  but
10152       then PCRE2 was never intended to be a POSIX engine. The following table
10153       lists the different possibilities for matching  newline  characters  in
10154       Perl and PCRE2:
10155
10156                                 Default   Change with
10157
10158         . matches newline          no     PCRE2_DOTALL
10159         newline matches [^a]       yes    not changeable
10160         $ matches \n at end        yes    PCRE2_DOLLAR_ENDONLY
10161         $ matches \n in middle     no     PCRE2_MULTILINE
10162         ^ matches \n in middle     no     PCRE2_MULTILINE
10163
10164       This is the equivalent table for a POSIX-compatible pattern matcher:
10165
10166                                 Default   Change with
10167
10168         . matches newline          yes    REG_NEWLINE
10169         newline matches [^a]       yes    REG_NEWLINE
10170         $ matches \n at end        no     REG_NEWLINE
10171         $ matches \n in middle     no     REG_NEWLINE
10172         ^ matches \n in middle     no     REG_NEWLINE
10173
10174       This  behaviour  is not what happens when PCRE2 is called via its POSIX
10175       API. By default, PCRE2's behaviour is the same as Perl's,  except  that
10176       there  is no equivalent for PCRE2_DOLLAR_ENDONLY in Perl. In both PCRE2
10177       and Perl, there is no way to stop newline from matching [^a].
10178
10179       Default POSIX newline handling can be obtained by setting  PCRE2_DOTALL
10180       and  PCRE2_DOLLAR_ENDONLY  when  calling  pcre2_compile() directly, but
10181       there is no way to make PCRE2 behave exactly as for the REG_NEWLINE ac-
10182       tion.  When  using  the  POSIX  API,  passing  REG_NEWLINE  to  PCRE2's
10183       pcre2_regcomp()  function  causes  PCRE2_MULTILINE  to  be  passed   to
10184       pcre2_compile(), and REG_DOTALL passes PCRE2_DOTALL. There is no way to
10185       pass PCRE2_DOLLAR_ENDONLY.
10186
10187
10188MATCHING A PATTERN
10189
10190       The function pcre2_regexec() is called to match a compiled pattern preg
10191       against  a  given string, which is by default terminated by a zero byte
10192       (but see REG_STARTEND below), subject to the options in eflags.   These
10193       can be:
10194
10195         REG_NOTBOL
10196
10197       The PCRE2_NOTBOL option is set when calling the underlying PCRE2 match-
10198       ing function.
10199
10200         REG_NOTEMPTY
10201
10202       The PCRE2_NOTEMPTY option is set  when  calling  the  underlying  PCRE2
10203       matching  function.  Note  that  REG_NOTEMPTY  is not part of the POSIX
10204       standard. However, setting this option can give more POSIX-like  behav-
10205       iour in some situations.
10206
10207         REG_NOTEOL
10208
10209       The PCRE2_NOTEOL option is set when calling the underlying PCRE2 match-
10210       ing function.
10211
10212         REG_STARTEND
10213
10214       When this option  is  set,  the  subject  string  starts  at  string  +
10215       pmatch[0].rm_so  and  ends  at  string  + pmatch[0].rm_eo, which should
10216       point to the first character beyond the string. There may be binary ze-
10217       ros  within  the  subject string, and indeed, using REG_STARTEND is the
10218       only way to pass a subject string that contains a binary zero.
10219
10220       Whatever the value of  pmatch[0].rm_so,  the  offsets  of  the  matched
10221       string  and  any  captured  substrings  are still given relative to the
10222       start of string itself. (Before PCRE2 release 10.30  these  were  given
10223       relative  to  string + pmatch[0].rm_so, but this differs from other im-
10224       plementations.)
10225
10226       This is a BSD extension, compatible with  but  not  specified  by  IEEE
10227       Standard  1003.2 (POSIX.2), and should be used with caution in software
10228       intended to be portable to other systems. Note that  a  non-zero  rm_so
10229       does  not  imply REG_NOTBOL; REG_STARTEND affects only the location and
10230       length of the string, not how it is matched. Setting  REG_STARTEND  and
10231       passing  pmatch as NULL are mutually exclusive; the error REG_INVARG is
10232       returned.
10233
10234       If the pattern was compiled with the REG_NOSUB flag, no data about  any
10235       matched  strings  is  returned.  The  nmatch  and  pmatch  arguments of
10236       pcre2_regexec() are ignored (except possibly  as  input  for  REG_STAR-
10237       TEND).
10238
10239       The  value of nmatch may be zero, and the value pmatch may be NULL (un-
10240       less REG_STARTEND is set); in  both  these  cases  no  data  about  any
10241       matched strings is returned.
10242
10243       Otherwise,  the  portion  of  the string that was matched, and also any
10244       captured substrings, are returned via the pmatch argument, which points
10245       to  an  array  of  nmatch structures of type regmatch_t, containing the
10246       members rm_so and rm_eo. These contain the byte  offset  to  the  first
10247       character of each substring and the offset to the first character after
10248       the end of each substring, respectively. The 0th element of the  vector
10249       relates  to  the  entire portion of string that was matched; subsequent
10250       elements relate to the capturing subpatterns of the regular expression.
10251       Unused entries in the array have both structure members set to -1.
10252
10253       A  successful  match  yields a zero return; various error codes are de-
10254       fined in the header file, of which REG_NOMATCH is the "expected"  fail-
10255       ure code.
10256
10257
10258ERROR MESSAGES
10259
10260       The  pcre2_regerror()  function  maps  a non-zero errorcode from either
10261       pcre2_regcomp() or pcre2_regexec() to a printable message. If  preg  is
10262       not  NULL, the error should have arisen from the use of that structure.
10263       A message terminated by a binary zero is placed in errbuf. If the  buf-
10264       fer  is too short, only the first errbuf_size - 1 characters of the er-
10265       ror message are used. The yield of the function is the size  of  buffer
10266       needed  to hold the whole message, including the terminating zero. This
10267       value is greater than errbuf_size if the message was truncated.
10268
10269
10270MEMORY USAGE
10271
10272       Compiling a regular expression causes memory to be allocated and  asso-
10273       ciated  with the preg structure. The function pcre2_regfree() frees all
10274       such memory, after which preg may no longer be used as a  compiled  ex-
10275       pression.
10276
10277
10278AUTHOR
10279
10280       Philip Hazel
10281       University Computing Service
10282       Cambridge, England.
10283
10284
10285REVISION
10286
10287       Last updated: 26 April 2021
10288       Copyright (c) 1997-2021 University of Cambridge.
10289------------------------------------------------------------------------------
10290
10291
10292PCRE2SAMPLE(3)             Library Functions Manual             PCRE2SAMPLE(3)
10293
10294
10295
10296NAME
10297       PCRE2 - Perl-compatible regular expressions (revised API)
10298
10299PCRE2 SAMPLE PROGRAM
10300
10301       A  simple, complete demonstration program to get you started with using
10302       PCRE2 is supplied in the file pcre2demo.c in the src directory  in  the
10303       PCRE2 distribution. A listing of this program is given in the pcre2demo
10304       documentation. If you do not have a copy of the PCRE2 distribution, you
10305       can save this listing to re-create the contents of pcre2demo.c.
10306
10307       The  demonstration  program compiles the regular expression that is its
10308       first argument, and matches it against the subject string in its second
10309       argument.  No  PCRE2  options are set, and default character tables are
10310       used. If matching succeeds, the program outputs the portion of the sub-
10311       ject  that  matched,  together  with  the contents of any captured sub-
10312       strings.
10313
10314       If the -g option is given on the command line, the program then goes on
10315       to check for further matches of the same regular expression in the same
10316       subject string. The logic is a little bit tricky because of the  possi-
10317       bility  of  matching an empty string. Comments in the code explain what
10318       is going on.
10319
10320       The code in pcre2demo.c is an 8-bit program that uses the  PCRE2  8-bit
10321       library.  It  handles  strings  and characters that are stored in 8-bit
10322       code units.  By default, one character corresponds to  one  code  unit,
10323       but  if  the  pattern starts with "(*UTF)", both it and the subject are
10324       treated as UTF-8 strings, where characters  may  occupy  multiple  code
10325       units.
10326
10327       If  PCRE2  is installed in the standard include and library directories
10328       for your operating system, you should be able to compile the demonstra-
10329       tion program using a command like this:
10330
10331         cc -o pcre2demo pcre2demo.c -lpcre2-8
10332
10333       If PCRE2 is installed elsewhere, you may need to add additional options
10334       to the command line. For example, on a Unix-like system that has  PCRE2
10335       installed  in /usr/local, you can compile the demonstration program us-
10336       ing a command like this:
10337
10338         cc -o pcre2demo -I/usr/local/include pcre2demo.c \
10339            -L/usr/local/lib -lpcre2-8
10340
10341       Once you have built the demonstration program, you can run simple tests
10342       like this:
10343
10344         ./pcre2demo 'cat|dog' 'the cat sat on the mat'
10345         ./pcre2demo -g 'cat|dog' 'the dog sat on the cat'
10346
10347       Note  that  there  is  a  much  more comprehensive test program, called
10348       pcre2test, which supports many more facilities for testing regular  ex-
10349       pressions  using  all three PCRE2 libraries (8-bit, 16-bit, and 32-bit,
10350       though not all three need be installed). The pcre2demo program is  pro-
10351       vided as a relatively simple coding example.
10352
10353       If you try to run pcre2demo when PCRE2 is not installed in the standard
10354       library directory, you may get an error like  this  on  some  operating
10355       systems (e.g. Solaris):
10356
10357         ld.so.1: pcre2demo: fatal: libpcre2-8.so.0: open failed: No such file
10358       or directory
10359
10360       This is caused by the way shared library support works  on  those  sys-
10361       tems. You need to add
10362
10363         -R/usr/local/lib
10364
10365       (for example) to the compile command to get round this problem.
10366
10367
10368AUTHOR
10369
10370       Philip Hazel
10371       University Computing Service
10372       Cambridge, England.
10373
10374
10375REVISION
10376
10377       Last updated: 02 February 2016
10378       Copyright (c) 1997-2016 University of Cambridge.
10379------------------------------------------------------------------------------
10380PCRE2SERIALIZE(3)          Library Functions Manual          PCRE2SERIALIZE(3)
10381
10382
10383
10384NAME
10385       PCRE2 - Perl-compatible regular expressions (revised API)
10386
10387SAVING AND RE-USING PRECOMPILED PCRE2 PATTERNS
10388
10389       int32_t pcre2_serialize_decode(pcre2_code **codes,
10390         int32_t number_of_codes, const uint8_t *bytes,
10391         pcre2_general_context *gcontext);
10392
10393       int32_t pcre2_serialize_encode(const pcre2_code **codes,
10394         int32_t number_of_codes, uint8_t **serialized_bytes,
10395         PCRE2_SIZE *serialized_size, pcre2_general_context *gcontext);
10396
10397       void pcre2_serialize_free(uint8_t *bytes);
10398
10399       int32_t pcre2_serialize_get_number_of_codes(const uint8_t *bytes);
10400
10401       If  you  are running an application that uses a large number of regular
10402       expression patterns, it may be useful to store them  in  a  precompiled
10403       form  instead  of  having to compile them every time the application is
10404       run. However, if you are using the just-in-time  optimization  feature,
10405       it is not possible to save and reload the JIT data, because it is posi-
10406       tion-dependent. The host on which the patterns  are  reloaded  must  be
10407       running  the  same version of PCRE2, with the same code unit width, and
10408       must also have the same endianness, pointer width and PCRE2_SIZE  type.
10409       For  example, patterns compiled on a 32-bit system using PCRE2's 16-bit
10410       library cannot be reloaded on a 64-bit system, nor can they be reloaded
10411       using the 8-bit library.
10412
10413       Note  that  "serialization" in PCRE2 does not convert compiled patterns
10414       to an abstract format like Java or .NET serialization.  The  serialized
10415       output  is  really  just  a  bytecode dump, which is why it can only be
10416       reloaded in the same environment as the one that created it. Hence  the
10417       restrictions  mentioned  above.   Applications  that are not statically
10418       linked with a fixed version of PCRE2 must be prepared to recompile pat-
10419       terns from their sources, in order to be immune to PCRE2 upgrades.
10420
10421
10422SECURITY CONCERNS
10423
10424       The facility for saving and restoring compiled patterns is intended for
10425       use within individual applications.  As  such,  the  data  supplied  to
10426       pcre2_serialize_decode()  is expected to be trusted data, not data from
10427       arbitrary external sources.  There  is  only  some  simple  consistency
10428       checking, not complete validation of what is being re-loaded. Corrupted
10429       data may cause undefined results. For example, if the length field of a
10430       pattern in the serialized data is corrupted, the deserializing code may
10431       read beyond the end of the byte stream that is passed to it.
10432
10433
10434SAVING COMPILED PATTERNS
10435
10436       Before compiled patterns can be saved they must be serialized, which in
10437       PCRE2  means converting the pattern to a stream of bytes. A single byte
10438       stream may contain any number of compiled patterns, but they  must  all
10439       use  the same character tables. A single copy of the tables is included
10440       in the byte stream (its size is 1088 bytes). For more details of  char-
10441       acter  tables,  see the section on locale support in the pcre2api docu-
10442       mentation.
10443
10444       The function pcre2_serialize_encode() creates a serialized byte  stream
10445       from  a  list of compiled patterns. Its first two arguments specify the
10446       list, being a pointer to a vector of pointers to compiled patterns, and
10447       the length of the vector. The third and fourth arguments point to vari-
10448       ables which are set to point to the created byte stream and its length,
10449       respectively.  The  final  argument  is a pointer to a general context,
10450       which can be used to specify custom memory  mangagement  functions.  If
10451       this  argument  is NULL, malloc() is used to obtain memory for the byte
10452       stream. The yield of the function is the number of serialized patterns,
10453       or one of the following negative error codes:
10454
10455         PCRE2_ERROR_BADDATA      the number of patterns is zero or less
10456         PCRE2_ERROR_BADMAGIC     mismatch of id bytes in one of the patterns
10457         PCRE2_ERROR_NOMEMORY     memory allocation failed
10458         PCRE2_ERROR_MIXEDTABLES  the patterns do not all use the same tables
10459         PCRE2_ERROR_NULL         the 1st, 3rd, or 4th argument is NULL
10460
10461       PCRE2_ERROR_BADMAGIC  means  either that a pattern's code has been cor-
10462       rupted, or that a slot in the vector does not point to a compiled  pat-
10463       tern.
10464
10465       Once a set of patterns has been serialized you can save the data in any
10466       appropriate manner. Here is sample code that compiles two patterns  and
10467       writes them to a file. It assumes that the variable fd refers to a file
10468       that is open for output. The error checking that should be present in a
10469       real application has been omitted for simplicity.
10470
10471         int errorcode;
10472         uint8_t *bytes;
10473         PCRE2_SIZE erroroffset;
10474         PCRE2_SIZE bytescount;
10475         pcre2_code *list_of_codes[2];
10476         list_of_codes[0] = pcre2_compile("first pattern",
10477           PCRE2_ZERO_TERMINATED, 0, &errorcode, &erroroffset, NULL);
10478         list_of_codes[1] = pcre2_compile("second pattern",
10479           PCRE2_ZERO_TERMINATED, 0, &errorcode, &erroroffset, NULL);
10480         errorcode = pcre2_serialize_encode(list_of_codes, 2, &bytes,
10481           &bytescount, NULL);
10482         errorcode = fwrite(bytes, 1, bytescount, fd);
10483
10484       Note  that  the  serialized data is binary data that may contain any of
10485       the 256 possible byte values. On systems that make  a  distinction  be-
10486       tween  binary  and non-binary data, be sure that the file is opened for
10487       binary output.
10488
10489       Serializing a set of patterns leaves the original  data  untouched,  so
10490       they  can  still  be used for matching. Their memory must eventually be
10491       freed in the usual way by calling pcre2_code_free(). When you have fin-
10492       ished with the byte stream, it too must be freed by calling pcre2_seri-
10493       alize_free(). If this function is called with a NULL argument,  it  re-
10494       turns immediately without doing anything.
10495
10496
10497RE-USING PRECOMPILED PATTERNS
10498
10499       In  order to re-use a set of saved patterns you must first make the se-
10500       rialized byte stream available in main memory (for example, by  reading
10501       from a file). The management of this memory block is up to the applica-
10502       tion. You can use the pcre2_serialize_get_number_of_codes() function to
10503       find  out how many compiled patterns are in the serialized data without
10504       actually decoding the patterns:
10505
10506         uint8_t *bytes = <serialized data>;
10507         int32_t number_of_codes = pcre2_serialize_get_number_of_codes(bytes);
10508
10509       The pcre2_serialize_decode() function reads a byte stream and recreates
10510       the compiled patterns in new memory blocks, setting pointers to them in
10511       a vector. The first two arguments are a pointer to  a  suitable  vector
10512       and its length, and the third argument points to a byte stream. The fi-
10513       nal argument is a pointer to a general context, which can  be  used  to
10514       specify  custom  memory mangagement functions for the decoded patterns.
10515       If this argument is NULL, malloc() and free() are used. After deserial-
10516       ization, the byte stream is no longer needed and can be discarded.
10517
10518         pcre2_code *list_of_codes[2];
10519         uint8_t *bytes = <serialized data>;
10520         int32_t number_of_codes =
10521           pcre2_serialize_decode(list_of_codes, 2, bytes, NULL);
10522
10523       If  the  vector  is  not  large enough for all the patterns in the byte
10524       stream, it is filled with those that fit, and  the  remainder  are  ig-
10525       nored.  The yield of the function is the number of decoded patterns, or
10526       one of the following negative error codes:
10527
10528         PCRE2_ERROR_BADDATA    second argument is zero or less
10529         PCRE2_ERROR_BADMAGIC   mismatch of id bytes in the data
10530         PCRE2_ERROR_BADMODE    mismatch of code unit size or PCRE2 version
10531         PCRE2_ERROR_BADSERIALIZEDDATA  other sanity check failure
10532         PCRE2_ERROR_MEMORY     memory allocation failed
10533         PCRE2_ERROR_NULL       first or third argument is NULL
10534
10535       PCRE2_ERROR_BADMAGIC may mean that the data is corrupt, or that it  was
10536       compiled on a system with different endianness.
10537
10538       Decoded patterns can be used for matching in the usual way, and must be
10539       freed by calling pcre2_code_free(). However, be aware that there  is  a
10540       potential  race  issue if you are using multiple patterns that were de-
10541       coded from a single byte stream in a multithreaded application. A  sin-
10542       gle  copy  of  the character tables is used by all the decoded patterns
10543       and a reference count is used to arrange for its memory to be automati-
10544       cally  freed when the last pattern is freed, but there is no locking on
10545       this reference count. Therefore, if you want to call  pcre2_code_free()
10546       for  these  patterns  in  different  threads, you must arrange your own
10547       locking, and ensure that pcre2_code_free()  cannot  be  called  by  two
10548       threads at the same time.
10549
10550       If  a pattern was processed by pcre2_jit_compile() before being serial-
10551       ized, the JIT data is discarded and so is no longer available  after  a
10552       save/restore  cycle.  You can, however, process a restored pattern with
10553       pcre2_jit_compile() if you wish.
10554
10555
10556AUTHOR
10557
10558       Philip Hazel
10559       University Computing Service
10560       Cambridge, England.
10561
10562
10563REVISION
10564
10565       Last updated: 27 June 2018
10566       Copyright (c) 1997-2018 University of Cambridge.
10567------------------------------------------------------------------------------
10568
10569
10570PCRE2SYNTAX(3)             Library Functions Manual             PCRE2SYNTAX(3)
10571
10572
10573
10574NAME
10575       PCRE2 - Perl-compatible regular expressions (revised API)
10576
10577PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY
10578
10579       The  full syntax and semantics of the regular expressions that are sup-
10580       ported by PCRE2 are described in the pcre2pattern  documentation.  This
10581       document contains a quick-reference summary of the syntax.
10582
10583
10584QUOTING
10585
10586         \x         where x is non-alphanumeric is a literal x
10587         \Q...\E    treat enclosed characters as literal
10588
10589
10590ESCAPED CHARACTERS
10591
10592       This  table  applies to ASCII and Unicode environments. An unrecognized
10593       escape sequence causes an error.
10594
10595         \a         alarm, that is, the BEL character (hex 07)
10596         \cx        "control-x", where x is any ASCII printing character
10597         \e         escape (hex 1B)
10598         \f         form feed (hex 0C)
10599         \n         newline (hex 0A)
10600         \r         carriage return (hex 0D)
10601         \t         tab (hex 09)
10602         \0dd       character with octal code 0dd
10603         \ddd       character with octal code ddd, or backreference
10604         \o{ddd..}  character with octal code ddd..
10605         \N{U+hh..} character with Unicode code point hh.. (Unicode mode only)
10606         \xhh       character with hex code hh
10607         \x{hh..}   character with hex code hh..
10608
10609       If PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX is set ("ALT_BSUX mode"), the
10610       following are also recognized:
10611
10612         \U         the character "U"
10613         \uhhhh     character with hex code hhhh
10614         \u{hh..}   character with hex code hh.. but only for EXTRA_ALT_BSUX
10615
10616       When  \x  is not followed by {, from zero to two hexadecimal digits are
10617       read, but in ALT_BSUX mode \x must be followed by two hexadecimal  dig-
10618       its  to  be  recognized as a hexadecimal escape; otherwise it matches a
10619       literal "x".  Likewise, if \u (in ALT_BSUX mode)  is  not  followed  by
10620       four  hexadecimal  digits or (in EXTRA_ALT_BSUX mode) a sequence of hex
10621       digits in curly brackets, it matches a literal "u".
10622
10623       Note that \0dd is always an octal code. The treatment of backslash fol-
10624       lowed  by  a non-zero digit is complicated; for details see the section
10625       "Non-printing characters" in the pcre2pattern documentation, where  de-
10626       tails  of  escape  processing  in  EBCDIC  environments are also given.
10627       \N{U+hh..} is synonymous with \x{hh..} in PCRE2 but is not supported in
10628       EBCDIC  environments.  Note  that  \N  not followed by an opening curly
10629       bracket has a different meaning (see below).
10630
10631
10632CHARACTER TYPES
10633
10634         .          any character except newline;
10635                      in dotall mode, any character whatsoever
10636         \C         one code unit, even in UTF mode (best avoided)
10637         \d         a decimal digit
10638         \D         a character that is not a decimal digit
10639         \h         a horizontal white space character
10640         \H         a character that is not a horizontal white space character
10641         \N         a character that is not a newline
10642         \p{xx}     a character with the xx property
10643         \P{xx}     a character without the xx property
10644         \R         a newline sequence
10645         \s         a white space character
10646         \S         a character that is not a white space character
10647         \v         a vertical white space character
10648         \V         a character that is not a vertical white space character
10649         \w         a "word" character
10650         \W         a "non-word" character
10651         \X         a Unicode extended grapheme cluster
10652
10653       \C is dangerous because it may leave the current matching point in  the
10654       middle of a UTF-8 or UTF-16 character. The application can lock out the
10655       use of \C by setting the PCRE2_NEVER_BACKSLASH_C  option.  It  is  also
10656       possible to build PCRE2 with the use of \C permanently disabled.
10657
10658       By  default,  \d, \s, and \w match only ASCII characters, even in UTF-8
10659       mode or in the 16-bit and 32-bit libraries. However, if locale-specific
10660       matching  is  happening,  \s and \w may also match characters with code
10661       points in the range 128-255. If the PCRE2_UCP option is set, the behav-
10662       iour of these escape sequences is changed to use Unicode properties and
10663       they match many more characters.
10664
10665       Property descriptions in \p and \P are matched caselessly; hyphens, un-
10666       derscores,  and  white  space are ignored, in accordance with Unicode's
10667       "loose matching" rules.
10668
10669
10670GENERAL CATEGORY PROPERTIES FOR \p and \P
10671
10672         C          Other
10673         Cc         Control
10674         Cf         Format
10675         Cn         Unassigned
10676         Co         Private use
10677         Cs         Surrogate
10678
10679         L          Letter
10680         Ll         Lower case letter
10681         Lm         Modifier letter
10682         Lo         Other letter
10683         Lt         Title case letter
10684         Lu         Upper case letter
10685         Lc         Ll, Lu, or Lt
10686         L&         Ll, Lu, or Lt
10687
10688         M          Mark
10689         Mc         Spacing mark
10690         Me         Enclosing mark
10691         Mn         Non-spacing mark
10692
10693         N          Number
10694         Nd         Decimal number
10695         Nl         Letter number
10696         No         Other number
10697
10698         P          Punctuation
10699         Pc         Connector punctuation
10700         Pd         Dash punctuation
10701         Pe         Close punctuation
10702         Pf         Final punctuation
10703         Pi         Initial punctuation
10704         Po         Other punctuation
10705         Ps         Open punctuation
10706
10707         S          Symbol
10708         Sc         Currency symbol
10709         Sk         Modifier symbol
10710         Sm         Mathematical symbol
10711         So         Other symbol
10712
10713         Z          Separator
10714         Zl         Line separator
10715         Zp         Paragraph separator
10716         Zs         Space separator
10717
10718
10719PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P
10720
10721         Xan        Alphanumeric: union of properties L and N
10722         Xps        POSIX space: property Z or tab, NL, VT, FF, CR
10723         Xsp        Perl space: property Z or tab, NL, VT, FF, CR
10724         Xuc        Univerally-named character: one that can be
10725                      represented by a Universal Character Name
10726         Xwd        Perl word: property Xan or underscore
10727
10728       Perl and POSIX space are now the same. Perl added VT to its space char-
10729       acter set at release 5.18.
10730
10731
10732BINARY PROPERTIES FOR \p AND \P
10733
10734       Unicode  defines  a  number  of  binary properties, that is, properties
10735       whose only values are true or false. You can obtain  a  list  of  those
10736       that  are  recognized  by \p and \P, along with their abbreviations, by
10737       running this command:
10738
10739         pcre2test -LP
10740
10741
10742SCRIPT MATCHING WITH \p AND \P
10743
10744       Many script names and their 4-letter abbreviations  are  recognized  in
10745       \p{sc:...}  or  \p{scx:...} items, or on their own with \p (and also \P
10746       of course). You can obtain a list of these scripts by running this com-
10747       mand:
10748
10749         pcre2test -LS
10750
10751
10752THE BIDI_CLASS PROPERTY FOR \p AND \P
10753
10754         \p{Bidi_Class:<class>}   matches a character with the given class
10755         \p{BC:<class>}           matches a character with the given class
10756
10757       The recognized classes are:
10758
10759         AL          Arabic letter
10760         AN          Arabic number
10761         B           paragraph separator
10762         BN          boundary neutral
10763         CS          common separator
10764         EN          European number
10765         ES          European separator
10766         ET          European terminator
10767         FSI         first strong isolate
10768         L           left-to-right
10769         LRE         left-to-right embedding
10770         LRI         left-to-right isolate
10771         LRO         left-to-right override
10772         NSM         non-spacing mark
10773         ON          other neutral
10774         PDF         pop directional format
10775         PDI         pop directional isolate
10776         R           right-to-left
10777         RLE         right-to-left embedding
10778         RLI         right-to-left isolate
10779         RLO         right-to-left override
10780         S           segment separator
10781         WS          which space
10782
10783
10784CHARACTER CLASSES
10785
10786         [...]       positive character class
10787         [^...]      negative character class
10788         [x-y]       range (can be used for hex characters)
10789         [[:xxx:]]   positive POSIX named set
10790         [[:^xxx:]]  negative POSIX named set
10791
10792         alnum       alphanumeric
10793         alpha       alphabetic
10794         ascii       0-127
10795         blank       space or tab
10796         cntrl       control character
10797         digit       decimal digit
10798         graph       printing, excluding space
10799         lower       lower case letter
10800         print       printing, including space
10801         punct       printing, excluding alphanumeric
10802         space       white space
10803         upper       upper case letter
10804         word        same as \w
10805         xdigit      hexadecimal digit
10806
10807       In  PCRE2, POSIX character set names recognize only ASCII characters by
10808       default, but some of them use Unicode properties if PCRE2_UCP  is  set.
10809       You can use \Q...\E inside a character class.
10810
10811
10812QUANTIFIERS
10813
10814         ?           0 or 1, greedy
10815         ?+          0 or 1, possessive
10816         ??          0 or 1, lazy
10817         *           0 or more, greedy
10818         *+          0 or more, possessive
10819         *?          0 or more, lazy
10820         +           1 or more, greedy
10821         ++          1 or more, possessive
10822         +?          1 or more, lazy
10823         {n}         exactly n
10824         {n,m}       at least n, no more than m, greedy
10825         {n,m}+      at least n, no more than m, possessive
10826         {n,m}?      at least n, no more than m, lazy
10827         {n,}        n or more, greedy
10828         {n,}+       n or more, possessive
10829         {n,}?       n or more, lazy
10830
10831
10832ANCHORS AND SIMPLE ASSERTIONS
10833
10834         \b          word boundary
10835         \B          not a word boundary
10836         ^           start of subject
10837                       also after an internal newline in multiline mode
10838                       (after any newline if PCRE2_ALT_CIRCUMFLEX is set)
10839         \A          start of subject
10840         $           end of subject
10841                       also before newline at end of subject
10842                       also before internal newline in multiline mode
10843         \Z          end of subject
10844                       also before newline at end of subject
10845         \z          end of subject
10846         \G          first matching position in subject
10847
10848
10849REPORTED MATCH POINT SETTING
10850
10851         \K          set reported start of match
10852
10853       From  release 10.38 \K is not permitted by default in lookaround asser-
10854       tions, for compatibility with Perl.  However,  if  the  PCRE2_EXTRA_AL-
10855       LOW_LOOKAROUND_BSK option is set, the previous behaviour is re-enabled.
10856       When this option is set, \K is honoured in positive assertions, but ig-
10857       nored in negative ones.
10858
10859
10860ALTERNATION
10861
10862         expr|expr|expr...
10863
10864
10865CAPTURING
10866
10867         (...)           capture group
10868         (?<name>...)    named capture group (Perl)
10869         (?'name'...)    named capture group (Perl)
10870         (?P<name>...)   named capture group (Python)
10871         (?:...)         non-capture group
10872         (?|...)         non-capture group; reset group numbers for
10873                          capture groups in each alternative
10874
10875       In  non-UTF  modes, names may contain underscores and ASCII letters and
10876       digits; in UTF modes, any Unicode letters and  Unicode  decimal  digits
10877       are permitted. In both cases, a name must not start with a digit.
10878
10879
10880ATOMIC GROUPS
10881
10882         (?>...)         atomic non-capture group
10883         (*atomic:...)   atomic non-capture group
10884
10885
10886COMMENT
10887
10888         (?#....)        comment (not nestable)
10889
10890
10891OPTION SETTING
10892       Changes  of these options within a group are automatically cancelled at
10893       the end of the group.
10894
10895         (?i)            caseless
10896         (?J)            allow duplicate named groups
10897         (?m)            multiline
10898         (?n)            no auto capture
10899         (?s)            single line (dotall)
10900         (?U)            default ungreedy (lazy)
10901         (?x)            extended: ignore white space except in classes
10902         (?xx)           as (?x) but also ignore space and tab in classes
10903         (?-...)         unset option(s)
10904         (?^)            unset imnsx options
10905
10906       Unsetting x or xx unsets both. Several options may be set at once,  and
10907       a mixture of setting and unsetting such as (?i-x) is allowed, but there
10908       may be only one hyphen. Setting (but no unsetting) is allowed after (?^
10909       for example (?^in). An option setting may appear at the start of a non-
10910       capture group, for example (?i:...).
10911
10912       The following are recognized only at the very start of a pattern or af-
10913       ter one of the newline or \R options with similar syntax. More than one
10914       of them may appear. For the first three, d is a decimal number.
10915
10916         (*LIMIT_DEPTH=d) set the backtracking limit to d
10917         (*LIMIT_HEAP=d)  set the heap size limit to d * 1024 bytes
10918         (*LIMIT_MATCH=d) set the match limit to d
10919         (*NOTEMPTY)      set PCRE2_NOTEMPTY when matching
10920         (*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
10921         (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
10922         (*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR)
10923         (*NO_JIT)       disable JIT optimization
10924         (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
10925         (*UTF)          set appropriate UTF mode for the library in use
10926         (*UCP)          set PCRE2_UCP (use Unicode properties for \d etc)
10927
10928       Note that LIMIT_DEPTH, LIMIT_HEAP, and LIMIT_MATCH can only reduce  the
10929       value   of   the   limits   set  by  the  caller  of  pcre2_match()  or
10930       pcre2_dfa_match(), not increase them. LIMIT_RECURSION  is  an  obsolete
10931       synonym for LIMIT_DEPTH. The application can lock out the use of (*UTF)
10932       and (*UCP) by setting the PCRE2_NEVER_UTF or  PCRE2_NEVER_UCP  options,
10933       respectively, at compile time.
10934
10935
10936NEWLINE CONVENTION
10937
10938       These are recognized only at the very start of the pattern or after op-
10939       tion settings with a similar syntax.
10940
10941         (*CR)           carriage return only
10942         (*LF)           linefeed only
10943         (*CRLF)         carriage return followed by linefeed
10944         (*ANYCRLF)      all three of the above
10945         (*ANY)          any Unicode newline sequence
10946         (*NUL)          the NUL character (binary zero)
10947
10948
10949WHAT \R MATCHES
10950
10951       These are recognized only at the very start of the pattern or after op-
10952       tion setting with a similar syntax.
10953
10954         (*BSR_ANYCRLF)  CR, LF, or CRLF
10955         (*BSR_UNICODE)  any Unicode newline sequence
10956
10957
10958LOOKAHEAD AND LOOKBEHIND ASSERTIONS
10959
10960         (?=...)                     )
10961         (*pla:...)                  ) positive lookahead
10962         (*positive_lookahead:...)   )
10963
10964         (?!...)                     )
10965         (*nla:...)                  ) negative lookahead
10966         (*negative_lookahead:...)   )
10967
10968         (?<=...)                    )
10969         (*plb:...)                  ) positive lookbehind
10970         (*positive_lookbehind:...)  )
10971
10972         (?<!...)                    )
10973         (*nlb:...)                  ) negative lookbehind
10974         (*negative_lookbehind:...)  )
10975
10976       Each top-level branch of a lookbehind must be of a fixed length.
10977
10978
10979NON-ATOMIC LOOKAROUND ASSERTIONS
10980
10981       These assertions are specific to PCRE2 and are not Perl-compatible.
10982
10983         (?*...)                                )
10984         (*napla:...)                           ) synonyms
10985         (*non_atomic_positive_lookahead:...)   )
10986
10987         (?<*...)                               )
10988         (*naplb:...)                           ) synonyms
10989         (*non_atomic_positive_lookbehind:...)  )
10990
10991
10992SCRIPT RUNS
10993
10994         (*script_run:...)           ) script run, can be backtracked into
10995         (*sr:...)                   )
10996
10997         (*atomic_script_run:...)    ) atomic script run
10998         (*asr:...)                  )
10999
11000
11001BACKREFERENCES
11002
11003         \n              reference by number (can be ambiguous)
11004         \gn             reference by number
11005         \g{n}           reference by number
11006         \g+n            relative reference by number (PCRE2 extension)
11007         \g-n            relative reference by number
11008         \g{+n}          relative reference by number (PCRE2 extension)
11009         \g{-n}          relative reference by number
11010         \k<name>        reference by name (Perl)
11011         \k'name'        reference by name (Perl)
11012         \g{name}        reference by name (Perl)
11013         \k{name}        reference by name (.NET)
11014         (?P=name)       reference by name (Python)
11015
11016
11017SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)
11018
11019         (?R)            recurse whole pattern
11020         (?n)            call subroutine by absolute number
11021         (?+n)           call subroutine by relative number
11022         (?-n)           call subroutine by relative number
11023         (?&name)        call subroutine by name (Perl)
11024         (?P>name)       call subroutine by name (Python)
11025         \g<name>        call subroutine by name (Oniguruma)
11026         \g'name'        call subroutine by name (Oniguruma)
11027         \g<n>           call subroutine by absolute number (Oniguruma)
11028         \g'n'           call subroutine by absolute number (Oniguruma)
11029         \g<+n>          call subroutine by relative number (PCRE2 extension)
11030         \g'+n'          call subroutine by relative number (PCRE2 extension)
11031         \g<-n>          call subroutine by relative number (PCRE2 extension)
11032         \g'-n'          call subroutine by relative number (PCRE2 extension)
11033
11034
11035CONDITIONAL PATTERNS
11036
11037         (?(condition)yes-pattern)
11038         (?(condition)yes-pattern|no-pattern)
11039
11040         (?(n)               absolute reference condition
11041         (?(+n)              relative reference condition
11042         (?(-n)              relative reference condition
11043         (?(<name>)          named reference condition (Perl)
11044         (?('name')          named reference condition (Perl)
11045         (?(name)            named reference condition (PCRE2, deprecated)
11046         (?(R)               overall recursion condition
11047         (?(Rn)              specific numbered group recursion condition
11048         (?(R&name)          specific named group recursion condition
11049         (?(DEFINE)          define groups for reference
11050         (?(VERSION[>]=n.m)  test PCRE2 version
11051         (?(assert)          assertion condition
11052
11053       Note  the  ambiguity of (?(R) and (?(Rn) which might be named reference
11054       conditions or recursion tests. Such a condition  is  interpreted  as  a
11055       reference condition if the relevant named group exists.
11056
11057
11058BACKTRACKING CONTROL
11059
11060       All  backtracking  control  verbs  may be in the form (*VERB:NAME). For
11061       (*MARK) the name is mandatory, for the others it is  optional.  (*SKIP)
11062       changes  its  behaviour if :NAME is present. The others just set a name
11063       for passing back to the caller, but this is not a name that (*SKIP) can
11064       see. The following act immediately they are reached:
11065
11066         (*ACCEPT)       force successful match
11067         (*FAIL)         force backtrack; synonym (*F)
11068         (*MARK:NAME)    set name to be passed back; synonym (*:NAME)
11069
11070       The  following  act only when a subsequent match failure causes a back-
11071       track to reach them. They all force a match failure, but they differ in
11072       what happens afterwards. Those that advance the start-of-match point do
11073       so only if the pattern is not anchored.
11074
11075         (*COMMIT)       overall failure, no advance of starting point
11076         (*PRUNE)        advance to next starting character
11077         (*SKIP)         advance to current matching position
11078         (*SKIP:NAME)    advance to position corresponding to an earlier
11079                         (*MARK:NAME); if not found, the (*SKIP) is ignored
11080         (*THEN)         local failure, backtrack to next alternation
11081
11082       The effect of one of these verbs in a group called as a  subroutine  is
11083       confined to the subroutine call.
11084
11085
11086CALLOUTS
11087
11088         (?C)            callout (assumed number 0)
11089         (?Cn)           callout with numerical data n
11090         (?C"text")      callout with string data
11091
11092       The allowed string delimiters are ` ' " ^ % # $ (which are the same for
11093       the start and the end), and the starting delimiter { matched  with  the
11094       ending  delimiter  }. To encode the ending delimiter within the string,
11095       double it.
11096
11097
11098SEE ALSO
11099
11100       pcre2pattern(3),   pcre2api(3),   pcre2callout(3),    pcre2matching(3),
11101       pcre2(3).
11102
11103
11104AUTHOR
11105
11106       Philip Hazel
11107       Retired from University Computing Service
11108       Cambridge, England.
11109
11110
11111REVISION
11112
11113       Last updated: 12 January 2022
11114       Copyright (c) 1997-2022 University of Cambridge.
11115------------------------------------------------------------------------------
11116
11117
11118PCRE2UNICODE(3)            Library Functions Manual            PCRE2UNICODE(3)
11119
11120
11121
11122NAME
11123       PCRE - Perl-compatible regular expressions (revised API)
11124
11125UNICODE AND UTF SUPPORT
11126
11127       PCRE2 is normally built with Unicode support, though if you do not need
11128       it, you can build it  without,  in  which  case  the  library  will  be
11129       smaller. With Unicode support, PCRE2 has knowledge of Unicode character
11130       properties and can process strings of text in UTF-8, UTF-16, and UTF-32
11131       format (depending on the code unit width), but this is not the default.
11132       Unless specifically requested, PCRE2 treats each code unit in a  string
11133       as one character.
11134
11135       There  are two ways of telling PCRE2 to switch to UTF mode, where char-
11136       acters may consist of more than one code unit and the range  of  values
11137       is constrained. The program can call pcre2_compile() with the PCRE2_UTF
11138       option, or the pattern may start with the  sequence  (*UTF).   However,
11139       the  latter  facility  can be locked out by the PCRE2_NEVER_UTF option.
11140       That is, the programmer can prevent the supplier of  the  pattern  from
11141       switching to UTF mode.
11142
11143       Note   that  the  PCRE2_MATCH_INVALID_UTF  option  (see  below)  forces
11144       PCRE2_UTF to be set.
11145
11146       In UTF mode, both the pattern and any subject strings that are  matched
11147       against  it are treated as UTF strings instead of strings of individual
11148       one-code-unit characters. There are also some other changes to the  way
11149       characters are handled, as documented below.
11150
11151
11152UNICODE PROPERTY SUPPORT
11153
11154       When  PCRE2 is built with Unicode support, the escape sequences \p{..},
11155       \P{..}, and \X can be used. This is not dependent on the PCRE2_UTF set-
11156       ting.   The Unicode properties that can be tested are a subset of those
11157       that Perl supports. Currently they are limited to the general  category
11158       properties such as Lu for an upper case letter or Nd for a decimal num-
11159       ber, the Unicode script  names  such  as  Arabic  or  Han,  Bidi_Class,
11160       Bidi_Control,  and the derived properties Any and LC (synonym L&). Full
11161       lists are given in the pcre2pattern and pcre2syntax  documentation.  In
11162       general,  only the short names for properties are supported.  For exam-
11163       ple, \p{L} matches a letter. Its longer  synonym,  \p{Letter},  is  not
11164       supported. Furthermore, in Perl, many properties may optionally be pre-
11165       fixed by "Is", for compatibility with Perl 5.6. PCRE2 does not  support
11166       this.
11167
11168
11169WIDE CHARACTERS AND UTF MODES
11170
11171       Code points less than 256 can be specified in patterns by either braced
11172       or unbraced hexadecimal escape sequences (for example, \x{b3} or \xb3).
11173       Larger  values have to use braced sequences. Unbraced octal code points
11174       up to \777 are also recognized; larger ones can be coded using \o{...}.
11175
11176       The escape sequence \N{U+<hex digits>} is recognized as another way  of
11177       specifying  a  Unicode character by code point in a UTF mode. It is not
11178       allowed in non-UTF mode.
11179
11180       In UTF mode, repeat quantifiers apply to complete UTF  characters,  not
11181       to individual code units.
11182
11183       In UTF mode, the dot metacharacter matches one UTF character instead of
11184       a single code unit.
11185
11186       In UTF mode, capture group names are not restricted to ASCII,  and  may
11187       contain any Unicode letters and decimal digits, as well as underscore.
11188
11189       The  escape  sequence \C can be used to match a single code unit in UTF
11190       mode, but its use can lead to some strange effects because it breaks up
11191       multi-unit  characters  (see  the description of \C in the pcre2pattern
11192       documentation). For this reason, there is a build-time option that dis-
11193       ables  support  for  \C completely. There is also a less draconian com-
11194       pile-time option for locking out the use of \C when a pattern  is  com-
11195       piled.
11196
11197       The  use  of  \C  is not supported by the alternative matching function
11198       pcre2_dfa_match() when in UTF-8 or UTF-16 mode, that is, when a charac-
11199       ter  may  consist  of  more  than one code unit. The use of \C in these
11200       modes provokes a match-time error. Also, the JIT optimization does  not
11201       support \C in these modes. If JIT optimization is requested for a UTF-8
11202       or UTF-16 pattern that contains \C, it will not succeed,  and  so  when
11203       pcre2_match() is called, the matching will be carried out by the inter-
11204       pretive function.
11205
11206       The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly test
11207       characters  of  any  code  value,  but, by default, the characters that
11208       PCRE2 recognizes as digits, spaces, or word characters remain the  same
11209       set  as  in  non-UTF mode, all with code points less than 256. This re-
11210       mains true even when PCRE2 is built to include Unicode support, because
11211       to  do  otherwise  would  slow down matching in many common cases. Note
11212       that this also applies to \b and \B, because they are defined in  terms
11213       of  \w  and \W. If you want to test for a wider sense of, say, "digit",
11214       you can use explicit Unicode property tests such  as  \p{Nd}.  Alterna-
11215       tively, if you set the PCRE2_UCP option, the way that the character es-
11216       capes work is changed so that Unicode properties are used to  determine
11217       which  characters  match.  There  are  more  details  in the section on
11218       generic character types in the pcre2pattern documentation.
11219
11220       Similarly, characters that match the POSIX named character classes  are
11221       all low-valued characters, unless the PCRE2_UCP option is set.
11222
11223       However,  the  special horizontal and vertical white space matching es-
11224       capes (\h, \H, \v, and \V) do match all the appropriate Unicode charac-
11225       ters, whether or not PCRE2_UCP is set.
11226
11227
11228UNICODE CASE-EQUIVALENCE
11229
11230       If  either  PCRE2_UTF  or PCRE2_UCP is set, upper/lower case processing
11231       makes use of Unicode properties except for characters whose code points
11232       are less than 128 and that have at most two case-equivalent values. For
11233       these, a direct table lookup is used for speed. A few  Unicode  charac-
11234       ters  such as Greek sigma have more than two code points that are case-
11235       equivalent, and these are treated specially. Setting PCRE2_UCP  without
11236       PCRE2_UTF  allows  Unicode-style  case processing for non-UTF character
11237       encodings such as UCS-2.
11238
11239
11240SCRIPT RUNS
11241
11242       The pattern constructs (*script_run:...) and  (*atomic_script_run:...),
11243       with  synonyms (*sr:...) and (*asr:...), verify that the string matched
11244       within the parentheses is a script run. In concept, a script run  is  a
11245       sequence  of characters that are all from the same Unicode script. How-
11246       ever, because some scripts are commonly used together, and because some
11247       diacritical  and  other marks are used with multiple scripts, it is not
11248       that simple.
11249
11250       Every Unicode character has a Script property, mostly with a value cor-
11251       responding  to the name of a script, such as Latin, Greek, or Cyrillic.
11252       There are also three special values:
11253
11254       "Unknown" is used for code points that have not been assigned, and also
11255       for  the surrogate code points. In the PCRE2 32-bit library, characters
11256       whose code points are greater  than  the  Unicode  maximum  (U+10FFFF),
11257       which  are  accessible  only  in non-UTF mode, are assigned the Unknown
11258       script.
11259
11260       "Common" is used for characters that are used with many scripts.  These
11261       include  punctuation,  emoji,  mathematical, musical, and currency sym-
11262       bols, and the ASCII digits 0 to 9.
11263
11264       "Inherited" is used for characters such as diacritical marks that  mod-
11265       ify a previous character. These are considered to take on the script of
11266       the character that they modify.
11267
11268       Some Inherited characters are used with many scripts, but many of  them
11269       are  only  normally  used  with a small number of scripts. For example,
11270       U+102E0 (Coptic Epact thousands mark) is used only with Arabic and Cop-
11271       tic.  In  order  to  make it possible to check this, a Unicode property
11272       called Script Extension exists. Its value is a list of scripts that ap-
11273       ply to the character. For the majority of characters, the list contains
11274       just one script, the same one as  the  Script  property.  However,  for
11275       characters  such  as  U+102E0 more than one Script is listed. There are
11276       also some Common characters that have a single,  non-Common  script  in
11277       their Script Extension list.
11278
11279       The next section describes the basic rules for deciding whether a given
11280       string of characters is a script run. Note,  however,  that  there  are
11281       some  special cases involving the Chinese Han script, and an additional
11282       constraint for decimal digits. These are  covered  in  subsequent  sec-
11283       tions.
11284
11285   Basic script run rules
11286
11287       A string that is less than two characters long is a script run. This is
11288       the only case in which an Unknown character can be  part  of  a  script
11289       run.  Longer strings are checked using only the Script Extensions prop-
11290       erty, not the basic Script property.
11291
11292       If a character's Script Extension property is the single value  "Inher-
11293       ited", it is always accepted as part of a script run. This is also true
11294       for the property "Common", subject to the checking  of  decimal  digits
11295       described below. All the remaining characters in a script run must have
11296       at least one script in common in their Script Extension lists. In  set-
11297       theoretic terminology, the intersection of all the sets of scripts must
11298       not be empty.
11299
11300       A simple example is an Internet name such as "google.com". The  letters
11301       are all in the Latin script, and the dot is Common, so this string is a
11302       script run.  However, the Cyrillic letter "o" looks exactly the same as
11303       the  Latin "o"; a string that looks the same, but with Cyrillic "o"s is
11304       not a script run.
11305
11306       More interesting examples involve characters with more than one  script
11307       in their Script Extension. Consider the following characters:
11308
11309         U+060C  Arabic comma
11310         U+06D4  Arabic full stop
11311
11312       The  first  has the Script Extension list Arabic, Hanifi Rohingya, Syr-
11313       iac, and Thaana; the second has just Arabic and Hanifi  Rohingya.  Both
11314       of  them  could  appear  in  script runs of either Arabic or Hanifi Ro-
11315       hingya. The first could also appear in Syriac or  Thaana  script  runs,
11316       but the second could not.
11317
11318   The Chinese Han script
11319
11320       The  Chinese  Han  script  is  commonly  used in conjunction with other
11321       scripts for writing certain languages. Japanese uses the  Hiragana  and
11322       Katakana  scripts  together  with Han; Korean uses Hangul and Han; Tai-
11323       wanese Mandarin uses Bopomofo and Han.  These  three  combinations  are
11324       treated  as special cases when checking script runs and are, in effect,
11325       "virtual scripts". Thus, a script run may contain a  mixture  of  Hira-
11326       gana,  Katakana,  and Han, or a mixture of Hangul and Han, or a mixture
11327       of Bopomofo and Han, but not, for example,  a  mixture  of  Hangul  and
11328       Bopomofo  and  Han. PCRE2 (like Perl) follows Unicode's Technical Stan-
11329       dard  39   ("Unicode   Security   Mechanisms",   http://unicode.org/re-
11330       ports/tr39/) in allowing such mixtures.
11331
11332   Decimal digits
11333
11334       Unicode  contains  many sets of 10 decimal digits in different scripts,
11335       and some scripts (including the Common script) contain  more  than  one
11336       set.  Some  of these decimal digits them are visually indistinguishable
11337       from the common ASCII digits. In addition to the  script  checking  de-
11338       scribed  above,  if a script run contains any decimal digits, they must
11339       all come from the same set of 10 adjacent characters.
11340
11341
11342VALIDITY OF UTF STRINGS
11343
11344       When the PCRE2_UTF option is set, the strings passed  as  patterns  and
11345       subjects are (by default) checked for validity on entry to the relevant
11346       functions. If an invalid UTF string is passed, a negative error code is
11347       returned.  The  code  unit offset to the offending character can be ex-
11348       tracted from the match data  block  by  calling  pcre2_get_startchar(),
11349       which is used for this purpose after a UTF error.
11350
11351       In  some  situations, you may already know that your strings are valid,
11352       and therefore want to skip these checks in  order  to  improve  perfor-
11353       mance,  for  example in the case of a long subject string that is being
11354       scanned repeatedly.  If you set the PCRE2_NO_UTF_CHECK option  at  com-
11355       pile  time  or at match time, PCRE2 assumes that the pattern or subject
11356       it is given (respectively) contains only valid UTF code unit sequences.
11357
11358       If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is  set,  the
11359       result  is undefined and your program may crash or loop indefinitely or
11360       give incorrect results. There is, however, one mode  of  matching  that
11361       can  handle  invalid  UTF  subject  strings. This is enabled by passing
11362       PCRE2_MATCH_INVALID_UTF to pcre2_compile() and is  discussed  below  in
11363       the  next  section.  The  rest  of  this  section  covers the case when
11364       PCRE2_MATCH_INVALID_UTF is not set.
11365
11366       Passing PCRE2_NO_UTF_CHECK to pcre2_compile()  just  disables  the  UTF
11367       check  for  the  pattern; it does not also apply to subject strings. If
11368       you want to disable the check for a subject string you must  pass  this
11369       same option to pcre2_match() or pcre2_dfa_match().
11370
11371       UTF-16 and UTF-32 strings can indicate their endianness by special code
11372       knows as a byte-order mark (BOM). The PCRE2  functions  do  not  handle
11373       this, expecting strings to be in host byte order.
11374
11375       Unless  PCRE2_NO_UTF_CHECK  is  set, a UTF string is checked before any
11376       other  processing  takes  place.  In  the  case  of  pcre2_match()  and
11377       pcre2_dfa_match()  calls  with a non-zero starting offset, the check is
11378       applied only to that part of the subject that could be inspected during
11379       matching,  and  there is a check that the starting offset points to the
11380       first code unit of a character or to the end of the subject.  If  there
11381       are  no  lookbehind  assertions in the pattern, the check starts at the
11382       starting offset.  Otherwise, it starts at the  length  of  the  longest
11383       lookbehind  before  the starting offset, or at the start of the subject
11384       if there are not that many characters before the starting offset.  Note
11385       that the sequences \b and \B are one-character lookbehinds.
11386
11387       In  addition  to checking the format of the string, there is a check to
11388       ensure that all code points lie in the range U+0 to U+10FFFF, excluding
11389       the  surrogate  area. The so-called "non-character" code points are not
11390       excluded because Unicode corrigendum #9 makes it clear that they should
11391       not be.
11392
11393       Characters  in  the "Surrogate Area" of Unicode are reserved for use by
11394       UTF-16, where they are used in pairs to encode code points with  values
11395       greater  than  0xFFFF. The code points that are encoded by UTF-16 pairs
11396       are available independently in the  UTF-8  and  UTF-32  encodings.  (In
11397       other  words, the whole surrogate thing is a fudge for UTF-16 which un-
11398       fortunately messes up UTF-8 and UTF-32.)
11399
11400       Setting PCRE2_NO_UTF_CHECK at compile time does not disable  the  error
11401       that  is  given if an escape sequence for an invalid Unicode code point
11402       is encountered in the pattern. If you want to  allow  escape  sequences
11403       such  as  \x{d800}  (a  surrogate code point) you can set the PCRE2_EX-
11404       TRA_ALLOW_SURROGATE_ESCAPES extra option.  However,  this  is  possible
11405       only  in  UTF-8  and  UTF-32 modes, because these values are not repre-
11406       sentable in UTF-16.
11407
11408   Errors in UTF-8 strings
11409
11410       The following negative error codes are given for invalid UTF-8 strings:
11411
11412         PCRE2_ERROR_UTF8_ERR1
11413         PCRE2_ERROR_UTF8_ERR2
11414         PCRE2_ERROR_UTF8_ERR3
11415         PCRE2_ERROR_UTF8_ERR4
11416         PCRE2_ERROR_UTF8_ERR5
11417
11418       The string ends with a truncated UTF-8 character;  the  code  specifies
11419       how  many bytes are missing (1 to 5). Although RFC 3629 restricts UTF-8
11420       characters to be no longer than 4 bytes, the  encoding  scheme  (origi-
11421       nally  defined  by  RFC  2279)  allows  for  up to 6 bytes, and this is
11422       checked first; hence the possibility of 4 or 5 missing bytes.
11423
11424         PCRE2_ERROR_UTF8_ERR6
11425         PCRE2_ERROR_UTF8_ERR7
11426         PCRE2_ERROR_UTF8_ERR8
11427         PCRE2_ERROR_UTF8_ERR9
11428         PCRE2_ERROR_UTF8_ERR10
11429
11430       The two most significant bits of the 2nd, 3rd, 4th, 5th, or 6th byte of
11431       the  character  do  not have the binary value 0b10 (that is, either the
11432       most significant bit is 0, or the next bit is 1).
11433
11434         PCRE2_ERROR_UTF8_ERR11
11435         PCRE2_ERROR_UTF8_ERR12
11436
11437       A character that is valid by the RFC 2279 rules is either 5 or 6  bytes
11438       long; these code points are excluded by RFC 3629.
11439
11440         PCRE2_ERROR_UTF8_ERR13
11441
11442       A 4-byte character has a value greater than 0x10ffff; these code points
11443       are excluded by RFC 3629.
11444
11445         PCRE2_ERROR_UTF8_ERR14
11446
11447       A 3-byte character has a value in the  range  0xd800  to  0xdfff;  this
11448       range  of code points are reserved by RFC 3629 for use with UTF-16, and
11449       so are excluded from UTF-8.
11450
11451         PCRE2_ERROR_UTF8_ERR15
11452         PCRE2_ERROR_UTF8_ERR16
11453         PCRE2_ERROR_UTF8_ERR17
11454         PCRE2_ERROR_UTF8_ERR18
11455         PCRE2_ERROR_UTF8_ERR19
11456
11457       A 2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it  codes
11458       for  a  value that can be represented by fewer bytes, which is invalid.
11459       For example, the two bytes 0xc0, 0xae give the value 0x2e,  whose  cor-
11460       rect coding uses just one byte.
11461
11462         PCRE2_ERROR_UTF8_ERR20
11463
11464       The two most significant bits of the first byte of a character have the
11465       binary value 0b10 (that is, the most significant bit is 1 and the  sec-
11466       ond  is  0). Such a byte can only validly occur as the second or subse-
11467       quent byte of a multi-byte character.
11468
11469         PCRE2_ERROR_UTF8_ERR21
11470
11471       The first byte of a character has the value 0xfe or 0xff. These  values
11472       can never occur in a valid UTF-8 string.
11473
11474   Errors in UTF-16 strings
11475
11476       The  following  negative  error  codes  are  given  for  invalid UTF-16
11477       strings:
11478
11479         PCRE2_ERROR_UTF16_ERR1  Missing low surrogate at end of string
11480         PCRE2_ERROR_UTF16_ERR2  Invalid low surrogate follows high surrogate
11481         PCRE2_ERROR_UTF16_ERR3  Isolated low surrogate
11482
11483
11484   Errors in UTF-32 strings
11485
11486       The following  negative  error  codes  are  given  for  invalid  UTF-32
11487       strings:
11488
11489         PCRE2_ERROR_UTF32_ERR1  Surrogate character (0xd800 to 0xdfff)
11490         PCRE2_ERROR_UTF32_ERR2  Code point is greater than 0x10ffff
11491
11492
11493MATCHING IN INVALID UTF STRINGS
11494
11495       You can run pattern matches on subject strings that may contain invalid
11496       UTF sequences if you  call  pcre2_compile()  with  the  PCRE2_MATCH_IN-
11497       VALID_UTF  option.  This  is  supported by pcre2_match(), including JIT
11498       matching, but not by pcre2_dfa_match(). When PCRE2_MATCH_INVALID_UTF is
11499       set,  it  forces  PCRE2_UTF  to be set as well. Note, however, that the
11500       pattern itself must be a valid UTF string.
11501
11502       Setting PCRE2_MATCH_INVALID_UTF does not  affect  what  pcre2_compile()
11503       generates,  but  if pcre2_jit_compile() is subsequently called, it does
11504       generate different code. If JIT is not used, the option affects the be-
11505       haviour of the interpretive code in pcre2_match(). When PCRE2_MATCH_IN-
11506       VALID_UTF is set at compile  time,  PCRE2_NO_UTF_CHECK  is  ignored  at
11507       match time.
11508
11509       In  this  mode,  an  invalid  code  unit  sequence in the subject never
11510       matches any pattern item. It does not match  dot,  it  does  not  match
11511       \p{Any},  it does not even match negative items such as [^X]. A lookbe-
11512       hind assertion fails if it encounters an invalid sequence while  moving
11513       the  current  point backwards. In other words, an invalid UTF code unit
11514       sequence acts as a barrier which no match can cross.
11515
11516       You can also think of this as the subject being split up into fragments
11517       of  valid UTF, delimited internally by invalid code unit sequences. The
11518       pattern is matched fragment by fragment. The  result  of  a  successful
11519       match,  however,  is  given  as code unit offsets in the entire subject
11520       string in the usual way. There are a few points to consider:
11521
11522       The internal boundaries are not interpreted as the beginnings  or  ends
11523       of  lines  and  so  do not match circumflex or dollar characters in the
11524       pattern.
11525
11526       If pcre2_match() is called with an offset that  points  to  an  invalid
11527       UTF-sequence,  that  sequence  is  skipped, and the match starts at the
11528       next valid UTF character, or the end of the subject.
11529
11530       At internal fragment boundaries, \b and \B behave in the same way as at
11531       the  beginning  and end of the subject. For example, a sequence such as
11532       \bWORD\b would match an instance of WORD that is surrounded by  invalid
11533       UTF code units.
11534
11535       Using  PCRE2_MATCH_INVALID_UTF, an application can run matches on arbi-
11536       trary data, knowing that any matched  strings  that  are  returned  are
11537       valid UTF. This can be useful when searching for UTF text in executable
11538       or other binary files.
11539
11540
11541AUTHOR
11542
11543       Philip Hazel
11544       Retired from University Computing Service
11545       Cambridge, England.
11546
11547
11548REVISION
11549
11550       Last updated: 22 December 2021
11551       Copyright (c) 1997-2021 University of Cambridge.
11552------------------------------------------------------------------------------
11553
11554
11555