1----------------------------------------------------------------------------- 2This file contains a concatenation of the PCRE2 man pages, converted to plain 3text format for ease of searching with a text editor, or for use on systems 4that do not have a man page processor. The small individual files that give 5synopses of each function in the library have not been included. Neither has 6the pcre2demo program. There are separate text files for the pcre2grep and 7pcre2test commands. 8----------------------------------------------------------------------------- 9 10 11PCRE2(3) Library Functions Manual PCRE2(3) 12 13 14 15NAME 16 PCRE2 - Perl-compatible regular expressions (revised API) 17 18INTRODUCTION 19 20 PCRE2 is the name used for a revised API for the PCRE library, which is 21 a set of functions, written in C, that implement regular expression 22 pattern matching using the same syntax and semantics as Perl, with just 23 a few differences. After nearly two decades, the limitations of the 24 original API were making development increasingly difficult. The new 25 API is more extensible, and it was simplified by abolishing the sepa- 26 rate "study" optimizing function; in PCRE2, patterns are automatically 27 optimized where possible. Since forking from PCRE1, the code has been 28 extensively refactored and new features introduced. 29 30 As well as Perl-style regular expression patterns, some features that 31 appeared in Python and the original PCRE before they appeared in Perl 32 are available using the Python syntax. There is also some support for 33 one or two .NET and Oniguruma syntax items, and there are options for 34 requesting some minor changes that give better ECMAScript (aka 35 JavaScript) compatibility. 36 37 The source code for PCRE2 can be compiled to support 8-bit, 16-bit, or 38 32-bit code units, which means that up to three separate libraries may 39 be installed. The original work to extend PCRE to 16-bit and 32-bit 40 code units was done by Zoltan Herczeg and Christian Persch, respec- 41 tively. In all three cases, strings can be interpreted either as one 42 character per code unit, or as UTF-encoded Unicode, with support for 43 Unicode general category properties. Unicode support is optional at 44 build time (but is the default). However, processing strings as UTF 45 code units must be enabled explicitly at run time. The version of Uni- 46 code in use can be discovered by running 47 48 pcre2test -C 49 50 The three libraries contain identical sets of functions, with names 51 ending in _8, _16, or _32, respectively (for example, pcre2_com- 52 pile_8()). However, by defining PCRE2_CODE_UNIT_WIDTH to be 8, 16, or 53 32, a program that uses just one code unit width can be written using 54 generic names such as pcre2_compile(), and the documentation is written 55 assuming that this is the case. 56 57 In addition to the Perl-compatible matching function, PCRE2 contains an 58 alternative function that matches the same compiled patterns in a dif- 59 ferent way. In certain circumstances, the alternative function has some 60 advantages. For a discussion of the two matching algorithms, see the 61 pcre2matching page. 62 63 Details of exactly which Perl regular expression features are and are 64 not supported by PCRE2 are given in separate documents. See the 65 pcre2pattern and pcre2compat pages. There is a syntax summary in the 66 pcre2syntax page. 67 68 Some features of PCRE2 can be included, excluded, or changed when the 69 library is built. The pcre2_config() function makes it possible for a 70 client to discover which features are available. The features them- 71 selves are described in the pcre2build page. Documentation about build- 72 ing PCRE2 for various operating systems can be found in the README and 73 NON-AUTOTOOLS_BUILD files in the source distribution. 74 75 The libraries contains a number of undocumented internal functions and 76 data tables that are used by more than one of the exported external 77 functions, but which are not intended for use by external callers. 78 Their names all begin with "_pcre2", which hopefully will not provoke 79 any name clashes. In some environments, it is possible to control which 80 external symbols are exported when a shared library is built, and in 81 these cases the undocumented symbols are not exported. 82 83 84SECURITY CONSIDERATIONS 85 86 If you are using PCRE2 in a non-UTF application that permits users to 87 supply arbitrary patterns for compilation, you should be aware of a 88 feature that allows users to turn on UTF support from within a pattern. 89 For example, an 8-bit pattern that begins with "(*UTF)" turns on UTF-8 90 mode, which interprets patterns and subjects as strings of UTF-8 code 91 units instead of individual 8-bit characters. This causes both the pat- 92 tern and any data against which it is matched to be checked for UTF-8 93 validity. If the data string is very long, such a check might use suf- 94 ficiently many resources as to cause your application to lose perfor- 95 mance. 96 97 One way of guarding against this possibility is to use the pcre2_pat- 98 tern_info() function to check the compiled pattern's options for 99 PCRE2_UTF. Alternatively, you can set the PCRE2_NEVER_UTF option when 100 calling pcre2_compile(). This causes a compile time error if the pat- 101 tern contains a UTF-setting sequence. 102 103 The use of Unicode properties for character types such as \d can also 104 be enabled from within the pattern, by specifying "(*UCP)". This fea- 105 ture can be disallowed by setting the PCRE2_NEVER_UCP option. 106 107 If your application is one that supports UTF, be aware that validity 108 checking can take time. If the same data string is to be matched many 109 times, you can use the PCRE2_NO_UTF_CHECK option for the second and 110 subsequent matches to avoid running redundant checks. 111 112 The use of the \C escape sequence in a UTF-8 or UTF-16 pattern can lead 113 to problems, because it may leave the current matching point in the 114 middle of a multi-code-unit character. The PCRE2_NEVER_BACKSLASH_C 115 option can be used by an application to lock out the use of \C, causing 116 a compile-time error if it is encountered. It is also possible to build 117 PCRE2 with the use of \C permanently disabled. 118 119 Another way that performance can be hit is by running a pattern that 120 has a very large search tree against a string that will never match. 121 Nested unlimited repeats in a pattern are a common example. PCRE2 pro- 122 vides some protection against this: see the pcre2_set_match_limit() 123 function in the pcre2api page. There is a similar function called 124 pcre2_set_depth_limit() that can be used to restrict the amount of mem- 125 ory that is used. 126 127 128USER DOCUMENTATION 129 130 The user documentation for PCRE2 comprises a number of different sec- 131 tions. In the "man" format, each of these is a separate "man page". In 132 the HTML format, each is a separate page, linked from the index page. 133 In the plain text format, the descriptions of the pcre2grep and 134 pcre2test programs are in files called pcre2grep.txt and pcre2test.txt, 135 respectively. The remaining sections, except for the pcre2demo section 136 (which is a program listing), and the short pages for individual func- 137 tions, are concatenated in pcre2.txt, for ease of searching. The sec- 138 tions are as follows: 139 140 pcre2 this document 141 pcre2-config show PCRE2 installation configuration information 142 pcre2api details of PCRE2's native C API 143 pcre2build building PCRE2 144 pcre2callout details of the callout feature 145 pcre2compat discussion of Perl compatibility 146 pcre2convert details of pattern conversion functions 147 pcre2demo a demonstration C program that uses PCRE2 148 pcre2grep description of the pcre2grep command (8-bit only) 149 pcre2jit discussion of just-in-time optimization support 150 pcre2limits details of size and other limits 151 pcre2matching discussion of the two matching algorithms 152 pcre2partial details of the partial matching facility 153 pcre2pattern syntax and semantics of supported regular 154 expression patterns 155 pcre2perform discussion of performance issues 156 pcre2posix the POSIX-compatible C API for the 8-bit library 157 pcre2sample discussion of the pcre2demo program 158 pcre2serialize details of pattern serialization 159 pcre2syntax quick syntax reference 160 pcre2test description of the pcre2test command 161 pcre2unicode discussion of Unicode and UTF support 162 163 In the "man" and HTML formats, there is also a short page for each C 164 library function, listing its arguments and results. 165 166 167AUTHOR 168 169 Philip Hazel 170 University Computing Service 171 Cambridge, England. 172 173 Putting an actual email address here is a spam magnet. If you want to 174 email me, use my two initials, followed by the two digits 10, at the 175 domain cam.ac.uk. 176 177 178REVISION 179 180 Last updated: 11 July 2018 181 Copyright (c) 1997-2018 University of Cambridge. 182------------------------------------------------------------------------------ 183 184 185PCRE2API(3) Library Functions Manual PCRE2API(3) 186 187 188 189NAME 190 PCRE2 - Perl-compatible regular expressions (revised API) 191 192 #include <pcre2.h> 193 194 PCRE2 is a new API for PCRE, starting at release 10.0. This document 195 contains a description of all its native functions. See the pcre2 docu- 196 ment for an overview of all the PCRE2 documentation. 197 198 199PCRE2 NATIVE API BASIC FUNCTIONS 200 201 pcre2_code *pcre2_compile(PCRE2_SPTR pattern, PCRE2_SIZE length, 202 uint32_t options, int *errorcode, PCRE2_SIZE *erroroffset, 203 pcre2_compile_context *ccontext); 204 205 void pcre2_code_free(pcre2_code *code); 206 207 pcre2_match_data *pcre2_match_data_create(uint32_t ovecsize, 208 pcre2_general_context *gcontext); 209 210 pcre2_match_data *pcre2_match_data_create_from_pattern( 211 const pcre2_code *code, pcre2_general_context *gcontext); 212 213 int pcre2_match(const pcre2_code *code, PCRE2_SPTR subject, 214 PCRE2_SIZE length, PCRE2_SIZE startoffset, 215 uint32_t options, pcre2_match_data *match_data, 216 pcre2_match_context *mcontext); 217 218 int pcre2_dfa_match(const pcre2_code *code, PCRE2_SPTR subject, 219 PCRE2_SIZE length, PCRE2_SIZE startoffset, 220 uint32_t options, pcre2_match_data *match_data, 221 pcre2_match_context *mcontext, 222 int *workspace, PCRE2_SIZE wscount); 223 224 void pcre2_match_data_free(pcre2_match_data *match_data); 225 226 227PCRE2 NATIVE API AUXILIARY MATCH FUNCTIONS 228 229 PCRE2_SPTR pcre2_get_mark(pcre2_match_data *match_data); 230 231 uint32_t pcre2_get_ovector_count(pcre2_match_data *match_data); 232 233 PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *match_data); 234 235 PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *match_data); 236 237 238PCRE2 NATIVE API GENERAL CONTEXT FUNCTIONS 239 240 pcre2_general_context *pcre2_general_context_create( 241 void *(*private_malloc)(PCRE2_SIZE, void *), 242 void (*private_free)(void *, void *), void *memory_data); 243 244 pcre2_general_context *pcre2_general_context_copy( 245 pcre2_general_context *gcontext); 246 247 void pcre2_general_context_free(pcre2_general_context *gcontext); 248 249 250PCRE2 NATIVE API COMPILE CONTEXT FUNCTIONS 251 252 pcre2_compile_context *pcre2_compile_context_create( 253 pcre2_general_context *gcontext); 254 255 pcre2_compile_context *pcre2_compile_context_copy( 256 pcre2_compile_context *ccontext); 257 258 void pcre2_compile_context_free(pcre2_compile_context *ccontext); 259 260 int pcre2_set_bsr(pcre2_compile_context *ccontext, 261 uint32_t value); 262 263 int pcre2_set_character_tables(pcre2_compile_context *ccontext, 264 const unsigned char *tables); 265 266 int pcre2_set_compile_extra_options(pcre2_compile_context *ccontext, 267 uint32_t extra_options); 268 269 int pcre2_set_max_pattern_length(pcre2_compile_context *ccontext, 270 PCRE2_SIZE value); 271 272 int pcre2_set_newline(pcre2_compile_context *ccontext, 273 uint32_t value); 274 275 int pcre2_set_parens_nest_limit(pcre2_compile_context *ccontext, 276 uint32_t value); 277 278 int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext, 279 int (*guard_function)(uint32_t, void *), void *user_data); 280 281 282PCRE2 NATIVE API MATCH CONTEXT FUNCTIONS 283 284 pcre2_match_context *pcre2_match_context_create( 285 pcre2_general_context *gcontext); 286 287 pcre2_match_context *pcre2_match_context_copy( 288 pcre2_match_context *mcontext); 289 290 void pcre2_match_context_free(pcre2_match_context *mcontext); 291 292 int pcre2_set_callout(pcre2_match_context *mcontext, 293 int (*callout_function)(pcre2_callout_block *, void *), 294 void *callout_data); 295 296 int pcre2_set_offset_limit(pcre2_match_context *mcontext, 297 PCRE2_SIZE value); 298 299 int pcre2_set_heap_limit(pcre2_match_context *mcontext, 300 uint32_t value); 301 302 int pcre2_set_match_limit(pcre2_match_context *mcontext, 303 uint32_t value); 304 305 int pcre2_set_depth_limit(pcre2_match_context *mcontext, 306 uint32_t value); 307 308 309PCRE2 NATIVE API STRING EXTRACTION FUNCTIONS 310 311 int pcre2_substring_copy_byname(pcre2_match_data *match_data, 312 PCRE2_SPTR name, PCRE2_UCHAR *buffer, PCRE2_SIZE *bufflen); 313 314 int pcre2_substring_copy_bynumber(pcre2_match_data *match_data, 315 uint32_t number, PCRE2_UCHAR *buffer, 316 PCRE2_SIZE *bufflen); 317 318 void pcre2_substring_free(PCRE2_UCHAR *buffer); 319 320 int pcre2_substring_get_byname(pcre2_match_data *match_data, 321 PCRE2_SPTR name, PCRE2_UCHAR **bufferptr, PCRE2_SIZE *bufflen); 322 323 int pcre2_substring_get_bynumber(pcre2_match_data *match_data, 324 uint32_t number, PCRE2_UCHAR **bufferptr, 325 PCRE2_SIZE *bufflen); 326 327 int pcre2_substring_length_byname(pcre2_match_data *match_data, 328 PCRE2_SPTR name, PCRE2_SIZE *length); 329 330 int pcre2_substring_length_bynumber(pcre2_match_data *match_data, 331 uint32_t number, PCRE2_SIZE *length); 332 333 int pcre2_substring_nametable_scan(const pcre2_code *code, 334 PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last); 335 336 int pcre2_substring_number_from_name(const pcre2_code *code, 337 PCRE2_SPTR name); 338 339 void pcre2_substring_list_free(PCRE2_SPTR *list); 340 341 int pcre2_substring_list_get(pcre2_match_data *match_data, 342 PCRE2_UCHAR ***listptr, PCRE2_SIZE **lengthsptr); 343 344 345PCRE2 NATIVE API STRING SUBSTITUTION FUNCTION 346 347 int pcre2_substitute(const pcre2_code *code, PCRE2_SPTR subject, 348 PCRE2_SIZE length, PCRE2_SIZE startoffset, 349 uint32_t options, pcre2_match_data *match_data, 350 pcre2_match_context *mcontext, PCRE2_SPTR replacementzfP, 351 PCRE2_SIZE rlength, PCRE2_UCHAR *outputbuffer, 352 PCRE2_SIZE *outlengthptr); 353 354 355PCRE2 NATIVE API JIT FUNCTIONS 356 357 int pcre2_jit_compile(pcre2_code *code, uint32_t options); 358 359 int pcre2_jit_match(const pcre2_code *code, PCRE2_SPTR subject, 360 PCRE2_SIZE length, PCRE2_SIZE startoffset, 361 uint32_t options, pcre2_match_data *match_data, 362 pcre2_match_context *mcontext); 363 364 void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext); 365 366 pcre2_jit_stack *pcre2_jit_stack_create(PCRE2_SIZE startsize, 367 PCRE2_SIZE maxsize, pcre2_general_context *gcontext); 368 369 void pcre2_jit_stack_assign(pcre2_match_context *mcontext, 370 pcre2_jit_callback callback_function, void *callback_data); 371 372 void pcre2_jit_stack_free(pcre2_jit_stack *jit_stack); 373 374 375PCRE2 NATIVE API SERIALIZATION FUNCTIONS 376 377 int32_t pcre2_serialize_decode(pcre2_code **codes, 378 int32_t number_of_codes, const uint8_t *bytes, 379 pcre2_general_context *gcontext); 380 381 int32_t pcre2_serialize_encode(const pcre2_code **codes, 382 int32_t number_of_codes, uint8_t **serialized_bytes, 383 PCRE2_SIZE *serialized_size, pcre2_general_context *gcontext); 384 385 void pcre2_serialize_free(uint8_t *bytes); 386 387 int32_t pcre2_serialize_get_number_of_codes(const uint8_t *bytes); 388 389 390PCRE2 NATIVE API AUXILIARY FUNCTIONS 391 392 pcre2_code *pcre2_code_copy(const pcre2_code *code); 393 394 pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *code); 395 396 int pcre2_get_error_message(int errorcode, PCRE2_UCHAR *buffer, 397 PCRE2_SIZE bufflen); 398 399 const unsigned char *pcre2_maketables(pcre2_general_context *gcontext); 400 401 int pcre2_pattern_info(const pcre2 *code, uint32_t what, void *where); 402 403 int pcre2_callout_enumerate(const pcre2_code *code, 404 int (*callback)(pcre2_callout_enumerate_block *, void *), 405 void *user_data); 406 407 int pcre2_config(uint32_t what, void *where); 408 409 410PCRE2 NATIVE API OBSOLETE FUNCTIONS 411 412 int pcre2_set_recursion_limit(pcre2_match_context *mcontext, 413 uint32_t value); 414 415 int pcre2_set_recursion_memory_management( 416 pcre2_match_context *mcontext, 417 void *(*private_malloc)(PCRE2_SIZE, void *), 418 void (*private_free)(void *, void *), void *memory_data); 419 420 These functions became obsolete at release 10.30 and are retained only 421 for backward compatibility. They should not be used in new code. The 422 first is replaced by pcre2_set_depth_limit(); the second is no longer 423 needed and has no effect (it always returns zero). 424 425 426PCRE2 EXPERIMENTAL PATTERN CONVERSION FUNCTIONS 427 428 pcre2_convert_context *pcre2_convert_context_create( 429 pcre2_general_context *gcontext); 430 431 pcre2_convert_context *pcre2_convert_context_copy( 432 pcre2_convert_context *cvcontext); 433 434 void pcre2_convert_context_free(pcre2_convert_context *cvcontext); 435 436 int pcre2_set_glob_escape(pcre2_convert_context *cvcontext, 437 uint32_t escape_char); 438 439 int pcre2_set_glob_separator(pcre2_convert_context *cvcontext, 440 uint32_t separator_char); 441 442 int pcre2_pattern_convert(PCRE2_SPTR pattern, PCRE2_SIZE length, 443 uint32_t options, PCRE2_UCHAR **buffer, 444 PCRE2_SIZE *blength, pcre2_convert_context *cvcontext); 445 446 void pcre2_converted_pattern_free(PCRE2_UCHAR *converted_pattern); 447 448 These functions provide a way of converting non-PCRE2 patterns into 449 patterns that can be processed by pcre2_compile(). This facility is 450 experimental and may be changed in future releases. At present, "globs" 451 and POSIX basic and extended patterns can be converted. Details are 452 given in the pcre2convert documentation. 453 454 455PCRE2 8-BIT, 16-BIT, AND 32-BIT LIBRARIES 456 457 There are three PCRE2 libraries, supporting 8-bit, 16-bit, and 32-bit 458 code units, respectively. However, there is just one header file, 459 pcre2.h. This contains the function prototypes and other definitions 460 for all three libraries. One, two, or all three can be installed simul- 461 taneously. On Unix-like systems the libraries are called libpcre2-8, 462 libpcre2-16, and libpcre2-32, and they can also co-exist with the orig- 463 inal PCRE libraries. 464 465 Character strings are passed to and from a PCRE2 library as a sequence 466 of unsigned integers in code units of the appropriate width. Every 467 PCRE2 function comes in three different forms, one for each library, 468 for example: 469 470 pcre2_compile_8() 471 pcre2_compile_16() 472 pcre2_compile_32() 473 474 There are also three different sets of data types: 475 476 PCRE2_UCHAR8, PCRE2_UCHAR16, PCRE2_UCHAR32 477 PCRE2_SPTR8, PCRE2_SPTR16, PCRE2_SPTR32 478 479 The UCHAR types define unsigned code units of the appropriate widths. 480 For example, PCRE2_UCHAR16 is usually defined as `uint16_t'. The SPTR 481 types are constant pointers to the equivalent UCHAR types, that is, 482 they are pointers to vectors of unsigned code units. 483 484 Many applications use only one code unit width. For their convenience, 485 macros are defined whose names are the generic forms such as pcre2_com- 486 pile() and PCRE2_SPTR. These macros use the value of the macro 487 PCRE2_CODE_UNIT_WIDTH to generate the appropriate width-specific func- 488 tion and macro names. PCRE2_CODE_UNIT_WIDTH is not defined by default. 489 An application must define it to be 8, 16, or 32 before including 490 pcre2.h in order to make use of the generic names. 491 492 Applications that use more than one code unit width can be linked with 493 more than one PCRE2 library, but must define PCRE2_CODE_UNIT_WIDTH to 494 be 0 before including pcre2.h, and then use the real function names. 495 Any code that is to be included in an environment where the value of 496 PCRE2_CODE_UNIT_WIDTH is unknown should also use the real function 497 names. (Unfortunately, it is not possible in C code to save and restore 498 the value of a macro.) 499 500 If PCRE2_CODE_UNIT_WIDTH is not defined before including pcre2.h, a 501 compiler error occurs. 502 503 When using multiple libraries in an application, you must take care 504 when processing any particular pattern to use only functions from a 505 single library. For example, if you want to run a match using a pat- 506 tern that was compiled with pcre2_compile_16(), you must do so with 507 pcre2_match_16(), not pcre2_match_8() or pcre2_match_32(). 508 509 In the function summaries above, and in the rest of this document and 510 other PCRE2 documents, functions and data types are described using 511 their generic names, without the _8, _16, or _32 suffix. 512 513 514PCRE2 API OVERVIEW 515 516 PCRE2 has its own native API, which is described in this document. 517 There are also some wrapper functions for the 8-bit library that corre- 518 spond to the POSIX regular expression API, but they do not give access 519 to all the functionality of PCRE2. They are described in the pcre2posix 520 documentation. Both these APIs define a set of C function calls. 521 522 The native API C data types, function prototypes, option values, and 523 error codes are defined in the header file pcre2.h, which also contains 524 definitions of PCRE2_MAJOR and PCRE2_MINOR, the major and minor release 525 numbers for the library. Applications can use these to include support 526 for different releases of PCRE2. 527 528 In a Windows environment, if you want to statically link an application 529 program against a non-dll PCRE2 library, you must define PCRE2_STATIC 530 before including pcre2.h. 531 532 The functions pcre2_compile() and pcre2_match() are used for compiling 533 and matching regular expressions in a Perl-compatible manner. A sample 534 program that demonstrates the simplest way of using them is provided in 535 the file called pcre2demo.c in the PCRE2 source distribution. A listing 536 of this program is given in the pcre2demo documentation, and the 537 pcre2sample documentation describes how to compile and run it. 538 539 The compiling and matching functions recognize various options that are 540 passed as bits in an options argument. There are also some more compli- 541 cated parameters such as custom memory management functions and 542 resource limits that are passed in "contexts" (which are just memory 543 blocks, described below). Simple applications do not need to make use 544 of contexts. 545 546 Just-in-time (JIT) compiler support is an optional feature of PCRE2 547 that can be built in appropriate hardware environments. It greatly 548 speeds up the matching performance of many patterns. Programs can 549 request that it be used if available by calling pcre2_jit_compile() 550 after a pattern has been successfully compiled by pcre2_compile(). This 551 does nothing if JIT support is not available. 552 553 More complicated programs might need to make use of the specialist 554 functions pcre2_jit_stack_create(), pcre2_jit_stack_free(), and 555 pcre2_jit_stack_assign() in order to control the JIT code's memory 556 usage. 557 558 JIT matching is automatically used by pcre2_match() if it is available, 559 unless the PCRE2_NO_JIT option is set. There is also a direct interface 560 for JIT matching, which gives improved performance at the expense of 561 less sanity checking. The JIT-specific functions are discussed in the 562 pcre2jit documentation. 563 564 A second matching function, pcre2_dfa_match(), which is not Perl-com- 565 patible, is also provided. This uses a different algorithm for the 566 matching. The alternative algorithm finds all possible matches (at a 567 given point in the subject), and scans the subject just once (unless 568 there are lookaround assertions). However, this algorithm does not 569 return captured substrings. A description of the two matching algo- 570 rithms and their advantages and disadvantages is given in the 571 pcre2matching documentation. There is no JIT support for 572 pcre2_dfa_match(). 573 574 In addition to the main compiling and matching functions, there are 575 convenience functions for extracting captured substrings from a subject 576 string that has been matched by pcre2_match(). They are: 577 578 pcre2_substring_copy_byname() 579 pcre2_substring_copy_bynumber() 580 pcre2_substring_get_byname() 581 pcre2_substring_get_bynumber() 582 pcre2_substring_list_get() 583 pcre2_substring_length_byname() 584 pcre2_substring_length_bynumber() 585 pcre2_substring_nametable_scan() 586 pcre2_substring_number_from_name() 587 588 pcre2_substring_free() and pcre2_substring_list_free() are also pro- 589 vided, to free memory used for extracted strings. If either of these 590 functions is called with a NULL argument, the function returns immedi- 591 ately without doing anything. 592 593 The function pcre2_substitute() can be called to match a pattern and 594 return a copy of the subject string with substitutions for parts that 595 were matched. 596 597 Functions whose names begin with pcre2_serialize_ are used for saving 598 compiled patterns on disc or elsewhere, and reloading them later. 599 600 Finally, there are functions for finding out information about a com- 601 piled pattern (pcre2_pattern_info()) and about the configuration with 602 which PCRE2 was built (pcre2_config()). 603 604 Functions with names ending with _free() are used for freeing memory 605 blocks of various sorts. In all cases, if one of these functions is 606 called with a NULL argument, it does nothing. 607 608 609STRING LENGTHS AND OFFSETS 610 611 The PCRE2 API uses string lengths and offsets into strings of code 612 units in several places. These values are always of type PCRE2_SIZE, 613 which is an unsigned integer type, currently always defined as size_t. 614 The largest value that can be stored in such a type (that is 615 ~(PCRE2_SIZE)0) is reserved as a special indicator for zero-terminated 616 strings and unset offsets. Therefore, the longest string that can be 617 handled is one less than this maximum. 618 619 620NEWLINES 621 622 PCRE2 supports five different conventions for indicating line breaks in 623 strings: a single CR (carriage return) character, a single LF (line- 624 feed) character, the two-character sequence CRLF, any of the three pre- 625 ceding, or any Unicode newline sequence. The Unicode newline sequences 626 are the three just mentioned, plus the single characters VT (vertical 627 tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line 628 separator, U+2028), and PS (paragraph separator, U+2029). 629 630 Each of the first three conventions is used by at least one operating 631 system as its standard newline sequence. When PCRE2 is built, a default 632 can be specified. If it is not, the default is set to LF, which is the 633 Unix standard. However, the newline convention can be changed by an 634 application when calling pcre2_compile(), or it can be specified by 635 special text at the start of the pattern itself; this overrides any 636 other settings. See the pcre2pattern page for details of the special 637 character sequences. 638 639 In the PCRE2 documentation the word "newline" is used to mean "the 640 character or pair of characters that indicate a line break". The choice 641 of newline convention affects the handling of the dot, circumflex, and 642 dollar metacharacters, the handling of #-comments in /x mode, and, when 643 CRLF is a recognized line ending sequence, the match position advance- 644 ment for a non-anchored pattern. There is more detail about this in the 645 section on pcre2_match() options below. 646 647 The choice of newline convention does not affect the interpretation of 648 the \n or \r escape sequences, nor does it affect what \R matches; this 649 has its own separate convention. 650 651 652MULTITHREADING 653 654 In a multithreaded application it is important to keep thread-specific 655 data separate from data that can be shared between threads. The PCRE2 656 library code itself is thread-safe: it contains no static or global 657 variables. The API is designed to be fairly simple for non-threaded 658 applications while at the same time ensuring that multithreaded appli- 659 cations can use it. 660 661 There are several different blocks of data that are used to pass infor- 662 mation between the application and the PCRE2 libraries. 663 664 The compiled pattern 665 666 A pointer to the compiled form of a pattern is returned to the user 667 when pcre2_compile() is successful. The data in the compiled pattern is 668 fixed, and does not change when the pattern is matched. Therefore, it 669 is thread-safe, that is, the same compiled pattern can be used by more 670 than one thread simultaneously. For example, an application can compile 671 all its patterns at the start, before forking off multiple threads that 672 use them. However, if the just-in-time (JIT) optimization feature is 673 being used, it needs separate memory stack areas for each thread. See 674 the pcre2jit documentation for more details. 675 676 In a more complicated situation, where patterns are compiled only when 677 they are first needed, but are still shared between threads, pointers 678 to compiled patterns must be protected from simultaneous writing by 679 multiple threads, at least until a pattern has been compiled. The logic 680 can be something like this: 681 682 Get a read-only (shared) lock (mutex) for pointer 683 if (pointer == NULL) 684 { 685 Get a write (unique) lock for pointer 686 pointer = pcre2_compile(... 687 } 688 Release the lock 689 Use pointer in pcre2_match() 690 691 Of course, testing for compilation errors should also be included in 692 the code. 693 694 If JIT is being used, but the JIT compilation is not being done immedi- 695 ately, (perhaps waiting to see if the pattern is used often enough) 696 similar logic is required. JIT compilation updates a pointer within the 697 compiled code block, so a thread must gain unique write access to the 698 pointer before calling pcre2_jit_compile(). Alternatively, 699 pcre2_code_copy() or pcre2_code_copy_with_tables() can be used to 700 obtain a private copy of the compiled code before calling the JIT com- 701 piler. 702 703 Context blocks 704 705 The next main section below introduces the idea of "contexts" in which 706 PCRE2 functions are called. A context is nothing more than a collection 707 of parameters that control the way PCRE2 operates. Grouping a number of 708 parameters together in a context is a convenient way of passing them to 709 a PCRE2 function without using lots of arguments. The parameters that 710 are stored in contexts are in some sense "advanced features" of the 711 API. Many straightforward applications will not need to use contexts. 712 713 In a multithreaded application, if the parameters in a context are val- 714 ues that are never changed, the same context can be used by all the 715 threads. However, if any thread needs to change any value in a context, 716 it must make its own thread-specific copy. 717 718 Match blocks 719 720 The matching functions need a block of memory for storing the results 721 of a match. This includes details of what was matched, as well as addi- 722 tional information such as the name of a (*MARK) setting. Each thread 723 must provide its own copy of this memory. 724 725 726PCRE2 CONTEXTS 727 728 Some PCRE2 functions have a lot of parameters, many of which are used 729 only by specialist applications, for example, those that use custom 730 memory management or non-standard character tables. To keep function 731 argument lists at a reasonable size, and at the same time to keep the 732 API extensible, "uncommon" parameters are passed to certain functions 733 in a context instead of directly. A context is just a block of memory 734 that holds the parameter values. Applications that do not need to 735 adjust any of the context parameters can pass NULL when a context 736 pointer is required. 737 738 There are three different types of context: a general context that is 739 relevant for several PCRE2 operations, a compile-time context, and a 740 match-time context. 741 742 The general context 743 744 At present, this context just contains pointers to (and data for) 745 external memory management functions that are called from several 746 places in the PCRE2 library. The context is named `general' rather than 747 specifically `memory' because in future other fields may be added. If 748 you do not want to supply your own custom memory management functions, 749 you do not need to bother with a general context. A general context is 750 created by: 751 752 pcre2_general_context *pcre2_general_context_create( 753 void *(*private_malloc)(PCRE2_SIZE, void *), 754 void (*private_free)(void *, void *), void *memory_data); 755 756 The two function pointers specify custom memory management functions, 757 whose prototypes are: 758 759 void *private_malloc(PCRE2_SIZE, void *); 760 void private_free(void *, void *); 761 762 Whenever code in PCRE2 calls these functions, the final argument is the 763 value of memory_data. Either of the first two arguments of the creation 764 function may be NULL, in which case the system memory management func- 765 tions malloc() and free() are used. (This is not currently useful, as 766 there are no other fields in a general context, but in future there 767 might be.) The private_malloc() function is used (if supplied) to 768 obtain memory for storing the context, and all three values are saved 769 as part of the context. 770 771 Whenever PCRE2 creates a data block of any kind, the block contains a 772 pointer to the free() function that matches the malloc() function that 773 was used. When the time comes to free the block, this function is 774 called. 775 776 A general context can be copied by calling: 777 778 pcre2_general_context *pcre2_general_context_copy( 779 pcre2_general_context *gcontext); 780 781 The memory used for a general context should be freed by calling: 782 783 void pcre2_general_context_free(pcre2_general_context *gcontext); 784 785 If this function is passed a NULL argument, it returns immediately 786 without doing anything. 787 788 The compile context 789 790 A compile context is required if you want to provide an external func- 791 tion for stack checking during compilation or to change the default 792 values of any of the following compile-time parameters: 793 794 What \R matches (Unicode newlines or CR, LF, CRLF only) 795 PCRE2's character tables 796 The newline character sequence 797 The compile time nested parentheses limit 798 The maximum length of the pattern string 799 The extra options bits (none set by default) 800 801 A compile context is also required if you are using custom memory man- 802 agement. If none of these apply, just pass NULL as the context argu- 803 ment of pcre2_compile(). 804 805 A compile context is created, copied, and freed by the following func- 806 tions: 807 808 pcre2_compile_context *pcre2_compile_context_create( 809 pcre2_general_context *gcontext); 810 811 pcre2_compile_context *pcre2_compile_context_copy( 812 pcre2_compile_context *ccontext); 813 814 void pcre2_compile_context_free(pcre2_compile_context *ccontext); 815 816 A compile context is created with default values for its parameters. 817 These can be changed by calling the following functions, which return 0 818 on success, or PCRE2_ERROR_BADDATA if invalid data is detected. 819 820 int pcre2_set_bsr(pcre2_compile_context *ccontext, 821 uint32_t value); 822 823 The value must be PCRE2_BSR_ANYCRLF, to specify that \R matches only 824 CR, LF, or CRLF, or PCRE2_BSR_UNICODE, to specify that \R matches any 825 Unicode line ending sequence. The value is used by the JIT compiler and 826 by the two interpreted matching functions, pcre2_match() and 827 pcre2_dfa_match(). 828 829 int pcre2_set_character_tables(pcre2_compile_context *ccontext, 830 const unsigned char *tables); 831 832 The value must be the result of a call to pcre2_maketables(), whose 833 only argument is a general context. This function builds a set of char- 834 acter tables in the current locale. 835 836 int pcre2_set_compile_extra_options(pcre2_compile_context *ccontext, 837 uint32_t extra_options); 838 839 As PCRE2 has developed, almost all the 32 option bits that are avail- 840 able in the options argument of pcre2_compile() have been used up. To 841 avoid running out, the compile context contains a set of extra option 842 bits which are used for some newer, assumed rarer, options. This func- 843 tion sets those bits. It always sets all the bits (either on or off). 844 It does not modify any existing setting. The available options are 845 defined in the section entitled "Extra compile options" below. 846 847 int pcre2_set_max_pattern_length(pcre2_compile_context *ccontext, 848 PCRE2_SIZE value); 849 850 This sets a maximum length, in code units, for any pattern string that 851 is compiled with this context. If the pattern is longer, an error is 852 generated. This facility is provided so that applications that accept 853 patterns from external sources can limit their size. The default is the 854 largest number that a PCRE2_SIZE variable can hold, which is effec- 855 tively unlimited. 856 857 int pcre2_set_newline(pcre2_compile_context *ccontext, 858 uint32_t value); 859 860 This specifies which characters or character sequences are to be recog- 861 nized as newlines. The value must be one of PCRE2_NEWLINE_CR (carriage 862 return only), PCRE2_NEWLINE_LF (linefeed only), PCRE2_NEWLINE_CRLF (the 863 two-character sequence CR followed by LF), PCRE2_NEWLINE_ANYCRLF (any 864 of the above), PCRE2_NEWLINE_ANY (any Unicode newline sequence), or 865 PCRE2_NEWLINE_NUL (the NUL character, that is a binary zero). 866 867 A pattern can override the value set in the compile context by starting 868 with a sequence such as (*CRLF). See the pcre2pattern page for details. 869 870 When a pattern is compiled with the PCRE2_EXTENDED or 871 PCRE2_EXTENDED_MORE option, the newline convention affects the recogni- 872 tion of the end of internal comments starting with #. The value is 873 saved with the compiled pattern for subsequent use by the JIT compiler 874 and by the two interpreted matching functions, pcre2_match() and 875 pcre2_dfa_match(). 876 877 int pcre2_set_parens_nest_limit(pcre2_compile_context *ccontext, 878 uint32_t value); 879 880 This parameter ajusts the limit, set when PCRE2 is built (default 250), 881 on the depth of parenthesis nesting in a pattern. This limit stops 882 rogue patterns using up too much system stack when being compiled. The 883 limit applies to parentheses of all kinds, not just capturing parenthe- 884 ses. 885 886 int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext, 887 int (*guard_function)(uint32_t, void *), void *user_data); 888 889 There is at least one application that runs PCRE2 in threads with very 890 limited system stack, where running out of stack is to be avoided at 891 all costs. The parenthesis limit above cannot take account of how much 892 stack is actually available during compilation. For a finer control, 893 you can supply a function that is called whenever pcre2_compile() 894 starts to compile a parenthesized part of a pattern. This function can 895 check the actual stack size (or anything else that it wants to, of 896 course). 897 898 The first argument to the callout function gives the current depth of 899 nesting, and the second is user data that is set up by the last argu- 900 ment of pcre2_set_compile_recursion_guard(). The callout function 901 should return zero if all is well, or non-zero to force an error. 902 903 The match context 904 905 A match context is required if you want to: 906 907 Set up a callout function 908 Set an offset limit for matching an unanchored pattern 909 Change the limit on the amount of heap used when matching 910 Change the backtracking match limit 911 Change the backtracking depth limit 912 Set custom memory management specifically for the match 913 914 If none of these apply, just pass NULL as the context argument of 915 pcre2_match(), pcre2_dfa_match(), or pcre2_jit_match(). 916 917 A match context is created, copied, and freed by the following func- 918 tions: 919 920 pcre2_match_context *pcre2_match_context_create( 921 pcre2_general_context *gcontext); 922 923 pcre2_match_context *pcre2_match_context_copy( 924 pcre2_match_context *mcontext); 925 926 void pcre2_match_context_free(pcre2_match_context *mcontext); 927 928 A match context is created with default values for its parameters. 929 These can be changed by calling the following functions, which return 0 930 on success, or PCRE2_ERROR_BADDATA if invalid data is detected. 931 932 int pcre2_set_callout(pcre2_match_context *mcontext, 933 int (*callout_function)(pcre2_callout_block *, void *), 934 void *callout_data); 935 936 This sets up a "callout" function for PCRE2 to call at specified points 937 during a matching operation. Details are given in the pcre2callout doc- 938 umentation. 939 940 int pcre2_set_offset_limit(pcre2_match_context *mcontext, 941 PCRE2_SIZE value); 942 943 The offset_limit parameter limits how far an unanchored search can 944 advance in the subject string. The default value is PCRE2_UNSET. The 945 pcre2_match() and pcre2_dfa_match() functions return 946 PCRE2_ERROR_NOMATCH if a match with a starting point before or at the 947 given offset is not found. The pcre2_substitute() function makes no 948 more substitutions. 949 950 For example, if the pattern /abc/ is matched against "123abc" with an 951 offset limit less than 3, the result is PCRE2_ERROR_NO_MATCH. A match 952 can never be found if the startoffset argument of pcre2_match(), 953 pcre2_dfa_match(), or pcre2_substitute() is greater than the offset 954 limit set in the match context. 955 956 When using this facility, you must set the PCRE2_USE_OFFSET_LIMIT 957 option when calling pcre2_compile() so that when JIT is in use, differ- 958 ent code can be compiled. If a match is started with a non-default 959 match limit when PCRE2_USE_OFFSET_LIMIT is not set, an error is gener- 960 ated. 961 962 The offset limit facility can be used to track progress when searching 963 large subject strings or to limit the extent of global substitutions. 964 See also the PCRE2_FIRSTLINE option, which requires a match to start 965 before or at the first newline that follows the start of matching in 966 the subject. If this is set with an offset limit, a match must occur in 967 the first line and also within the offset limit. In other words, which- 968 ever limit comes first is used. 969 970 int pcre2_set_heap_limit(pcre2_match_context *mcontext, 971 uint32_t value); 972 973 The heap_limit parameter specifies, in units of kibibytes (1024 bytes), 974 the maximum amount of heap memory that pcre2_match() may use to hold 975 backtracking information when running an interpretive match. This limit 976 also applies to pcre2_dfa_match(), which may use the heap when process- 977 ing patterns with a lot of nested pattern recursion or lookarounds or 978 atomic groups. This limit does not apply to matching with the JIT opti- 979 mization, which has its own memory control arrangements (see the 980 pcre2jit documentation for more details). If the limit is reached, the 981 negative error code PCRE2_ERROR_HEAPLIMIT is returned. The default 982 limit can be set when PCRE2 is built; if it is not, the default is set 983 very large and is essentially "unlimited". 984 985 A value for the heap limit may also be supplied by an item at the start 986 of a pattern of the form 987 988 (*LIMIT_HEAP=ddd) 989 990 where ddd is a decimal number. However, such a setting is ignored 991 unless ddd is less than the limit set by the caller of pcre2_match() 992 or, if no such limit is set, less than the default. 993 994 The pcre2_match() function starts out using a 20KiB vector on the sys- 995 tem stack for recording backtracking points. The more nested backtrack- 996 ing points there are (that is, the deeper the search tree), the more 997 memory is needed. Heap memory is used only if the initial vector is 998 too small. If the heap limit is set to a value less than 21 (in partic- 999 ular, zero) no heap memory will be used. In this case, only patterns 1000 that do not have a lot of nested backtracking can be successfully pro- 1001 cessed. 1002 1003 Similarly, for pcre2_dfa_match(), a vector on the system stack is used 1004 when processing pattern recursions, lookarounds, or atomic groups, and 1005 only if this is not big enough is heap memory used. In this case, too, 1006 setting a value of zero disables the use of the heap. 1007 1008 int pcre2_set_match_limit(pcre2_match_context *mcontext, 1009 uint32_t value); 1010 1011 The match_limit parameter provides a means of preventing PCRE2 from 1012 using up too many computing resources when processing patterns that are 1013 not going to match, but which have a very large number of possibilities 1014 in their search trees. The classic example is a pattern that uses 1015 nested unlimited repeats. 1016 1017 There is an internal counter in pcre2_match() that is incremented each 1018 time round its main matching loop. If this value reaches the match 1019 limit, pcre2_match() returns the negative value PCRE2_ERROR_MATCHLIMIT. 1020 This has the effect of limiting the amount of backtracking that can 1021 take place. For patterns that are not anchored, the count restarts from 1022 zero for each position in the subject string. This limit also applies 1023 to pcre2_dfa_match(), though the counting is done in a different way. 1024 1025 When pcre2_match() is called with a pattern that was successfully pro- 1026 cessed by pcre2_jit_compile(), the way in which matching is executed is 1027 entirely different. However, there is still the possibility of runaway 1028 matching that goes on for a very long time, and so the match_limit 1029 value is also used in this case (but in a different way) to limit how 1030 long the matching can continue. 1031 1032 The default value for the limit can be set when PCRE2 is built; the 1033 default default is 10 million, which handles all but the most extreme 1034 cases. A value for the match limit may also be supplied by an item at 1035 the start of a pattern of the form 1036 1037 (*LIMIT_MATCH=ddd) 1038 1039 where ddd is a decimal number. However, such a setting is ignored 1040 unless ddd is less than the limit set by the caller of pcre2_match() or 1041 pcre2_dfa_match() or, if no such limit is set, less than the default. 1042 1043 int pcre2_set_depth_limit(pcre2_match_context *mcontext, 1044 uint32_t value); 1045 1046 This parameter limits the depth of nested backtracking in 1047 pcre2_match(). Each time a nested backtracking point is passed, a new 1048 memory "frame" is used to remember the state of matching at that point. 1049 Thus, this parameter indirectly limits the amount of memory that is 1050 used in a match. However, because the size of each memory "frame" 1051 depends on the number of capturing parentheses, the actual memory limit 1052 varies from pattern to pattern. This limit was more useful in versions 1053 before 10.30, where function recursion was used for backtracking. 1054 1055 The depth limit is not relevant, and is ignored, when matching is done 1056 using JIT compiled code. However, it is supported by pcre2_dfa_match(), 1057 which uses it to limit the depth of nested internal recursive function 1058 calls that implement atomic groups, lookaround assertions, and pattern 1059 recursions. This limits, indirectly, the amount of system stack that is 1060 used. It was more useful in versions before 10.32, when stack memory 1061 was used for local workspace vectors for recursive function calls. From 1062 version 10.32, only local variables are allocated on the stack and as 1063 each call uses only a few hundred bytes, even a small stack can support 1064 quite a lot of recursion. 1065 1066 If the depth of internal recursive function calls is great enough, 1067 local workspace vectors are allocated on the heap from version 10.32 1068 onwards, so the depth limit also indirectly limits the amount of heap 1069 memory that is used. A recursive pattern such as /(.(?2))((?1)|)/, when 1070 matched to a very long string using pcre2_dfa_match(), can use a great 1071 deal of memory. However, it is probably better to limit heap usage 1072 directly by calling pcre2_set_heap_limit(). 1073 1074 The default value for the depth limit can be set when PCRE2 is built; 1075 if it is not, the default is set to the same value as the default for 1076 the match limit. If the limit is exceeded, pcre2_match() or 1077 pcre2_dfa_match() returns PCRE2_ERROR_DEPTHLIMIT. A value for the depth 1078 limit may also be supplied by an item at the start of a pattern of the 1079 form 1080 1081 (*LIMIT_DEPTH=ddd) 1082 1083 where ddd is a decimal number. However, such a setting is ignored 1084 unless ddd is less than the limit set by the caller of pcre2_match() or 1085 pcre2_dfa_match() or, if no such limit is set, less than the default. 1086 1087 1088CHECKING BUILD-TIME OPTIONS 1089 1090 int pcre2_config(uint32_t what, void *where); 1091 1092 The function pcre2_config() makes it possible for a PCRE2 client to 1093 discover which optional features have been compiled into the PCRE2 1094 library. The pcre2build documentation has more details about these 1095 optional features. 1096 1097 The first argument for pcre2_config() specifies which information is 1098 required. The second argument is a pointer to memory into which the 1099 information is placed. If NULL is passed, the function returns the 1100 amount of memory that is needed for the requested information. For 1101 calls that return numerical values, the value is in bytes; when 1102 requesting these values, where should point to appropriately aligned 1103 memory. For calls that return strings, the required length is given in 1104 code units, not counting the terminating zero. 1105 1106 When requesting information, the returned value from pcre2_config() is 1107 non-negative on success, or the negative error code PCRE2_ERROR_BADOP- 1108 TION if the value in the first argument is not recognized. The follow- 1109 ing information is available: 1110 1111 PCRE2_CONFIG_BSR 1112 1113 The output is a uint32_t integer whose value indicates what character 1114 sequences the \R escape sequence matches by default. A value of 1115 PCRE2_BSR_UNICODE means that \R matches any Unicode line ending 1116 sequence; a value of PCRE2_BSR_ANYCRLF means that \R matches only CR, 1117 LF, or CRLF. The default can be overridden when a pattern is compiled. 1118 1119 PCRE2_CONFIG_COMPILED_WIDTHS 1120 1121 The output is a uint32_t integer whose lower bits indicate which code 1122 unit widths were selected when PCRE2 was built. The 1-bit indicates 1123 8-bit support, and the 2-bit and 4-bit indicate 16-bit and 32-bit sup- 1124 port, respectively. 1125 1126 PCRE2_CONFIG_DEPTHLIMIT 1127 1128 The output is a uint32_t integer that gives the default limit for the 1129 depth of nested backtracking in pcre2_match() or the depth of nested 1130 recursions, lookarounds, and atomic groups in pcre2_dfa_match(). Fur- 1131 ther details are given with pcre2_set_depth_limit() above. 1132 1133 PCRE2_CONFIG_HEAPLIMIT 1134 1135 The output is a uint32_t integer that gives, in kibibytes, the default 1136 limit for the amount of heap memory used by pcre2_match() or 1137 pcre2_dfa_match(). Further details are given with 1138 pcre2_set_heap_limit() above. 1139 1140 PCRE2_CONFIG_JIT 1141 1142 The output is a uint32_t integer that is set to one if support for 1143 just-in-time compiling is available; otherwise it is set to zero. 1144 1145 PCRE2_CONFIG_JITTARGET 1146 1147 The where argument should point to a buffer that is at least 48 code 1148 units long. (The exact length required can be found by calling 1149 pcre2_config() with where set to NULL.) The buffer is filled with a 1150 string that contains the name of the architecture for which the JIT 1151 compiler is configured, for example "x86 32bit (little endian + 1152 unaligned)". If JIT support is not available, PCRE2_ERROR_BADOPTION is 1153 returned, otherwise the number of code units used is returned. This is 1154 the length of the string, plus one unit for the terminating zero. 1155 1156 PCRE2_CONFIG_LINKSIZE 1157 1158 The output is a uint32_t integer that contains the number of bytes used 1159 for internal linkage in compiled regular expressions. When PCRE2 is 1160 configured, the value can be set to 2, 3, or 4, with the default being 1161 2. This is the value that is returned by pcre2_config(). However, when 1162 the 16-bit library is compiled, a value of 3 is rounded up to 4, and 1163 when the 32-bit library is compiled, internal linkages always use 4 1164 bytes, so the configured value is not relevant. 1165 1166 The default value of 2 for the 8-bit and 16-bit libraries is sufficient 1167 for all but the most massive patterns, since it allows the size of the 1168 compiled pattern to be up to 65535 code units. Larger values allow 1169 larger regular expressions to be compiled by those two libraries, but 1170 at the expense of slower matching. 1171 1172 PCRE2_CONFIG_MATCHLIMIT 1173 1174 The output is a uint32_t integer that gives the default match limit for 1175 pcre2_match(). Further details are given with pcre2_set_match_limit() 1176 above. 1177 1178 PCRE2_CONFIG_NEWLINE 1179 1180 The output is a uint32_t integer whose value specifies the default 1181 character sequence that is recognized as meaning "newline". The values 1182 are: 1183 1184 PCRE2_NEWLINE_CR Carriage return (CR) 1185 PCRE2_NEWLINE_LF Linefeed (LF) 1186 PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF) 1187 PCRE2_NEWLINE_ANY Any Unicode line ending 1188 PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF 1189 PCRE2_NEWLINE_NUL The NUL character (binary zero) 1190 1191 The default should normally correspond to the standard sequence for 1192 your operating system. 1193 1194 PCRE2_CONFIG_NEVER_BACKSLASH_C 1195 1196 The output is a uint32_t integer that is set to one if the use of \C 1197 was permanently disabled when PCRE2 was built; otherwise it is set to 1198 zero. 1199 1200 PCRE2_CONFIG_PARENSLIMIT 1201 1202 The output is a uint32_t integer that gives the maximum depth of nest- 1203 ing of parentheses (of any kind) in a pattern. This limit is imposed to 1204 cap the amount of system stack used when a pattern is compiled. It is 1205 specified when PCRE2 is built; the default is 250. This limit does not 1206 take into account the stack that may already be used by the calling 1207 application. For finer control over compilation stack usage, see 1208 pcre2_set_compile_recursion_guard(). 1209 1210 PCRE2_CONFIG_STACKRECURSE 1211 1212 This parameter is obsolete and should not be used in new code. The out- 1213 put is a uint32_t integer that is always set to zero. 1214 1215 PCRE2_CONFIG_UNICODE_VERSION 1216 1217 The where argument should point to a buffer that is at least 24 code 1218 units long. (The exact length required can be found by calling 1219 pcre2_config() with where set to NULL.) If PCRE2 has been compiled 1220 without Unicode support, the buffer is filled with the text "Unicode 1221 not supported". Otherwise, the Unicode version string (for example, 1222 "8.0.0") is inserted. The number of code units used is returned. This 1223 is the length of the string plus one unit for the terminating zero. 1224 1225 PCRE2_CONFIG_UNICODE 1226 1227 The output is a uint32_t integer that is set to one if Unicode support 1228 is available; otherwise it is set to zero. Unicode support implies UTF 1229 support. 1230 1231 PCRE2_CONFIG_VERSION 1232 1233 The where argument should point to a buffer that is at least 24 code 1234 units long. (The exact length required can be found by calling 1235 pcre2_config() with where set to NULL.) The buffer is filled with the 1236 PCRE2 version string, zero-terminated. The number of code units used is 1237 returned. This is the length of the string plus one unit for the termi- 1238 nating zero. 1239 1240 1241COMPILING A PATTERN 1242 1243 pcre2_code *pcre2_compile(PCRE2_SPTR pattern, PCRE2_SIZE length, 1244 uint32_t options, int *errorcode, PCRE2_SIZE *erroroffset, 1245 pcre2_compile_context *ccontext); 1246 1247 void pcre2_code_free(pcre2_code *code); 1248 1249 pcre2_code *pcre2_code_copy(const pcre2_code *code); 1250 1251 pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *code); 1252 1253 The pcre2_compile() function compiles a pattern into an internal form. 1254 The pattern is defined by a pointer to a string of code units and a 1255 length (in code units). If the pattern is zero-terminated, the length 1256 can be specified as PCRE2_ZERO_TERMINATED. The function returns a 1257 pointer to a block of memory that contains the compiled pattern and 1258 related data, or NULL if an error occurred. 1259 1260 If the compile context argument ccontext is NULL, memory for the com- 1261 piled pattern is obtained by calling malloc(). Otherwise, it is 1262 obtained from the same memory function that was used for the compile 1263 context. The caller must free the memory by calling pcre2_code_free() 1264 when it is no longer needed. If pcre2_code_free() is called with a 1265 NULL argument, it returns immediately, without doing anything. 1266 1267 The function pcre2_code_copy() makes a copy of the compiled code in new 1268 memory, using the same memory allocator as was used for the original. 1269 However, if the code has been processed by the JIT compiler (see 1270 below), the JIT information cannot be copied (because it is position- 1271 dependent). The new copy can initially be used only for non-JIT match- 1272 ing, though it can be passed to pcre2_jit_compile() if required. If 1273 pcre2_code_copy() is called with a NULL argument, it returns NULL. 1274 1275 The pcre2_code_copy() function provides a way for individual threads in 1276 a multithreaded application to acquire a private copy of shared com- 1277 piled code. However, it does not make a copy of the character tables 1278 used by the compiled pattern; the new pattern code points to the same 1279 tables as the original code. (See "Locale Support" below for details 1280 of these character tables.) In many applications the same tables are 1281 used throughout, so this behaviour is appropriate. Nevertheless, there 1282 are occasions when a copy of a compiled pattern and the relevant tables 1283 are needed. The pcre2_code_copy_with_tables() provides this facility. 1284 Copies of both the code and the tables are made, with the new code 1285 pointing to the new tables. The memory for the new tables is automati- 1286 cally freed when pcre2_code_free() is called for the new copy of the 1287 compiled code. If pcre2_code_copy_withy_tables() is called with a NULL 1288 argument, it returns NULL. 1289 1290 NOTE: When one of the matching functions is called, pointers to the 1291 compiled pattern and the subject string are set in the match data block 1292 so that they can be referenced by the substring extraction functions. 1293 After running a match, you must not free a compiled pattern (or a sub- 1294 ject string) until after all operations on the match data block have 1295 taken place. 1296 1297 The options argument for pcre2_compile() contains various bit settings 1298 that affect the compilation. It should be zero if no options are 1299 required. The available options are described below. Some of them (in 1300 particular, those that are compatible with Perl, but some others as 1301 well) can also be set and unset from within the pattern (see the 1302 detailed description in the pcre2pattern documentation). 1303 1304 For those options that can be different in different parts of the pat- 1305 tern, the contents of the options argument specifies their settings at 1306 the start of compilation. The PCRE2_ANCHORED, PCRE2_ENDANCHORED, and 1307 PCRE2_NO_UTF_CHECK options can be set at the time of matching as well 1308 as at compile time. 1309 1310 Other, less frequently required compile-time parameters (for example, 1311 the newline setting) can be provided in a compile context (as described 1312 above). 1313 1314 If errorcode or erroroffset is NULL, pcre2_compile() returns NULL imme- 1315 diately. Otherwise, the variables to which these point are set to an 1316 error code and an offset (number of code units) within the pattern, 1317 respectively, when pcre2_compile() returns NULL because a compilation 1318 error has occurred. The values are not defined when compilation is suc- 1319 cessful and pcre2_compile() returns a non-NULL value. 1320 1321 There are nearly 100 positive error codes that pcre2_compile() may 1322 return if it finds an error in the pattern. There are also some nega- 1323 tive error codes that are used for invalid UTF strings. These are the 1324 same as given by pcre2_match() and pcre2_dfa_match(), and are described 1325 in the pcre2unicode page. There is no separate documentation for the 1326 positive error codes, because the textual error messages that are 1327 obtained by calling the pcre2_get_error_message() function (see 1328 "Obtaining a textual error message" below) should be self-explanatory. 1329 Macro names starting with PCRE2_ERROR_ are defined for both positive 1330 and negative error codes in pcre2.h. 1331 1332 The value returned in erroroffset is an indication of where in the pat- 1333 tern the error occurred. It is not necessarily the furthest point in 1334 the pattern that was read. For example, after the error "lookbehind 1335 assertion is not fixed length", the error offset points to the start of 1336 the failing assertion. For an invalid UTF-8 or UTF-16 string, the off- 1337 set is that of the first code unit of the failing character. 1338 1339 Some errors are not detected until the whole pattern has been scanned; 1340 in these cases, the offset passed back is the length of the pattern. 1341 Note that the offset is in code units, not characters, even in a UTF 1342 mode. It may sometimes point into the middle of a UTF-8 or UTF-16 char- 1343 acter. 1344 1345 This code fragment shows a typical straightforward call to pcre2_com- 1346 pile(): 1347 1348 pcre2_code *re; 1349 PCRE2_SIZE erroffset; 1350 int errorcode; 1351 re = pcre2_compile( 1352 "^A.*Z", /* the pattern */ 1353 PCRE2_ZERO_TERMINATED, /* the pattern is zero-terminated */ 1354 0, /* default options */ 1355 &errorcode, /* for error code */ 1356 &erroffset, /* for error offset */ 1357 NULL); /* no compile context */ 1358 1359 The following names for option bits are defined in the pcre2.h header 1360 file: 1361 1362 PCRE2_ANCHORED 1363 1364 If this bit is set, the pattern is forced to be "anchored", that is, it 1365 is constrained to match only at the first matching point in the string 1366 that is being searched (the "subject string"). This effect can also be 1367 achieved by appropriate constructs in the pattern itself, which is the 1368 only way to do it in Perl. 1369 1370 PCRE2_ALLOW_EMPTY_CLASS 1371 1372 By default, for compatibility with Perl, a closing square bracket that 1373 immediately follows an opening one is treated as a data character for 1374 the class. When PCRE2_ALLOW_EMPTY_CLASS is set, it terminates the 1375 class, which therefore contains no characters and so can never match. 1376 1377 PCRE2_ALT_BSUX 1378 1379 This option request alternative handling of three escape sequences, 1380 which makes PCRE2's behaviour more like ECMAscript (aka JavaScript). 1381 When it is set: 1382 1383 (1) \U matches an upper case "U" character; by default \U causes a com- 1384 pile time error (Perl uses \U to upper case subsequent characters). 1385 1386 (2) \u matches a lower case "u" character unless it is followed by four 1387 hexadecimal digits, in which case the hexadecimal number defines the 1388 code point to match. By default, \u causes a compile time error (Perl 1389 uses it to upper case the following character). 1390 1391 (3) \x matches a lower case "x" character unless it is followed by two 1392 hexadecimal digits, in which case the hexadecimal number defines the 1393 code point to match. By default, as in Perl, a hexadecimal number is 1394 always expected after \x, but it may have zero, one, or two digits (so, 1395 for example, \xz matches a binary zero character followed by z). 1396 1397 PCRE2_ALT_CIRCUMFLEX 1398 1399 In multiline mode (when PCRE2_MULTILINE is set), the circumflex 1400 metacharacter matches at the start of the subject (unless PCRE2_NOTBOL 1401 is set), and also after any internal newline. However, it does not 1402 match after a newline at the end of the subject, for compatibility with 1403 Perl. If you want a multiline circumflex also to match after a termi- 1404 nating newline, you must set PCRE2_ALT_CIRCUMFLEX. 1405 1406 PCRE2_ALT_VERBNAMES 1407 1408 By default, for compatibility with Perl, the name in any verb sequence 1409 such as (*MARK:NAME) is any sequence of characters that does not 1410 include a closing parenthesis. The name is not processed in any way, 1411 and it is not possible to include a closing parenthesis in the name. 1412 However, if the PCRE2_ALT_VERBNAMES option is set, normal backslash 1413 processing is applied to verb names and only an unescaped closing 1414 parenthesis terminates the name. A closing parenthesis can be included 1415 in a name either as \) or between \Q and \E. If the PCRE2_EXTENDED or 1416 PCRE2_EXTENDED_MORE option is set with PCRE2_ALT_VERBNAMES, unescaped 1417 whitespace in verb names is skipped and #-comments are recognized, 1418 exactly as in the rest of the pattern. 1419 1420 PCRE2_AUTO_CALLOUT 1421 1422 If this bit is set, pcre2_compile() automatically inserts callout 1423 items, all with number 255, before each pattern item, except immedi- 1424 ately before or after an explicit callout in the pattern. For discus- 1425 sion of the callout facility, see the pcre2callout documentation. 1426 1427 PCRE2_CASELESS 1428 1429 If this bit is set, letters in the pattern match both upper and lower 1430 case letters in the subject. It is equivalent to Perl's /i option, and 1431 it can be changed within a pattern by a (?i) option setting. If 1432 PCRE2_UTF is set, Unicode properties are used for all characters with 1433 more than one other case, and for all characters whose code points are 1434 greater than U+007F. For lower valued characters with only one other 1435 case, a lookup table is used for speed. When PCRE2_UTF is not set, a 1436 lookup table is used for all code points less than 256, and higher code 1437 points (available only in 16-bit or 32-bit mode) are treated as not 1438 having another case. 1439 1440 PCRE2_DOLLAR_ENDONLY 1441 1442 If this bit is set, a dollar metacharacter in the pattern matches only 1443 at the end of the subject string. Without this option, a dollar also 1444 matches immediately before a newline at the end of the string (but not 1445 before any other newlines). The PCRE2_DOLLAR_ENDONLY option is ignored 1446 if PCRE2_MULTILINE is set. There is no equivalent to this option in 1447 Perl, and no way to set it within a pattern. 1448 1449 PCRE2_DOTALL 1450 1451 If this bit is set, a dot metacharacter in the pattern matches any 1452 character, including one that indicates a newline. However, it only 1453 ever matches one character, even if newlines are coded as CRLF. Without 1454 this option, a dot does not match when the current position in the sub- 1455 ject is at a newline. This option is equivalent to Perl's /s option, 1456 and it can be changed within a pattern by a (?s) option setting. A neg- 1457 ative class such as [^a] always matches newline characters, and the \N 1458 escape sequence always matches a non-newline character, independent of 1459 the setting of PCRE2_DOTALL. 1460 1461 PCRE2_DUPNAMES 1462 1463 If this bit is set, names used to identify capturing subpatterns need 1464 not be unique. This can be helpful for certain types of pattern when it 1465 is known that only one instance of the named subpattern can ever be 1466 matched. There are more details of named subpatterns below; see also 1467 the pcre2pattern documentation. 1468 1469 PCRE2_ENDANCHORED 1470 1471 If this bit is set, the end of any pattern match must be right at the 1472 end of the string being searched (the "subject string"). If the pattern 1473 match succeeds by reaching (*ACCEPT), but does not reach the end of the 1474 subject, the match fails at the current starting point. For unanchored 1475 patterns, a new match is then tried at the next starting point. How- 1476 ever, if the match succeeds by reaching the end of the pattern, but not 1477 the end of the subject, backtracking occurs and an alternative match 1478 may be found. Consider these two patterns: 1479 1480 .(*ACCEPT)|.. 1481 .|.. 1482 1483 If matched against "abc" with PCRE2_ENDANCHORED set, the first matches 1484 "c" whereas the second matches "bc". The effect of PCRE2_ENDANCHORED 1485 can also be achieved by appropriate constructs in the pattern itself, 1486 which is the only way to do it in Perl. 1487 1488 For DFA matching with pcre2_dfa_match(), PCRE2_ENDANCHORED applies only 1489 to the first (that is, the longest) matched string. Other parallel 1490 matches, which are necessarily substrings of the first one, must obvi- 1491 ously end before the end of the subject. 1492 1493 PCRE2_EXTENDED 1494 1495 If this bit is set, most white space characters in the pattern are 1496 totally ignored except when escaped or inside a character class. How- 1497 ever, white space is not allowed within sequences such as (?> that 1498 introduce various parenthesized subpatterns, nor within numerical quan- 1499 tifiers such as {1,3}. Ignorable white space is permitted between an 1500 item and a following quantifier and between a quantifier and a follow- 1501 ing + that indicates possessiveness. PCRE2_EXTENDED is equivalent to 1502 Perl's /x option, and it can be changed within a pattern by a (?x) 1503 option setting. 1504 1505 When PCRE2 is compiled without Unicode support, PCRE2_EXTENDED recog- 1506 nizes as white space only those characters with code points less than 1507 256 that are flagged as white space in its low-character table. The ta- 1508 ble is normally created by pcre2_maketables(), which uses the isspace() 1509 function to identify space characters. In most ASCII environments, the 1510 relevant characters are those with code points 0x0009 (tab), 0x000A 1511 (linefeed), 0x000B (vertical tab), 0x000C (formfeed), 0x000D (carriage 1512 return), and 0x0020 (space). 1513 1514 When PCRE2 is compiled with Unicode support, in addition to these char- 1515 acters, five more Unicode "Pattern White Space" characters are recog- 1516 nized by PCRE2_EXTENDED. These are U+0085 (next line), U+200E (left-to- 1517 right mark), U+200F (right-to-left mark), U+2028 (line separator), and 1518 U+2029 (paragraph separator). This set of characters is the same as 1519 recognized by Perl's /x option. Note that the horizontal and vertical 1520 space characters that are matched by the \h and \v escapes in patterns 1521 are a much bigger set. 1522 1523 As well as ignoring most white space, PCRE2_EXTENDED also causes char- 1524 acters between an unescaped # outside a character class and the next 1525 newline, inclusive, to be ignored, which makes it possible to include 1526 comments inside complicated patterns. Note that the end of this type of 1527 comment is a literal newline sequence in the pattern; escape sequences 1528 that happen to represent a newline do not count. 1529 1530 Which characters are interpreted as newlines can be specified by a set- 1531 ting in the compile context that is passed to pcre2_compile() or by a 1532 special sequence at the start of the pattern, as described in the sec- 1533 tion entitled "Newline conventions" in the pcre2pattern documentation. 1534 A default is defined when PCRE2 is built. 1535 1536 PCRE2_EXTENDED_MORE 1537 1538 This option has the effect of PCRE2_EXTENDED, but, in addition, 1539 unescaped space and horizontal tab characters are ignored inside a 1540 character class. Note: only these two characters are ignored, not the 1541 full set of pattern white space characters that are ignored outside a 1542 character class. PCRE2_EXTENDED_MORE is equivalent to Perl's /xx 1543 option, and it can be changed within a pattern by a (?xx) option set- 1544 ting. 1545 1546 PCRE2_FIRSTLINE 1547 1548 If this option is set, the start of an unanchored pattern match must be 1549 before or at the first newline in the subject string following the 1550 start of matching, though the matched text may continue over the new- 1551 line. If startoffset is non-zero, the limiting newline is not necessar- 1552 ily the first newline in the subject. For example, if the subject 1553 string is "abc\nxyz" (where \n represents a single-character newline) a 1554 pattern match for "yz" succeeds with PCRE2_FIRSTLINE if startoffset is 1555 greater than 3. See also PCRE2_USE_OFFSET_LIMIT, which provides a more 1556 general limiting facility. If PCRE2_FIRSTLINE is set with an offset 1557 limit, a match must occur in the first line and also within the offset 1558 limit. In other words, whichever limit comes first is used. 1559 1560 PCRE2_LITERAL 1561 1562 If this option is set, all meta-characters in the pattern are disabled, 1563 and it is treated as a literal string. Matching literal strings with a 1564 regular expression engine is not the most efficient way of doing it. If 1565 you are doing a lot of literal matching and are worried about effi- 1566 ciency, you should consider using other approaches. The only other main 1567 options that are allowed with PCRE2_LITERAL are: PCRE2_ANCHORED, 1568 PCRE2_ENDANCHORED, PCRE2_AUTO_CALLOUT, PCRE2_CASELESS, PCRE2_FIRSTLINE, 1569 PCRE2_NO_START_OPTIMIZE, PCRE2_NO_UTF_CHECK, PCRE2_UTF, and 1570 PCRE2_USE_OFFSET_LIMIT. The extra options PCRE2_EXTRA_MATCH_LINE and 1571 PCRE2_EXTRA_MATCH_WORD are also supported. Any other options cause an 1572 error. 1573 1574 PCRE2_MATCH_UNSET_BACKREF 1575 1576 If this option is set, a backreference to an unset subpattern group 1577 matches an empty string (by default this causes the current matching 1578 alternative to fail). A pattern such as (\1)(a) succeeds when this 1579 option is set (assuming it can find an "a" in the subject), whereas it 1580 fails by default, for Perl compatibility. Setting this option makes 1581 PCRE2 behave more like ECMAscript (aka JavaScript). 1582 1583 PCRE2_MULTILINE 1584 1585 By default, for the purposes of matching "start of line" and "end of 1586 line", PCRE2 treats the subject string as consisting of a single line 1587 of characters, even if it actually contains newlines. The "start of 1588 line" metacharacter (^) matches only at the start of the string, and 1589 the "end of line" metacharacter ($) matches only at the end of the 1590 string, or before a terminating newline (except when PCRE2_DOL- 1591 LAR_ENDONLY is set). Note, however, that unless PCRE2_DOTALL is set, 1592 the "any character" metacharacter (.) does not match at a newline. This 1593 behaviour (for ^, $, and dot) is the same as Perl. 1594 1595 When PCRE2_MULTILINE it is set, the "start of line" and "end of line" 1596 constructs match immediately following or immediately before internal 1597 newlines in the subject string, respectively, as well as at the very 1598 start and end. This is equivalent to Perl's /m option, and it can be 1599 changed within a pattern by a (?m) option setting. Note that the "start 1600 of line" metacharacter does not match after a newline at the end of the 1601 subject, for compatibility with Perl. However, you can change this by 1602 setting the PCRE2_ALT_CIRCUMFLEX option. If there are no newlines in a 1603 subject string, or no occurrences of ^ or $ in a pattern, setting 1604 PCRE2_MULTILINE has no effect. 1605 1606 PCRE2_NEVER_BACKSLASH_C 1607 1608 This option locks out the use of \C in the pattern that is being com- 1609 piled. This escape can cause unpredictable behaviour in UTF-8 or 1610 UTF-16 modes, because it may leave the current matching point in the 1611 middle of a multi-code-unit character. This option may be useful in 1612 applications that process patterns from external sources. Note that 1613 there is also a build-time option that permanently locks out the use of 1614 \C. 1615 1616 PCRE2_NEVER_UCP 1617 1618 This option locks out the use of Unicode properties for handling \B, 1619 \b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes, as 1620 described for the PCRE2_UCP option below. In particular, it prevents 1621 the creator of the pattern from enabling this facility by starting the 1622 pattern with (*UCP). This option may be useful in applications that 1623 process patterns from external sources. The option combination PCRE_UCP 1624 and PCRE_NEVER_UCP causes an error. 1625 1626 PCRE2_NEVER_UTF 1627 1628 This option locks out interpretation of the pattern as UTF-8, UTF-16, 1629 or UTF-32, depending on which library is in use. In particular, it pre- 1630 vents the creator of the pattern from switching to UTF interpretation 1631 by starting the pattern with (*UTF). This option may be useful in 1632 applications that process patterns from external sources. The combina- 1633 tion of PCRE2_UTF and PCRE2_NEVER_UTF causes an error. 1634 1635 PCRE2_NO_AUTO_CAPTURE 1636 1637 If this option is set, it disables the use of numbered capturing paren- 1638 theses in the pattern. Any opening parenthesis that is not followed by 1639 ? behaves as if it were followed by ?: but named parentheses can still 1640 be used for capturing (and they acquire numbers in the usual way). This 1641 is the same as Perl's /n option. Note that, when this option is set, 1642 references to capturing groups (backreferences or recursion/subroutine 1643 calls) may only refer to named groups, though the reference can be by 1644 name or by number. 1645 1646 PCRE2_NO_AUTO_POSSESS 1647 1648 If this option is set, it disables "auto-possessification", which is an 1649 optimization that, for example, turns a+b into a++b in order to avoid 1650 backtracks into a+ that can never be successful. However, if callouts 1651 are in use, auto-possessification means that some callouts are never 1652 taken. You can set this option if you want the matching functions to do 1653 a full unoptimized search and run all the callouts, but it is mainly 1654 provided for testing purposes. 1655 1656 PCRE2_NO_DOTSTAR_ANCHOR 1657 1658 If this option is set, it disables an optimization that is applied when 1659 .* is the first significant item in a top-level branch of a pattern, 1660 and all the other branches also start with .* or with \A or \G or ^. 1661 The optimization is automatically disabled for .* if it is inside an 1662 atomic group or a capturing group that is the subject of a backrefer- 1663 ence, or if the pattern contains (*PRUNE) or (*SKIP). When the opti- 1664 mization is not disabled, such a pattern is automatically anchored if 1665 PCRE2_DOTALL is set for all the .* items and PCRE2_MULTILINE is not set 1666 for any ^ items. Otherwise, the fact that any match must start either 1667 at the start of the subject or following a newline is remembered. Like 1668 other optimizations, this can cause callouts to be skipped. 1669 1670 PCRE2_NO_START_OPTIMIZE 1671 1672 This is an option whose main effect is at matching time. It does not 1673 change what pcre2_compile() generates, but it does affect the output of 1674 the JIT compiler. 1675 1676 There are a number of optimizations that may occur at the start of a 1677 match, in order to speed up the process. For example, if it is known 1678 that an unanchored match must start with a specific code unit value, 1679 the matching code searches the subject for that value, and fails imme- 1680 diately if it cannot find it, without actually running the main match- 1681 ing function. This means that a special item such as (*COMMIT) at the 1682 start of a pattern is not considered until after a suitable starting 1683 point for the match has been found. Also, when callouts or (*MARK) 1684 items are in use, these "start-up" optimizations can cause them to be 1685 skipped if the pattern is never actually used. The start-up optimiza- 1686 tions are in effect a pre-scan of the subject that takes place before 1687 the pattern is run. 1688 1689 The PCRE2_NO_START_OPTIMIZE option disables the start-up optimizations, 1690 possibly causing performance to suffer, but ensuring that in cases 1691 where the result is "no match", the callouts do occur, and that items 1692 such as (*COMMIT) and (*MARK) are considered at every possible starting 1693 position in the subject string. 1694 1695 Setting PCRE2_NO_START_OPTIMIZE may change the outcome of a matching 1696 operation. Consider the pattern 1697 1698 (*COMMIT)ABC 1699 1700 When this is compiled, PCRE2 records the fact that a match must start 1701 with the character "A". Suppose the subject string is "DEFABC". The 1702 start-up optimization scans along the subject, finds "A" and runs the 1703 first match attempt from there. The (*COMMIT) item means that the pat- 1704 tern must match the current starting position, which in this case, it 1705 does. However, if the same match is run with PCRE2_NO_START_OPTIMIZE 1706 set, the initial scan along the subject string does not happen. The 1707 first match attempt is run starting from "D" and when this fails, 1708 (*COMMIT) prevents any further matches being tried, so the overall 1709 result is "no match". 1710 1711 There are also other start-up optimizations. For example, a minimum 1712 length for the subject may be recorded. Consider the pattern 1713 1714 (*MARK:A)(X|Y) 1715 1716 The minimum length for a match is one character. If the subject is 1717 "ABC", there will be attempts to match "ABC", "BC", and "C". An attempt 1718 to match an empty string at the end of the subject does not take place, 1719 because PCRE2 knows that the subject is now too short, and so the 1720 (*MARK) is never encountered. In this case, the optimization does not 1721 affect the overall match result, which is still "no match", but it does 1722 affect the auxiliary information that is returned. 1723 1724 PCRE2_NO_UTF_CHECK 1725 1726 When PCRE2_UTF is set, the validity of the pattern as a UTF string is 1727 automatically checked. There are discussions about the validity of 1728 UTF-8 strings, UTF-16 strings, and UTF-32 strings in the pcre2unicode 1729 document. If an invalid UTF sequence is found, pcre2_compile() returns 1730 a negative error code. 1731 1732 If you know that your pattern is a valid UTF string, and you want to 1733 skip this check for performance reasons, you can set the 1734 PCRE2_NO_UTF_CHECK option. When it is set, the effect of passing an 1735 invalid UTF string as a pattern is undefined. It may cause your program 1736 to crash or loop. 1737 1738 Note that this option can also be passed to pcre2_match() and 1739 pcre_dfa_match(), to suppress UTF validity checking of the subject 1740 string. 1741 1742 Note also that setting PCRE2_NO_UTF_CHECK at compile time does not dis- 1743 able the error that is given if an escape sequence for an invalid Uni- 1744 code code point is encountered in the pattern. In particular, the so- 1745 called "surrogate" code points (0xd800 to 0xdfff) are invalid. If you 1746 want to allow escape sequences such as \x{d800} you can set the 1747 PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra option, as described in the 1748 section entitled "Extra compile options" below. However, this is pos- 1749 sible only in UTF-8 and UTF-32 modes, because these values are not rep- 1750 resentable in UTF-16. 1751 1752 PCRE2_UCP 1753 1754 This option changes the way PCRE2 processes \B, \b, \D, \d, \S, \s, \W, 1755 \w, and some of the POSIX character classes. By default, only ASCII 1756 characters are recognized, but if PCRE2_UCP is set, Unicode properties 1757 are used instead to classify characters. More details are given in the 1758 section on generic character types in the pcre2pattern page. If you set 1759 PCRE2_UCP, matching one of the items it affects takes much longer. The 1760 option is available only if PCRE2 has been compiled with Unicode sup- 1761 port (which is the default). 1762 1763 PCRE2_UNGREEDY 1764 1765 This option inverts the "greediness" of the quantifiers so that they 1766 are not greedy by default, but become greedy if followed by "?". It is 1767 not compatible with Perl. It can also be set by a (?U) option setting 1768 within the pattern. 1769 1770 PCRE2_USE_OFFSET_LIMIT 1771 1772 This option must be set for pcre2_compile() if pcre2_set_offset_limit() 1773 is going to be used to set a non-default offset limit in a match con- 1774 text for matches that use this pattern. An error is generated if an 1775 offset limit is set without this option. For more details, see the 1776 description of pcre2_set_offset_limit() in the section that describes 1777 match contexts. See also the PCRE2_FIRSTLINE option above. 1778 1779 PCRE2_UTF 1780 1781 This option causes PCRE2 to regard both the pattern and the subject 1782 strings that are subsequently processed as strings of UTF characters 1783 instead of single-code-unit strings. It is available when PCRE2 is 1784 built to include Unicode support (which is the default). If Unicode 1785 support is not available, the use of this option provokes an error. 1786 Details of how PCRE2_UTF changes the behaviour of PCRE2 are given in 1787 the pcre2unicode page. In particular, note that it changes the way 1788 PCRE2_CASELESS handles characters with code points greater than 127. 1789 1790 Extra compile options 1791 1792 Unlike the main compile-time options, the extra options are not saved 1793 with the compiled pattern. The option bits that can be set in a compile 1794 context by calling the pcre2_set_compile_extra_options() function are 1795 as follows: 1796 1797 PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES 1798 1799 This option applies when compiling a pattern in UTF-8 or UTF-32 mode. 1800 It is forbidden in UTF-16 mode, and ignored in non-UTF modes. Unicode 1801 "surrogate" code points in the range 0xd800 to 0xdfff are used in pairs 1802 in UTF-16 to encode code points with values in the range 0x10000 to 1803 0x10ffff. The surrogates cannot therefore be represented in UTF-16. 1804 They can be represented in UTF-8 and UTF-32, but are defined as invalid 1805 code points, and cause errors if encountered in a UTF-8 or UTF-32 1806 string that is being checked for validity by PCRE2. 1807 1808 These values also cause errors if encountered in escape sequences such 1809 as \x{d912} within a pattern. However, it seems that some applications, 1810 when using PCRE2 to check for unwanted characters in UTF-8 strings, 1811 explicitly test for the surrogates using escape sequences. The 1812 PCRE2_NO_UTF_CHECK option does not disable the error that occurs, 1813 because it applies only to the testing of input strings for UTF valid- 1814 ity. 1815 1816 If the extra option PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES is set, surro- 1817 gate code point values in UTF-8 and UTF-32 patterns no longer provoke 1818 errors and are incorporated in the compiled pattern. However, they can 1819 only match subject characters if the matching function is called with 1820 PCRE2_NO_UTF_CHECK set. 1821 1822 PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL 1823 1824 This is a dangerous option. Use with care. By default, an unrecognized 1825 escape such as \j or a malformed one such as \x{2z} causes a compile- 1826 time error when detected by pcre2_compile(). Perl is somewhat inconsis- 1827 tent in handling such items: for example, \j is treated as a literal 1828 "j", and non-hexadecimal digits in \x{} are just ignored, though warn- 1829 ings are given in both cases if Perl's warning switch is enabled. How- 1830 ever, a malformed octal number after \o{ always causes an error in 1831 Perl. 1832 1833 If the PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL extra option is passed to 1834 pcre2_compile(), all unrecognized or erroneous escape sequences are 1835 treated as single-character escapes. For example, \j is a literal "j" 1836 and \x{2z} is treated as the literal string "x{2z}". Setting this 1837 option means that typos in patterns may go undetected and have unex- 1838 pected results. This is a dangerous option. Use with care. 1839 1840 PCRE2_EXTRA_MATCH_LINE 1841 1842 This option is provided for use by the -x option of pcre2grep. It 1843 causes the pattern only to match complete lines. This is achieved by 1844 automatically inserting the code for "^(?:" at the start of the com- 1845 piled pattern and ")$" at the end. Thus, when PCRE2_MULTILINE is set, 1846 the matched line may be in the middle of the subject string. This 1847 option can be used with PCRE2_LITERAL. 1848 1849 PCRE2_EXTRA_MATCH_WORD 1850 1851 This option is provided for use by the -w option of pcre2grep. It 1852 causes the pattern only to match strings that have a word boundary at 1853 the start and the end. This is achieved by automatically inserting the 1854 code for "\b(?:" at the start of the compiled pattern and ")\b" at the 1855 end. The option may be used with PCRE2_LITERAL. However, it is ignored 1856 if PCRE2_EXTRA_MATCH_LINE is also set. 1857 1858 1859JUST-IN-TIME (JIT) COMPILATION 1860 1861 int pcre2_jit_compile(pcre2_code *code, uint32_t options); 1862 1863 int pcre2_jit_match(const pcre2_code *code, PCRE2_SPTR subject, 1864 PCRE2_SIZE length, PCRE2_SIZE startoffset, 1865 uint32_t options, pcre2_match_data *match_data, 1866 pcre2_match_context *mcontext); 1867 1868 void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext); 1869 1870 pcre2_jit_stack *pcre2_jit_stack_create(PCRE2_SIZE startsize, 1871 PCRE2_SIZE maxsize, pcre2_general_context *gcontext); 1872 1873 void pcre2_jit_stack_assign(pcre2_match_context *mcontext, 1874 pcre2_jit_callback callback_function, void *callback_data); 1875 1876 void pcre2_jit_stack_free(pcre2_jit_stack *jit_stack); 1877 1878 These functions provide support for JIT compilation, which, if the 1879 just-in-time compiler is available, further processes a compiled pat- 1880 tern into machine code that executes much faster than the pcre2_match() 1881 interpretive matching function. Full details are given in the pcre2jit 1882 documentation. 1883 1884 JIT compilation is a heavyweight optimization. It can take some time 1885 for patterns to be analyzed, and for one-off matches and simple pat- 1886 terns the benefit of faster execution might be offset by a much slower 1887 compilation time. Most (but not all) patterns can be optimized by the 1888 JIT compiler. 1889 1890 1891LOCALE SUPPORT 1892 1893 PCRE2 handles caseless matching, and determines whether characters are 1894 letters, digits, or whatever, by reference to a set of tables, indexed 1895 by character code point. This applies only to characters whose code 1896 points are less than 256. By default, higher-valued code points never 1897 match escapes such as \w or \d. However, if PCRE2 is built with Uni- 1898 code support, all characters can be tested with \p and \P, or, alterna- 1899 tively, the PCRE2_UCP option can be set when a pattern is compiled; 1900 this causes \w and friends to use Unicode property support instead of 1901 the built-in tables. 1902 1903 The use of locales with Unicode is discouraged. If you are handling 1904 characters with code points greater than 128, you should either use 1905 Unicode support, or use locales, but not try to mix the two. 1906 1907 PCRE2 contains an internal set of character tables that are used by 1908 default. These are sufficient for many applications. Normally, the 1909 internal tables recognize only ASCII characters. However, when PCRE2 is 1910 built, it is possible to cause the internal tables to be rebuilt in the 1911 default "C" locale of the local system, which may cause them to be dif- 1912 ferent. 1913 1914 The internal tables can be overridden by tables supplied by the appli- 1915 cation that calls PCRE2. These may be created in a different locale 1916 from the default. As more and more applications change to using Uni- 1917 code, the need for this locale support is expected to die away. 1918 1919 External tables are built by calling the pcre2_maketables() function, 1920 in the relevant locale. The result can be passed to pcre2_compile() as 1921 often as necessary, by creating a compile context and calling 1922 pcre2_set_character_tables() to set the tables pointer therein. For 1923 example, to build and use tables that are appropriate for the French 1924 locale (where accented characters with values greater than 128 are 1925 treated as letters), the following code could be used: 1926 1927 setlocale(LC_CTYPE, "fr_FR"); 1928 tables = pcre2_maketables(NULL); 1929 ccontext = pcre2_compile_context_create(NULL); 1930 pcre2_set_character_tables(ccontext, tables); 1931 re = pcre2_compile(..., ccontext); 1932 1933 The locale name "fr_FR" is used on Linux and other Unix-like systems; 1934 if you are using Windows, the name for the French locale is "french". 1935 It is the caller's responsibility to ensure that the memory containing 1936 the tables remains available for as long as it is needed. 1937 1938 The pointer that is passed (via the compile context) to pcre2_compile() 1939 is saved with the compiled pattern, and the same tables are used by 1940 pcre2_match() and pcre_dfa_match(). Thus, for any single pattern, com- 1941 pilation and matching both happen in the same locale, but different 1942 patterns can be processed in different locales. 1943 1944 1945INFORMATION ABOUT A COMPILED PATTERN 1946 1947 int pcre2_pattern_info(const pcre2 *code, uint32_t what, void *where); 1948 1949 The pcre2_pattern_info() function returns general information about a 1950 compiled pattern. For information about callouts, see the next section. 1951 The first argument for pcre2_pattern_info() is a pointer to the com- 1952 piled pattern. The second argument specifies which piece of information 1953 is required, and the third argument is a pointer to a variable to 1954 receive the data. If the third argument is NULL, the first argument is 1955 ignored, and the function returns the size in bytes of the variable 1956 that is required for the information requested. Otherwise, the yield of 1957 the function is zero for success, or one of the following negative num- 1958 bers: 1959 1960 PCRE2_ERROR_NULL the argument code was NULL 1961 PCRE2_ERROR_BADMAGIC the "magic number" was not found 1962 PCRE2_ERROR_BADOPTION the value of what was invalid 1963 PCRE2_ERROR_UNSET the requested field is not set 1964 1965 The "magic number" is placed at the start of each compiled pattern as 1966 an simple check against passing an arbitrary memory pointer. Here is a 1967 typical call of pcre2_pattern_info(), to obtain the length of the com- 1968 piled pattern: 1969 1970 int rc; 1971 size_t length; 1972 rc = pcre2_pattern_info( 1973 re, /* result of pcre2_compile() */ 1974 PCRE2_INFO_SIZE, /* what is required */ 1975 &length); /* where to put the data */ 1976 1977 The possible values for the second argument are defined in pcre2.h, and 1978 are as follows: 1979 1980 PCRE2_INFO_ALLOPTIONS 1981 PCRE2_INFO_ARGOPTIONS 1982 PCRE2_INFO_EXTRAOPTIONS 1983 1984 Return copies of the pattern's options. The third argument should point 1985 to a uint32_t variable. PCRE2_INFO_ARGOPTIONS returns exactly the 1986 options that were passed to pcre2_compile(), whereas PCRE2_INFO_ALLOP- 1987 TIONS returns the compile options as modified by any top-level (*XXX) 1988 option settings such as (*UTF) at the start of the pattern itself. 1989 PCRE2_INFO_EXTRAOPTIONS returns the extra options that were set in the 1990 compile context by calling the pcre2_set_compile_extra_options() func- 1991 tion. 1992 1993 For example, if the pattern /(*UTF)abc/ is compiled with the 1994 PCRE2_EXTENDED option, the result for PCRE2_INFO_ALLOPTIONS is 1995 PCRE2_EXTENDED and PCRE2_UTF. Option settings such as (?i) that can 1996 change within a pattern do not affect the result of PCRE2_INFO_ALLOP- 1997 TIONS, even if they appear right at the start of the pattern. (This was 1998 different in some earlier releases.) 1999 2000 A pattern compiled without PCRE2_ANCHORED is automatically anchored by 2001 PCRE2 if the first significant item in every top-level branch is one of 2002 the following: 2003 2004 ^ unless PCRE2_MULTILINE is set 2005 \A always 2006 \G always 2007 .* sometimes - see below 2008 2009 When .* is the first significant item, anchoring is possible only when 2010 all the following are true: 2011 2012 .* is not in an atomic group 2013 .* is not in a capturing group that is the subject 2014 of a backreference 2015 PCRE2_DOTALL is in force for .* 2016 Neither (*PRUNE) nor (*SKIP) appears in the pattern 2017 PCRE2_NO_DOTSTAR_ANCHOR is not set 2018 2019 For patterns that are auto-anchored, the PCRE2_ANCHORED bit is set in 2020 the options returned for PCRE2_INFO_ALLOPTIONS. 2021 2022 PCRE2_INFO_BACKREFMAX 2023 2024 Return the number of the highest backreference in the pattern. The 2025 third argument should point to an uint32_t variable. Named subpatterns 2026 acquire numbers as well as names, and these count towards the highest 2027 backreference. Backreferences such as \4 or \g{12} match the captured 2028 characters of the given group, but in addition, the check that a cap- 2029 turing group is set in a conditional subpattern such as (?(3)a|b) is 2030 also a backreference. Zero is returned if there are no backreferences. 2031 2032 PCRE2_INFO_BSR 2033 2034 The output is a uint32_t integer whose value indicates what character 2035 sequences the \R escape sequence matches. A value of PCRE2_BSR_UNICODE 2036 means that \R matches any Unicode line ending sequence; a value of 2037 PCRE2_BSR_ANYCRLF means that \R matches only CR, LF, or CRLF. 2038 2039 PCRE2_INFO_CAPTURECOUNT 2040 2041 Return the highest capturing subpattern number in the pattern. In pat- 2042 terns where (?| is not used, this is also the total number of capturing 2043 subpatterns. The third argument should point to an uint32_t variable. 2044 2045 PCRE2_INFO_DEPTHLIMIT 2046 2047 If the pattern set a backtracking depth limit by including an item of 2048 the form (*LIMIT_DEPTH=nnnn) at the start, the value is returned. The 2049 third argument should point to a uint32_t integer. If no such value has 2050 been set, the call to pcre2_pattern_info() returns the error 2051 PCRE2_ERROR_UNSET. Note that this limit will only be used during match- 2052 ing if it is less than the limit set or defaulted by the caller of the 2053 match function. 2054 2055 PCRE2_INFO_FIRSTBITMAP 2056 2057 In the absence of a single first code unit for a non-anchored pattern, 2058 pcre2_compile() may construct a 256-bit table that defines a fixed set 2059 of values for the first code unit in any match. For example, a pattern 2060 that starts with [abc] results in a table with three bits set. When 2061 code unit values greater than 255 are supported, the flag bit for 255 2062 means "any code unit of value 255 or above". If such a table was con- 2063 structed, a pointer to it is returned. Otherwise NULL is returned. The 2064 third argument should point to a const uint8_t * variable. 2065 2066 PCRE2_INFO_FIRSTCODETYPE 2067 2068 Return information about the first code unit of any matched string, for 2069 a non-anchored pattern. The third argument should point to an uint32_t 2070 variable. If there is a fixed first value, for example, the letter "c" 2071 from a pattern such as (cat|cow|coyote), 1 is returned, and the value 2072 can be retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed 2073 first value, but it is known that a match can occur only at the start 2074 of the subject or following a newline in the subject, 2 is returned. 2075 Otherwise, and for anchored patterns, 0 is returned. 2076 2077 PCRE2_INFO_FIRSTCODEUNIT 2078 2079 Return the value of the first code unit of any matched string for a 2080 pattern where PCRE2_INFO_FIRSTCODETYPE returns 1; otherwise return 0. 2081 The third argument should point to an uint32_t variable. In the 8-bit 2082 library, the value is always less than 256. In the 16-bit library the 2083 value can be up to 0xffff. In the 32-bit library in UTF-32 mode the 2084 value can be up to 0x10ffff, and up to 0xffffffff when not using UTF-32 2085 mode. 2086 2087 PCRE2_INFO_FRAMESIZE 2088 2089 Return the size (in bytes) of the data frames that are used to remember 2090 backtracking positions when the pattern is processed by pcre2_match() 2091 without the use of JIT. The third argument should point to a size_t 2092 variable. The frame size depends on the number of capturing parentheses 2093 in the pattern. Each additional capturing group adds two PCRE2_SIZE 2094 variables. 2095 2096 PCRE2_INFO_HASBACKSLASHC 2097 2098 Return 1 if the pattern contains any instances of \C, otherwise 0. The 2099 third argument should point to an uint32_t variable. 2100 2101 PCRE2_INFO_HASCRORLF 2102 2103 Return 1 if the pattern contains any explicit matches for CR or LF 2104 characters, otherwise 0. The third argument should point to an uint32_t 2105 variable. An explicit match is either a literal CR or LF character, or 2106 \r or \n or one of the equivalent hexadecimal or octal escape 2107 sequences. 2108 2109 PCRE2_INFO_HEAPLIMIT 2110 2111 If the pattern set a heap memory limit by including an item of the form 2112 (*LIMIT_HEAP=nnnn) at the start, the value is returned. The third argu- 2113 ment should point to a uint32_t integer. If no such value has been set, 2114 the call to pcre2_pattern_info() returns the error PCRE2_ERROR_UNSET. 2115 Note that this limit will only be used during matching if it is less 2116 than the limit set or defaulted by the caller of the match function. 2117 2118 PCRE2_INFO_JCHANGED 2119 2120 Return 1 if the (?J) or (?-J) option setting is used in the pattern, 2121 otherwise 0. The third argument should point to an uint32_t variable. 2122 (?J) and (?-J) set and unset the local PCRE2_DUPNAMES option, respec- 2123 tively. 2124 2125 PCRE2_INFO_JITSIZE 2126 2127 If the compiled pattern was successfully processed by pcre2_jit_com- 2128 pile(), return the size of the JIT compiled code, otherwise return 2129 zero. The third argument should point to a size_t variable. 2130 2131 PCRE2_INFO_LASTCODETYPE 2132 2133 Returns 1 if there is a rightmost literal code unit that must exist in 2134 any matched string, other than at its start. The third argument should 2135 point to an uint32_t variable. If there is no such value, 0 is 2136 returned. When 1 is returned, the code unit value itself can be 2137 retrieved using PCRE2_INFO_LASTCODEUNIT. For anchored patterns, a last 2138 literal value is recorded only if it follows something of variable 2139 length. For example, for the pattern /^a\d+z\d+/ the returned value is 2140 1 (with "z" returned from PCRE2_INFO_LASTCODEUNIT), but for /^a\dz\d/ 2141 the returned value is 0. 2142 2143 PCRE2_INFO_LASTCODEUNIT 2144 2145 Return the value of the rightmost literal code unit that must exist in 2146 any matched string, other than at its start, for a pattern where 2147 PCRE2_INFO_LASTCODETYPE returns 1. Otherwise, return 0. The third argu- 2148 ment should point to an uint32_t variable. 2149 2150 PCRE2_INFO_MATCHEMPTY 2151 2152 Return 1 if the pattern might match an empty string, otherwise 0. The 2153 third argument should point to an uint32_t variable. When a pattern 2154 contains recursive subroutine calls it is not always possible to deter- 2155 mine whether or not it can match an empty string. PCRE2 takes a cau- 2156 tious approach and returns 1 in such cases. 2157 2158 PCRE2_INFO_MATCHLIMIT 2159 2160 If the pattern set a match limit by including an item of the form 2161 (*LIMIT_MATCH=nnnn) at the start, the value is returned. The third 2162 argument should point to a uint32_t integer. If no such value has been 2163 set, the call to pcre2_pattern_info() returns the error 2164 PCRE2_ERROR_UNSET. Note that this limit will only be used during match- 2165 ing if it is less than the limit set or defaulted by the caller of the 2166 match function. 2167 2168 PCRE2_INFO_MAXLOOKBEHIND 2169 2170 Return the number of characters (not code units) in the longest lookbe- 2171 hind assertion in the pattern. The third argument should point to a 2172 uint32_t integer. This information is useful when doing multi-segment 2173 matching using the partial matching facilities. Note that the simple 2174 assertions \b and \B require a one-character lookbehind. \A also regis- 2175 ters a one-character lookbehind, though it does not actually inspect 2176 the previous character. This is to ensure that at least one character 2177 from the old segment is retained when a new segment is processed. Oth- 2178 erwise, if there are no lookbehinds in the pattern, \A might match 2179 incorrectly at the start of a second or subsequent segment. 2180 2181 PCRE2_INFO_MINLENGTH 2182 2183 If a minimum length for matching subject strings was computed, its 2184 value is returned. Otherwise the returned value is 0. The value is a 2185 number of characters, which in UTF mode may be different from the num- 2186 ber of code units. The third argument should point to an uint32_t 2187 variable. The value is a lower bound to the length of any matching 2188 string. There may not be any strings of that length that do actually 2189 match, but every string that does match is at least that long. 2190 2191 PCRE2_INFO_NAMECOUNT 2192 PCRE2_INFO_NAMEENTRYSIZE 2193 PCRE2_INFO_NAMETABLE 2194 2195 PCRE2 supports the use of named as well as numbered capturing parenthe- 2196 ses. The names are just an additional way of identifying the parenthe- 2197 ses, which still acquire numbers. Several convenience functions such as 2198 pcre2_substring_get_byname() are provided for extracting captured sub- 2199 strings by name. It is also possible to extract the data directly, by 2200 first converting the name to a number in order to access the correct 2201 pointers in the output vector (described with pcre2_match() below). To 2202 do the conversion, you need to use the name-to-number map, which is 2203 described by these three values. 2204 2205 The map consists of a number of fixed-size entries. PCRE2_INFO_NAME- 2206 COUNT gives the number of entries, and PCRE2_INFO_NAMEENTRYSIZE gives 2207 the size of each entry in code units; both of these return a uint32_t 2208 value. The entry size depends on the length of the longest name. 2209 2210 PCRE2_INFO_NAMETABLE returns a pointer to the first entry of the table. 2211 This is a PCRE2_SPTR pointer to a block of code units. In the 8-bit 2212 library, the first two bytes of each entry are the number of the cap- 2213 turing parenthesis, most significant byte first. In the 16-bit library, 2214 the pointer points to 16-bit code units, the first of which contains 2215 the parenthesis number. In the 32-bit library, the pointer points to 2216 32-bit code units, the first of which contains the parenthesis number. 2217 The rest of the entry is the corresponding name, zero terminated. 2218 2219 The names are in alphabetical order. If (?| is used to create multiple 2220 groups with the same number, as described in the section on duplicate 2221 subpattern numbers in the pcre2pattern page, the groups may be given 2222 the same name, but there is only one entry in the table. Different 2223 names for groups of the same number are not permitted. 2224 2225 Duplicate names for subpatterns with different numbers are permitted, 2226 but only if PCRE2_DUPNAMES is set. They appear in the table in the 2227 order in which they were found in the pattern. In the absence of (?| 2228 this is the order of increasing number; when (?| is used this is not 2229 necessarily the case because later subpatterns may have lower numbers. 2230 2231 As a simple example of the name/number table, consider the following 2232 pattern after compilation by the 8-bit library (assume PCRE2_EXTENDED 2233 is set, so white space - including newlines - is ignored): 2234 2235 (?<date> (?<year>(\d\d)?\d\d) - 2236 (?<month>\d\d) - (?<day>\d\d) ) 2237 2238 There are four named subpatterns, so the table has four entries, and 2239 each entry in the table is eight bytes long. The table is as follows, 2240 with non-printing bytes shows in hexadecimal, and undefined bytes shown 2241 as ??: 2242 2243 00 01 d a t e 00 ?? 2244 00 05 d a y 00 ?? ?? 2245 00 04 m o n t h 00 2246 00 02 y e a r 00 ?? 2247 2248 When writing code to extract data from named subpatterns using the 2249 name-to-number map, remember that the length of the entries is likely 2250 to be different for each compiled pattern. 2251 2252 PCRE2_INFO_NEWLINE 2253 2254 The output is one of the following uint32_t values: 2255 2256 PCRE2_NEWLINE_CR Carriage return (CR) 2257 PCRE2_NEWLINE_LF Linefeed (LF) 2258 PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF) 2259 PCRE2_NEWLINE_ANY Any Unicode line ending 2260 PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF 2261 PCRE2_NEWLINE_NUL The NUL character (binary zero) 2262 2263 This identifies the character sequence that will be recognized as mean- 2264 ing "newline" while matching. 2265 2266 PCRE2_INFO_SIZE 2267 2268 Return the size of the compiled pattern in bytes (for all three 2269 libraries). The third argument should point to a size_t variable. This 2270 value includes the size of the general data block that precedes the 2271 code units of the compiled pattern itself. The value that is used when 2272 pcre2_compile() is getting memory in which to place the compiled pat- 2273 tern may be slightly larger than the value returned by this option, 2274 because there are cases where the code that calculates the size has to 2275 over-estimate. Processing a pattern with the JIT compiler does not 2276 alter the value returned by this option. 2277 2278 2279INFORMATION ABOUT A PATTERN'S CALLOUTS 2280 2281 int pcre2_callout_enumerate(const pcre2_code *code, 2282 int (*callback)(pcre2_callout_enumerate_block *, void *), 2283 void *user_data); 2284 2285 A script language that supports the use of string arguments in callouts 2286 might like to scan all the callouts in a pattern before running the 2287 match. This can be done by calling pcre2_callout_enumerate(). The first 2288 argument is a pointer to a compiled pattern, the second points to a 2289 callback function, and the third is arbitrary user data. The callback 2290 function is called for every callout in the pattern in the order in 2291 which they appear. Its first argument is a pointer to a callout enumer- 2292 ation block, and its second argument is the user_data value that was 2293 passed to pcre2_callout_enumerate(). The contents of the callout enu- 2294 meration block are described in the pcre2callout documentation, which 2295 also gives further details about callouts. 2296 2297 2298SERIALIZATION AND PRECOMPILING 2299 2300 It is possible to save compiled patterns on disc or elsewhere, and 2301 reload them later, subject to a number of restrictions. The host on 2302 which the patterns are reloaded must be running the same version of 2303 PCRE2, with the same code unit width, and must also have the same endi- 2304 anness, pointer width, and PCRE2_SIZE type. Before compiled patterns 2305 can be saved, they must be converted to a "serialized" form, which in 2306 the case of PCRE2 is really just a bytecode dump. The functions whose 2307 names begin with pcre2_serialize_ are used for converting to and from 2308 the serialized form. They are described in the pcre2serialize documen- 2309 tation. Note that PCRE2 serialization does not convert compiled pat- 2310 terns to an abstract format like Java or .NET serialization. 2311 2312 2313THE MATCH DATA BLOCK 2314 2315 pcre2_match_data *pcre2_match_data_create(uint32_t ovecsize, 2316 pcre2_general_context *gcontext); 2317 2318 pcre2_match_data *pcre2_match_data_create_from_pattern( 2319 const pcre2_code *code, pcre2_general_context *gcontext); 2320 2321 void pcre2_match_data_free(pcre2_match_data *match_data); 2322 2323 Information about a successful or unsuccessful match is placed in a 2324 match data block, which is an opaque structure that is accessed by 2325 function calls. In particular, the match data block contains a vector 2326 of offsets into the subject string that define the matched part of the 2327 subject and any substrings that were captured. This is known as the 2328 ovector. 2329 2330 Before calling pcre2_match(), pcre2_dfa_match(), or pcre2_jit_match() 2331 you must create a match data block by calling one of the creation func- 2332 tions above. For pcre2_match_data_create(), the first argument is the 2333 number of pairs of offsets in the ovector. One pair of offsets is 2334 required to identify the string that matched the whole pattern, with an 2335 additional pair for each captured substring. For example, a value of 4 2336 creates enough space to record the matched portion of the subject plus 2337 three captured substrings. A minimum of at least 1 pair is imposed by 2338 pcre2_match_data_create(), so it is always possible to return the over- 2339 all matched string. 2340 2341 The second argument of pcre2_match_data_create() is a pointer to a gen- 2342 eral context, which can specify custom memory management for obtaining 2343 the memory for the match data block. If you are not using custom memory 2344 management, pass NULL, which causes malloc() to be used. 2345 2346 For pcre2_match_data_create_from_pattern(), the first argument is a 2347 pointer to a compiled pattern. The ovector is created to be exactly the 2348 right size to hold all the substrings a pattern might capture. The sec- 2349 ond argument is again a pointer to a general context, but in this case 2350 if NULL is passed, the memory is obtained using the same allocator that 2351 was used for the compiled pattern (custom or default). 2352 2353 A match data block can be used many times, with the same or different 2354 compiled patterns. You can extract information from a match data block 2355 after a match operation has finished, using functions that are 2356 described in the sections on matched strings and other match data 2357 below. 2358 2359 When a call of pcre2_match() fails, valid data is available in the 2360 match block only when the error is PCRE2_ERROR_NOMATCH, 2361 PCRE2_ERROR_PARTIAL, or one of the error codes for an invalid UTF 2362 string. Exactly what is available depends on the error, and is detailed 2363 below. 2364 2365 When one of the matching functions is called, pointers to the compiled 2366 pattern and the subject string are set in the match data block so that 2367 they can be referenced by the extraction functions. After running a 2368 match, you must not free a compiled pattern or a subject string until 2369 after all operations on the match data block (for that match) have 2370 taken place. 2371 2372 When a match data block itself is no longer needed, it should be freed 2373 by calling pcre2_match_data_free(). If this function is called with a 2374 NULL argument, it returns immediately, without doing anything. 2375 2376 2377MATCHING A PATTERN: THE TRADITIONAL FUNCTION 2378 2379 int pcre2_match(const pcre2_code *code, PCRE2_SPTR subject, 2380 PCRE2_SIZE length, PCRE2_SIZE startoffset, 2381 uint32_t options, pcre2_match_data *match_data, 2382 pcre2_match_context *mcontext); 2383 2384 The function pcre2_match() is called to match a subject string against 2385 a compiled pattern, which is passed in the code argument. You can call 2386 pcre2_match() with the same code argument as many times as you like, in 2387 order to find multiple matches in the subject string or to match dif- 2388 ferent subject strings with the same pattern. 2389 2390 This function is the main matching facility of the library, and it 2391 operates in a Perl-like manner. For specialist use there is also an 2392 alternative matching function, which is described below in the section 2393 about the pcre2_dfa_match() function. 2394 2395 Here is an example of a simple call to pcre2_match(): 2396 2397 pcre2_match_data *md = pcre2_match_data_create(4, NULL); 2398 int rc = pcre2_match( 2399 re, /* result of pcre2_compile() */ 2400 "some string", /* the subject string */ 2401 11, /* the length of the subject string */ 2402 0, /* start at offset 0 in the subject */ 2403 0, /* default options */ 2404 md, /* the match data block */ 2405 NULL); /* a match context; NULL means use defaults */ 2406 2407 If the subject string is zero-terminated, the length can be given as 2408 PCRE2_ZERO_TERMINATED. A match context must be provided if certain less 2409 common matching parameters are to be changed. For details, see the sec- 2410 tion on the match context above. 2411 2412 The string to be matched by pcre2_match() 2413 2414 The subject string is passed to pcre2_match() as a pointer in subject, 2415 a length in length, and a starting offset in startoffset. The length 2416 and offset are in code units, not characters. That is, they are in 2417 bytes for the 8-bit library, 16-bit code units for the 16-bit library, 2418 and 32-bit code units for the 32-bit library, whether or not UTF pro- 2419 cessing is enabled. 2420 2421 If startoffset is greater than the length of the subject, pcre2_match() 2422 returns PCRE2_ERROR_BADOFFSET. When the starting offset is zero, the 2423 search for a match starts at the beginning of the subject, and this is 2424 by far the most common case. In UTF-8 or UTF-16 mode, the starting off- 2425 set must point to the start of a character, or to the end of the sub- 2426 ject (in UTF-32 mode, one code unit equals one character, so all off- 2427 sets are valid). Like the pattern string, the subject may contain 2428 binary zeros. 2429 2430 A non-zero starting offset is useful when searching for another match 2431 in the same subject by calling pcre2_match() again after a previous 2432 success. Setting startoffset differs from passing over a shortened 2433 string and setting PCRE2_NOTBOL in the case of a pattern that begins 2434 with any kind of lookbehind. For example, consider the pattern 2435 2436 \Biss\B 2437 2438 which finds occurrences of "iss" in the middle of words. (\B matches 2439 only if the current position in the subject is not a word boundary.) 2440 When applied to the string "Mississipi" the first call to pcre2_match() 2441 finds the first occurrence. If pcre2_match() is called again with just 2442 the remainder of the subject, namely "issipi", it does not match, 2443 because \B is always false at the start of the subject, which is deemed 2444 to be a word boundary. However, if pcre2_match() is passed the entire 2445 string again, but with startoffset set to 4, it finds the second occur- 2446 rence of "iss" because it is able to look behind the starting point to 2447 discover that it is preceded by a letter. 2448 2449 Finding all the matches in a subject is tricky when the pattern can 2450 match an empty string. It is possible to emulate Perl's /g behaviour by 2451 first trying the match again at the same offset, with the 2452 PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED options, and then if that 2453 fails, advancing the starting offset and trying an ordinary match 2454 again. There is some code that demonstrates how to do this in the 2455 pcre2demo sample program. In the most general case, you have to check 2456 to see if the newline convention recognizes CRLF as a newline, and if 2457 so, and the current character is CR followed by LF, advance the start- 2458 ing offset by two characters instead of one. 2459 2460 If a non-zero starting offset is passed when the pattern is anchored, a 2461 single attempt to match at the given offset is made. This can only suc- 2462 ceed if the pattern does not require the match to be at the start of 2463 the subject. In other words, the anchoring must be the result of set- 2464 ting the PCRE2_ANCHORED option or the use of .* with PCRE2_DOTALL, not 2465 by starting the pattern with ^ or \A. 2466 2467 Option bits for pcre2_match() 2468 2469 The unused bits of the options argument for pcre2_match() must be zero. 2470 The only bits that may be set are PCRE2_ANCHORED, PCRE2_ENDANCHORED, 2471 PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, 2472 PCRE2_NO_JIT, PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PAR- 2473 TIAL_SOFT. Their action is described below. 2474 2475 Setting PCRE2_ANCHORED or PCRE2_ENDANCHORED at match time is not sup- 2476 ported by the just-in-time (JIT) compiler. If it is set, JIT matching 2477 is disabled and the interpretive code in pcre2_match() is run. Apart 2478 from PCRE2_NO_JIT (obviously), the remaining options are supported for 2479 JIT matching. 2480 2481 PCRE2_ANCHORED 2482 2483 The PCRE2_ANCHORED option limits pcre2_match() to matching at the first 2484 matching position. If a pattern was compiled with PCRE2_ANCHORED, or 2485 turned out to be anchored by virtue of its contents, it cannot be made 2486 unachored at matching time. Note that setting the option at match time 2487 disables JIT matching. 2488 2489 PCRE2_ENDANCHORED 2490 2491 If the PCRE2_ENDANCHORED option is set, any string that pcre2_match() 2492 matches must be right at the end of the subject string. Note that set- 2493 ting the option at match time disables JIT matching. 2494 2495 PCRE2_NOTBOL 2496 2497 This option specifies that first character of the subject string is not 2498 the beginning of a line, so the circumflex metacharacter should not 2499 match before it. Setting this without having set PCRE2_MULTILINE at 2500 compile time causes circumflex never to match. This option affects only 2501 the behaviour of the circumflex metacharacter. It does not affect \A. 2502 2503 PCRE2_NOTEOL 2504 2505 This option specifies that the end of the subject string is not the end 2506 of a line, so the dollar metacharacter should not match it nor (except 2507 in multiline mode) a newline immediately before it. Setting this with- 2508 out having set PCRE2_MULTILINE at compile time causes dollar never to 2509 match. This option affects only the behaviour of the dollar metacharac- 2510 ter. It does not affect \Z or \z. 2511 2512 PCRE2_NOTEMPTY 2513 2514 An empty string is not considered to be a valid match if this option is 2515 set. If there are alternatives in the pattern, they are tried. If all 2516 the alternatives match the empty string, the entire match fails. For 2517 example, if the pattern 2518 2519 a?b? 2520 2521 is applied to a string not beginning with "a" or "b", it matches an 2522 empty string at the start of the subject. With PCRE2_NOTEMPTY set, this 2523 match is not valid, so pcre2_match() searches further into the string 2524 for occurrences of "a" or "b". 2525 2526 PCRE2_NOTEMPTY_ATSTART 2527 2528 This is like PCRE2_NOTEMPTY, except that it locks out an empty string 2529 match only at the first matching position, that is, at the start of the 2530 subject plus the starting offset. An empty string match later in the 2531 subject is permitted. If the pattern is anchored, such a match can 2532 occur only if the pattern contains \K. 2533 2534 PCRE2_NO_JIT 2535 2536 By default, if a pattern has been successfully processed by 2537 pcre2_jit_compile(), JIT is automatically used when pcre2_match() is 2538 called with options that JIT supports. Setting PCRE2_NO_JIT disables 2539 the use of JIT; it forces matching to be done by the interpreter. 2540 2541 PCRE2_NO_UTF_CHECK 2542 2543 When PCRE2_UTF is set at compile time, the validity of the subject as a 2544 UTF string is checked by default when pcre2_match() is subsequently 2545 called. If a non-zero starting offset is given, the check is applied 2546 only to that part of the subject that could be inspected during match- 2547 ing, and there is a check that the starting offset points to the first 2548 code unit of a character or to the end of the subject. If there are no 2549 lookbehind assertions in the pattern, the check starts at the starting 2550 offset. Otherwise, it starts at the length of the longest lookbehind 2551 before the starting offset, or at the start of the subject if there are 2552 not that many characters before the starting offset. Note that the 2553 sequences \b and \B are one-character lookbehinds. 2554 2555 The check is carried out before any other processing takes place, and a 2556 negative error code is returned if the check fails. There are several 2557 UTF error codes for each code unit width, corresponding to different 2558 problems with the code unit sequence. There are discussions about the 2559 validity of UTF-8 strings, UTF-16 strings, and UTF-32 strings in the 2560 pcre2unicode page. 2561 2562 If you know that your subject is valid, and you want to skip these 2563 checks for performance reasons, you can set the PCRE2_NO_UTF_CHECK 2564 option when calling pcre2_match(). You might want to do this for the 2565 second and subsequent calls to pcre2_match() if you are making repeated 2566 calls to find other matches in the same subject string. 2567 2568 Warning: When PCRE2_NO_UTF_CHECK is set, the effect of passing an 2569 invalid string as a subject, or an invalid value of startoffset, is 2570 undefined. Your program may crash or loop indefinitely. 2571 2572 PCRE2_PARTIAL_HARD 2573 PCRE2_PARTIAL_SOFT 2574 2575 These options turn on the partial matching feature. A partial match 2576 occurs if the end of the subject string is reached successfully, but 2577 there are not enough subject characters to complete the match. If this 2578 happens when PCRE2_PARTIAL_SOFT (but not PCRE2_PARTIAL_HARD) is set, 2579 matching continues by testing any remaining alternatives. Only if no 2580 complete match can be found is PCRE2_ERROR_PARTIAL returned instead of 2581 PCRE2_ERROR_NOMATCH. In other words, PCRE2_PARTIAL_SOFT specifies that 2582 the caller is prepared to handle a partial match, but only if no com- 2583 plete match can be found. 2584 2585 If PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In this 2586 case, if a partial match is found, pcre2_match() immediately returns 2587 PCRE2_ERROR_PARTIAL, without considering any other alternatives. In 2588 other words, when PCRE2_PARTIAL_HARD is set, a partial match is consid- 2589 ered to be more important that an alternative complete match. 2590 2591 There is a more detailed discussion of partial and multi-segment match- 2592 ing, with examples, in the pcre2partial documentation. 2593 2594 2595NEWLINE HANDLING WHEN MATCHING 2596 2597 When PCRE2 is built, a default newline convention is set; this is usu- 2598 ally the standard convention for the operating system. The default can 2599 be overridden in a compile context by calling pcre2_set_newline(). It 2600 can also be overridden by starting a pattern string with, for example, 2601 (*CRLF), as described in the section on newline conventions in the 2602 pcre2pattern page. During matching, the newline choice affects the be- 2603 haviour of the dot, circumflex, and dollar metacharacters. It may also 2604 alter the way the match starting position is advanced after a match 2605 failure for an unanchored pattern. 2606 2607 When PCRE2_NEWLINE_CRLF, PCRE2_NEWLINE_ANYCRLF, or PCRE2_NEWLINE_ANY is 2608 set as the newline convention, and a match attempt for an unanchored 2609 pattern fails when the current starting position is at a CRLF sequence, 2610 and the pattern contains no explicit matches for CR or LF characters, 2611 the match position is advanced by two characters instead of one, in 2612 other words, to after the CRLF. 2613 2614 The above rule is a compromise that makes the most common cases work as 2615 expected. For example, if the pattern is .+A (and the PCRE2_DOTALL 2616 option is not set), it does not match the string "\r\nA" because, after 2617 failing at the start, it skips both the CR and the LF before retrying. 2618 However, the pattern [\r\n]A does match that string, because it con- 2619 tains an explicit CR or LF reference, and so advances only by one char- 2620 acter after the first failure. 2621 2622 An explicit match for CR of LF is either a literal appearance of one of 2623 those characters in the pattern, or one of the \r or \n or equivalent 2624 octal or hexadecimal escape sequences. Implicit matches such as [^X] do 2625 not count, nor does \s, even though it includes CR and LF in the char- 2626 acters that it matches. 2627 2628 Notwithstanding the above, anomalous effects may still occur when CRLF 2629 is a valid newline sequence and explicit \r or \n escapes appear in the 2630 pattern. 2631 2632 2633HOW PCRE2_MATCH() RETURNS A STRING AND CAPTURED SUBSTRINGS 2634 2635 uint32_t pcre2_get_ovector_count(pcre2_match_data *match_data); 2636 2637 PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *match_data); 2638 2639 In general, a pattern matches a certain portion of the subject, and in 2640 addition, further substrings from the subject may be picked out by 2641 parenthesized parts of the pattern. Following the usage in Jeffrey 2642 Friedl's book, this is called "capturing" in what follows, and the 2643 phrase "capturing subpattern" or "capturing group" is used for a frag- 2644 ment of a pattern that picks out a substring. PCRE2 supports several 2645 other kinds of parenthesized subpattern that do not cause substrings to 2646 be captured. The pcre2_pattern_info() function can be used to find out 2647 how many capturing subpatterns there are in a compiled pattern. 2648 2649 You can use auxiliary functions for accessing captured substrings by 2650 number or by name, as described in sections below. 2651 2652 Alternatively, you can make direct use of the vector of PCRE2_SIZE val- 2653 ues, called the ovector, which contains the offsets of captured 2654 strings. It is part of the match data block. The function 2655 pcre2_get_ovector_pointer() returns the address of the ovector, and 2656 pcre2_get_ovector_count() returns the number of pairs of values it con- 2657 tains. 2658 2659 Within the ovector, the first in each pair of values is set to the off- 2660 set of the first code unit of a substring, and the second is set to the 2661 offset of the first code unit after the end of a substring. These val- 2662 ues are always code unit offsets, not character offsets. That is, they 2663 are byte offsets in the 8-bit library, 16-bit offsets in the 16-bit 2664 library, and 32-bit offsets in the 32-bit library. 2665 2666 After a partial match (error return PCRE2_ERROR_PARTIAL), only the 2667 first pair of offsets (that is, ovector[0] and ovector[1]) are set. 2668 They identify the part of the subject that was partially matched. See 2669 the pcre2partial documentation for details of partial matching. 2670 2671 After a fully successful match, the first pair of offsets identifies 2672 the portion of the subject string that was matched by the entire pat- 2673 tern. The next pair is used for the first captured substring, and so 2674 on. The value returned by pcre2_match() is one more than the highest 2675 numbered pair that has been set. For example, if two substrings have 2676 been captured, the returned value is 3. If there are no captured sub- 2677 strings, the return value from a successful match is 1, indicating that 2678 just the first pair of offsets has been set. 2679 2680 If a pattern uses the \K escape sequence within a positive assertion, 2681 the reported start of a successful match can be greater than the end of 2682 the match. For example, if the pattern (?=ab\K) is matched against 2683 "ab", the start and end offset values for the match are 2 and 0. 2684 2685 If a capturing subpattern group is matched repeatedly within a single 2686 match operation, it is the last portion of the subject that it matched 2687 that is returned. 2688 2689 If the ovector is too small to hold all the captured substring offsets, 2690 as much as possible is filled in, and the function returns a value of 2691 zero. If captured substrings are not of interest, pcre2_match() may be 2692 called with a match data block whose ovector is of minimum length (that 2693 is, one pair). 2694 2695 It is possible for capturing subpattern number n+1 to match some part 2696 of the subject when subpattern n has not been used at all. For example, 2697 if the string "abc" is matched against the pattern (a|(z))(bc) the 2698 return from the function is 4, and subpatterns 1 and 3 are matched, but 2699 2 is not. When this happens, both values in the offset pairs corre- 2700 sponding to unused subpatterns are set to PCRE2_UNSET. 2701 2702 Offset values that correspond to unused subpatterns at the end of the 2703 expression are also set to PCRE2_UNSET. For example, if the string 2704 "abc" is matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 2705 are not matched. The return from the function is 2, because the high- 2706 est used capturing subpattern number is 1. The offsets for for the sec- 2707 ond and third capturing subpatterns (assuming the vector is large 2708 enough, of course) are set to PCRE2_UNSET. 2709 2710 Elements in the ovector that do not correspond to capturing parentheses 2711 in the pattern are never changed. That is, if a pattern contains n cap- 2712 turing parentheses, no more than ovector[0] to ovector[2n+1] are set by 2713 pcre2_match(). The other elements retain whatever values they previ- 2714 ously had. After a failed match attempt, the contents of the ovector 2715 are unchanged. 2716 2717 2718OTHER INFORMATION ABOUT A MATCH 2719 2720 PCRE2_SPTR pcre2_get_mark(pcre2_match_data *match_data); 2721 2722 PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *match_data); 2723 2724 As well as the offsets in the ovector, other information about a match 2725 is retained in the match data block and can be retrieved by the above 2726 functions in appropriate circumstances. If they are called at other 2727 times, the result is undefined. 2728 2729 After a successful match, a partial match (PCRE2_ERROR_PARTIAL), or a 2730 failure to match (PCRE2_ERROR_NOMATCH), a (*MARK), (*PRUNE), or (*THEN) 2731 name may be available. The function pcre2_get_mark() can be called to 2732 access this name. The same function applies to all three verbs. It 2733 returns a pointer to the zero-terminated name, which is within the com- 2734 piled pattern. If no name is available, NULL is returned. The length of 2735 the name (excluding the terminating zero) is stored in the code unit 2736 that precedes the name. You should use this length instead of relying 2737 on the terminating zero if the name might contain a binary zero. 2738 2739 After a successful match, the name that is returned is the last 2740 (*MARK), (*PRUNE), or (*THEN) name encountered on the matching path 2741 through the pattern. Instances of (*PRUNE) and (*THEN) without names 2742 are ignored. Thus, for example, if the matching path contains 2743 (*MARK:A)(*PRUNE), the name "A" is returned. After a "no match" or a 2744 partial match, the last encountered name is returned. For example, 2745 consider this pattern: 2746 2747 ^(*MARK:A)((*MARK:B)a|b)c 2748 2749 When it matches "bc", the returned name is A. The B mark is "seen" in 2750 the first branch of the group, but it is not on the matching path. On 2751 the other hand, when this pattern fails to match "bx", the returned 2752 name is B. 2753 2754 Warning: By default, certain start-of-match optimizations are used to 2755 give a fast "no match" result in some situations. For example, if the 2756 anchoring is removed from the pattern above, there is an initial check 2757 for the presence of "c" in the subject before running the matching 2758 engine. This check fails for "bx", causing a match failure without see- 2759 ing any marks. You can disable the start-of-match optimizations by set- 2760 ting the PCRE2_NO_START_OPTIMIZE option for pcre2_compile() or starting 2761 the pattern with (*NO_START_OPT). 2762 2763 After a successful match, a partial match, or one of the invalid UTF 2764 errors (for example, PCRE2_ERROR_UTF8_ERR5), pcre2_get_startchar() can 2765 be called. After a successful or partial match it returns the code unit 2766 offset of the character at which the match started. For a non-partial 2767 match, this can be different to the value of ovector[0] if the pattern 2768 contains the \K escape sequence. After a partial match, however, this 2769 value is always the same as ovector[0] because \K does not affect the 2770 result of a partial match. 2771 2772 After a UTF check failure, pcre2_get_startchar() can be used to obtain 2773 the code unit offset of the invalid UTF character. Details are given in 2774 the pcre2unicode page. 2775 2776 2777ERROR RETURNS FROM pcre2_match() 2778 2779 If pcre2_match() fails, it returns a negative number. This can be con- 2780 verted to a text string by calling the pcre2_get_error_message() func- 2781 tion (see "Obtaining a textual error message" below). Negative error 2782 codes are also returned by other functions, and are documented with 2783 them. The codes are given names in the header file. If UTF checking is 2784 in force and an invalid UTF subject string is detected, one of a number 2785 of UTF-specific negative error codes is returned. Details are given in 2786 the pcre2unicode page. The following are the other errors that may be 2787 returned by pcre2_match(): 2788 2789 PCRE2_ERROR_NOMATCH 2790 2791 The subject string did not match the pattern. 2792 2793 PCRE2_ERROR_PARTIAL 2794 2795 The subject string did not match, but it did match partially. See the 2796 pcre2partial documentation for details of partial matching. 2797 2798 PCRE2_ERROR_BADMAGIC 2799 2800 PCRE2 stores a 4-byte "magic number" at the start of the compiled code, 2801 to catch the case when it is passed a junk pointer. This is the error 2802 that is returned when the magic number is not present. 2803 2804 PCRE2_ERROR_BADMODE 2805 2806 This error is given when a compiled pattern is passed to a function in 2807 a library of a different code unit width, for example, a pattern com- 2808 piled by the 8-bit library is passed to a 16-bit or 32-bit library 2809 function. 2810 2811 PCRE2_ERROR_BADOFFSET 2812 2813 The value of startoffset was greater than the length of the subject. 2814 2815 PCRE2_ERROR_BADOPTION 2816 2817 An unrecognized bit was set in the options argument. 2818 2819 PCRE2_ERROR_BADUTFOFFSET 2820 2821 The UTF code unit sequence that was passed as a subject was checked and 2822 found to be valid (the PCRE2_NO_UTF_CHECK option was not set), but the 2823 value of startoffset did not point to the beginning of a UTF character 2824 or the end of the subject. 2825 2826 PCRE2_ERROR_CALLOUT 2827 2828 This error is never generated by pcre2_match() itself. It is provided 2829 for use by callout functions that want to cause pcre2_match() or 2830 pcre2_callout_enumerate() to return a distinctive error code. See the 2831 pcre2callout documentation for details. 2832 2833 PCRE2_ERROR_DEPTHLIMIT 2834 2835 The nested backtracking depth limit was reached. 2836 2837 PCRE2_ERROR_HEAPLIMIT 2838 2839 The heap limit was reached. 2840 2841 PCRE2_ERROR_INTERNAL 2842 2843 An unexpected internal error has occurred. This error could be caused 2844 by a bug in PCRE2 or by overwriting of the compiled pattern. 2845 2846 PCRE2_ERROR_JIT_STACKLIMIT 2847 2848 This error is returned when a pattern that was successfully studied 2849 using JIT is being matched, but the memory available for the just-in- 2850 time processing stack is not large enough. See the pcre2jit documenta- 2851 tion for more details. 2852 2853 PCRE2_ERROR_MATCHLIMIT 2854 2855 The backtracking match limit was reached. 2856 2857 PCRE2_ERROR_NOMEMORY 2858 2859 If a pattern contains many nested backtracking points, heap memory is 2860 used to remember them. This error is given when the memory allocation 2861 function (default or custom) fails. Note that a different error, 2862 PCRE2_ERROR_HEAPLIMIT, is given if the amount of memory needed exceeds 2863 the heap limit. 2864 2865 PCRE2_ERROR_NULL 2866 2867 Either the code, subject, or match_data argument was passed as NULL. 2868 2869 PCRE2_ERROR_RECURSELOOP 2870 2871 This error is returned when pcre2_match() detects a recursion loop 2872 within the pattern. Specifically, it means that either the whole pat- 2873 tern or a subpattern has been called recursively for the second time at 2874 the same position in the subject string. Some simple patterns that 2875 might do this are detected and faulted at compile time, but more com- 2876 plicated cases, in particular mutual recursions between two different 2877 subpatterns, cannot be detected until matching is attempted. 2878 2879 2880OBTAINING A TEXTUAL ERROR MESSAGE 2881 2882 int pcre2_get_error_message(int errorcode, PCRE2_UCHAR *buffer, 2883 PCRE2_SIZE bufflen); 2884 2885 A text message for an error code from any PCRE2 function (compile, 2886 match, or auxiliary) can be obtained by calling pcre2_get_error_mes- 2887 sage(). The code is passed as the first argument, with the remaining 2888 two arguments specifying a code unit buffer and its length in code 2889 units, into which the text message is placed. The message is returned 2890 in code units of the appropriate width for the library that is being 2891 used. 2892 2893 The returned message is terminated with a trailing zero, and the func- 2894 tion returns the number of code units used, excluding the trailing 2895 zero. If the error number is unknown, the negative error code 2896 PCRE2_ERROR_BADDATA is returned. If the buffer is too small, the mes- 2897 sage is truncated (but still with a trailing zero), and the negative 2898 error code PCRE2_ERROR_NOMEMORY is returned. None of the messages are 2899 very long; a buffer size of 120 code units is ample. 2900 2901 2902EXTRACTING CAPTURED SUBSTRINGS BY NUMBER 2903 2904 int pcre2_substring_length_bynumber(pcre2_match_data *match_data, 2905 uint32_t number, PCRE2_SIZE *length); 2906 2907 int pcre2_substring_copy_bynumber(pcre2_match_data *match_data, 2908 uint32_t number, PCRE2_UCHAR *buffer, 2909 PCRE2_SIZE *bufflen); 2910 2911 int pcre2_substring_get_bynumber(pcre2_match_data *match_data, 2912 uint32_t number, PCRE2_UCHAR **bufferptr, 2913 PCRE2_SIZE *bufflen); 2914 2915 void pcre2_substring_free(PCRE2_UCHAR *buffer); 2916 2917 Captured substrings can be accessed directly by using the ovector as 2918 described above. For convenience, auxiliary functions are provided for 2919 extracting captured substrings as new, separate, zero-terminated 2920 strings. A substring that contains a binary zero is correctly extracted 2921 and has a further zero added on the end, but the result is not, of 2922 course, a C string. 2923 2924 The functions in this section identify substrings by number. The number 2925 zero refers to the entire matched substring, with higher numbers refer- 2926 ring to substrings captured by parenthesized groups. After a partial 2927 match, only substring zero is available. An attempt to extract any 2928 other substring gives the error PCRE2_ERROR_PARTIAL. The next section 2929 describes similar functions for extracting captured substrings by name. 2930 2931 If a pattern uses the \K escape sequence within a positive assertion, 2932 the reported start of a successful match can be greater than the end of 2933 the match. For example, if the pattern (?=ab\K) is matched against 2934 "ab", the start and end offset values for the match are 2 and 0. In 2935 this situation, calling these functions with a zero substring number 2936 extracts a zero-length empty string. 2937 2938 You can find the length in code units of a captured substring without 2939 extracting it by calling pcre2_substring_length_bynumber(). The first 2940 argument is a pointer to the match data block, the second is the group 2941 number, and the third is a pointer to a variable into which the length 2942 is placed. If you just want to know whether or not the substring has 2943 been captured, you can pass the third argument as NULL. 2944 2945 The pcre2_substring_copy_bynumber() function copies a captured sub- 2946 string into a supplied buffer, whereas pcre2_substring_get_bynumber() 2947 copies it into new memory, obtained using the same memory allocation 2948 function that was used for the match data block. The first two argu- 2949 ments of these functions are a pointer to the match data block and a 2950 capturing group number. 2951 2952 The final arguments of pcre2_substring_copy_bynumber() are a pointer to 2953 the buffer and a pointer to a variable that contains its length in code 2954 units. This is updated to contain the actual number of code units used 2955 for the extracted substring, excluding the terminating zero. 2956 2957 For pcre2_substring_get_bynumber() the third and fourth arguments point 2958 to variables that are updated with a pointer to the new memory and the 2959 number of code units that comprise the substring, again excluding the 2960 terminating zero. When the substring is no longer needed, the memory 2961 should be freed by calling pcre2_substring_free(). 2962 2963 The return value from all these functions is zero for success, or a 2964 negative error code. If the pattern match failed, the match failure 2965 code is returned. If a substring number greater than zero is used 2966 after a partial match, PCRE2_ERROR_PARTIAL is returned. Other possible 2967 error codes are: 2968 2969 PCRE2_ERROR_NOMEMORY 2970 2971 The buffer was too small for pcre2_substring_copy_bynumber(), or the 2972 attempt to get memory failed for pcre2_substring_get_bynumber(). 2973 2974 PCRE2_ERROR_NOSUBSTRING 2975 2976 There is no substring with that number in the pattern, that is, the 2977 number is greater than the number of capturing parentheses. 2978 2979 PCRE2_ERROR_UNAVAILABLE 2980 2981 The substring number, though not greater than the number of captures in 2982 the pattern, is greater than the number of slots in the ovector, so the 2983 substring could not be captured. 2984 2985 PCRE2_ERROR_UNSET 2986 2987 The substring did not participate in the match. For example, if the 2988 pattern is (abc)|(def) and the subject is "def", and the ovector con- 2989 tains at least two capturing slots, substring number 1 is unset. 2990 2991 2992EXTRACTING A LIST OF ALL CAPTURED SUBSTRINGS 2993 2994 int pcre2_substring_list_get(pcre2_match_data *match_data, 2995 PCRE2_UCHAR ***listptr, PCRE2_SIZE **lengthsptr); 2996 2997 void pcre2_substring_list_free(PCRE2_SPTR *list); 2998 2999 The pcre2_substring_list_get() function extracts all available sub- 3000 strings and builds a list of pointers to them. It also (optionally) 3001 builds a second list that contains their lengths (in code units), 3002 excluding a terminating zero that is added to each of them. All this is 3003 done in a single block of memory that is obtained using the same memory 3004 allocation function that was used to get the match data block. 3005 3006 This function must be called only after a successful match. If called 3007 after a partial match, the error code PCRE2_ERROR_PARTIAL is returned. 3008 3009 The address of the memory block is returned via listptr, which is also 3010 the start of the list of string pointers. The end of the list is marked 3011 by a NULL pointer. The address of the list of lengths is returned via 3012 lengthsptr. If your strings do not contain binary zeros and you do not 3013 therefore need the lengths, you may supply NULL as the lengthsptr argu- 3014 ment to disable the creation of a list of lengths. The yield of the 3015 function is zero if all went well, or PCRE2_ERROR_NOMEMORY if the mem- 3016 ory block could not be obtained. When the list is no longer needed, it 3017 should be freed by calling pcre2_substring_list_free(). 3018 3019 If this function encounters a substring that is unset, which can happen 3020 when capturing subpattern number n+1 matches some part of the subject, 3021 but subpattern n has not been used at all, it returns an empty string. 3022 This can be distinguished from a genuine zero-length substring by 3023 inspecting the appropriate offset in the ovector, which contain 3024 PCRE2_UNSET for unset substrings, or by calling pcre2_sub- 3025 string_length_bynumber(). 3026 3027 3028EXTRACTING CAPTURED SUBSTRINGS BY NAME 3029 3030 int pcre2_substring_number_from_name(const pcre2_code *code, 3031 PCRE2_SPTR name); 3032 3033 int pcre2_substring_length_byname(pcre2_match_data *match_data, 3034 PCRE2_SPTR name, PCRE2_SIZE *length); 3035 3036 int pcre2_substring_copy_byname(pcre2_match_data *match_data, 3037 PCRE2_SPTR name, PCRE2_UCHAR *buffer, PCRE2_SIZE *bufflen); 3038 3039 int pcre2_substring_get_byname(pcre2_match_data *match_data, 3040 PCRE2_SPTR name, PCRE2_UCHAR **bufferptr, PCRE2_SIZE *bufflen); 3041 3042 void pcre2_substring_free(PCRE2_UCHAR *buffer); 3043 3044 To extract a substring by name, you first have to find associated num- 3045 ber. For example, for this pattern: 3046 3047 (a+)b(?<xxx>\d+)... 3048 3049 the number of the subpattern called "xxx" is 2. If the name is known to 3050 be unique (PCRE2_DUPNAMES was not set), you can find the number from 3051 the name by calling pcre2_substring_number_from_name(). The first argu- 3052 ment is the compiled pattern, and the second is the name. The yield of 3053 the function is the subpattern number, PCRE2_ERROR_NOSUBSTRING if there 3054 is no subpattern of that name, or PCRE2_ERROR_NOUNIQUESUBSTRING if 3055 there is more than one subpattern of that name. Given the number, you 3056 can extract the substring directly from the ovector, or use one of the 3057 "bynumber" functions described above. 3058 3059 For convenience, there are also "byname" functions that correspond to 3060 the "bynumber" functions, the only difference being that the second 3061 argument is a name instead of a number. If PCRE2_DUPNAMES is set and 3062 there are duplicate names, these functions scan all the groups with the 3063 given name, and return the first named string that is set. 3064 3065 If there are no groups with the given name, PCRE2_ERROR_NOSUBSTRING is 3066 returned. If all groups with the name have numbers that are greater 3067 than the number of slots in the ovector, PCRE2_ERROR_UNAVAILABLE is 3068 returned. If there is at least one group with a slot in the ovector, 3069 but no group is found to be set, PCRE2_ERROR_UNSET is returned. 3070 3071 Warning: If the pattern uses the (?| feature to set up multiple subpat- 3072 terns with the same number, as described in the section on duplicate 3073 subpattern numbers in the pcre2pattern page, you cannot use names to 3074 distinguish the different subpatterns, because names are not included 3075 in the compiled code. The matching process uses only numbers. For this 3076 reason, the use of different names for subpatterns of the same number 3077 causes an error at compile time. 3078 3079 3080CREATING A NEW STRING WITH SUBSTITUTIONS 3081 3082 int pcre2_substitute(const pcre2_code *code, PCRE2_SPTR subject, 3083 PCRE2_SIZE length, PCRE2_SIZE startoffset, 3084 uint32_t options, pcre2_match_data *match_data, 3085 pcre2_match_context *mcontext, PCRE2_SPTR replacement, 3086 PCRE2_SIZE rlength, PCRE2_UCHAR *outputbufferP, 3087 PCRE2_SIZE *outlengthptr); 3088 3089 This function calls pcre2_match() and then makes a copy of the subject 3090 string in outputbuffer, replacing the part that was matched with the 3091 replacement string, whose length is supplied in rlength. This can be 3092 given as PCRE2_ZERO_TERMINATED for a zero-terminated string. Matches in 3093 which a \K item in a lookahead in the pattern causes the match to end 3094 before it starts are not supported, and give rise to an error return. 3095 For global replacements, matches in which \K in a lookbehind causes the 3096 match to start earlier than the point that was reached in the previous 3097 iteration are also not supported. 3098 3099 The first seven arguments of pcre2_substitute() are the same as for 3100 pcre2_match(), except that the partial matching options are not permit- 3101 ted, and match_data may be passed as NULL, in which case a match data 3102 block is obtained and freed within this function, using memory manage- 3103 ment functions from the match context, if provided, or else those that 3104 were used to allocate memory for the compiled code. 3105 3106 If an external match_data block is provided, its contents afterwards 3107 are those set by the final call to pcre2_match(), which will have ended 3108 in a matching error. The contents of the ovector within the match data 3109 block may or may not have been changed. 3110 3111 The outlengthptr argument must point to a variable that contains the 3112 length, in code units, of the output buffer. If the function is suc- 3113 cessful, the value is updated to contain the length of the new string, 3114 excluding the trailing zero that is automatically added. 3115 3116 If the function is not successful, the value set via outlengthptr 3117 depends on the type of error. For syntax errors in the replacement 3118 string, the value is the offset in the replacement string where the 3119 error was detected. For other errors, the value is PCRE2_UNSET by 3120 default. This includes the case of the output buffer being too small, 3121 unless PCRE2_SUBSTITUTE_OVERFLOW_LENGTH is set (see below), in which 3122 case the value is the minimum length needed, including space for the 3123 trailing zero. Note that in order to compute the required length, 3124 pcre2_substitute() has to simulate all the matching and copying, 3125 instead of giving an error return as soon as the buffer overflows. Note 3126 also that the length is in code units, not bytes. 3127 3128 In the replacement string, which is interpreted as a UTF string in UTF 3129 mode, and is checked for UTF validity unless the PCRE2_NO_UTF_CHECK 3130 option is set, a dollar character is an escape character that can spec- 3131 ify the insertion of characters from capturing groups or (*MARK), 3132 (*PRUNE), or (*THEN) items in the pattern. The following forms are 3133 always recognized: 3134 3135 $$ insert a dollar character 3136 $<n> or ${<n>} insert the contents of group <n> 3137 $*MARK or ${*MARK} insert a (*MARK), (*PRUNE), or (*THEN) name 3138 3139 Either a group number or a group name can be given for <n>. Curly 3140 brackets are required only if the following character would be inter- 3141 preted as part of the number or name. The number may be zero to include 3142 the entire matched string. For example, if the pattern a(b)c is 3143 matched with "=abc=" and the replacement string "+$1$0$1+", the result 3144 is "=+babcb+=". 3145 3146 $*MARK inserts the name from the last encountered (*MARK), (*PRUNE), or 3147 (*THEN) on the matching path that has a name. (*MARK) must always 3148 include a name, but (*PRUNE) and (*THEN) need not. For example, in the 3149 case of (*MARK:A)(*PRUNE) the name inserted is "A", but for 3150 (*MARK:A)(*PRUNE:B) the relevant name is "B". This facility can be 3151 used to perform simple simultaneous substitutions, as this pcre2test 3152 example shows: 3153 3154 /(*MARK:pear)apple|(*MARK:orange)lemon/g,replace=${*MARK} 3155 apple lemon 3156 2: pear orange 3157 3158 As well as the usual options for pcre2_match(), a number of additional 3159 options can be set in the options argument of pcre2_substitute(). 3160 3161 PCRE2_SUBSTITUTE_GLOBAL causes the function to iterate over the subject 3162 string, replacing every matching substring. If this option is not set, 3163 only the first matching substring is replaced. The search for matches 3164 takes place in the original subject string (that is, previous replace- 3165 ments do not affect it). Iteration is implemented by advancing the 3166 startoffset value for each search, which is always passed the entire 3167 subject string. If an offset limit is set in the match context, search- 3168 ing stops when that limit is reached. 3169 3170 You can restrict the effect of a global substitution to a portion of 3171 the subject string by setting either or both of startoffset and an off- 3172 set limit. Here is a pcre2test example: 3173 3174 /B/g,replace=!,use_offset_limit 3175 ABC ABC ABC ABC\=offset=3,offset_limit=12 3176 2: ABC A!C A!C ABC 3177 3178 When continuing with global substitutions after matching a substring 3179 with zero length, an attempt to find a non-empty match at the same off- 3180 set is performed. If this is not successful, the offset is advanced by 3181 one character except when CRLF is a valid newline sequence and the next 3182 two characters are CR, LF. In this case, the offset is advanced by two 3183 characters. 3184 3185 PCRE2_SUBSTITUTE_OVERFLOW_LENGTH changes what happens when the output 3186 buffer is too small. The default action is to return PCRE2_ERROR_NOMEM- 3187 ORY immediately. If this option is set, however, pcre2_substitute() 3188 continues to go through the motions of matching and substituting (with- 3189 out, of course, writing anything) in order to compute the size of buf- 3190 fer that is needed. This value is passed back via the outlengthptr 3191 variable, with the result of the function still being 3192 PCRE2_ERROR_NOMEMORY. 3193 3194 Passing a buffer size of zero is a permitted way of finding out how 3195 much memory is needed for given substitution. However, this does mean 3196 that the entire operation is carried out twice. Depending on the appli- 3197 cation, it may be more efficient to allocate a large buffer and free 3198 the excess afterwards, instead of using PCRE2_SUBSTITUTE_OVER- 3199 FLOW_LENGTH. 3200 3201 PCRE2_SUBSTITUTE_UNKNOWN_UNSET causes references to capturing groups 3202 that do not appear in the pattern to be treated as unset groups. This 3203 option should be used with care, because it means that a typo in a 3204 group name or number no longer causes the PCRE2_ERROR_NOSUBSTRING 3205 error. 3206 3207 PCRE2_SUBSTITUTE_UNSET_EMPTY causes unset capturing groups (including 3208 unknown groups when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) to be 3209 treated as empty strings when inserted as described above. If this 3210 option is not set, an attempt to insert an unset group causes the 3211 PCRE2_ERROR_UNSET error. This option does not influence the extended 3212 substitution syntax described below. 3213 3214 PCRE2_SUBSTITUTE_EXTENDED causes extra processing to be applied to the 3215 replacement string. Without this option, only the dollar character is 3216 special, and only the group insertion forms listed above are valid. 3217 When PCRE2_SUBSTITUTE_EXTENDED is set, two things change: 3218 3219 Firstly, backslash in a replacement string is interpreted as an escape 3220 character. The usual forms such as \n or \x{ddd} can be used to specify 3221 particular character codes, and backslash followed by any non-alphanu- 3222 meric character quotes that character. Extended quoting can be coded 3223 using \Q...\E, exactly as in pattern strings. 3224 3225 There are also four escape sequences for forcing the case of inserted 3226 letters. The insertion mechanism has three states: no case forcing, 3227 force upper case, and force lower case. The escape sequences change the 3228 current state: \U and \L change to upper or lower case forcing, respec- 3229 tively, and \E (when not terminating a \Q quoted sequence) reverts to 3230 no case forcing. The sequences \u and \l force the next character (if 3231 it is a letter) to upper or lower case, respectively, and then the 3232 state automatically reverts to no case forcing. Case forcing applies to 3233 all inserted characters, including those from captured groups and let- 3234 ters within \Q...\E quoted sequences. 3235 3236 Note that case forcing sequences such as \U...\E do not nest. For exam- 3237 ple, the result of processing "\Uaa\LBB\Ecc\E" is "AAbbcc"; the final 3238 \E has no effect. 3239 3240 The second effect of setting PCRE2_SUBSTITUTE_EXTENDED is to add more 3241 flexibility to group substitution. The syntax is similar to that used 3242 by Bash: 3243 3244 ${<n>:-<string>} 3245 ${<n>:+<string1>:<string2>} 3246 3247 As before, <n> may be a group number or a name. The first form speci- 3248 fies a default value. If group <n> is set, its value is inserted; if 3249 not, <string> is expanded and the result inserted. The second form 3250 specifies strings that are expanded and inserted when group <n> is set 3251 or unset, respectively. The first form is just a convenient shorthand 3252 for 3253 3254 ${<n>:+${<n>}:<string>} 3255 3256 Backslash can be used to escape colons and closing curly brackets in 3257 the replacement strings. A change of the case forcing state within a 3258 replacement string remains in force afterwards, as shown in this 3259 pcre2test example: 3260 3261 /(some)?(body)/substitute_extended,replace=${1:+\U:\L}HeLLo 3262 body 3263 1: hello 3264 somebody 3265 1: HELLO 3266 3267 The PCRE2_SUBSTITUTE_UNSET_EMPTY option does not affect these extended 3268 substitutions. However, PCRE2_SUBSTITUTE_UNKNOWN_UNSET does cause 3269 unknown groups in the extended syntax forms to be treated as unset. 3270 3271 If successful, pcre2_substitute() returns the number of replacements 3272 that were made. This may be zero if no matches were found, and is never 3273 greater than 1 unless PCRE2_SUBSTITUTE_GLOBAL is set. 3274 3275 In the event of an error, a negative error code is returned. Except for 3276 PCRE2_ERROR_NOMATCH (which is never returned), errors from 3277 pcre2_match() are passed straight back. 3278 3279 PCRE2_ERROR_NOSUBSTRING is returned for a non-existent substring inser- 3280 tion, unless PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set. 3281 3282 PCRE2_ERROR_UNSET is returned for an unset substring insertion (includ- 3283 ing an unknown substring when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) 3284 when the simple (non-extended) syntax is used and PCRE2_SUBSTI- 3285 TUTE_UNSET_EMPTY is not set. 3286 3287 PCRE2_ERROR_NOMEMORY is returned if the output buffer is not big 3288 enough. If the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set, the size 3289 of buffer that is needed is returned via outlengthptr. Note that this 3290 does not happen by default. 3291 3292 PCRE2_ERROR_BADREPLACEMENT is used for miscellaneous syntax errors in 3293 the replacement string, with more particular errors being 3294 PCRE2_ERROR_BADREPESCAPE (invalid escape sequence), PCRE2_ERROR_REP- 3295 MISSINGBRACE (closing curly bracket not found), PCRE2_ERROR_BADSUBSTI- 3296 TUTION (syntax error in extended group substitution), and 3297 PCRE2_ERROR_BADSUBSPATTERN (the pattern match ended before it started 3298 or the match started earlier than the current position in the subject, 3299 which can happen if \K is used in an assertion). 3300 3301 As for all PCRE2 errors, a text message that describes the error can be 3302 obtained by calling the pcre2_get_error_message() function (see 3303 "Obtaining a textual error message" above). 3304 3305 3306DUPLICATE SUBPATTERN NAMES 3307 3308 int pcre2_substring_nametable_scan(const pcre2_code *code, 3309 PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last); 3310 3311 When a pattern is compiled with the PCRE2_DUPNAMES option, names for 3312 subpatterns are not required to be unique. Duplicate names are always 3313 allowed for subpatterns with the same number, created by using the (?| 3314 feature. Indeed, if such subpatterns are named, they are required to 3315 use the same names. 3316 3317 Normally, patterns with duplicate names are such that in any one match, 3318 only one of the named subpatterns participates. An example is shown in 3319 the pcre2pattern documentation. 3320 3321 When duplicates are present, pcre2_substring_copy_byname() and 3322 pcre2_substring_get_byname() return the first substring corresponding 3323 to the given name that is set. Only if none are set is 3324 PCRE2_ERROR_UNSET is returned. The pcre2_substring_number_from_name() 3325 function returns the error PCRE2_ERROR_NOUNIQUESUBSTRING when there are 3326 duplicate names. 3327 3328 If you want to get full details of all captured substrings for a given 3329 name, you must use the pcre2_substring_nametable_scan() function. The 3330 first argument is the compiled pattern, and the second is the name. If 3331 the third and fourth arguments are NULL, the function returns a group 3332 number for a unique name, or PCRE2_ERROR_NOUNIQUESUBSTRING otherwise. 3333 3334 When the third and fourth arguments are not NULL, they must be pointers 3335 to variables that are updated by the function. After it has run, they 3336 point to the first and last entries in the name-to-number table for the 3337 given name, and the function returns the length of each entry in code 3338 units. In both cases, PCRE2_ERROR_NOSUBSTRING is returned if there are 3339 no entries for the given name. 3340 3341 The format of the name table is described above in the section entitled 3342 Information about a pattern. Given all the relevant entries for the 3343 name, you can extract each of their numbers, and hence the captured 3344 data. 3345 3346 3347FINDING ALL POSSIBLE MATCHES AT ONE POSITION 3348 3349 The traditional matching function uses a similar algorithm to Perl, 3350 which stops when it finds the first match at a given point in the sub- 3351 ject. If you want to find all possible matches, or the longest possible 3352 match at a given position, consider using the alternative matching 3353 function (see below) instead. If you cannot use the alternative func- 3354 tion, you can kludge it up by making use of the callout facility, which 3355 is described in the pcre2callout documentation. 3356 3357 What you have to do is to insert a callout right at the end of the pat- 3358 tern. When your callout function is called, extract and save the cur- 3359 rent matched substring. Then return 1, which forces pcre2_match() to 3360 backtrack and try other alternatives. Ultimately, when it runs out of 3361 matches, pcre2_match() will yield PCRE2_ERROR_NOMATCH. 3362 3363 3364MATCHING A PATTERN: THE ALTERNATIVE FUNCTION 3365 3366 int pcre2_dfa_match(const pcre2_code *code, PCRE2_SPTR subject, 3367 PCRE2_SIZE length, PCRE2_SIZE startoffset, 3368 uint32_t options, pcre2_match_data *match_data, 3369 pcre2_match_context *mcontext, 3370 int *workspace, PCRE2_SIZE wscount); 3371 3372 The function pcre2_dfa_match() is called to match a subject string 3373 against a compiled pattern, using a matching algorithm that scans the 3374 subject string just once (not counting lookaround assertions), and does 3375 not backtrack. This has different characteristics to the normal algo- 3376 rithm, and is not compatible with Perl. Some of the features of PCRE2 3377 patterns are not supported. Nevertheless, there are times when this 3378 kind of matching can be useful. For a discussion of the two matching 3379 algorithms, and a list of features that pcre2_dfa_match() does not sup- 3380 port, see the pcre2matching documentation. 3381 3382 The arguments for the pcre2_dfa_match() function are the same as for 3383 pcre2_match(), plus two extras. The ovector within the match data block 3384 is used in a different way, and this is described below. The other com- 3385 mon arguments are used in the same way as for pcre2_match(), so their 3386 description is not repeated here. 3387 3388 The two additional arguments provide workspace for the function. The 3389 workspace vector should contain at least 20 elements. It is used for 3390 keeping track of multiple paths through the pattern tree. More 3391 workspace is needed for patterns and subjects where there are a lot of 3392 potential matches. 3393 3394 Here is an example of a simple call to pcre2_dfa_match(): 3395 3396 int wspace[20]; 3397 pcre2_match_data *md = pcre2_match_data_create(4, NULL); 3398 int rc = pcre2_dfa_match( 3399 re, /* result of pcre2_compile() */ 3400 "some string", /* the subject string */ 3401 11, /* the length of the subject string */ 3402 0, /* start at offset 0 in the subject */ 3403 0, /* default options */ 3404 md, /* the match data block */ 3405 NULL, /* a match context; NULL means use defaults */ 3406 wspace, /* working space vector */ 3407 20); /* number of elements (NOT size in bytes) */ 3408 3409 Option bits for pcre_dfa_match() 3410 3411 The unused bits of the options argument for pcre2_dfa_match() must be 3412 zero. The only bits that may be set are PCRE2_ANCHORED, PCRE2_ENDAN- 3413 CHORED, PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, 3414 PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, 3415 PCRE2_PARTIAL_SOFT, PCRE2_DFA_SHORTEST, and PCRE2_DFA_RESTART. All but 3416 the last four of these are exactly the same as for pcre2_match(), so 3417 their description is not repeated here. 3418 3419 PCRE2_PARTIAL_HARD 3420 PCRE2_PARTIAL_SOFT 3421 3422 These have the same general effect as they do for pcre2_match(), but 3423 the details are slightly different. When PCRE2_PARTIAL_HARD is set for 3424 pcre2_dfa_match(), it returns PCRE2_ERROR_PARTIAL if the end of the 3425 subject is reached and there is still at least one matching possibility 3426 that requires additional characters. This happens even if some complete 3427 matches have already been found. When PCRE2_PARTIAL_SOFT is set, the 3428 return code PCRE2_ERROR_NOMATCH is converted into PCRE2_ERROR_PARTIAL 3429 if the end of the subject is reached, there have been no complete 3430 matches, but there is still at least one matching possibility. The por- 3431 tion of the string that was inspected when the longest partial match 3432 was found is set as the first matching string in both cases. There is a 3433 more detailed discussion of partial and multi-segment matching, with 3434 examples, in the pcre2partial documentation. 3435 3436 PCRE2_DFA_SHORTEST 3437 3438 Setting the PCRE2_DFA_SHORTEST option causes the matching algorithm to 3439 stop as soon as it has found one match. Because of the way the alterna- 3440 tive algorithm works, this is necessarily the shortest possible match 3441 at the first possible matching point in the subject string. 3442 3443 PCRE2_DFA_RESTART 3444 3445 When pcre2_dfa_match() returns a partial match, it is possible to call 3446 it again, with additional subject characters, and have it continue with 3447 the same match. The PCRE2_DFA_RESTART option requests this action; when 3448 it is set, the workspace and wscount options must reference the same 3449 vector as before because data about the match so far is left in them 3450 after a partial match. There is more discussion of this facility in the 3451 pcre2partial documentation. 3452 3453 Successful returns from pcre2_dfa_match() 3454 3455 When pcre2_dfa_match() succeeds, it may have matched more than one sub- 3456 string in the subject. Note, however, that all the matches from one run 3457 of the function start at the same point in the subject. The shorter 3458 matches are all initial substrings of the longer matches. For example, 3459 if the pattern 3460 3461 <.*> 3462 3463 is matched against the string 3464 3465 This is <something> <something else> <something further> no more 3466 3467 the three matched strings are 3468 3469 <something> <something else> <something further> 3470 <something> <something else> 3471 <something> 3472 3473 On success, the yield of the function is a number greater than zero, 3474 which is the number of matched substrings. The offsets of the sub- 3475 strings are returned in the ovector, and can be extracted by number in 3476 the same way as for pcre2_match(), but the numbers bear no relation to 3477 any capturing groups that may exist in the pattern, because DFA match- 3478 ing does not support group capture. 3479 3480 Calls to the convenience functions that extract substrings by name 3481 return the error PCRE2_ERROR_DFA_UFUNC (unsupported function) if used 3482 after a DFA match. The convenience functions that extract substrings by 3483 number never return PCRE2_ERROR_NOSUBSTRING. 3484 3485 The matched strings are stored in the ovector in reverse order of 3486 length; that is, the longest matching string is first. If there were 3487 too many matches to fit into the ovector, the yield of the function is 3488 zero, and the vector is filled with the longest matches. 3489 3490 NOTE: PCRE2's "auto-possessification" optimization usually applies to 3491 character repeats at the end of a pattern (as well as internally). For 3492 example, the pattern "a\d+" is compiled as if it were "a\d++". For DFA 3493 matching, this means that only one possible match is found. If you 3494 really do want multiple matches in such cases, either use an ungreedy 3495 repeat such as "a\d+?" or set the PCRE2_NO_AUTO_POSSESS option when 3496 compiling. 3497 3498 Error returns from pcre2_dfa_match() 3499 3500 The pcre2_dfa_match() function returns a negative number when it fails. 3501 Many of the errors are the same as for pcre2_match(), as described 3502 above. There are in addition the following errors that are specific to 3503 pcre2_dfa_match(): 3504 3505 PCRE2_ERROR_DFA_UITEM 3506 3507 This return is given if pcre2_dfa_match() encounters an item in the 3508 pattern that it does not support, for instance, the use of \C in a UTF 3509 mode or a backreference. 3510 3511 PCRE2_ERROR_DFA_UCOND 3512 3513 This return is given if pcre2_dfa_match() encounters a condition item 3514 that uses a backreference for the condition, or a test for recursion in 3515 a specific group. These are not supported. 3516 3517 PCRE2_ERROR_DFA_WSSIZE 3518 3519 This return is given if pcre2_dfa_match() runs out of space in the 3520 workspace vector. 3521 3522 PCRE2_ERROR_DFA_RECURSE 3523 3524 When a recursive subpattern is processed, the matching function calls 3525 itself recursively, using private memory for the ovector and workspace. 3526 This error is given if the internal ovector is not large enough. This 3527 should be extremely rare, as a vector of size 1000 is used. 3528 3529 PCRE2_ERROR_DFA_BADRESTART 3530 3531 When pcre2_dfa_match() is called with the PCRE2_DFA_RESTART option, 3532 some plausibility checks are made on the contents of the workspace, 3533 which should contain data about the previous partial match. If any of 3534 these checks fail, this error is given. 3535 3536 3537SEE ALSO 3538 3539 pcre2build(3), pcre2callout(3), pcre2demo(3), pcre2matching(3), 3540 pcre2partial(3), pcre2posix(3), pcre2sample(3), pcre2unicode(3). 3541 3542 3543AUTHOR 3544 3545 Philip Hazel 3546 University Computing Service 3547 Cambridge, England. 3548 3549 3550REVISION 3551 3552 Last updated: 07 September 2018 3553 Copyright (c) 1997-2018 University of Cambridge. 3554------------------------------------------------------------------------------ 3555 3556 3557PCRE2BUILD(3) Library Functions Manual PCRE2BUILD(3) 3558 3559 3560 3561NAME 3562 PCRE2 - Perl-compatible regular expressions (revised API) 3563 3564BUILDING PCRE2 3565 3566 PCRE2 is distributed with a configure script that can be used to build 3567 the library in Unix-like environments using the applications known as 3568 Autotools. Also in the distribution are files to support building using 3569 CMake instead of configure. The text file README contains general 3570 information about building with Autotools (some of which is repeated 3571 below), and also has some comments about building on various operating 3572 systems. There is a lot more information about building PCRE2 without 3573 using Autotools (including information about using CMake and building 3574 "by hand") in the text file called NON-AUTOTOOLS-BUILD. You should 3575 consult this file as well as the README file if you are building in a 3576 non-Unix-like environment. 3577 3578 3579PCRE2 BUILD-TIME OPTIONS 3580 3581 The rest of this document describes the optional features of PCRE2 that 3582 can be selected when the library is compiled. It assumes use of the 3583 configure script, where the optional features are selected or dese- 3584 lected by providing options to configure before running the make com- 3585 mand. However, the same options can be selected in both Unix-like and 3586 non-Unix-like environments if you are using CMake instead of configure 3587 to build PCRE2. 3588 3589 If you are not using Autotools or CMake, option selection can be done 3590 by editing the config.h file, or by passing parameter settings to the 3591 compiler, as described in NON-AUTOTOOLS-BUILD. 3592 3593 The complete list of options for configure (which includes the standard 3594 ones such as the selection of the installation directory) can be 3595 obtained by running 3596 3597 ./configure --help 3598 3599 The following sections include descriptions of "on/off" options whose 3600 names begin with --enable or --disable. Because of the way that config- 3601 ure works, --enable and --disable always come in pairs, so the comple- 3602 mentary option always exists as well, but as it specifies the default, 3603 it is not described. Options that specify values have names that start 3604 with --with. At the end of a configure run, a summary of the configura- 3605 tion is output. 3606 3607 3608BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES 3609 3610 By default, a library called libpcre2-8 is built, containing functions 3611 that take string arguments contained in arrays of bytes, interpreted 3612 either as single-byte characters, or UTF-8 strings. You can also build 3613 two other libraries, called libpcre2-16 and libpcre2-32, which process 3614 strings that are contained in arrays of 16-bit and 32-bit code units, 3615 respectively. These can be interpreted either as single-unit characters 3616 or UTF-16/UTF-32 strings. To build these additional libraries, add one 3617 or both of the following to the configure command: 3618 3619 --enable-pcre2-16 3620 --enable-pcre2-32 3621 3622 If you do not want the 8-bit library, add 3623 3624 --disable-pcre2-8 3625 3626 as well. At least one of the three libraries must be built. Note that 3627 the POSIX wrapper is for the 8-bit library only, and that pcre2grep is 3628 an 8-bit program. Neither of these are built if you select only the 3629 16-bit or 32-bit libraries. 3630 3631 3632BUILDING SHARED AND STATIC LIBRARIES 3633 3634 The Autotools PCRE2 building process uses libtool to build both shared 3635 and static libraries by default. You can suppress an unwanted library 3636 by adding one of 3637 3638 --disable-shared 3639 --disable-static 3640 3641 to the configure command. 3642 3643 3644UNICODE AND UTF SUPPORT 3645 3646 By default, PCRE2 is built with support for Unicode and UTF character 3647 strings. To build it without Unicode support, add 3648 3649 --disable-unicode 3650 3651 to the configure command. This setting applies to all three libraries. 3652 It is not possible to build one library with Unicode support, and 3653 another without, in the same configuration. 3654 3655 Of itself, Unicode support does not make PCRE2 treat strings as UTF-8, 3656 UTF-16 or UTF-32. To do that, applications that use the library can set 3657 the PCRE2_UTF option when they call pcre2_compile() to compile a pat- 3658 tern. Alternatively, patterns may be started with (*UTF) unless the 3659 application has locked this out by setting PCRE2_NEVER_UTF. 3660 3661 UTF support allows the libraries to process character code points up to 3662 0x10ffff in the strings that they handle. Unicode support also gives 3663 access to the Unicode properties of characters, using pattern escapes 3664 such as \P, \p, and \X. Only the general category properties such as Lu 3665 and Nd are supported. Details are given in the pcre2pattern documenta- 3666 tion. 3667 3668 Pattern escapes such as \d and \w do not by default make use of Unicode 3669 properties. The application can request that they do by setting the 3670 PCRE2_UCP option. Unless the application has set PCRE2_NEVER_UCP, a 3671 pattern may also request this by starting with (*UCP). 3672 3673 3674DISABLING THE USE OF \C 3675 3676 The \C escape sequence, which matches a single code unit, even in a UTF 3677 mode, can cause unpredictable behaviour because it may leave the cur- 3678 rent matching point in the middle of a multi-code-unit character. The 3679 application can lock it out by setting the PCRE2_NEVER_BACKSLASH_C 3680 option when calling pcre2_compile(). There is also a build-time option 3681 3682 --enable-never-backslash-C 3683 3684 (note the upper case C) which locks out the use of \C entirely. 3685 3686 3687JUST-IN-TIME COMPILER SUPPORT 3688 3689 Just-in-time (JIT) compiler support is included in the build by speci- 3690 fying 3691 3692 --enable-jit 3693 3694 This support is available only for certain hardware architectures. If 3695 this option is set for an unsupported architecture, a building error 3696 occurs. If in doubt, use 3697 3698 --enable-jit=auto 3699 3700 which enables JIT only if the current hardware is supported. You can 3701 check if JIT is enabled in the configuration summary that is output at 3702 the end of a configure run. If you are enabling JIT under SELinux you 3703 may also want to add 3704 3705 --enable-jit-sealloc 3706 3707 which enables the use of an execmem allocator in JIT that is compatible 3708 with SELinux. This has no effect if JIT is not enabled. See the 3709 pcre2jit documentation for a discussion of JIT usage. When JIT support 3710 is enabled, pcre2grep automatically makes use of it, unless you add 3711 3712 --disable-pcre2grep-jit 3713 3714 to the "configure" command. 3715 3716 3717NEWLINE RECOGNITION 3718 3719 By default, PCRE2 interprets the linefeed (LF) character as indicating 3720 the end of a line. This is the normal newline character on Unix-like 3721 systems. You can compile PCRE2 to use carriage return (CR) instead, by 3722 adding 3723 3724 --enable-newline-is-cr 3725 3726 to the configure command. There is also an --enable-newline-is-lf 3727 option, which explicitly specifies linefeed as the newline character. 3728 3729 Alternatively, you can specify that line endings are to be indicated by 3730 the two-character sequence CRLF (CR immediately followed by LF). If you 3731 want this, add 3732 3733 --enable-newline-is-crlf 3734 3735 to the configure command. There is a fourth option, specified by 3736 3737 --enable-newline-is-anycrlf 3738 3739 which causes PCRE2 to recognize any of the three sequences CR, LF, or 3740 CRLF as indicating a line ending. A fifth option, specified by 3741 3742 --enable-newline-is-any 3743 3744 causes PCRE2 to recognize any Unicode newline sequence. The Unicode 3745 newline sequences are the three just mentioned, plus the single charac- 3746 ters VT (vertical tab, U+000B), FF (form feed, U+000C), NEL (next line, 3747 U+0085), LS (line separator, U+2028), and PS (paragraph separator, 3748 U+2029). The final option is 3749 3750 --enable-newline-is-nul 3751 3752 which causes NUL (binary zero) to be set as the default line-ending 3753 character. 3754 3755 Whatever default line ending convention is selected when PCRE2 is built 3756 can be overridden by applications that use the library. At build time 3757 it is recommended to use the standard for your operating system. 3758 3759 3760WHAT \R MATCHES 3761 3762 By default, the sequence \R in a pattern matches any Unicode newline 3763 sequence, independently of what has been selected as the line ending 3764 sequence. If you specify 3765 3766 --enable-bsr-anycrlf 3767 3768 the default is changed so that \R matches only CR, LF, or CRLF. What- 3769 ever is selected when PCRE2 is built can be overridden by applications 3770 that use the library. 3771 3772 3773HANDLING VERY LARGE PATTERNS 3774 3775 Within a compiled pattern, offset values are used to point from one 3776 part to another (for example, from an opening parenthesis to an alter- 3777 nation metacharacter). By default, in the 8-bit and 16-bit libraries, 3778 two-byte values are used for these offsets, leading to a maximum size 3779 for a compiled pattern of around 64 thousand code units. This is suffi- 3780 cient to handle all but the most gigantic patterns. Nevertheless, some 3781 people do want to process truly enormous patterns, so it is possible to 3782 compile PCRE2 to use three-byte or four-byte offsets by adding a set- 3783 ting such as 3784 3785 --with-link-size=3 3786 3787 to the configure command. The value given must be 2, 3, or 4. For the 3788 16-bit library, a value of 3 is rounded up to 4. In these libraries, 3789 using longer offsets slows down the operation of PCRE2 because it has 3790 to load additional data when handling them. For the 32-bit library the 3791 value is always 4 and cannot be overridden; the value of --with-link- 3792 size is ignored. 3793 3794 3795LIMITING PCRE2 RESOURCE USAGE 3796 3797 The pcre2_match() function increments a counter each time it goes round 3798 its main loop. Putting a limit on this counter controls the amount of 3799 computing resource used by a single call to pcre2_match(). The limit 3800 can be changed at run time, as described in the pcre2api documentation. 3801 The default is 10 million, but this can be changed by adding a setting 3802 such as 3803 3804 --with-match-limit=500000 3805 3806 to the configure command. This setting also applies to the 3807 pcre2_dfa_match() matching function, and to JIT matching (though the 3808 counting is done differently). 3809 3810 The pcre2_match() function starts out using a 20KiB vector on the sys- 3811 tem stack to record backtracking points. The more nested backtracking 3812 points there are (that is, the deeper the search tree), the more memory 3813 is needed. If the initial vector is not large enough, heap memory is 3814 used, up to a certain limit, which is specified in kibibytes (units of 3815 1024 bytes). The limit can be changed at run time, as described in the 3816 pcre2api documentation. The default limit (in effect unlimited) is 20 3817 million. You can change this by a setting such as 3818 3819 --with-heap-limit=500 3820 3821 which limits the amount of heap to 500 KiB. This limit applies only to 3822 interpretive matching in pcre2_match() and pcre2_dfa_match(), which may 3823 also use the heap for internal workspace when processing complicated 3824 patterns. This limit does not apply when JIT (which has its own memory 3825 arrangements) is used. 3826 3827 You can also explicitly limit the depth of nested backtracking in the 3828 pcre2_match() interpreter. This limit defaults to the value that is set 3829 for --with-match-limit. You can set a lower default limit by adding, 3830 for example, 3831 3832 --with-match-limit_depth=10000 3833 3834 to the configure command. This value can be overridden at run time. 3835 This depth limit indirectly limits the amount of heap memory that is 3836 used, but because the size of each backtracking "frame" depends on the 3837 number of capturing parentheses in a pattern, the amount of heap that 3838 is used before the limit is reached varies from pattern to pattern. 3839 This limit was more useful in versions before 10.30, where function 3840 recursion was used for backtracking. 3841 3842 As well as applying to pcre2_match(), the depth limit also controls the 3843 depth of recursive function calls in pcre2_dfa_match(). These are used 3844 for lookaround assertions, atomic groups, and recursion within pat- 3845 terns. The limit does not apply to JIT matching. 3846 3847 3848CREATING CHARACTER TABLES AT BUILD TIME 3849 3850 PCRE2 uses fixed tables for processing characters whose code points are 3851 less than 256. By default, PCRE2 is built with a set of tables that are 3852 distributed in the file src/pcre2_chartables.c.dist. These tables are 3853 for ASCII codes only. If you add 3854 3855 --enable-rebuild-chartables 3856 3857 to the configure command, the distributed tables are no longer used. 3858 Instead, a program called dftables is compiled and run. This outputs 3859 the source for new set of tables, created in the default locale of your 3860 C run-time system. This method of replacing the tables does not work if 3861 you are cross compiling, because dftables is run on the local host. If 3862 you need to create alternative tables when cross compiling, you will 3863 have to do so "by hand". 3864 3865 3866USING EBCDIC CODE 3867 3868 PCRE2 assumes by default that it will run in an environment where the 3869 character code is ASCII or Unicode, which is a superset of ASCII. This 3870 is the case for most computer operating systems. PCRE2 can, however, be 3871 compiled to run in an 8-bit EBCDIC environment by adding 3872 3873 --enable-ebcdic --disable-unicode 3874 3875 to the configure command. This setting implies --enable-rebuild-charta- 3876 bles. You should only use it if you know that you are in an EBCDIC 3877 environment (for example, an IBM mainframe operating system). 3878 3879 It is not possible to support both EBCDIC and UTF-8 codes in the same 3880 version of the library. Consequently, --enable-unicode and --enable- 3881 ebcdic are mutually exclusive. 3882 3883 The EBCDIC character that corresponds to an ASCII LF is assumed to have 3884 the value 0x15 by default. However, in some EBCDIC environments, 0x25 3885 is used. In such an environment you should use 3886 3887 --enable-ebcdic-nl25 3888 3889 as well as, or instead of, --enable-ebcdic. The EBCDIC character for CR 3890 has the same value as in ASCII, namely, 0x0d. Whichever of 0x15 and 3891 0x25 is not chosen as LF is made to correspond to the Unicode NEL char- 3892 acter (which, in Unicode, is 0x85). 3893 3894 The options that select newline behaviour, such as --enable-newline-is- 3895 cr, and equivalent run-time options, refer to these character values in 3896 an EBCDIC environment. 3897 3898 3899PCRE2GREP SUPPORT FOR EXTERNAL SCRIPTS 3900 3901 By default, on non-Windows systems, pcre2grep supports the use of call- 3902 outs with string arguments within the patterns it is matching, in order 3903 to run external scripts. For details, see the pcre2grep documentation. 3904 This support can be disabled by adding --disable-pcre2grep-callout to 3905 the configure command. 3906 3907 3908PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT 3909 3910 By default, pcre2grep reads all files as plain text. You can build it 3911 so that it recognizes files whose names end in .gz or .bz2, and reads 3912 them with libz or libbz2, respectively, by adding one or both of 3913 3914 --enable-pcre2grep-libz 3915 --enable-pcre2grep-libbz2 3916 3917 to the configure command. These options naturally require that the rel- 3918 evant libraries are installed on your system. Configuration will fail 3919 if they are not. 3920 3921 3922PCRE2GREP BUFFER SIZE 3923 3924 pcre2grep uses an internal buffer to hold a "window" on the file it is 3925 scanning, in order to be able to output "before" and "after" lines when 3926 it finds a match. The default starting size of the buffer is 20KiB. The 3927 buffer itself is three times this size, but because of the way it is 3928 used for holding "before" lines, the longest line that is guaranteed to 3929 be processable is the notional buffer size. If a longer line is encoun- 3930 tered, pcre2grep automatically expands the buffer, up to a specified 3931 maximum size, whose default is 1MiB or the starting size, whichever is 3932 the larger. You can change the default parameter values by adding, for 3933 example, 3934 3935 --with-pcre2grep-bufsize=51200 3936 --with-pcre2grep-max-bufsize=2097152 3937 3938 to the configure command. The caller of pcre2grep can override these 3939 values by using --buffer-size and --max-buffer-size on the command 3940 line. 3941 3942 3943PCRE2TEST OPTION FOR LIBREADLINE SUPPORT 3944 3945 If you add one of 3946 3947 --enable-pcre2test-libreadline 3948 --enable-pcre2test-libedit 3949 3950 to the configure command, pcre2test is linked with the libreadline 3951 orlibedit library, respectively, and when its input is from a terminal, 3952 it reads it using the readline() function. This provides line-editing 3953 and history facilities. Note that libreadline is GPL-licensed, so if 3954 you distribute a binary of pcre2test linked in this way, there may be 3955 licensing issues. These can be avoided by linking instead with libedit, 3956 which has a BSD licence. 3957 3958 Setting --enable-pcre2test-libreadline causes the -lreadline option to 3959 be added to the pcre2test build. In many operating environments with a 3960 sytem-installed readline library this is sufficient. However, in some 3961 environments (e.g. if an unmodified distribution version of readline is 3962 in use), some extra configuration may be necessary. The INSTALL file 3963 for libreadline says this: 3964 3965 "Readline uses the termcap functions, but does not link with 3966 the termcap or curses library itself, allowing applications 3967 which link with readline the to choose an appropriate library." 3968 3969 If your environment has not been set up so that an appropriate library 3970 is automatically included, you may need to add something like 3971 3972 LIBS="-ncurses" 3973 3974 immediately before the configure command. 3975 3976 3977INCLUDING DEBUGGING CODE 3978 3979 If you add 3980 3981 --enable-debug 3982 3983 to the configure command, additional debugging code is included in the 3984 build. This feature is intended for use by the PCRE2 maintainers. 3985 3986 3987DEBUGGING WITH VALGRIND SUPPORT 3988 3989 If you add 3990 3991 --enable-valgrind 3992 3993 to the configure command, PCRE2 will use valgrind annotations to mark 3994 certain memory regions as unaddressable. This allows it to detect 3995 invalid memory accesses, and is mostly useful for debugging PCRE2 3996 itself. 3997 3998 3999CODE COVERAGE REPORTING 4000 4001 If your C compiler is gcc, you can build a version of PCRE2 that can 4002 generate a code coverage report for its test suite. To enable this, you 4003 must install lcov version 1.6 or above. Then specify 4004 4005 --enable-coverage 4006 4007 to the configure command and build PCRE2 in the usual way. 4008 4009 Note that using ccache (a caching C compiler) is incompatible with code 4010 coverage reporting. If you have configured ccache to run automatically 4011 on your system, you must set the environment variable 4012 4013 CCACHE_DISABLE=1 4014 4015 before running make to build PCRE2, so that ccache is not used. 4016 4017 When --enable-coverage is used, the following addition targets are 4018 added to the Makefile: 4019 4020 make coverage 4021 4022 This creates a fresh coverage report for the PCRE2 test suite. It is 4023 equivalent to running "make coverage-reset", "make coverage-baseline", 4024 "make check", and then "make coverage-report". 4025 4026 make coverage-reset 4027 4028 This zeroes the coverage counters, but does nothing else. 4029 4030 make coverage-baseline 4031 4032 This captures baseline coverage information. 4033 4034 make coverage-report 4035 4036 This creates the coverage report. 4037 4038 make coverage-clean-report 4039 4040 This removes the generated coverage report without cleaning the cover- 4041 age data itself. 4042 4043 make coverage-clean-data 4044 4045 This removes the captured coverage data without removing the coverage 4046 files created at compile time (*.gcno). 4047 4048 make coverage-clean 4049 4050 This cleans all coverage data including the generated coverage report. 4051 For more information about code coverage, see the gcov and lcov docu- 4052 mentation. 4053 4054 4055SUPPORT FOR FUZZERS 4056 4057 There is a special option for use by people who want to run fuzzing 4058 tests on PCRE2: 4059 4060 --enable-fuzz-support 4061 4062 At present this applies only to the 8-bit library. If set, it causes an 4063 extra library called libpcre2-fuzzsupport.a to be built, but not 4064 installed. This contains a single function called LLVMFuzzerTestOneIn- 4065 put() whose arguments are a pointer to a string and the length of the 4066 string. When called, this function tries to compile the string as a 4067 pattern, and if that succeeds, to match it. This is done both with no 4068 options and with some random options bits that are generated from the 4069 string. 4070 4071 Setting --enable-fuzz-support also causes a binary called pcre2fuz- 4072 zcheck to be created. This is normally run under valgrind or used when 4073 PCRE2 is compiled with address sanitizing enabled. It calls the fuzzing 4074 function and outputs information about what it is doing. The input 4075 strings are specified by arguments: if an argument starts with "=" the 4076 rest of it is a literal input string. Otherwise, it is assumed to be a 4077 file name, and the contents of the file are the test string. 4078 4079 4080OBSOLETE OPTION 4081 4082 In versions of PCRE2 prior to 10.30, there were two ways of handling 4083 backtracking in the pcre2_match() function. The default was to use the 4084 system stack, but if 4085 4086 --disable-stack-for-recursion 4087 4088 was set, memory on the heap was used. From release 10.30 onwards this 4089 has changed (the stack is no longer used) and this option now does 4090 nothing except give a warning. 4091 4092 4093SEE ALSO 4094 4095 pcre2api(3), pcre2-config(3). 4096 4097 4098AUTHOR 4099 4100 Philip Hazel 4101 University Computing Service 4102 Cambridge, England. 4103 4104 4105REVISION 4106 4107 Last updated: 26 April 2018 4108 Copyright (c) 1997-2018 University of Cambridge. 4109------------------------------------------------------------------------------ 4110 4111 4112PCRE2CALLOUT(3) Library Functions Manual PCRE2CALLOUT(3) 4113 4114 4115 4116NAME 4117 PCRE2 - Perl-compatible regular expressions (revised API) 4118 4119SYNOPSIS 4120 4121 #include <pcre2.h> 4122 4123 int (*pcre2_callout)(pcre2_callout_block *, void *); 4124 4125 int pcre2_callout_enumerate(const pcre2_code *code, 4126 int (*callback)(pcre2_callout_enumerate_block *, void *), 4127 void *user_data); 4128 4129 4130DESCRIPTION 4131 4132 PCRE2 provides a feature called "callout", which is a means of tempo- 4133 rarily passing control to the caller of PCRE2 in the middle of pattern 4134 matching. The caller of PCRE2 provides an external function by putting 4135 its entry point in a match context (see pcre2_set_callout() in the 4136 pcre2api documentation). 4137 4138 Within a regular expression, (?C<arg>) indicates a point at which the 4139 external function is to be called. Different callout points can be 4140 identified by putting a number less than 256 after the letter C. The 4141 default value is zero. Alternatively, the argument may be a delimited 4142 string. The starting delimiter must be one of ` ' " ^ % # $ { and the 4143 ending delimiter is the same as the start, except for {, where the end- 4144 ing delimiter is }. If the ending delimiter is needed within the 4145 string, it must be doubled. For example, this pattern has two callout 4146 points: 4147 4148 (?C1)abc(?C"some ""arbitrary"" text")def 4149 4150 If the PCRE2_AUTO_CALLOUT option bit is set when a pattern is compiled, 4151 PCRE2 automatically inserts callouts, all with number 255, before each 4152 item in the pattern except for immediately before or after an explicit 4153 callout. For example, if PCRE2_AUTO_CALLOUT is used with the pattern 4154 4155 A(?C3)B 4156 4157 it is processed as if it were 4158 4159 (?C255)A(?C3)B(?C255) 4160 4161 Here is a more complicated example: 4162 4163 A(\d{2}|--) 4164 4165 With PCRE2_AUTO_CALLOUT, this pattern is processed as if it were 4166 4167 (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255) 4168 4169 Notice that there is a callout before and after each parenthesis and 4170 alternation bar. If the pattern contains a conditional group whose con- 4171 dition is an assertion, an automatic callout is inserted immediately 4172 before the condition. Such a callout may also be inserted explicitly, 4173 for example: 4174 4175 (?(?C9)(?=a)ab|de) (?(?C%text%)(?!=d)ab|de) 4176 4177 This applies only to assertion conditions (because they are themselves 4178 independent groups). 4179 4180 Callouts can be useful for tracking the progress of pattern matching. 4181 The pcre2test program has a pattern qualifier (/auto_callout) that sets 4182 automatic callouts. When any callouts are present, the output from 4183 pcre2test indicates how the pattern is being matched. This is useful 4184 information when you are trying to optimize the performance of a par- 4185 ticular pattern. 4186 4187 4188MISSING CALLOUTS 4189 4190 You should be aware that, because of optimizations in the way PCRE2 4191 compiles and matches patterns, callouts sometimes do not happen exactly 4192 as you might expect. 4193 4194 Auto-possessification 4195 4196 At compile time, PCRE2 "auto-possessifies" repeated items when it knows 4197 that what follows cannot be part of the repeat. For example, a+[bc] is 4198 compiled as if it were a++[bc]. The pcre2test output when this pattern 4199 is compiled with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied 4200 to the string "aaaa" is: 4201 4202 --->aaaa 4203 +0 ^ a+ 4204 +2 ^ ^ [bc] 4205 No match 4206 4207 This indicates that when matching [bc] fails, there is no backtracking 4208 into a+ (because it is being treated as a++) and therefore the callouts 4209 that would be taken for the backtracks do not occur. You can disable 4210 the auto-possessify feature by passing PCRE2_NO_AUTO_POSSESS to 4211 pcre2_compile(), or starting the pattern with (*NO_AUTO_POSSESS). In 4212 this case, the output changes to this: 4213 4214 --->aaaa 4215 +0 ^ a+ 4216 +2 ^ ^ [bc] 4217 +2 ^ ^ [bc] 4218 +2 ^ ^ [bc] 4219 +2 ^^ [bc] 4220 No match 4221 4222 This time, when matching [bc] fails, the matcher backtracks into a+ and 4223 tries again, repeatedly, until a+ itself fails. 4224 4225 Automatic .* anchoring 4226 4227 By default, an optimization is applied when .* is the first significant 4228 item in a pattern. If PCRE2_DOTALL is set, so that the dot can match 4229 any character, the pattern is automatically anchored. If PCRE2_DOTALL 4230 is not set, a match can start only after an internal newline or at the 4231 beginning of the subject, and pcre2_compile() remembers this. If a pat- 4232 tern has more than one top-level branch, automatic anchoring occurs if 4233 all branches are anchorable. 4234 4235 This optimization is disabled, however, if .* is in an atomic group or 4236 if there is a backreference to the capturing group in which it appears. 4237 It is also disabled if the pattern contains (*PRUNE) or (*SKIP). How- 4238 ever, the presence of callouts does not affect it. 4239 4240 For example, if the pattern .*\d is compiled with PCRE2_AUTO_CALLOUT 4241 and applied to the string "aa", the pcre2test output is: 4242 4243 --->aa 4244 +0 ^ .* 4245 +2 ^ ^ \d 4246 +2 ^^ \d 4247 +2 ^ \d 4248 No match 4249 4250 This shows that all match attempts start at the beginning of the sub- 4251 ject. In other words, the pattern is anchored. You can disable this 4252 optimization by passing PCRE2_NO_DOTSTAR_ANCHOR to pcre2_compile(), or 4253 starting the pattern with (*NO_DOTSTAR_ANCHOR). In this case, the out- 4254 put changes to: 4255 4256 --->aa 4257 +0 ^ .* 4258 +2 ^ ^ \d 4259 +2 ^^ \d 4260 +2 ^ \d 4261 +0 ^ .* 4262 +2 ^^ \d 4263 +2 ^ \d 4264 No match 4265 4266 This shows more match attempts, starting at the second subject charac- 4267 ter. Another optimization, described in the next section, means that 4268 there is no subsequent attempt to match with an empty subject. 4269 4270 Other optimizations 4271 4272 Other optimizations that provide fast "no match" results also affect 4273 callouts. For example, if the pattern is 4274 4275 ab(?C4)cd 4276 4277 PCRE2 knows that any matching string must contain the letter "d". If 4278 the subject string is "abyz", the lack of "d" means that matching 4279 doesn't ever start, and the callout is never reached. However, with 4280 "abyd", though the result is still no match, the callout is obeyed. 4281 4282 For most patterns PCRE2 also knows the minimum length of a matching 4283 string, and will immediately give a "no match" return without actually 4284 running a match if the subject is not long enough, or, for unanchored 4285 patterns, if it has been scanned far enough. 4286 4287 You can disable these optimizations by passing the PCRE2_NO_START_OPTI- 4288 MIZE option to pcre2_compile(), or by starting the pattern with 4289 (*NO_START_OPT). This slows down the matching process, but does ensure 4290 that callouts such as the example above are obeyed. 4291 4292 4293THE CALLOUT INTERFACE 4294 4295 During matching, when PCRE2 reaches a callout point, if an external 4296 function is provided in the match context, it is called. This applies 4297 to both normal, DFA, and JIT matching. The first argument to the call- 4298 out function is a pointer to a pcre2_callout block. The second argument 4299 is the void * callout data that was supplied when the callout was set 4300 up by calling pcre2_set_callout() (see the pcre2api documentation). The 4301 callout block structure contains the following fields, not necessarily 4302 in this order: 4303 4304 uint32_t version; 4305 uint32_t callout_number; 4306 uint32_t capture_top; 4307 uint32_t capture_last; 4308 uint32_t callout_flags; 4309 PCRE2_SIZE *offset_vector; 4310 PCRE2_SPTR mark; 4311 PCRE2_SPTR subject; 4312 PCRE2_SIZE subject_length; 4313 PCRE2_SIZE start_match; 4314 PCRE2_SIZE current_position; 4315 PCRE2_SIZE pattern_position; 4316 PCRE2_SIZE next_item_length; 4317 PCRE2_SIZE callout_string_offset; 4318 PCRE2_SIZE callout_string_length; 4319 PCRE2_SPTR callout_string; 4320 4321 The version field contains the version number of the block format. The 4322 current version is 2; the three callout string fields were added for 4323 version 1, and the callout_flags field for version 2. If you are writ- 4324 ing an application that might use an earlier release of PCRE2, you 4325 should check the version number before accessing any of these fields. 4326 The version number will increase in future if more fields are added, 4327 but the intention is never to remove any of the existing fields. 4328 4329 Fields for numerical callouts 4330 4331 For a numerical callout, callout_string is NULL, and callout_number 4332 contains the number of the callout, in the range 0-255. This is the 4333 number that follows (?C for callouts that part of the pattern; it is 4334 255 for automatically generated callouts. 4335 4336 Fields for string callouts 4337 4338 For callouts with string arguments, callout_number is always zero, and 4339 callout_string points to the string that is contained within the com- 4340 piled pattern. Its length is given by callout_string_length. Duplicated 4341 ending delimiters that were present in the original pattern string have 4342 been turned into single characters, but there is no other processing of 4343 the callout string argument. An additional code unit containing binary 4344 zero is present after the string, but is not included in the length. 4345 The delimiter that was used to start the string is also stored within 4346 the pattern, immediately before the string itself. You can access this 4347 delimiter as callout_string[-1] if you need it. 4348 4349 The callout_string_offset field is the code unit offset to the start of 4350 the callout argument string within the original pattern string. This is 4351 provided for the benefit of applications such as script languages that 4352 might need to report errors in the callout string within the pattern. 4353 4354 Fields for all callouts 4355 4356 The remaining fields in the callout block are the same for both kinds 4357 of callout. 4358 4359 The offset_vector field is a pointer to a vector of capturing offsets 4360 (the "ovector"). You may read the elements in this vector, but you must 4361 not change any of them. 4362 4363 For calls to pcre2_match(), the offset_vector field is not (since 4364 release 10.30) a pointer to the actual ovector that was passed to the 4365 matching function in the match data block. Instead it points to an 4366 internal ovector of a size large enough to hold all possible captured 4367 substrings in the pattern. Note that whenever a recursion or subroutine 4368 call within a pattern completes, the capturing state is reset to what 4369 it was before. 4370 4371 The capture_last field contains the number of the most recently cap- 4372 tured substring, and the capture_top field contains one more than the 4373 number of the highest numbered captured substring so far. If no sub- 4374 strings have yet been captured, the value of capture_last is 0 and the 4375 value of capture_top is 1. The values of these fields do not always 4376 differ by one; for example, when the callout in the pattern 4377 ((a)(b))(?C2) is taken, capture_last is 1 but capture_top is 4. 4378 4379 The contents of ovector[2] to ovector[<capture_top>*2-1] can be 4380 inspected in order to extract substrings that have been matched so far, 4381 in the same way as extracting substrings after a match has completed. 4382 The values in ovector[0] and ovector[1] are always PCRE2_UNSET because 4383 the match is by definition not complete. Substrings that have not been 4384 captured but whose numbers are less than capture_top also have both of 4385 their ovector slots set to PCRE2_UNSET. 4386 4387 For DFA matching, the offset_vector field points to the ovector that 4388 was passed to the matching function in the match data block for call- 4389 outs at the top level, but to an internal ovector during the processing 4390 of pattern recursions, lookarounds, and atomic groups. However, these 4391 ovectors hold no useful information because pcre2_dfa_match() does not 4392 support substring capturing. The value of capture_top is always 1 and 4393 the value of capture_last is always 0 for DFA matching. 4394 4395 The subject and subject_length fields contain copies of the values that 4396 were passed to the matching function. 4397 4398 The start_match field normally contains the offset within the subject 4399 at which the current match attempt started. However, if the escape 4400 sequence \K has been encountered, this value is changed to reflect the 4401 modified starting point. If the pattern is not anchored, the callout 4402 function may be called several times from the same point in the pattern 4403 for different starting points in the subject. 4404 4405 The current_position field contains the offset within the subject of 4406 the current match pointer. 4407 4408 The pattern_position field contains the offset in the pattern string to 4409 the next item to be matched. 4410 4411 The next_item_length field contains the length of the next item to be 4412 processed in the pattern string. When the callout is at the end of the 4413 pattern, the length is zero. When the callout precedes an opening 4414 parenthesis, the length includes meta characters that follow the paren- 4415 thesis. For example, in a callout before an assertion such as (?=ab) 4416 the length is 3. For an an alternation bar or a closing parenthesis, 4417 the length is one, unless a closing parenthesis is followed by a quan- 4418 tifier, in which case its length is included. (This changed in release 4419 10.23. In earlier releases, before an opening parenthesis the length 4420 was that of the entire subpattern, and before an alternation bar or a 4421 closing parenthesis the length was zero.) 4422 4423 The pattern_position and next_item_length fields are intended to help 4424 in distinguishing between different automatic callouts, which all have 4425 the same callout number. However, they are set for all callouts, and 4426 are used by pcre2test to show the next item to be matched when display- 4427 ing callout information. 4428 4429 In callouts from pcre2_match() the mark field contains a pointer to the 4430 zero-terminated name of the most recently passed (*MARK), (*PRUNE), or 4431 (*THEN) item in the match, or NULL if no such items have been passed. 4432 Instances of (*PRUNE) or (*THEN) without a name do not obliterate a 4433 previous (*MARK). In callouts from the DFA matching function this field 4434 always contains NULL. 4435 4436 The callout_flags field is always zero in callouts from 4437 pcre2_dfa_match() or when JIT is being used. When pcre2_match() without 4438 JIT is used, the following bits may be set: 4439 4440 PCRE2_CALLOUT_STARTMATCH 4441 4442 This is set for the first callout after the start of matching for each 4443 new starting position in the subject. 4444 4445 PCRE2_CALLOUT_BACKTRACK 4446 4447 This is set if there has been a matching backtrack since the previous 4448 callout, or since the start of matching if this is the first callout 4449 from a pcre2_match() run. 4450 4451 Both bits are set when a backtrack has caused a "bumpalong" to a new 4452 starting position in the subject. Output from pcre2test does not indi- 4453 cate the presence of these bits unless the callout_extra modifier is 4454 set. 4455 4456 The information in the callout_flags field is provided so that applica- 4457 tions can track and tell their users how matching with backtracking is 4458 done. This can be useful when trying to optimize patterns, or just to 4459 understand how PCRE2 works. There is no support in pcre2_dfa_match() 4460 because there is no backtracking in DFA matching, and there is no sup- 4461 port in JIT because JIT is all about maximimizing matching performance. 4462 In both these cases the callout_flags field is always zero. 4463 4464 4465RETURN VALUES FROM CALLOUTS 4466 4467 The external callout function returns an integer to PCRE2. If the value 4468 is zero, matching proceeds as normal. If the value is greater than 4469 zero, matching fails at the current point, but the testing of other 4470 matching possibilities goes ahead, just as if a lookahead assertion had 4471 failed. If the value is less than zero, the match is abandoned, and the 4472 matching function returns the negative value. 4473 4474 Negative values should normally be chosen from the set of 4475 PCRE2_ERROR_xxx values. In particular, PCRE2_ERROR_NOMATCH forces a 4476 standard "no match" failure. The error number PCRE2_ERROR_CALLOUT is 4477 reserved for use by callout functions; it will never be used by PCRE2 4478 itself. 4479 4480 4481CALLOUT ENUMERATION 4482 4483 int pcre2_callout_enumerate(const pcre2_code *code, 4484 int (*callback)(pcre2_callout_enumerate_block *, void *), 4485 void *user_data); 4486 4487 A script language that supports the use of string arguments in callouts 4488 might like to scan all the callouts in a pattern before running the 4489 match. This can be done by calling pcre2_callout_enumerate(). The first 4490 argument is a pointer to a compiled pattern, the second points to a 4491 callback function, and the third is arbitrary user data. The callback 4492 function is called for every callout in the pattern in the order in 4493 which they appear. Its first argument is a pointer to a callout enumer- 4494 ation block, and its second argument is the user_data value that was 4495 passed to pcre2_callout_enumerate(). The data block contains the fol- 4496 lowing fields: 4497 4498 version Block version number 4499 pattern_position Offset to next item in pattern 4500 next_item_length Length of next item in pattern 4501 callout_number Number for numbered callouts 4502 callout_string_offset Offset to string within pattern 4503 callout_string_length Length of callout string 4504 callout_string Points to callout string or is NULL 4505 4506 The version number is currently 0. It will increase if new fields are 4507 ever added to the block. The remaining fields are the same as their 4508 namesakes in the pcre2_callout block that is used for callouts during 4509 matching, as described above. 4510 4511 Note that the value of pattern_position is unique for each callout. 4512 However, if a callout occurs inside a group that is quantified with a 4513 non-zero minimum or a fixed maximum, the group is replicated inside the 4514 compiled pattern. For example, a pattern such as /(a){2}/ is compiled 4515 as if it were /(a)(a)/. This means that the callout will be enumerated 4516 more than once, but with the same value for pattern_position in each 4517 case. 4518 4519 The callback function should normally return zero. If it returns a non- 4520 zero value, scanning the pattern stops, and that value is returned from 4521 pcre2_callout_enumerate(). 4522 4523 4524AUTHOR 4525 4526 Philip Hazel 4527 University Computing Service 4528 Cambridge, England. 4529 4530 4531REVISION 4532 4533 Last updated: 26 April 2018 4534 Copyright (c) 1997-2018 University of Cambridge. 4535------------------------------------------------------------------------------ 4536 4537 4538PCRE2COMPAT(3) Library Functions Manual PCRE2COMPAT(3) 4539 4540 4541 4542NAME 4543 PCRE2 - Perl-compatible regular expressions (revised API) 4544 4545DIFFERENCES BETWEEN PCRE2 AND PERL 4546 4547 This document describes the differences in the ways that PCRE2 and Perl 4548 handle regular expressions. The differences described here are with 4549 respect to Perl versions 5.26, but as both Perl and PCRE2 are continu- 4550 ally changing, the information may sometimes be out of date. 4551 4552 1. PCRE2 has only a subset of Perl's Unicode support. Details of what 4553 it does have are given in the pcre2unicode page. 4554 4555 2. Like Perl, PCRE2 allows repeat quantifiers on parenthesized asser- 4556 tions, but they do not mean what you might think. For example, (?!a){3} 4557 does not assert that the next three characters are not "a". It just 4558 asserts that the next character is not "a" three times (in principle; 4559 PCRE2 optimizes this to run the assertion just once). Perl allows some 4560 repeat quantifiers on other assertions, for example, \b* (but not 4561 \b{3}), but these do not seem to have any use. 4562 4563 3. Capturing subpatterns that occur inside negative lookaround asser- 4564 tions are counted, but their entries in the offsets vector are set only 4565 when a negative assertion is a condition that has a matching branch 4566 (that is, the condition is false). 4567 4568 4. The following Perl escape sequences are not supported: \F, \l, \L, 4569 \u, \U, and \N when followed by a character name. \N on its own, match- 4570 ing a non-newline character, and \N{U+dd..}, matching a Unicode code 4571 point, are supported. The escapes that modify the case of following 4572 letters are implemented by Perl's general string-handling and are not 4573 part of its pattern matching engine. If any of these are encountered by 4574 PCRE2, an error is generated by default. However, if the PCRE2_ALT_BSUX 4575 option is set, \U and \u are interpreted as ECMAScript interprets them. 4576 4577 5. The Perl escape sequences \p, \P, and \X are supported only if PCRE2 4578 is built with Unicode support (the default). The properties that can be 4579 tested with \p and \P are limited to the general category properties 4580 such as Lu and Nd, script names such as Greek or Han, and the derived 4581 properties Any and L&. PCRE2 does support the Cs (surrogate) property, 4582 which Perl does not; the Perl documentation says "Because Perl hides 4583 the need for the user to understand the internal representation of Uni- 4584 code characters, there is no need to implement the somewhat messy con- 4585 cept of surrogates." 4586 4587 6. PCRE2 supports the \Q...\E escape for quoting substrings. Characters 4588 in between are treated as literals. However, this is slightly different 4589 from Perl in that $ and @ are also handled as literals inside the 4590 quotes. In Perl, they cause variable interpolation (but of course PCRE2 4591 does not have variables). Also, Perl does "double-quotish backslash 4592 interpolation" on any backslashes between \Q and \E which, its documen- 4593 tation says, "may lead to confusing results". PCRE2 treats a backslash 4594 between \Q and \E just like any other character. Note the following 4595 examples: 4596 4597 Pattern PCRE2 matches Perl matches 4598 4599 \Qabc$xyz\E abc$xyz abc followed by the 4600 contents of $xyz 4601 \Qabc\$xyz\E abc\$xyz abc\$xyz 4602 \Qabc\E\$\Qxyz\E abc$xyz abc$xyz 4603 \QA\B\E A\B A\B 4604 \Q\\E \ \\E 4605 4606 The \Q...\E sequence is recognized both inside and outside character 4607 classes. 4608 4609 7. Fairly obviously, PCRE2 does not support the (?{code}) and 4610 (??{code}) constructions. However, PCRE2 does have a "callout" feature, 4611 which allows an external function to be called during pattern matching. 4612 See the pcre2callout documentation for details. 4613 4614 8. Subroutine calls (whether recursive or not) were treated as atomic 4615 groups up to PCRE2 release 10.23, but from release 10.30 this changed, 4616 and backtracking into subroutine calls is now supported, as in Perl. 4617 4618 9. If any of the backtracking control verbs are used in a subpattern 4619 that is called as a subroutine (whether or not recursively), their 4620 effect is confined to that subpattern; it does not extend to the sur- 4621 rounding pattern. This is not always the case in Perl. In particular, 4622 if (*THEN) is present in a group that is called as a subroutine, its 4623 action is limited to that group, even if the group does not contain any 4624 | characters. Note that such subpatterns are processed as anchored at 4625 the point where they are tested. 4626 4627 10. If a pattern contains more than one backtracking control verb, the 4628 first one that is backtracked onto acts. For example, in the pattern 4629 A(*COMMIT)B(*PRUNE)C a failure in B triggers (*COMMIT), but a failure 4630 in C triggers (*PRUNE). Perl's behaviour is more complex; in many cases 4631 it is the same as PCRE2, but there are cases where it differs. 4632 4633 11. Most backtracking verbs in assertions have their normal actions. 4634 They are not confined to the assertion. 4635 4636 12. There are some differences that are concerned with the settings of 4637 captured strings when part of a pattern is repeated. For example, 4638 matching "aba" against the pattern /^(a(b)?)+$/ in Perl leaves $2 4639 unset, but in PCRE2 it is set to "b". 4640 4641 13. PCRE2's handling of duplicate subpattern numbers and duplicate sub- 4642 pattern names is not as general as Perl's. This is a consequence of the 4643 fact the PCRE2 works internally just with numbers, using an external 4644 table to translate between numbers and names. In particular, a pattern 4645 such as (?|(?<a>A)|(?<b>B), where the two capturing parentheses have 4646 the same number but different names, is not supported, and causes an 4647 error at compile time. If it were allowed, it would not be possible to 4648 distinguish which parentheses matched, because both names map to cap- 4649 turing subpattern number 1. To avoid this confusing situation, an error 4650 is given at compile time. 4651 4652 14. Perl used to recognize comments in some places that PCRE2 does not, 4653 for example, between the ( and ? at the start of a subpattern. If the 4654 /x modifier is set, Perl allowed white space between ( and ? though the 4655 latest Perls give an error (for a while it was just deprecated). There 4656 may still be some cases where Perl behaves differently. 4657 4658 15. Perl, when in warning mode, gives warnings for character classes 4659 such as [A-\d] or [a-[:digit:]]. It then treats the hyphens as liter- 4660 als. PCRE2 has no warning features, so it gives an error in these cases 4661 because they are almost certainly user mistakes. 4662 4663 16. In PCRE2, the upper/lower case character properties Lu and Ll are 4664 not affected when case-independent matching is specified. For example, 4665 \p{Lu} always matches an upper case letter. I think Perl has changed in 4666 this respect; in the release at the time of writing (5.24), \p{Lu} and 4667 \p{Ll} match all letters, regardless of case, when case independence is 4668 specified. 4669 4670 17. PCRE2 provides some extensions to the Perl regular expression 4671 facilities. Perl 5.10 includes new features that are not in earlier 4672 versions of Perl, some of which (such as named parentheses) were in 4673 PCRE2 for some time before. This list is with respect to Perl 5.26: 4674 4675 (a) Although lookbehind assertions in PCRE2 must match fixed length 4676 strings, each alternative branch of a lookbehind assertion can match a 4677 different length of string. Perl requires them all to have the same 4678 length. 4679 4680 (b) From PCRE2 10.23, backreferences to groups of fixed length are sup- 4681 ported in lookbehinds, provided that there is no possibility of refer- 4682 encing a non-unique number or name. Perl does not support backrefer- 4683 ences in lookbehinds. 4684 4685 (c) If PCRE2_DOLLAR_ENDONLY is set and PCRE2_MULTILINE is not set, the 4686 $ meta-character matches only at the very end of the string. 4687 4688 (d) A backslash followed by a letter with no special meaning is 4689 faulted. (Perl can be made to issue a warning.) 4690 4691 (e) If PCRE2_UNGREEDY is set, the greediness of the repetition quanti- 4692 fiers is inverted, that is, by default they are not greedy, but if fol- 4693 lowed by a question mark they are. 4694 4695 (f) PCRE2_ANCHORED can be used at matching time to force a pattern to 4696 be tried only at the first matching position in the subject string. 4697 4698 (g) The PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY and 4699 PCRE2_NOTEMPTY_ATSTART options have no Perl equivalents. 4700 4701 (h) The \R escape sequence can be restricted to match only CR, LF, or 4702 CRLF by the PCRE2_BSR_ANYCRLF option. 4703 4704 (i) The callout facility is PCRE2-specific. Perl supports codeblocks 4705 and variable interpolation, but not general hooks on every match. 4706 4707 (j) The partial matching facility is PCRE2-specific. 4708 4709 (k) The alternative matching function (pcre2_dfa_match() matches in a 4710 different way and is not Perl-compatible. 4711 4712 (l) PCRE2 recognizes some special sequences such as (*CR) or (*NO_JIT) 4713 at the start of a pattern that set overall options that cannot be 4714 changed within the pattern. 4715 4716 18. The Perl /a modifier restricts /d numbers to pure ascii, and the 4717 /aa modifier restricts /i case-insensitive matching to pure ascii, 4718 ignoring Unicode rules. This separation cannot be represented with 4719 PCRE2_UCP. 4720 4721 19. Perl has different limits than PCRE2. See the pcre2limit documenta- 4722 tion for details. Perl went with 5.10 from recursion to iteration keep- 4723 ing the intermediate matches on the heap, which is ~10% slower but does 4724 not fall into any stack-overflow limit. PCRE2 made a similar change at 4725 release 10.30, and also has many build-time and run-time customizable 4726 limits. 4727 4728 4729AUTHOR 4730 4731 Philip Hazel 4732 University Computing Service 4733 Cambridge, England. 4734 4735 4736REVISION 4737 4738 Last updated: 28 July 2018 4739 Copyright (c) 1997-2018 University of Cambridge. 4740------------------------------------------------------------------------------ 4741 4742 4743PCRE2JIT(3) Library Functions Manual PCRE2JIT(3) 4744 4745 4746 4747NAME 4748 PCRE2 - Perl-compatible regular expressions (revised API) 4749 4750PCRE2 JUST-IN-TIME COMPILER SUPPORT 4751 4752 Just-in-time compiling is a heavyweight optimization that can greatly 4753 speed up pattern matching. However, it comes at the cost of extra pro- 4754 cessing before the match is performed, so it is of most benefit when 4755 the same pattern is going to be matched many times. This does not nec- 4756 essarily mean many calls of a matching function; if the pattern is not 4757 anchored, matching attempts may take place many times at various posi- 4758 tions in the subject, even for a single call. Therefore, if the subject 4759 string is very long, it may still pay to use JIT even for one-off 4760 matches. JIT support is available for all of the 8-bit, 16-bit and 4761 32-bit PCRE2 libraries. 4762 4763 JIT support applies only to the traditional Perl-compatible matching 4764 function. It does not apply when the DFA matching function is being 4765 used. The code for this support was written by Zoltan Herczeg. 4766 4767 4768AVAILABILITY OF JIT SUPPORT 4769 4770 JIT support is an optional feature of PCRE2. The "configure" option 4771 --enable-jit (or equivalent CMake option) must be set when PCRE2 is 4772 built if you want to use JIT. The support is limited to the following 4773 hardware platforms: 4774 4775 ARM 32-bit (v5, v7, and Thumb2) 4776 ARM 64-bit 4777 Intel x86 32-bit and 64-bit 4778 MIPS 32-bit and 64-bit 4779 Power PC 32-bit and 64-bit 4780 SPARC 32-bit 4781 4782 If --enable-jit is set on an unsupported platform, compilation fails. 4783 4784 A program can tell if JIT support is available by calling pcre2_con- 4785 fig() with the PCRE2_CONFIG_JIT option. The result is 1 when JIT is 4786 available, and 0 otherwise. However, a simple program does not need to 4787 check this in order to use JIT. The API is implemented in a way that 4788 falls back to the interpretive code if JIT is not available. For pro- 4789 grams that need the best possible performance, there is also a "fast 4790 path" API that is JIT-specific. 4791 4792 4793SIMPLE USE OF JIT 4794 4795 To make use of the JIT support in the simplest way, all you have to do 4796 is to call pcre2_jit_compile() after successfully compiling a pattern 4797 with pcre2_compile(). This function has two arguments: the first is the 4798 compiled pattern pointer that was returned by pcre2_compile(), and the 4799 second is zero or more of the following option bits: PCRE2_JIT_COM- 4800 PLETE, PCRE2_JIT_PARTIAL_HARD, or PCRE2_JIT_PARTIAL_SOFT. 4801 4802 If JIT support is not available, a call to pcre2_jit_compile() does 4803 nothing and returns PCRE2_ERROR_JIT_BADOPTION. Otherwise, the compiled 4804 pattern is passed to the JIT compiler, which turns it into machine code 4805 that executes much faster than the normal interpretive code, but yields 4806 exactly the same results. The returned value from pcre2_jit_compile() 4807 is zero on success, or a negative error code. 4808 4809 There is a limit to the size of pattern that JIT supports, imposed by 4810 the size of machine stack that it uses. The exact rules are not docu- 4811 mented because they may change at any time, in particular, when new 4812 optimizations are introduced. If a pattern is too big, a call to 4813 pcre2_jit_compile() returns PCRE2_ERROR_NOMEMORY. 4814 4815 PCRE2_JIT_COMPLETE requests the JIT compiler to generate code for com- 4816 plete matches. If you want to run partial matches using the PCRE2_PAR- 4817 TIAL_HARD or PCRE2_PARTIAL_SOFT options of pcre2_match(), you should 4818 set one or both of the other options as well as, or instead of 4819 PCRE2_JIT_COMPLETE. The JIT compiler generates different optimized code 4820 for each of the three modes (normal, soft partial, hard partial). When 4821 pcre2_match() is called, the appropriate code is run if it is avail- 4822 able. Otherwise, the pattern is matched using interpretive code. 4823 4824 You can call pcre2_jit_compile() multiple times for the same compiled 4825 pattern. It does nothing if it has previously compiled code for any of 4826 the option bits. For example, you can call it once with PCRE2_JIT_COM- 4827 PLETE and (perhaps later, when you find you need partial matching) 4828 again with PCRE2_JIT_COMPLETE and PCRE2_JIT_PARTIAL_HARD. This time it 4829 will ignore PCRE2_JIT_COMPLETE and just compile code for partial match- 4830 ing. If pcre2_jit_compile() is called with no option bits set, it imme- 4831 diately returns zero. This is an alternative way of testing whether JIT 4832 is available. 4833 4834 At present, it is not possible to free JIT compiled code except when 4835 the entire compiled pattern is freed by calling pcre2_code_free(). 4836 4837 In some circumstances you may need to call additional functions. These 4838 are described in the section entitled "Controlling the JIT stack" 4839 below. 4840 4841 There are some pcre2_match() options that are not supported by JIT, and 4842 there are also some pattern items that JIT cannot handle. Details are 4843 given below. In both cases, matching automatically falls back to the 4844 interpretive code. If you want to know whether JIT was actually used 4845 for a particular match, you should arrange for a JIT callback function 4846 to be set up as described in the section entitled "Controlling the JIT 4847 stack" below, even if you do not need to supply a non-default JIT 4848 stack. Such a callback function is called whenever JIT code is about to 4849 be obeyed. If the match-time options are not right for JIT execution, 4850 the callback function is not obeyed. 4851 4852 If the JIT compiler finds an unsupported item, no JIT data is gener- 4853 ated. You can find out if JIT matching is available after compiling a 4854 pattern by calling pcre2_pattern_info() with the PCRE2_INFO_JITSIZE 4855 option. A non-zero result means that JIT compilation was successful. A 4856 result of 0 means that JIT support is not available, or the pattern was 4857 not processed by pcre2_jit_compile(), or the JIT compiler was not able 4858 to handle the pattern. 4859 4860 4861UNSUPPORTED OPTIONS AND PATTERN ITEMS 4862 4863 The pcre2_match() options that are supported for JIT matching are 4864 PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, 4865 PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. The 4866 PCRE2_ANCHORED option is not supported at match time. 4867 4868 If the PCRE2_NO_JIT option is passed to pcre2_match() it disables the 4869 use of JIT, forcing matching by the interpreter code. 4870 4871 The only unsupported pattern items are \C (match a single data unit) 4872 when running in a UTF mode, and a callout immediately before an asser- 4873 tion condition in a conditional group. 4874 4875 4876RETURN VALUES FROM JIT MATCHING 4877 4878 When a pattern is matched using JIT matching, the return values are the 4879 same as those given by the interpretive pcre2_match() code, with the 4880 addition of one new error code: PCRE2_ERROR_JIT_STACKLIMIT. This means 4881 that the memory used for the JIT stack was insufficient. See "Control- 4882 ling the JIT stack" below for a discussion of JIT stack usage. 4883 4884 The error code PCRE2_ERROR_MATCHLIMIT is returned by the JIT code if 4885 searching a very large pattern tree goes on for too long, as it is in 4886 the same circumstance when JIT is not used, but the details of exactly 4887 what is counted are not the same. The PCRE2_ERROR_DEPTHLIMIT error code 4888 is never returned when JIT matching is used. 4889 4890 4891CONTROLLING THE JIT STACK 4892 4893 When the compiled JIT code runs, it needs a block of memory to use as a 4894 stack. By default, it uses 32KiB on the machine stack. However, some 4895 large or complicated patterns need more than this. The error 4896 PCRE2_ERROR_JIT_STACKLIMIT is given when there is not enough stack. 4897 Three functions are provided for managing blocks of memory for use as 4898 JIT stacks. There is further discussion about the use of JIT stacks in 4899 the section entitled "JIT stack FAQ" below. 4900 4901 The pcre2_jit_stack_create() function creates a JIT stack. Its argu- 4902 ments are a starting size, a maximum size, and a general context (for 4903 memory allocation functions, or NULL for standard memory allocation). 4904 It returns a pointer to an opaque structure of type pcre2_jit_stack, or 4905 NULL if there is an error. The pcre2_jit_stack_free() function is used 4906 to free a stack that is no longer needed. If its argument is NULL, this 4907 function returns immediately, without doing anything. (For the techni- 4908 cally minded: the address space is allocated by mmap or VirtualAlloc.) 4909 A maximum stack size of 512KiB to 1MiB should be more than enough for 4910 any pattern. 4911 4912 The pcre2_jit_stack_assign() function specifies which stack JIT code 4913 should use. Its arguments are as follows: 4914 4915 pcre2_match_context *mcontext 4916 pcre2_jit_callback callback 4917 void *data 4918 4919 The first argument is a pointer to a match context. When this is subse- 4920 quently passed to a matching function, its information determines which 4921 JIT stack is used. If this argument is NULL, the function returns imme- 4922 diately, without doing anything. There are three cases for the values 4923 of the other two options: 4924 4925 (1) If callback is NULL and data is NULL, an internal 32KiB block 4926 on the machine stack is used. This is the default when a match 4927 context is created. 4928 4929 (2) If callback is NULL and data is not NULL, data must be 4930 a pointer to a valid JIT stack, the result of calling 4931 pcre2_jit_stack_create(). 4932 4933 (3) If callback is not NULL, it must point to a function that is 4934 called with data as an argument at the start of matching, in 4935 order to set up a JIT stack. If the return from the callback 4936 function is NULL, the internal 32KiB stack is used; otherwise the 4937 return value must be a valid JIT stack, the result of calling 4938 pcre2_jit_stack_create(). 4939 4940 A callback function is obeyed whenever JIT code is about to be run; it 4941 is not obeyed when pcre2_match() is called with options that are incom- 4942 patible for JIT matching. A callback function can therefore be used to 4943 determine whether a match operation was executed by JIT or by the 4944 interpreter. 4945 4946 You may safely use the same JIT stack for more than one pattern (either 4947 by assigning directly or by callback), as long as the patterns are 4948 matched sequentially in the same thread. Currently, the only way to set 4949 up non-sequential matches in one thread is to use callouts: if a call- 4950 out function starts another match, that match must use a different JIT 4951 stack to the one used for currently suspended match(es). 4952 4953 In a multithread application, if you do not specify a JIT stack, or if 4954 you assign or pass back NULL from a callback, that is thread-safe, 4955 because each thread has its own machine stack. However, if you assign 4956 or pass back a non-NULL JIT stack, this must be a different stack for 4957 each thread so that the application is thread-safe. 4958 4959 Strictly speaking, even more is allowed. You can assign the same non- 4960 NULL stack to a match context that is used by any number of patterns, 4961 as long as they are not used for matching by multiple threads at the 4962 same time. For example, you could use the same stack in all compiled 4963 patterns, with a global mutex in the callback to wait until the stack 4964 is available for use. However, this is an inefficient solution, and not 4965 recommended. 4966 4967 This is a suggestion for how a multithreaded program that needs to set 4968 up non-default JIT stacks might operate: 4969 4970 During thread initalization 4971 thread_local_var = pcre2_jit_stack_create(...) 4972 4973 During thread exit 4974 pcre2_jit_stack_free(thread_local_var) 4975 4976 Use a one-line callback function 4977 return thread_local_var 4978 4979 All the functions described in this section do nothing if JIT is not 4980 available. 4981 4982 4983JIT STACK FAQ 4984 4985 (1) Why do we need JIT stacks? 4986 4987 PCRE2 (and JIT) is a recursive, depth-first engine, so it needs a stack 4988 where the local data of the current node is pushed before checking its 4989 child nodes. Allocating real machine stack on some platforms is diffi- 4990 cult. For example, the stack chain needs to be updated every time if we 4991 extend the stack on PowerPC. Although it is possible, its updating 4992 time overhead decreases performance. So we do the recursion in memory. 4993 4994 (2) Why don't we simply allocate blocks of memory with malloc()? 4995 4996 Modern operating systems have a nice feature: they can reserve an 4997 address space instead of allocating memory. We can safely allocate mem- 4998 ory pages inside this address space, so the stack could grow without 4999 moving memory data (this is important because of pointers). Thus we can 5000 allocate 1MiB address space, and use only a single memory page (usually 5001 4KiB) if that is enough. However, we can still grow up to 1MiB anytime 5002 if needed. 5003 5004 (3) Who "owns" a JIT stack? 5005 5006 The owner of the stack is the user program, not the JIT studied pattern 5007 or anything else. The user program must ensure that if a stack is being 5008 used by pcre2_match(), (that is, it is assigned to a match context that 5009 is passed to the pattern currently running), that stack must not be 5010 used by any other threads (to avoid overwriting the same memory area). 5011 The best practice for multithreaded programs is to allocate a stack for 5012 each thread, and return this stack through the JIT callback function. 5013 5014 (4) When should a JIT stack be freed? 5015 5016 You can free a JIT stack at any time, as long as it will not be used by 5017 pcre2_match() again. When you assign the stack to a match context, only 5018 a pointer is set. There is no reference counting or any other magic. 5019 You can free compiled patterns, contexts, and stacks in any order, any- 5020 time. Just do not call pcre2_match() with a match context pointing to 5021 an already freed stack, as that will cause SEGFAULT. (Also, do not free 5022 a stack currently used by pcre2_match() in another thread). You can 5023 also replace the stack in a context at any time when it is not in use. 5024 You should free the previous stack before assigning a replacement. 5025 5026 (5) Should I allocate/free a stack every time before/after calling 5027 pcre2_match()? 5028 5029 No, because this is too costly in terms of resources. However, you 5030 could implement some clever idea which release the stack if it is not 5031 used in let's say two minutes. The JIT callback can help to achieve 5032 this without keeping a list of patterns. 5033 5034 (6) OK, the stack is for long term memory allocation. But what happens 5035 if a pattern causes stack overflow with a stack of 1MiB? Is that 1MiB 5036 kept until the stack is freed? 5037 5038 Especially on embedded sytems, it might be a good idea to release mem- 5039 ory sometimes without freeing the stack. There is no API for this at 5040 the moment. Probably a function call which returns with the currently 5041 allocated memory for any stack and another which allows releasing mem- 5042 ory (shrinking the stack) would be a good idea if someone needs this. 5043 5044 (7) This is too much of a headache. Isn't there any better solution for 5045 JIT stack handling? 5046 5047 No, thanks to Windows. If POSIX threads were used everywhere, we could 5048 throw out this complicated API. 5049 5050 5051FREEING JIT SPECULATIVE MEMORY 5052 5053 void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext); 5054 5055 The JIT executable allocator does not free all memory when it is possi- 5056 ble. It expects new allocations, and keeps some free memory around to 5057 improve allocation speed. However, in low memory conditions, it might 5058 be better to free all possible memory. You can cause this to happen by 5059 calling pcre2_jit_free_unused_memory(). Its argument is a general con- 5060 text, for custom memory management, or NULL for standard memory manage- 5061 ment. 5062 5063 5064EXAMPLE CODE 5065 5066 This is a single-threaded example that specifies a JIT stack without 5067 using a callback. A real program should include error checking after 5068 all the function calls. 5069 5070 int rc; 5071 pcre2_code *re; 5072 pcre2_match_data *match_data; 5073 pcre2_match_context *mcontext; 5074 pcre2_jit_stack *jit_stack; 5075 5076 re = pcre2_compile(pattern, PCRE2_ZERO_TERMINATED, 0, 5077 &errornumber, &erroffset, NULL); 5078 rc = pcre2_jit_compile(re, PCRE2_JIT_COMPLETE); 5079 mcontext = pcre2_match_context_create(NULL); 5080 jit_stack = pcre2_jit_stack_create(32*1024, 512*1024, NULL); 5081 pcre2_jit_stack_assign(mcontext, NULL, jit_stack); 5082 match_data = pcre2_match_data_create(re, 10); 5083 rc = pcre2_match(re, subject, length, 0, 0, match_data, mcontext); 5084 /* Process result */ 5085 5086 pcre2_code_free(re); 5087 pcre2_match_data_free(match_data); 5088 pcre2_match_context_free(mcontext); 5089 pcre2_jit_stack_free(jit_stack); 5090 5091 5092JIT FAST PATH API 5093 5094 Because the API described above falls back to interpreted matching when 5095 JIT is not available, it is convenient for programs that are written 5096 for general use in many environments. However, calling JIT via 5097 pcre2_match() does have a performance impact. Programs that are written 5098 for use where JIT is known to be available, and which need the best 5099 possible performance, can instead use a "fast path" API to call JIT 5100 matching directly instead of calling pcre2_match() (obviously only for 5101 patterns that have been successfully processed by pcre2_jit_compile()). 5102 5103 The fast path function is called pcre2_jit_match(), and it takes 5104 exactly the same arguments as pcre2_match(). The return values are also 5105 the same, plus PCRE2_ERROR_JIT_BADOPTION if a matching mode (partial or 5106 complete) is requested that was not compiled. Unsupported option bits 5107 (for example, PCRE2_ANCHORED) are ignored, as is the PCRE2_NO_JIT 5108 option. 5109 5110 When you call pcre2_match(), as well as testing for invalid options, a 5111 number of other sanity checks are performed on the arguments. For exam- 5112 ple, if the subject pointer is NULL, an immediate error is given. Also, 5113 unless PCRE2_NO_UTF_CHECK is set, a UTF subject string is tested for 5114 validity. In the interests of speed, these checks do not happen on the 5115 JIT fast path, and if invalid data is passed, the result is undefined. 5116 5117 Bypassing the sanity checks and the pcre2_match() wrapping can give 5118 speedups of more than 10%. 5119 5120 5121SEE ALSO 5122 5123 pcre2api(3) 5124 5125 5126AUTHOR 5127 5128 Philip Hazel (FAQ by Zoltan Herczeg) 5129 University Computing Service 5130 Cambridge, England. 5131 5132 5133REVISION 5134 5135 Last updated: 28 June 2018 5136 Copyright (c) 1997-2018 University of Cambridge. 5137------------------------------------------------------------------------------ 5138 5139 5140PCRE2LIMITS(3) Library Functions Manual PCRE2LIMITS(3) 5141 5142 5143 5144NAME 5145 PCRE2 - Perl-compatible regular expressions (revised API) 5146 5147SIZE AND OTHER LIMITATIONS 5148 5149 There are some size limitations in PCRE2 but it is hoped that they will 5150 never in practice be relevant. 5151 5152 The maximum size of a compiled pattern is approximately 64 thousand 5153 code units for the 8-bit and 16-bit libraries if PCRE2 is compiled with 5154 the default internal linkage size, which is 2 bytes for these 5155 libraries. If you want to process regular expressions that are truly 5156 enormous, you can compile PCRE2 with an internal linkage size of 3 or 4 5157 (when building the 16-bit library, 3 is rounded up to 4). See the 5158 README file in the source distribution and the pcre2build documentation 5159 for details. In these cases the limit is substantially larger. How- 5160 ever, the speed of execution is slower. In the 32-bit library, the 5161 internal linkage size is always 4. 5162 5163 The maximum length of a source pattern string is essentially unlimited; 5164 it is the largest number a PCRE2_SIZE variable can hold. However, the 5165 program that calls pcre2_compile() can specify a smaller limit. 5166 5167 The maximum length (in code units) of a subject string is one less than 5168 the largest number a PCRE2_SIZE variable can hold. PCRE2_SIZE is an 5169 unsigned integer type, usually defined as size_t. Its maximum value 5170 (that is ~(PCRE2_SIZE)0) is reserved as a special indicator for zero- 5171 terminated strings and unset offsets. 5172 5173 All values in repeating quantifiers must be less than 65536. 5174 5175 The maximum length of a lookbehind assertion is 65535 characters. 5176 5177 There is no limit to the number of parenthesized subpatterns, but there 5178 can be no more than 65535 capturing subpatterns. There is, however, a 5179 limit to the depth of nesting of parenthesized subpatterns of all 5180 kinds. This is imposed in order to limit the amount of system stack 5181 used at compile time. The default limit can be specified when PCRE2 is 5182 built; if not, the default is set to 250. An application can change 5183 this limit by calling pcre2_set_parens_nest_limit() to set the limit in 5184 a compile context. 5185 5186 The maximum length of name for a named subpattern is 32 code units, and 5187 the maximum number of named subpatterns is 10000. 5188 5189 The maximum length of a name in a (*MARK), (*PRUNE), (*SKIP), or 5190 (*THEN) verb is 255 code units for the 8-bit library and 65535 code 5191 units for the 16-bit and 32-bit libraries. 5192 5193 The maximum length of a string argument to a callout is the largest 5194 number a 32-bit unsigned integer can hold. 5195 5196 5197AUTHOR 5198 5199 Philip Hazel 5200 University Computing Service 5201 Cambridge, England. 5202 5203 5204REVISION 5205 5206 Last updated: 30 March 2017 5207 Copyright (c) 1997-2017 University of Cambridge. 5208------------------------------------------------------------------------------ 5209 5210 5211PCRE2MATCHING(3) Library Functions Manual PCRE2MATCHING(3) 5212 5213 5214 5215NAME 5216 PCRE2 - Perl-compatible regular expressions (revised API) 5217 5218PCRE2 MATCHING ALGORITHMS 5219 5220 This document describes the two different algorithms that are available 5221 in PCRE2 for matching a compiled regular expression against a given 5222 subject string. The "standard" algorithm is the one provided by the 5223 pcre2_match() function. This works in the same as as Perl's matching 5224 function, and provide a Perl-compatible matching operation. The just- 5225 in-time (JIT) optimization that is described in the pcre2jit documenta- 5226 tion is compatible with this function. 5227 5228 An alternative algorithm is provided by the pcre2_dfa_match() function; 5229 it operates in a different way, and is not Perl-compatible. This alter- 5230 native has advantages and disadvantages compared with the standard 5231 algorithm, and these are described below. 5232 5233 When there is only one possible way in which a given subject string can 5234 match a pattern, the two algorithms give the same answer. A difference 5235 arises, however, when there are multiple possibilities. For example, if 5236 the pattern 5237 5238 ^<.*> 5239 5240 is matched against the string 5241 5242 <something> <something else> <something further> 5243 5244 there are three possible answers. The standard algorithm finds only one 5245 of them, whereas the alternative algorithm finds all three. 5246 5247 5248REGULAR EXPRESSIONS AS TREES 5249 5250 The set of strings that are matched by a regular expression can be rep- 5251 resented as a tree structure. An unlimited repetition in the pattern 5252 makes the tree of infinite size, but it is still a tree. Matching the 5253 pattern to a given subject string (from a given starting point) can be 5254 thought of as a search of the tree. There are two ways to search a 5255 tree: depth-first and breadth-first, and these correspond to the two 5256 matching algorithms provided by PCRE2. 5257 5258 5259THE STANDARD MATCHING ALGORITHM 5260 5261 In the terminology of Jeffrey Friedl's book "Mastering Regular Expres- 5262 sions", the standard algorithm is an "NFA algorithm". It conducts a 5263 depth-first search of the pattern tree. That is, it proceeds along a 5264 single path through the tree, checking that the subject matches what is 5265 required. When there is a mismatch, the algorithm tries any alterna- 5266 tives at the current point, and if they all fail, it backs up to the 5267 previous branch point in the tree, and tries the next alternative 5268 branch at that level. This often involves backing up (moving to the 5269 left) in the subject string as well. The order in which repetition 5270 branches are tried is controlled by the greedy or ungreedy nature of 5271 the quantifier. 5272 5273 If a leaf node is reached, a matching string has been found, and at 5274 that point the algorithm stops. Thus, if there is more than one possi- 5275 ble match, this algorithm returns the first one that it finds. Whether 5276 this is the shortest, the longest, or some intermediate length depends 5277 on the way the greedy and ungreedy repetition quantifiers are specified 5278 in the pattern. 5279 5280 Because it ends up with a single path through the tree, it is rela- 5281 tively straightforward for this algorithm to keep track of the sub- 5282 strings that are matched by portions of the pattern in parentheses. 5283 This provides support for capturing parentheses and backreferences. 5284 5285 5286THE ALTERNATIVE MATCHING ALGORITHM 5287 5288 This algorithm conducts a breadth-first search of the tree. Starting 5289 from the first matching point in the subject, it scans the subject 5290 string from left to right, once, character by character, and as it does 5291 this, it remembers all the paths through the tree that represent valid 5292 matches. In Friedl's terminology, this is a kind of "DFA algorithm", 5293 though it is not implemented as a traditional finite state machine (it 5294 keeps multiple states active simultaneously). 5295 5296 Although the general principle of this matching algorithm is that it 5297 scans the subject string only once, without backtracking, there is one 5298 exception: when a lookaround assertion is encountered, the characters 5299 following or preceding the current point have to be independently 5300 inspected. 5301 5302 The scan continues until either the end of the subject is reached, or 5303 there are no more unterminated paths. At this point, terminated paths 5304 represent the different matching possibilities (if there are none, the 5305 match has failed). Thus, if there is more than one possible match, 5306 this algorithm finds all of them, and in particular, it finds the long- 5307 est. The matches are returned in decreasing order of length. There is 5308 an option to stop the algorithm after the first match (which is neces- 5309 sarily the shortest) is found. 5310 5311 Note that all the matches that are found start at the same point in the 5312 subject. If the pattern 5313 5314 cat(er(pillar)?)? 5315 5316 is matched against the string "the caterpillar catchment", the result 5317 is the three strings "caterpillar", "cater", and "cat" that start at 5318 the fifth character of the subject. The algorithm does not automati- 5319 cally move on to find matches that start at later positions. 5320 5321 PCRE2's "auto-possessification" optimization usually applies to charac- 5322 ter repeats at the end of a pattern (as well as internally). For exam- 5323 ple, the pattern "a\d+" is compiled as if it were "a\d++" because there 5324 is no point even considering the possibility of backtracking into the 5325 repeated digits. For DFA matching, this means that only one possible 5326 match is found. If you really do want multiple matches in such cases, 5327 either use an ungreedy repeat ("a\d+?") or set the PCRE2_NO_AUTO_POS- 5328 SESS option when compiling. 5329 5330 There are a number of features of PCRE2 regular expressions that are 5331 not supported by the alternative matching algorithm. They are as fol- 5332 lows: 5333 5334 1. Because the algorithm finds all possible matches, the greedy or 5335 ungreedy nature of repetition quantifiers is not relevant (though it 5336 may affect auto-possessification, as just described). During matching, 5337 greedy and ungreedy quantifiers are treated in exactly the same way. 5338 However, possessive quantifiers can make a difference when what follows 5339 could also match what is quantified, for example in a pattern like 5340 this: 5341 5342 ^a++\w! 5343 5344 This pattern matches "aaab!" but not "aaa!", which would be matched by 5345 a non-possessive quantifier. Similarly, if an atomic group is present, 5346 it is matched as if it were a standalone pattern at the current point, 5347 and the longest match is then "locked in" for the rest of the overall 5348 pattern. 5349 5350 2. When dealing with multiple paths through the tree simultaneously, it 5351 is not straightforward to keep track of captured substrings for the 5352 different matching possibilities, and PCRE2's implementation of this 5353 algorithm does not attempt to do this. This means that no captured sub- 5354 strings are available. 5355 5356 3. Because no substrings are captured, backreferences within the pat- 5357 tern are not supported, and cause errors if encountered. 5358 5359 4. For the same reason, conditional expressions that use a backrefer- 5360 ence as the condition or test for a specific group recursion are not 5361 supported. 5362 5363 5. Because many paths through the tree may be active, the \K escape 5364 sequence, which resets the start of the match when encountered (but may 5365 be on some paths and not on others), is not supported. It causes an 5366 error if encountered. 5367 5368 6. Callouts are supported, but the value of the capture_top field is 5369 always 1, and the value of the capture_last field is always 0. 5370 5371 7. The \C escape sequence, which (in the standard algorithm) always 5372 matches a single code unit, even in a UTF mode, is not supported in 5373 these modes, because the alternative algorithm moves through the sub- 5374 ject string one character (not code unit) at a time, for all active 5375 paths through the tree. 5376 5377 8. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) 5378 are not supported. (*FAIL) is supported, and behaves like a failing 5379 negative assertion. 5380 5381 5382ADVANTAGES OF THE ALTERNATIVE ALGORITHM 5383 5384 Using the alternative matching algorithm provides the following advan- 5385 tages: 5386 5387 1. All possible matches (at a single point in the subject) are automat- 5388 ically found, and in particular, the longest match is found. To find 5389 more than one match using the standard algorithm, you have to do kludgy 5390 things with callouts. 5391 5392 2. Because the alternative algorithm scans the subject string just 5393 once, and never needs to backtrack (except for lookbehinds), it is pos- 5394 sible to pass very long subject strings to the matching function in 5395 several pieces, checking for partial matching each time. Although it is 5396 also possible to do multi-segment matching using the standard algo- 5397 rithm, by retaining partially matched substrings, it is more compli- 5398 cated. The pcre2partial documentation gives details of partial matching 5399 and discusses multi-segment matching. 5400 5401 5402DISADVANTAGES OF THE ALTERNATIVE ALGORITHM 5403 5404 The alternative algorithm suffers from a number of disadvantages: 5405 5406 1. It is substantially slower than the standard algorithm. This is 5407 partly because it has to search for all possible matches, but is also 5408 because it is less susceptible to optimization. 5409 5410 2. Capturing parentheses and backreferences are not supported. 5411 5412 3. Although atomic groups are supported, their use does not provide the 5413 performance advantage that it does for the standard algorithm. 5414 5415 5416AUTHOR 5417 5418 Philip Hazel 5419 University Computing Service 5420 Cambridge, England. 5421 5422 5423REVISION 5424 5425 Last updated: 29 September 2014 5426 Copyright (c) 1997-2014 University of Cambridge. 5427------------------------------------------------------------------------------ 5428 5429 5430PCRE2PARTIAL(3) Library Functions Manual PCRE2PARTIAL(3) 5431 5432 5433 5434NAME 5435 PCRE2 - Perl-compatible regular expressions 5436 5437PARTIAL MATCHING IN PCRE2 5438 5439 In normal use of PCRE2, if the subject string that is passed to a 5440 matching function matches as far as it goes, but is too short to match 5441 the entire pattern, PCRE2_ERROR_NOMATCH is returned. There are circum- 5442 stances where it might be helpful to distinguish this case from other 5443 cases in which there is no match. 5444 5445 Consider, for example, an application where a human is required to type 5446 in data for a field with specific formatting requirements. An example 5447 might be a date in the form ddmmmyy, defined by this pattern: 5448 5449 ^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$ 5450 5451 If the application sees the user's keystrokes one by one, and can check 5452 that what has been typed so far is potentially valid, it is able to 5453 raise an error as soon as a mistake is made, by beeping and not 5454 reflecting the character that has been typed, for example. This immedi- 5455 ate feedback is likely to be a better user interface than a check that 5456 is delayed until the entire string has been entered. Partial matching 5457 can also be useful when the subject string is very long and is not all 5458 available at once. 5459 5460 PCRE2 supports partial matching by means of the PCRE2_PARTIAL_SOFT and 5461 PCRE2_PARTIAL_HARD options, which can be set when calling a matching 5462 function. The difference between the two options is whether or not a 5463 partial match is preferred to an alternative complete match, though the 5464 details differ between the two types of matching function. If both 5465 options are set, PCRE2_PARTIAL_HARD takes precedence. 5466 5467 If you want to use partial matching with just-in-time optimized code, 5468 you must call pcre2_jit_compile() with one or both of these options: 5469 5470 PCRE2_JIT_PARTIAL_SOFT 5471 PCRE2_JIT_PARTIAL_HARD 5472 5473 PCRE2_JIT_COMPLETE should also be set if you are going to run non-par- 5474 tial matches on the same pattern. If the appropriate JIT mode has not 5475 been compiled, interpretive matching code is used. 5476 5477 Setting a partial matching option disables two of PCRE2's standard 5478 optimizations. PCRE2 remembers the last literal code unit in a pattern, 5479 and abandons matching immediately if it is not present in the subject 5480 string. This optimization cannot be used for a subject string that 5481 might match only partially. PCRE2 also knows the minimum length of a 5482 matching string, and does not bother to run the matching function on 5483 shorter strings. This optimization is also disabled for partial match- 5484 ing. 5485 5486 5487PARTIAL MATCHING USING pcre2_match() 5488 5489 A partial match occurs during a call to pcre2_match() when the end of 5490 the subject string is reached successfully, but matching cannot con- 5491 tinue because more characters are needed. However, at least one charac- 5492 ter in the subject must have been inspected. This character need not 5493 form part of the final matched string; lookbehind assertions and the \K 5494 escape sequence provide ways of inspecting characters before the start 5495 of a matched string. The requirement for inspecting at least one char- 5496 acter exists because an empty string can always be matched; without 5497 such a restriction there would always be a partial match of an empty 5498 string at the end of the subject. 5499 5500 When a partial match is returned, the first two elements in the ovector 5501 point to the portion of the subject that was matched, but the values in 5502 the rest of the ovector are undefined. The appearance of \K in the pat- 5503 tern has no effect for a partial match. Consider this pattern: 5504 5505 /abc\K123/ 5506 5507 If it is matched against "456abc123xyz" the result is a complete match, 5508 and the ovector defines the matched string as "123", because \K resets 5509 the "start of match" point. However, if a partial match is requested 5510 and the subject string is "456abc12", a partial match is found for the 5511 string "abc12", because all these characters are needed for a subse- 5512 quent re-match with additional characters. 5513 5514 What happens when a partial match is identified depends on which of the 5515 two partial matching options are set. 5516 5517 PCRE2_PARTIAL_SOFT WITH pcre2_match() 5518 5519 If PCRE2_PARTIAL_SOFT is set when pcre2_match() identifies a partial 5520 match, the partial match is remembered, but matching continues as nor- 5521 mal, and other alternatives in the pattern are tried. If no complete 5522 match can be found, PCRE2_ERROR_PARTIAL is returned instead of 5523 PCRE2_ERROR_NOMATCH. 5524 5525 This option is "soft" because it prefers a complete match over a par- 5526 tial match. All the various matching items in a pattern behave as if 5527 the subject string is potentially complete. For example, \z, \Z, and $ 5528 match at the end of the subject, as normal, and for \b and \B the end 5529 of the subject is treated as a non-alphanumeric. 5530 5531 If there is more than one partial match, the first one that was found 5532 provides the data that is returned. Consider this pattern: 5533 5534 /123\w+X|dogY/ 5535 5536 If this is matched against the subject string "abc123dog", both alter- 5537 natives fail to match, but the end of the subject is reached during 5538 matching, so PCRE2_ERROR_PARTIAL is returned. The offsets are set to 3 5539 and 9, identifying "123dog" as the first partial match that was found. 5540 (In this example, there are two partial matches, because "dog" on its 5541 own partially matches the second alternative.) 5542 5543 PCRE2_PARTIAL_HARD WITH pcre2_match() 5544 5545 If PCRE2_PARTIAL_HARD is set for pcre2_match(), PCRE2_ERROR_PARTIAL is 5546 returned as soon as a partial match is found, without continuing to 5547 search for possible complete matches. This option is "hard" because it 5548 prefers an earlier partial match over a later complete match. For this 5549 reason, the assumption is made that the end of the supplied subject 5550 string may not be the true end of the available data, and so, if \z, 5551 \Z, \b, \B, or $ are encountered at the end of the subject, the result 5552 is PCRE2_ERROR_PARTIAL, provided that at least one character in the 5553 subject has been inspected. 5554 5555 Comparing hard and soft partial matching 5556 5557 The difference between the two partial matching options can be illus- 5558 trated by a pattern such as: 5559 5560 /dog(sbody)?/ 5561 5562 This matches either "dog" or "dogsbody", greedily (that is, it prefers 5563 the longer string if possible). If it is matched against the string 5564 "dog" with PCRE2_PARTIAL_SOFT, it yields a complete match for "dog". 5565 However, if PCRE2_PARTIAL_HARD is set, the result is PCRE2_ERROR_PAR- 5566 TIAL. On the other hand, if the pattern is made ungreedy the result is 5567 different: 5568 5569 /dog(sbody)??/ 5570 5571 In this case the result is always a complete match because that is 5572 found first, and matching never continues after finding a complete 5573 match. It might be easier to follow this explanation by thinking of the 5574 two patterns like this: 5575 5576 /dog(sbody)?/ is the same as /dogsbody|dog/ 5577 /dog(sbody)??/ is the same as /dog|dogsbody/ 5578 5579 The second pattern will never match "dogsbody", because it will always 5580 find the shorter match first. 5581 5582 5583PARTIAL MATCHING USING pcre2_dfa_match() 5584 5585 The DFA functions move along the subject string character by character, 5586 without backtracking, searching for all possible matches simultane- 5587 ously. If the end of the subject is reached before the end of the pat- 5588 tern, there is the possibility of a partial match, again provided that 5589 at least one character has been inspected. 5590 5591 When PCRE2_PARTIAL_SOFT is set, PCRE2_ERROR_PARTIAL is returned only if 5592 there have been no complete matches. Otherwise, the complete matches 5593 are returned. However, if PCRE2_PARTIAL_HARD is set, a partial match 5594 takes precedence over any complete matches. The portion of the string 5595 that was matched when the longest partial match was found is set as the 5596 first matching string. 5597 5598 Because the DFA functions always search for all possible matches, and 5599 there is no difference between greedy and ungreedy repetition, their 5600 behaviour is different from the standard functions when PCRE2_PAR- 5601 TIAL_HARD is set. Consider the string "dog" matched against the 5602 ungreedy pattern shown above: 5603 5604 /dog(sbody)??/ 5605 5606 Whereas the standard function stops as soon as it finds the complete 5607 match for "dog", the DFA function also finds the partial match for 5608 "dogsbody", and so returns that when PCRE2_PARTIAL_HARD is set. 5609 5610 5611PARTIAL MATCHING AND WORD BOUNDARIES 5612 5613 If a pattern ends with one of sequences \b or \B, which test for word 5614 boundaries, partial matching with PCRE2_PARTIAL_SOFT can give counter- 5615 intuitive results. Consider this pattern: 5616 5617 /\bcat\b/ 5618 5619 This matches "cat", provided there is a word boundary at either end. If 5620 the subject string is "the cat", the comparison of the final "t" with a 5621 following character cannot take place, so a partial match is found. 5622 However, normal matching carries on, and \b matches at the end of the 5623 subject when the last character is a letter, so a complete match is 5624 found. The result, therefore, is not PCRE2_ERROR_PARTIAL. Using 5625 PCRE2_PARTIAL_HARD in this case does yield PCRE2_ERROR_PARTIAL, because 5626 then the partial match takes precedence. 5627 5628 5629EXAMPLE OF PARTIAL MATCHING USING PCRE2TEST 5630 5631 If the partial_soft (or ps) modifier is present on a pcre2test data 5632 line, the PCRE2_PARTIAL_SOFT option is used for the match. Here is a 5633 run of pcre2test that uses the date example quoted above: 5634 5635 re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ 5636 data> 25jun04\=ps 5637 0: 25jun04 5638 1: jun 5639 data> 25dec3\=ps 5640 Partial match: 23dec3 5641 data> 3ju\=ps 5642 Partial match: 3ju 5643 data> 3juj\=ps 5644 No match 5645 data> j\=ps 5646 No match 5647 5648 The first data string is matched completely, so pcre2test shows the 5649 matched substrings. The remaining four strings do not match the com- 5650 plete pattern, but the first two are partial matches. Similar output is 5651 obtained if DFA matching is used. 5652 5653 If the partial_hard (or ph) modifier is present on a pcre2test data 5654 line, the PCRE2_PARTIAL_HARD option is set for the match. 5655 5656 5657MULTI-SEGMENT MATCHING WITH pcre2_dfa_match() 5658 5659 When a partial match has been found using a DFA matching function, it 5660 is possible to continue the match by providing additional subject data 5661 and calling the function again with the same compiled regular expres- 5662 sion, this time setting the PCRE2_DFA_RESTART option. You must pass the 5663 same working space as before, because this is where details of the pre- 5664 vious partial match are stored. Here is an example using pcre2test: 5665 5666 re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ 5667 data> 23ja\=dfa,ps 5668 Partial match: 23ja 5669 data> n05\=dfa,dfa_restart 5670 0: n05 5671 5672 The first call has "23ja" as the subject, and requests partial match- 5673 ing; the second call has "n05" as the subject for the continued 5674 (restarted) match. Notice that when the match is complete, only the 5675 last part is shown; PCRE2 does not retain the previously partially- 5676 matched string. It is up to the calling program to do that if it needs 5677 to. 5678 5679 That means that, for an unanchored pattern, if a continued match fails, 5680 it is not possible to try again at a new starting point. All this 5681 facility is capable of doing is continuing with the previous match 5682 attempt. In the previous example, if the second set of data is "ug23" 5683 the result is no match, even though there would be a match for "aug23" 5684 if the entire string were given at once. Depending on the application, 5685 this may or may not be what you want. The only way to allow for start- 5686 ing again at the next character is to retain the matched part of the 5687 subject and try a new complete match. 5688 5689 You can set the PCRE2_PARTIAL_SOFT or PCRE2_PARTIAL_HARD options with 5690 PCRE2_DFA_RESTART to continue partial matching over multiple segments. 5691 This facility can be used to pass very long subject strings to the DFA 5692 matching functions. 5693 5694 5695MULTI-SEGMENT MATCHING WITH pcre2_match() 5696 5697 Unlike the DFA function, it is not possible to restart the previous 5698 match with a new segment of data when using pcre2_match(). Instead, new 5699 data must be added to the previous subject string, and the entire match 5700 re-run, starting from the point where the partial match occurred. Ear- 5701 lier data can be discarded. 5702 5703 It is best to use PCRE2_PARTIAL_HARD in this situation, because it does 5704 not treat the end of a segment as the end of the subject when matching 5705 \z, \Z, \b, \B, and $. Consider an unanchored pattern that matches 5706 dates: 5707 5708 re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/ 5709 data> The date is 23ja\=ph 5710 Partial match: 23ja 5711 5712 At this stage, an application could discard the text preceding "23ja", 5713 add on text from the next segment, and call the matching function 5714 again. Unlike the DFA matching function, the entire matching string 5715 must always be available, and the complete matching process occurs for 5716 each call, so more memory and more processing time is needed. 5717 5718 5719ISSUES WITH MULTI-SEGMENT MATCHING 5720 5721 Certain types of pattern may give problems with multi-segment matching, 5722 whichever matching function is used. 5723 5724 1. If the pattern contains a test for the beginning of a line, you need 5725 to pass the PCRE2_NOTBOL option when the subject string for any call 5726 does start at the beginning of a line. There is also a PCRE2_NOTEOL 5727 option, but in practice when doing multi-segment matching you should be 5728 using PCRE2_PARTIAL_HARD, which includes the effect of PCRE2_NOTEOL. 5729 5730 2. If a pattern contains a lookbehind assertion, characters that pre- 5731 cede the start of the partial match may have been inspected during the 5732 matching process. When using pcre2_match(), sufficient characters must 5733 be retained for the next match attempt. You can ensure that enough 5734 characters are retained by doing the following: 5735 5736 Before doing any matching, find the length of the longest lookbehind in 5737 the pattern by calling pcre2_pattern_info() with the 5738 PCRE2_INFO_MAXLOOKBEHIND option. Note that the resulting count is in 5739 characters, not code units. After a partial match, moving back from the 5740 ovector[0] offset in the subject by the number of characters given for 5741 the maximum lookbehind gets you to the earliest character that must be 5742 retained. In a non-UTF or a 32-bit situation, moving back is just a 5743 subtraction, but in UTF-8 or UTF-16 you have to count characters while 5744 moving back through the code units. 5745 5746 Characters before the point you have now reached can be discarded, and 5747 after the next segment has been added to what is retained, you should 5748 run the next match with the startoffset argument set so that the match 5749 begins at the same point as before. 5750 5751 For example, if the pattern "(?<=123)abc" is partially matched against 5752 the string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maxi- 5753 mum lookbehind count is 3, so all characters before offset 2 can be 5754 discarded. The value of startoffset for the next match should be 3. 5755 When pcre2test displays a partial match, it indicates the lookbehind 5756 characters with '<' characters: 5757 5758 re> "(?<=123)abc" 5759 data> xx123ab\=ph 5760 Partial match: 123ab 5761 <<< 5762 5763 3. Because a partial match must always contain at least one character, 5764 what might be considered a partial match of an empty string actually 5765 gives a "no match" result. For example: 5766 5767 re> /c(?<=abc)x/ 5768 data> ab\=ps 5769 No match 5770 5771 If the next segment begins "cx", a match should be found, but this will 5772 only happen if characters from the previous segment are retained. For 5773 this reason, a "no match" result should be interpreted as "partial 5774 match of an empty string" when the pattern contains lookbehinds. 5775 5776 4. Matching a subject string that is split into multiple segments may 5777 not always produce exactly the same result as matching over one single 5778 long string, especially when PCRE2_PARTIAL_SOFT is used. The section 5779 "Partial Matching and Word Boundaries" above describes an issue that 5780 arises if the pattern ends with \b or \B. Another kind of difference 5781 may occur when there are multiple matching possibilities, because (for 5782 PCRE2_PARTIAL_SOFT) a partial match result is given only when there are 5783 no completed matches. This means that as soon as the shortest match has 5784 been found, continuation to a new subject segment is no longer possi- 5785 ble. Consider this pcre2test example: 5786 5787 re> /dog(sbody)?/ 5788 data> dogsb\=ps 5789 0: dog 5790 data> do\=ps,dfa 5791 Partial match: do 5792 data> gsb\=ps,dfa,dfa_restart 5793 0: g 5794 data> dogsbody\=dfa 5795 0: dogsbody 5796 1: dog 5797 5798 The first data line passes the string "dogsb" to a standard matching 5799 function, setting the PCRE2_PARTIAL_SOFT option. Although the string is 5800 a partial match for "dogsbody", the result is not PCRE2_ERROR_PARTIAL, 5801 because the shorter string "dog" is a complete match. Similarly, when 5802 the subject is presented to a DFA matching function in several parts 5803 ("do" and "gsb" being the first two) the match stops when "dog" has 5804 been found, and it is not possible to continue. On the other hand, if 5805 "dogsbody" is presented as a single string, a DFA matching function 5806 finds both matches. 5807 5808 Because of these problems, it is best to use PCRE2_PARTIAL_HARD when 5809 matching multi-segment data. The example above then behaves differ- 5810 ently: 5811 5812 re> /dog(sbody)?/ 5813 data> dogsb\=ph 5814 Partial match: dogsb 5815 data> do\=ps,dfa 5816 Partial match: do 5817 data> gsb\=ph,dfa,dfa_restart 5818 Partial match: gsb 5819 5820 5. Patterns that contain alternatives at the top level which do not all 5821 start with the same pattern item may not work as expected when 5822 PCRE2_DFA_RESTART is used. For example, consider this pattern: 5823 5824 1234|3789 5825 5826 If the first part of the subject is "ABC123", a partial match of the 5827 first alternative is found at offset 3. There is no partial match for 5828 the second alternative, because such a match does not start at the same 5829 point in the subject string. Attempting to continue with the string 5830 "7890" does not yield a match because only those alternatives that 5831 match at one point in the subject are remembered. The problem arises 5832 because the start of the second alternative matches within the first 5833 alternative. There is no problem with anchored patterns or patterns 5834 such as: 5835 5836 1234|ABCD 5837 5838 where no string can be a partial match for both alternatives. This is 5839 not a problem if a standard matching function is used, because the 5840 entire match has to be rerun each time: 5841 5842 re> /1234|3789/ 5843 data> ABC123\=ph 5844 Partial match: 123 5845 data> 1237890 5846 0: 3789 5847 5848 Of course, instead of using PCRE2_DFA_RESTART, the same technique of 5849 re-running the entire match can also be used with the DFA matching 5850 function. Another possibility is to work with two buffers. If a partial 5851 match at offset n in the first buffer is followed by "no match" when 5852 PCRE2_DFA_RESTART is used on the second buffer, you can then try a new 5853 match starting at offset n+1 in the first buffer. 5854 5855 5856AUTHOR 5857 5858 Philip Hazel 5859 University Computing Service 5860 Cambridge, England. 5861 5862 5863REVISION 5864 5865 Last updated: 22 December 2014 5866 Copyright (c) 1997-2014 University of Cambridge. 5867------------------------------------------------------------------------------ 5868 5869 5870PCRE2PATTERN(3) Library Functions Manual PCRE2PATTERN(3) 5871 5872 5873 5874NAME 5875 PCRE2 - Perl-compatible regular expressions (revised API) 5876 5877PCRE2 REGULAR EXPRESSION DETAILS 5878 5879 The syntax and semantics of the regular expressions that are supported 5880 by PCRE2 are described in detail below. There is a quick-reference syn- 5881 tax summary in the pcre2syntax page. PCRE2 tries to match Perl syntax 5882 and semantics as closely as it can. PCRE2 also supports some alterna- 5883 tive regular expression syntax (which does not conflict with the Perl 5884 syntax) in order to provide some compatibility with regular expressions 5885 in Python, .NET, and Oniguruma. 5886 5887 Perl's regular expressions are described in its own documentation, and 5888 regular expressions in general are covered in a number of books, some 5889 of which have copious examples. Jeffrey Friedl's "Mastering Regular 5890 Expressions", published by O'Reilly, covers regular expressions in 5891 great detail. This description of PCRE2's regular expressions is 5892 intended as reference material. 5893 5894 This document discusses the patterns that are supported by PCRE2 when 5895 its main matching function, pcre2_match(), is used. PCRE2 also has an 5896 alternative matching function, pcre2_dfa_match(), which matches using a 5897 different algorithm that is not Perl-compatible. Some of the features 5898 discussed below are not available when DFA matching is used. The advan- 5899 tages and disadvantages of the alternative function, and how it differs 5900 from the normal function, are discussed in the pcre2matching page. 5901 5902 5903SPECIAL START-OF-PATTERN ITEMS 5904 5905 A number of options that can be passed to pcre2_compile() can also be 5906 set by special items at the start of a pattern. These are not Perl-com- 5907 patible, but are provided to make these options accessible to pattern 5908 writers who are not able to change the program that processes the pat- 5909 tern. Any number of these items may appear, but they must all be 5910 together right at the start of the pattern string, and the letters must 5911 be in upper case. 5912 5913 UTF support 5914 5915 In the 8-bit and 16-bit PCRE2 libraries, characters may be coded either 5916 as single code units, or as multiple UTF-8 or UTF-16 code units. UTF-32 5917 can be specified for the 32-bit library, in which case it constrains 5918 the character values to valid Unicode code points. To process UTF 5919 strings, PCRE2 must be built to include Unicode support (which is the 5920 default). When using UTF strings you must either call the compiling 5921 function with the PCRE2_UTF option, or the pattern must start with the 5922 special sequence (*UTF), which is equivalent to setting the relevant 5923 option. How setting a UTF mode affects pattern matching is mentioned in 5924 several places below. There is also a summary of features in the 5925 pcre2unicode page. 5926 5927 Some applications that allow their users to supply patterns may wish to 5928 restrict them to non-UTF data for security reasons. If the 5929 PCRE2_NEVER_UTF option is passed to pcre2_compile(), (*UTF) is not 5930 allowed, and its appearance in a pattern causes an error. 5931 5932 Unicode property support 5933 5934 Another special sequence that may appear at the start of a pattern is 5935 (*UCP). This has the same effect as setting the PCRE2_UCP option: it 5936 causes sequences such as \d and \w to use Unicode properties to deter- 5937 mine character types, instead of recognizing only characters with codes 5938 less than 256 via a lookup table. 5939 5940 Some applications that allow their users to supply patterns may wish to 5941 restrict them for security reasons. If the PCRE2_NEVER_UCP option is 5942 passed to pcre2_compile(), (*UCP) is not allowed, and its appearance in 5943 a pattern causes an error. 5944 5945 Locking out empty string matching 5946 5947 Starting a pattern with (*NOTEMPTY) or (*NOTEMPTY_ATSTART) has the same 5948 effect as passing the PCRE2_NOTEMPTY or PCRE2_NOTEMPTY_ATSTART option 5949 to whichever matching function is subsequently called to match the pat- 5950 tern. These options lock out the matching of empty strings, either 5951 entirely, or only at the start of the subject. 5952 5953 Disabling auto-possessification 5954 5955 If a pattern starts with (*NO_AUTO_POSSESS), it has the same effect as 5956 setting the PCRE2_NO_AUTO_POSSESS option. This stops PCRE2 from making 5957 quantifiers possessive when what follows cannot match the repeated 5958 item. For example, by default a+b is treated as a++b. For more details, 5959 see the pcre2api documentation. 5960 5961 Disabling start-up optimizations 5962 5963 If a pattern starts with (*NO_START_OPT), it has the same effect as 5964 setting the PCRE2_NO_START_OPTIMIZE option. This disables several opti- 5965 mizations for quickly reaching "no match" results. For more details, 5966 see the pcre2api documentation. 5967 5968 Disabling automatic anchoring 5969 5970 If a pattern starts with (*NO_DOTSTAR_ANCHOR), it has the same effect 5971 as setting the PCRE2_NO_DOTSTAR_ANCHOR option. This disables optimiza- 5972 tions that apply to patterns whose top-level branches all start with .* 5973 (match any number of arbitrary characters). For more details, see the 5974 pcre2api documentation. 5975 5976 Disabling JIT compilation 5977 5978 If a pattern that starts with (*NO_JIT) is successfully compiled, an 5979 attempt by the application to apply the JIT optimization by calling 5980 pcre2_jit_compile() is ignored. 5981 5982 Setting match resource limits 5983 5984 The pcre2_match() function contains a counter that is incremented every 5985 time it goes round its main loop. The caller of pcre2_match() can set a 5986 limit on this counter, which therefore limits the amount of computing 5987 resource used for a match. The maximum depth of nested backtracking can 5988 also be limited; this indirectly restricts the amount of heap memory 5989 that is used, but there is also an explicit memory limit that can be 5990 set. 5991 5992 These facilities are provided to catch runaway matches that are pro- 5993 voked by patterns with huge matching trees (a typical example is a pat- 5994 tern with nested unlimited repeats applied to a long string that does 5995 not match). When one of these limits is reached, pcre2_match() gives an 5996 error return. The limits can also be set by items at the start of the 5997 pattern of the form 5998 5999 (*LIMIT_HEAP=d) 6000 (*LIMIT_MATCH=d) 6001 (*LIMIT_DEPTH=d) 6002 6003 where d is any number of decimal digits. However, the value of the set- 6004 ting must be less than the value set (or defaulted) by the caller of 6005 pcre2_match() for it to have any effect. In other words, the pattern 6006 writer can lower the limits set by the programmer, but not raise them. 6007 If there is more than one setting of one of these limits, the lower 6008 value is used. The heap limit is specified in kibibytes (units of 1024 6009 bytes). 6010 6011 Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This 6012 name is still recognized for backwards compatibility. 6013 6014 The heap limit applies only when the pcre2_match() or pcre2_dfa_match() 6015 interpreters are used for matching. It does not apply to JIT. The match 6016 limit is used (but in a different way) when JIT is being used, or when 6017 pcre2_dfa_match() is called, to limit computing resource usage by those 6018 matching functions. The depth limit is ignored by JIT but is relevant 6019 for DFA matching, which uses function recursion for recursions within 6020 the pattern and for lookaround assertions and atomic groups. In this 6021 case, the depth limit controls the depth of such recursion. 6022 6023 Newline conventions 6024 6025 PCRE2 supports six different conventions for indicating line breaks in 6026 strings: a single CR (carriage return) character, a single LF (line- 6027 feed) character, the two-character sequence CRLF, any of the three pre- 6028 ceding, any Unicode newline sequence, or the NUL character (binary 6029 zero). The pcre2api page has further discussion about newlines, and 6030 shows how to set the newline convention when calling pcre2_compile(). 6031 6032 It is also possible to specify a newline convention by starting a pat- 6033 tern string with one of the following sequences: 6034 6035 (*CR) carriage return 6036 (*LF) linefeed 6037 (*CRLF) carriage return, followed by linefeed 6038 (*ANYCRLF) any of the three above 6039 (*ANY) all Unicode newline sequences 6040 (*NUL) the NUL character (binary zero) 6041 6042 These override the default and the options given to the compiling func- 6043 tion. For example, on a Unix system where LF is the default newline 6044 sequence, the pattern 6045 6046 (*CR)a.b 6047 6048 changes the convention to CR. That pattern matches "a\nb" because LF is 6049 no longer a newline. If more than one of these settings is present, the 6050 last one is used. 6051 6052 The newline convention affects where the circumflex and dollar asser- 6053 tions are true. It also affects the interpretation of the dot metachar- 6054 acter when PCRE2_DOTALL is not set, and the behaviour of \N when not 6055 followed by an opening brace. However, it does not affect what the \R 6056 escape sequence matches. By default, this is any Unicode newline 6057 sequence, for Perl compatibility. However, this can be changed; see the 6058 next section and the description of \R in the section entitled "Newline 6059 sequences" below. A change of \R setting can be combined with a change 6060 of newline convention. 6061 6062 Specifying what \R matches 6063 6064 It is possible to restrict \R to match only CR, LF, or CRLF (instead of 6065 the complete set of Unicode line endings) by setting the option 6066 PCRE2_BSR_ANYCRLF at compile time. This effect can also be achieved by 6067 starting a pattern with (*BSR_ANYCRLF). For completeness, (*BSR_UNI- 6068 CODE) is also recognized, corresponding to PCRE2_BSR_UNICODE. 6069 6070 6071EBCDIC CHARACTER CODES 6072 6073 PCRE2 can be compiled to run in an environment that uses EBCDIC as its 6074 character code instead of ASCII or Unicode (typically a mainframe sys- 6075 tem). In the sections below, character code values are ASCII or Uni- 6076 code; in an EBCDIC environment these characters may have different code 6077 values, and there are no code points greater than 255. 6078 6079 6080CHARACTERS AND METACHARACTERS 6081 6082 A regular expression is a pattern that is matched against a subject 6083 string from left to right. Most characters stand for themselves in a 6084 pattern, and match the corresponding characters in the subject. As a 6085 trivial example, the pattern 6086 6087 The quick brown fox 6088 6089 matches a portion of a subject string that is identical to itself. When 6090 caseless matching is specified (the PCRE2_CASELESS option), letters are 6091 matched independently of case. 6092 6093 The power of regular expressions comes from the ability to include 6094 alternatives and repetitions in the pattern. These are encoded in the 6095 pattern by the use of metacharacters, which do not stand for themselves 6096 but instead are interpreted in some special way. 6097 6098 There are two different sets of metacharacters: those that are recog- 6099 nized anywhere in the pattern except within square brackets, and those 6100 that are recognized within square brackets. Outside square brackets, 6101 the metacharacters are as follows: 6102 6103 \ general escape character with several uses 6104 ^ assert start of string (or line, in multiline mode) 6105 $ assert end of string (or line, in multiline mode) 6106 . match any character except newline (by default) 6107 [ start character class definition 6108 | start of alternative branch 6109 ( start subpattern 6110 ) end subpattern 6111 ? extends the meaning of ( 6112 also 0 or 1 quantifier 6113 also quantifier minimizer 6114 * 0 or more quantifier 6115 + 1 or more quantifier 6116 also "possessive quantifier" 6117 { start min/max quantifier 6118 6119 Part of a pattern that is in square brackets is called a "character 6120 class". In a character class the only metacharacters are: 6121 6122 \ general escape character 6123 ^ negate the class, but only if the first character 6124 - indicates character range 6125 [ POSIX character class (only if followed by POSIX 6126 syntax) 6127 ] terminates the character class 6128 6129 The following sections describe the use of each of the metacharacters. 6130 6131 6132BACKSLASH 6133 6134 The backslash character has several uses. Firstly, if it is followed by 6135 a character that is not a number or a letter, it takes away any special 6136 meaning that character may have. This use of backslash as an escape 6137 character applies both inside and outside character classes. 6138 6139 For example, if you want to match a * character, you must write \* in 6140 the pattern. This escaping action applies whether or not the following 6141 character would otherwise be interpreted as a metacharacter, so it is 6142 always safe to precede a non-alphanumeric with backslash to specify 6143 that it stands for itself. In particular, if you want to match a back- 6144 slash, you write \\. 6145 6146 In a UTF mode, only ASCII numbers and letters have any special meaning 6147 after a backslash. All other characters (in particular, those whose 6148 code points are greater than 127) are treated as literals. 6149 6150 If a pattern is compiled with the PCRE2_EXTENDED option, most white 6151 space in the pattern (other than in a character class), and characters 6152 between a # outside a character class and the next newline, inclusive, 6153 are ignored. An escaping backslash can be used to include a white space 6154 or # character as part of the pattern. 6155 6156 If you want to remove the special meaning from a sequence of charac- 6157 ters, you can do so by putting them between \Q and \E. This is differ- 6158 ent from Perl in that $ and @ are handled as literals in \Q...\E 6159 sequences in PCRE2, whereas in Perl, $ and @ cause variable interpola- 6160 tion. Also, Perl does "double-quotish backslash interpolation" on any 6161 backslashes between \Q and \E which, its documentation says, "may lead 6162 to confusing results". PCRE2 treats a backslash between \Q and \E just 6163 like any other character. Note the following examples: 6164 6165 Pattern PCRE2 matches Perl matches 6166 6167 \Qabc$xyz\E abc$xyz abc followed by the 6168 contents of $xyz 6169 \Qabc\$xyz\E abc\$xyz abc\$xyz 6170 \Qabc\E\$\Qxyz\E abc$xyz abc$xyz 6171 \QA\B\E A\B A\B 6172 \Q\\E \ \\E 6173 6174 The \Q...\E sequence is recognized both inside and outside character 6175 classes. An isolated \E that is not preceded by \Q is ignored. If \Q 6176 is not followed by \E later in the pattern, the literal interpretation 6177 continues to the end of the pattern (that is, \E is assumed at the 6178 end). If the isolated \Q is inside a character class, this causes an 6179 error, because the character class is not terminated by a closing 6180 square bracket. 6181 6182 Non-printing characters 6183 6184 A second use of backslash provides a way of encoding non-printing char- 6185 acters in patterns in a visible manner. There is no restriction on the 6186 appearance of non-printing characters in a pattern, but when a pattern 6187 is being prepared by text editing, it is often easier to use one of the 6188 following escape sequences than the binary character it represents. In 6189 an ASCII or Unicode environment, these escapes are as follows: 6190 6191 \a alarm, that is, the BEL character (hex 07) 6192 \cx "control-x", where x is any printable ASCII character 6193 \e escape (hex 1B) 6194 \f form feed (hex 0C) 6195 \n linefeed (hex 0A) 6196 \r carriage return (hex 0D) 6197 \t tab (hex 09) 6198 \0dd character with octal code 0dd 6199 \ddd character with octal code ddd, or backreference 6200 \o{ddd..} character with octal code ddd.. 6201 \xhh character with hex code hh 6202 \x{hhh..} character with hex code hhh.. 6203 \N{U+hhh..} character with Unicode hex code point hhh.. 6204 \uhhhh character with hex code hhhh (when PCRE2_ALT_BSUX is set) 6205 6206 The \N{U+hhh..} escape sequence is recognized only when the PCRE2_UTF 6207 option is set, that is, when PCRE2 is operating in a Unicode mode. Perl 6208 also uses \N{name} to specify characters by Unicode name; PCRE2 does 6209 not support this. Note that when \N is not followed by an opening 6210 brace (curly bracket) it has an entirely different meaning, matching 6211 any character that is not a newline. 6212 6213 The precise effect of \cx on ASCII characters is as follows: if x is a 6214 lower case letter, it is converted to upper case. Then bit 6 of the 6215 character (hex 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A 6216 (A is 41, Z is 5A), but \c{ becomes hex 3B ({ is 7B), and \c; becomes 6217 hex 7B (; is 3B). If the code unit following \c has a value less than 6218 32 or greater than 126, a compile-time error occurs. 6219 6220 When PCRE2 is compiled in EBCDIC mode, \N{U+hhh..} is not supported. 6221 \a, \e, \f, \n, \r, and \t generate the appropriate EBCDIC code values. 6222 The \c escape is processed as specified for Perl in the perlebcdic doc- 6223 ument. The only characters that are allowed after \c are A-Z, a-z, or 6224 one of @, [, \, ], ^, _, or ?. Any other character provokes a compile- 6225 time error. The sequence \c@ encodes character code 0; after \c the 6226 letters (in either case) encode characters 1-26 (hex 01 to hex 1A); [, 6227 \, ], ^, and _ encode characters 27-31 (hex 1B to hex 1F), and \c? 6228 becomes either 255 (hex FF) or 95 (hex 5F). 6229 6230 Thus, apart from \c?, these escapes generate the same character code 6231 values as they do in an ASCII environment, though the meanings of the 6232 values mostly differ. For example, \cG always generates code value 7, 6233 which is BEL in ASCII but DEL in EBCDIC. 6234 6235 The sequence \c? generates DEL (127, hex 7F) in an ASCII environment, 6236 but because 127 is not a control character in EBCDIC, Perl makes it 6237 generate the APC character. Unfortunately, there are several variants 6238 of EBCDIC. In most of them the APC character has the value 255 (hex 6239 FF), but in the one Perl calls POSIX-BC its value is 95 (hex 5F). If 6240 certain other characters have POSIX-BC values, PCRE2 makes \c? generate 6241 95; otherwise it generates 255. 6242 6243 After \0 up to two further octal digits are read. If there are fewer 6244 than two digits, just those that are present are used. Thus the 6245 sequence \0\x\015 specifies two binary zeros followed by a CR character 6246 (code value 13). Make sure you supply two digits after the initial zero 6247 if the pattern character that follows is itself an octal digit. 6248 6249 The escape \o must be followed by a sequence of octal digits, enclosed 6250 in braces. An error occurs if this is not the case. This escape is a 6251 recent addition to Perl; it provides way of specifying character code 6252 points as octal numbers greater than 0777, and it also allows octal 6253 numbers and backreferences to be unambiguously specified. 6254 6255 For greater clarity and unambiguity, it is best to avoid following \ by 6256 a digit greater than zero. Instead, use \o{} or \x{} to specify numeri- 6257 cal character code points, and \g{} to specify backreferences. The fol- 6258 lowing paragraphs describe the old, ambiguous syntax. 6259 6260 The handling of a backslash followed by a digit other than 0 is compli- 6261 cated, and Perl has changed over time, causing PCRE2 also to change. 6262 6263 Outside a character class, PCRE2 reads the digit and any following dig- 6264 its as a decimal number. If the number is less than 10, begins with the 6265 digit 8 or 9, or if there are at least that many previous capturing 6266 left parentheses in the expression, the entire sequence is taken as a 6267 backreference. A description of how this works is given later, follow- 6268 ing the discussion of parenthesized subpatterns. Otherwise, up to 6269 three octal digits are read to form a character code. 6270 6271 Inside a character class, PCRE2 handles \8 and \9 as the literal char- 6272 acters "8" and "9", and otherwise reads up to three octal digits fol- 6273 lowing the backslash, using them to generate a data character. Any sub- 6274 sequent digits stand for themselves. For example, outside a character 6275 class: 6276 6277 \040 is another way of writing an ASCII space 6278 \40 is the same, provided there are fewer than 40 6279 previous capturing subpatterns 6280 \7 is always a backreference 6281 \11 might be a backreference, or another way of 6282 writing a tab 6283 \011 is always a tab 6284 \0113 is a tab followed by the character "3" 6285 \113 might be a backreference, otherwise the 6286 character with octal code 113 6287 \377 might be a backreference, otherwise 6288 the value 255 (decimal) 6289 \81 is always a backreference 6290 6291 Note that octal values of 100 or greater that are specified using this 6292 syntax must not be introduced by a leading zero, because no more than 6293 three octal digits are ever read. 6294 6295 By default, after \x that is not followed by {, from zero to two hexa- 6296 decimal digits are read (letters can be in upper or lower case). Any 6297 number of hexadecimal digits may appear between \x{ and }. If a charac- 6298 ter other than a hexadecimal digit appears between \x{ and }, or if 6299 there is no terminating }, an error occurs. 6300 6301 If the PCRE2_ALT_BSUX option is set, the interpretation of \x is as 6302 just described only when it is followed by two hexadecimal digits. Oth- 6303 erwise, it matches a literal "x" character. In this mode, support for 6304 code points greater than 256 is provided by \u, which must be followed 6305 by four hexadecimal digits; otherwise it matches a literal "u" charac- 6306 ter. 6307 6308 Characters whose value is less than 256 can be defined by either of the 6309 two syntaxes for \x (or by \u in PCRE2_ALT_BSUX mode). There is no dif- 6310 ference in the way they are handled. For example, \xdc is exactly the 6311 same as \x{dc} (or \u00dc in PCRE2_ALT_BSUX mode). 6312 6313 Constraints on character values 6314 6315 Characters that are specified using octal or hexadecimal numbers are 6316 limited to certain values, as follows: 6317 6318 8-bit non-UTF mode no greater than 0xff 6319 16-bit non-UTF mode no greater than 0xffff 6320 32-bit non-UTF mode no greater than 0xffffffff 6321 All UTF modes no greater than 0x10ffff and a valid code point 6322 6323 Invalid Unicode code points are all those in the range 0xd800 to 0xdfff 6324 (the so-called "surrogate" code points). The check for these can be 6325 disabled by the caller of pcre2_compile() by setting the option 6326 PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES. However, this is possible only in 6327 UTF-8 and UTF-32 modes, because these values are not representable in 6328 UTF-16. 6329 6330 Escape sequences in character classes 6331 6332 All the sequences that define a single character value can be used both 6333 inside and outside character classes. In addition, inside a character 6334 class, \b is interpreted as the backspace character (hex 08). 6335 6336 When not followed by an opening brace, \N is not allowed in a character 6337 class. \B, \R, and \X are not special inside a character class. Like 6338 other unrecognized alphabetic escape sequences, they cause an error. 6339 Outside a character class, these sequences have different meanings. 6340 6341 Unsupported escape sequences 6342 6343 In Perl, the sequences \F, \l, \L, \u, and \U are recognized by its 6344 string handler and used to modify the case of following characters. By 6345 default, PCRE2 does not support these escape sequences. However, if the 6346 PCRE2_ALT_BSUX option is set, \U matches a "U" character, and \u can be 6347 used to define a character by code point, as described above. 6348 6349 Absolute and relative backreferences 6350 6351 The sequence \g followed by a signed or unsigned number, optionally 6352 enclosed in braces, is an absolute or relative backreference. A named 6353 backreference can be coded as \g{name}. Backreferences are discussed 6354 later, following the discussion of parenthesized subpatterns. 6355 6356 Absolute and relative subroutine calls 6357 6358 For compatibility with Oniguruma, the non-Perl syntax \g followed by a 6359 name or a number enclosed either in angle brackets or single quotes, is 6360 an alternative syntax for referencing a subpattern as a "subroutine". 6361 Details are discussed later. Note that \g{...} (Perl syntax) and 6362 \g<...> (Oniguruma syntax) are not synonymous. The former is a backref- 6363 erence; the latter is a subroutine call. 6364 6365 Generic character types 6366 6367 Another use of backslash is for specifying generic character types: 6368 6369 \d any decimal digit 6370 \D any character that is not a decimal digit 6371 \h any horizontal white space character 6372 \H any character that is not a horizontal white space character 6373 \N any character that is not a newline 6374 \s any white space character 6375 \S any character that is not a white space character 6376 \v any vertical white space character 6377 \V any character that is not a vertical white space character 6378 \w any "word" character 6379 \W any "non-word" character 6380 6381 The \N escape sequence has the same meaning as the "." metacharacter 6382 when PCRE2_DOTALL is not set, but setting PCRE2_DOTALL does not change 6383 the meaning of \N. Note that when \N is followed by an opening brace it 6384 has a different meaning. See the section entitled "Non-printing charac- 6385 ters" above for details. Perl also uses \N{name} to specify characters 6386 by Unicode name; PCRE2 does not support this. 6387 6388 Each pair of lower and upper case escape sequences partitions the com- 6389 plete set of characters into two disjoint sets. Any given character 6390 matches one, and only one, of each pair. The sequences can appear both 6391 inside and outside character classes. They each match one character of 6392 the appropriate type. If the current matching point is at the end of 6393 the subject string, all of them fail, because there is no character to 6394 match. 6395 6396 The default \s characters are HT (9), LF (10), VT (11), FF (12), CR 6397 (13), and space (32), which are defined as white space in the "C" 6398 locale. This list may vary if locale-specific matching is taking place. 6399 For example, in some locales the "non-breaking space" character (\xA0) 6400 is recognized as white space, and in others the VT character is not. 6401 6402 A "word" character is an underscore or any character that is a letter 6403 or digit. By default, the definition of letters and digits is con- 6404 trolled by PCRE2's low-valued character tables, and may vary if locale- 6405 specific matching is taking place (see "Locale support" in the pcre2api 6406 page). For example, in a French locale such as "fr_FR" in Unix-like 6407 systems, or "french" in Windows, some character codes greater than 127 6408 are used for accented letters, and these are then matched by \w. The 6409 use of locales with Unicode is discouraged. 6410 6411 By default, characters whose code points are greater than 127 never 6412 match \d, \s, or \w, and always match \D, \S, and \W, although this may 6413 be different for characters in the range 128-255 when locale-specific 6414 matching is happening. These escape sequences retain their original 6415 meanings from before Unicode support was available, mainly for effi- 6416 ciency reasons. If the PCRE2_UCP option is set, the behaviour is 6417 changed so that Unicode properties are used to determine character 6418 types, as follows: 6419 6420 \d any character that matches \p{Nd} (decimal digit) 6421 \s any character that matches \p{Z} or \h or \v 6422 \w any character that matches \p{L} or \p{N}, plus underscore 6423 6424 The upper case escapes match the inverse sets of characters. Note that 6425 \d matches only decimal digits, whereas \w matches any Unicode digit, 6426 as well as any Unicode letter, and underscore. Note also that PCRE2_UCP 6427 affects \b, and \B because they are defined in terms of \w and \W. 6428 Matching these sequences is noticeably slower when PCRE2_UCP is set. 6429 6430 The sequences \h, \H, \v, and \V, in contrast to the other sequences, 6431 which match only ASCII characters by default, always match a specific 6432 list of code points, whether or not PCRE2_UCP is set. The horizontal 6433 space characters are: 6434 6435 U+0009 Horizontal tab (HT) 6436 U+0020 Space 6437 U+00A0 Non-break space 6438 U+1680 Ogham space mark 6439 U+180E Mongolian vowel separator 6440 U+2000 En quad 6441 U+2001 Em quad 6442 U+2002 En space 6443 U+2003 Em space 6444 U+2004 Three-per-em space 6445 U+2005 Four-per-em space 6446 U+2006 Six-per-em space 6447 U+2007 Figure space 6448 U+2008 Punctuation space 6449 U+2009 Thin space 6450 U+200A Hair space 6451 U+202F Narrow no-break space 6452 U+205F Medium mathematical space 6453 U+3000 Ideographic space 6454 6455 The vertical space characters are: 6456 6457 U+000A Linefeed (LF) 6458 U+000B Vertical tab (VT) 6459 U+000C Form feed (FF) 6460 U+000D Carriage return (CR) 6461 U+0085 Next line (NEL) 6462 U+2028 Line separator 6463 U+2029 Paragraph separator 6464 6465 In 8-bit, non-UTF-8 mode, only the characters with code points less 6466 than 256 are relevant. 6467 6468 Newline sequences 6469 6470 Outside a character class, by default, the escape sequence \R matches 6471 any Unicode newline sequence. In 8-bit non-UTF-8 mode \R is equivalent 6472 to the following: 6473 6474 (?>\r\n|\n|\x0b|\f|\r|\x85) 6475 6476 This is an example of an "atomic group", details of which are given 6477 below. This particular group matches either the two-character sequence 6478 CR followed by LF, or one of the single characters LF (linefeed, 6479 U+000A), VT (vertical tab, U+000B), FF (form feed, U+000C), CR (car- 6480 riage return, U+000D), or NEL (next line, U+0085). Because this is an 6481 atomic group, the two-character sequence is treated as a single unit 6482 that cannot be split. 6483 6484 In other modes, two additional characters whose code points are greater 6485 than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa- 6486 rator, U+2029). Unicode support is not needed for these characters to 6487 be recognized. 6488 6489 It is possible to restrict \R to match only CR, LF, or CRLF (instead of 6490 the complete set of Unicode line endings) by setting the option 6491 PCRE2_BSR_ANYCRLF at compile time. (BSR is an abbrevation for "back- 6492 slash R".) This can be made the default when PCRE2 is built; if this is 6493 the case, the other behaviour can be requested via the PCRE2_BSR_UNI- 6494 CODE option. It is also possible to specify these settings by starting 6495 a pattern string with one of the following sequences: 6496 6497 (*BSR_ANYCRLF) CR, LF, or CRLF only 6498 (*BSR_UNICODE) any Unicode newline sequence 6499 6500 These override the default and the options given to the compiling func- 6501 tion. Note that these special settings, which are not Perl-compatible, 6502 are recognized only at the very start of a pattern, and that they must 6503 be in upper case. If more than one of them is present, the last one is 6504 used. They can be combined with a change of newline convention; for 6505 example, a pattern can start with: 6506 6507 (*ANY)(*BSR_ANYCRLF) 6508 6509 They can also be combined with the (*UTF) or (*UCP) special sequences. 6510 Inside a character class, \R is treated as an unrecognized escape 6511 sequence, and causes an error. 6512 6513 Unicode character properties 6514 6515 When PCRE2 is built with Unicode support (the default), three addi- 6516 tional escape sequences that match characters with specific properties 6517 are available. In 8-bit non-UTF-8 mode, these sequences are of course 6518 limited to testing characters whose code points are less than 256, but 6519 they do work in this mode. In 32-bit non-UTF mode, code points greater 6520 than 0x10ffff (the Unicode limit) may be encountered. These are all 6521 treated as being in the Common script and with an unassigned type. The 6522 extra escape sequences are: 6523 6524 \p{xx} a character with the xx property 6525 \P{xx} a character without the xx property 6526 \X a Unicode extended grapheme cluster 6527 6528 The property names represented by xx above are limited to the Unicode 6529 script names, the general category properties, "Any", which matches any 6530 character (including newline), and some special PCRE2 properties 6531 (described in the next section). Other Perl properties such as "InMu- 6532 sicalSymbols" are not supported by PCRE2. Note that \P{Any} does not 6533 match any characters, so always causes a match failure. 6534 6535 Sets of Unicode characters are defined as belonging to certain scripts. 6536 A character from one of these sets can be matched using a script name. 6537 For example: 6538 6539 \p{Greek} 6540 \P{Han} 6541 6542 Those that are not part of an identified script are lumped together as 6543 "Common". The current list of scripts is: 6544 6545 Adlam, Ahom, Anatolian_Hieroglyphs, Arabic, Armenian, Avestan, Bali- 6546 nese, Bamum, Bassa_Vah, Batak, Bengali, Bhaiksuki, Bopomofo, Brahmi, 6547 Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Caucasian_Alba- 6548 nian, Chakma, Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot, 6549 Cyrillic, Deseret, Devanagari, Dogra, Duployan, Egyptian_Hieroglyphs, 6550 Elbasan, Ethiopic, Georgian, Glagolitic, Gothic, Grantha, Greek, 6551 Gujarati, Gunjala_Gondi, Gurmukhi, Han, Hangul, Hanifi_Rohingya, 6552 Hanunoo, Hatran, Hebrew, Hiragana, Imperial_Aramaic, Inherited, 6553 Inscriptional_Pahlavi, Inscriptional_Parthian, Javanese, Kaithi, Kan- 6554 nada, Katakana, Kayah_Li, Kharoshthi, Khmer, Khojki, Khudawadi, Lao, 6555 Latin, Lepcha, Limbu, Linear_A, Linear_B, Lisu, Lycian, Lydian, Maha- 6556 jani, Makasar, Malayalam, Mandaic, Manichaean, Marchen, Masaram_Gondi, 6557 Medefaidrin, Meetei_Mayek, Mende_Kikakui, Meroitic_Cursive, 6558 Meroitic_Hieroglyphs, Miao, Modi, Mongolian, Mro, Multani, Myanmar, 6559 Nabataean, New_Tai_Lue, Newa, Nko, Nushu, Ogham, Ol_Chiki, Old_Hungar- 6560 ian, Old_Italic, Old_North_Arabian, Old_Permic, Old_Persian, Old_Sog- 6561 dian, Old_South_Arabian, Old_Turkic, Oriya, Osage, Osmanya, 6562 Pahawh_Hmong, Palmyrene, Pau_Cin_Hau, Phags_Pa, Phoenician, 6563 Psalter_Pahlavi, Rejang, Runic, Samaritan, Saurashtra, Sharada, Sha- 6564 vian, Siddham, SignWriting, Sinhala, Sogdian, Sora_Sompeng, Soyombo, 6565 Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tai_Tham, 6566 Tai_Viet, Takri, Tamil, Tangut, Telugu, Thaana, Thai, Tibetan, Tifi- 6567 nagh, Tirhuta, Ugaritic, Vai, Warang_Citi, Yi, Zanabazar_Square. 6568 6569 Each character has exactly one Unicode general category property, spec- 6570 ified by a two-letter abbreviation. For compatibility with Perl, nega- 6571 tion can be specified by including a circumflex between the opening 6572 brace and the property name. For example, \p{^Lu} is the same as 6573 \P{Lu}. 6574 6575 If only one letter is specified with \p or \P, it includes all the gen- 6576 eral category properties that start with that letter. In this case, in 6577 the absence of negation, the curly brackets in the escape sequence are 6578 optional; these two examples have the same effect: 6579 6580 \p{L} 6581 \pL 6582 6583 The following general category property codes are supported: 6584 6585 C Other 6586 Cc Control 6587 Cf Format 6588 Cn Unassigned 6589 Co Private use 6590 Cs Surrogate 6591 6592 L Letter 6593 Ll Lower case letter 6594 Lm Modifier letter 6595 Lo Other letter 6596 Lt Title case letter 6597 Lu Upper case letter 6598 6599 M Mark 6600 Mc Spacing mark 6601 Me Enclosing mark 6602 Mn Non-spacing mark 6603 6604 N Number 6605 Nd Decimal number 6606 Nl Letter number 6607 No Other number 6608 6609 P Punctuation 6610 Pc Connector punctuation 6611 Pd Dash punctuation 6612 Pe Close punctuation 6613 Pf Final punctuation 6614 Pi Initial punctuation 6615 Po Other punctuation 6616 Ps Open punctuation 6617 6618 S Symbol 6619 Sc Currency symbol 6620 Sk Modifier symbol 6621 Sm Mathematical symbol 6622 So Other symbol 6623 6624 Z Separator 6625 Zl Line separator 6626 Zp Paragraph separator 6627 Zs Space separator 6628 6629 The special property L& is also supported: it matches a character that 6630 has the Lu, Ll, or Lt property, in other words, a letter that is not 6631 classified as a modifier or "other". 6632 6633 The Cs (Surrogate) property applies only to characters in the range 6634 U+D800 to U+DFFF. Such characters are not valid in Unicode strings and 6635 so cannot be tested by PCRE2, unless UTF validity checking has been 6636 turned off (see the discussion of PCRE2_NO_UTF_CHECK in the pcre2api 6637 page). Perl does not support the Cs property. 6638 6639 The long synonyms for property names that Perl supports (such as 6640 \p{Letter}) are not supported by PCRE2, nor is it permitted to prefix 6641 any of these properties with "Is". 6642 6643 No character that is in the Unicode table has the Cn (unassigned) prop- 6644 erty. Instead, this property is assumed for any code point that is not 6645 in the Unicode table. 6646 6647 Specifying caseless matching does not affect these escape sequences. 6648 For example, \p{Lu} always matches only upper case letters. This is 6649 different from the behaviour of current versions of Perl. 6650 6651 Matching characters by Unicode property is not fast, because PCRE2 has 6652 to do a multistage table lookup in order to find a character's prop- 6653 erty. That is why the traditional escape sequences such as \d and \w do 6654 not use Unicode properties in PCRE2 by default, though you can make 6655 them do so by setting the PCRE2_UCP option or by starting the pattern 6656 with (*UCP). 6657 6658 Extended grapheme clusters 6659 6660 The \X escape matches any number of Unicode characters that form an 6661 "extended grapheme cluster", and treats the sequence as an atomic group 6662 (see below). Unicode supports various kinds of composite character by 6663 giving each character a grapheme breaking property, and having rules 6664 that use these properties to define the boundaries of extended grapheme 6665 clusters. The rules are defined in Unicode Standard Annex 29, "Unicode 6666 Text Segmentation". Unicode 11.0.0 abandoned the use of some previous 6667 properties that had been used for emojis. Instead it introduced vari- 6668 ous emoji-specific properties. PCRE2 uses only the Extended Picto- 6669 graphic property. 6670 6671 \X always matches at least one character. Then it decides whether to 6672 add additional characters according to the following rules for ending a 6673 cluster: 6674 6675 1. End at the end of the subject string. 6676 6677 2. Do not end between CR and LF; otherwise end after any control char- 6678 acter. 6679 6680 3. Do not break Hangul (a Korean script) syllable sequences. Hangul 6681 characters are of five types: L, V, T, LV, and LVT. An L character may 6682 be followed by an L, V, LV, or LVT character; an LV or V character may 6683 be followed by a V or T character; an LVT or T character may be follwed 6684 only by a T character. 6685 6686 4. Do not end before extending characters or spacing marks or the 6687 "zero-width joiner" character. Characters with the "mark" property 6688 always have the "extend" grapheme breaking property. 6689 6690 5. Do not end after prepend characters. 6691 6692 6. Do not break within emoji modifier sequences or emoji zwj sequences. 6693 That is, do not break between characters with the Extended_Pictographic 6694 property. Extend and ZWJ characters are allowed between the charac- 6695 ters. 6696 6697 7. Do not break within emoji flag sequences. That is, do not break 6698 between regional indicator (RI) characters if there are an odd number 6699 of RI characters before the break point. 6700 6701 8. Otherwise, end the cluster. 6702 6703 PCRE2's additional properties 6704 6705 As well as the standard Unicode properties described above, PCRE2 sup- 6706 ports four more that make it possible to convert traditional escape 6707 sequences such as \w and \s to use Unicode properties. PCRE2 uses these 6708 non-standard, non-Perl properties internally when PCRE2_UCP is set. 6709 However, they may also be used explicitly. These properties are: 6710 6711 Xan Any alphanumeric character 6712 Xps Any POSIX space character 6713 Xsp Any Perl space character 6714 Xwd Any Perl "word" character 6715 6716 Xan matches characters that have either the L (letter) or the N (num- 6717 ber) property. Xps matches the characters tab, linefeed, vertical tab, 6718 form feed, or carriage return, and any other character that has the Z 6719 (separator) property. Xsp is the same as Xps; in PCRE1 it used to 6720 exclude vertical tab, for Perl compatibility, but Perl changed. Xwd 6721 matches the same characters as Xan, plus underscore. 6722 6723 There is another non-standard property, Xuc, which matches any charac- 6724 ter that can be represented by a Universal Character Name in C++ and 6725 other programming languages. These are the characters $, @, ` (grave 6726 accent), and all characters with Unicode code points greater than or 6727 equal to U+00A0, except for the surrogates U+D800 to U+DFFF. Note that 6728 most base (ASCII) characters are excluded. (Universal Character Names 6729 are of the form \uHHHH or \UHHHHHHHH where H is a hexadecimal digit. 6730 Note that the Xuc property does not match these sequences but the char- 6731 acters that they represent.) 6732 6733 Resetting the match start 6734 6735 In normal use, the escape sequence \K causes any previously matched 6736 characters not to be included in the final matched sequence that is 6737 returned. For example, the pattern: 6738 6739 foo\Kbar 6740 6741 matches "foobar", but reports that it has matched "bar". \K does not 6742 interact with anchoring in any way. The pattern: 6743 6744 ^foo\Kbar 6745 6746 matches only when the subject begins with "foobar" (in single line 6747 mode), though it again reports the matched string as "bar". This fea- 6748 ture is similar to a lookbehind assertion (described below). However, 6749 in this case, the part of the subject before the real match does not 6750 have to be of fixed length, as lookbehind assertions do. The use of \K 6751 does not interfere with the setting of captured substrings. For exam- 6752 ple, when the pattern 6753 6754 (foo)\Kbar 6755 6756 matches "foobar", the first substring is still set to "foo". 6757 6758 Perl documents that the use of \K within assertions is "not well 6759 defined". In PCRE2, \K is acted upon when it occurs inside positive 6760 assertions, but is ignored in negative assertions. Note that when a 6761 pattern such as (?=ab\K) matches, the reported start of the match can 6762 be greater than the end of the match. Using \K in a lookbehind asser- 6763 tion at the start of a pattern can also lead to odd effects. For exam- 6764 ple, consider this pattern: 6765 6766 (?<=\Kfoo)bar 6767 6768 If the subject is "foobar", a call to pcre2_match() with a starting 6769 offset of 3 succeeds and reports the matching string as "foobar", that 6770 is, the start of the reported match is earlier than where the match 6771 started. 6772 6773 Simple assertions 6774 6775 The final use of backslash is for certain simple assertions. An asser- 6776 tion specifies a condition that has to be met at a particular point in 6777 a match, without consuming any characters from the subject string. The 6778 use of subpatterns for more complicated assertions is described below. 6779 The backslashed assertions are: 6780 6781 \b matches at a word boundary 6782 \B matches when not at a word boundary 6783 \A matches at the start of the subject 6784 \Z matches at the end of the subject 6785 also matches before a newline at the end of the subject 6786 \z matches only at the end of the subject 6787 \G matches at the first matching position in the subject 6788 6789 Inside a character class, \b has a different meaning; it matches the 6790 backspace character. If any other of these assertions appears in a 6791 character class, an "invalid escape sequence" error is generated. 6792 6793 A word boundary is a position in the subject string where the current 6794 character and the previous character do not both match \w or \W (i.e. 6795 one matches \w and the other matches \W), or the start or end of the 6796 string if the first or last character matches \w, respectively. In a 6797 UTF mode, the meanings of \w and \W can be changed by setting the 6798 PCRE2_UCP option. When this is done, it also affects \b and \B. Neither 6799 PCRE2 nor Perl has a separate "start of word" or "end of word" metase- 6800 quence. However, whatever follows \b normally determines which it is. 6801 For example, the fragment \ba matches "a" at the start of a word. 6802 6803 The \A, \Z, and \z assertions differ from the traditional circumflex 6804 and dollar (described in the next section) in that they only ever match 6805 at the very start and end of the subject string, whatever options are 6806 set. Thus, they are independent of multiline mode. These three asser- 6807 tions are not affected by the PCRE2_NOTBOL or PCRE2_NOTEOL options, 6808 which affect only the behaviour of the circumflex and dollar metachar- 6809 acters. However, if the startoffset argument of pcre2_match() is non- 6810 zero, indicating that matching is to start at a point other than the 6811 beginning of the subject, \A can never match. The difference between 6812 \Z and \z is that \Z matches before a newline at the end of the string 6813 as well as at the very end, whereas \z matches only at the end. 6814 6815 The \G assertion is true only when the current matching position is at 6816 the start point of the matching process, as specified by the startoff- 6817 set argument of pcre2_match(). It differs from \A when the value of 6818 startoffset is non-zero. By calling pcre2_match() multiple times with 6819 appropriate arguments, you can mimic Perl's /g option, and it is in 6820 this kind of implementation where \G can be useful. 6821 6822 Note, however, that PCRE2's implementation of \G, being true at the 6823 starting character of the matching process, is subtly different from 6824 Perl's, which defines it as true at the end of the previous match. In 6825 Perl, these can be different when the previously matched string was 6826 empty. Because PCRE2 does just one match at a time, it cannot reproduce 6827 this behaviour. 6828 6829 If all the alternatives of a pattern begin with \G, the expression is 6830 anchored to the starting match position, and the "anchored" flag is set 6831 in the compiled regular expression. 6832 6833 6834CIRCUMFLEX AND DOLLAR 6835 6836 The circumflex and dollar metacharacters are zero-width assertions. 6837 That is, they test for a particular condition being true without con- 6838 suming any characters from the subject string. These two metacharacters 6839 are concerned with matching the starts and ends of lines. If the new- 6840 line convention is set so that only the two-character sequence CRLF is 6841 recognized as a newline, isolated CR and LF characters are treated as 6842 ordinary data characters, and are not recognized as newlines. 6843 6844 Outside a character class, in the default matching mode, the circumflex 6845 character is an assertion that is true only if the current matching 6846 point is at the start of the subject string. If the startoffset argu- 6847 ment of pcre2_match() is non-zero, or if PCRE2_NOTBOL is set, circum- 6848 flex can never match if the PCRE2_MULTILINE option is unset. Inside a 6849 character class, circumflex has an entirely different meaning (see 6850 below). 6851 6852 Circumflex need not be the first character of the pattern if a number 6853 of alternatives are involved, but it should be the first thing in each 6854 alternative in which it appears if the pattern is ever to match that 6855 branch. If all possible alternatives start with a circumflex, that is, 6856 if the pattern is constrained to match only at the start of the sub- 6857 ject, it is said to be an "anchored" pattern. (There are also other 6858 constructs that can cause a pattern to be anchored.) 6859 6860 The dollar character is an assertion that is true only if the current 6861 matching point is at the end of the subject string, or immediately 6862 before a newline at the end of the string (by default), unless 6863 PCRE2_NOTEOL is set. Note, however, that it does not actually match the 6864 newline. Dollar need not be the last character of the pattern if a num- 6865 ber of alternatives are involved, but it should be the last item in any 6866 branch in which it appears. Dollar has no special meaning in a charac- 6867 ter class. 6868 6869 The meaning of dollar can be changed so that it matches only at the 6870 very end of the string, by setting the PCRE2_DOLLAR_ENDONLY option at 6871 compile time. This does not affect the \Z assertion. 6872 6873 The meanings of the circumflex and dollar metacharacters are changed if 6874 the PCRE2_MULTILINE option is set. When this is the case, a dollar 6875 character matches before any newlines in the string, as well as at the 6876 very end, and a circumflex matches immediately after internal newlines 6877 as well as at the start of the subject string. It does not match after 6878 a newline that ends the string, for compatibility with Perl. However, 6879 this can be changed by setting the PCRE2_ALT_CIRCUMFLEX option. 6880 6881 For example, the pattern /^abc$/ matches the subject string "def\nabc" 6882 (where \n represents a newline) in multiline mode, but not otherwise. 6883 Consequently, patterns that are anchored in single line mode because 6884 all branches start with ^ are not anchored in multiline mode, and a 6885 match for circumflex is possible when the startoffset argument of 6886 pcre2_match() is non-zero. The PCRE2_DOLLAR_ENDONLY option is ignored 6887 if PCRE2_MULTILINE is set. 6888 6889 When the newline convention (see "Newline conventions" below) recog- 6890 nizes the two-character sequence CRLF as a newline, this is preferred, 6891 even if the single characters CR and LF are also recognized as new- 6892 lines. For example, if the newline convention is "any", a multiline 6893 mode circumflex matches before "xyz" in the string "abc\r\nxyz" rather 6894 than after CR, even though CR on its own is a valid newline. (It also 6895 matches at the very start of the string, of course.) 6896 6897 Note that the sequences \A, \Z, and \z can be used to match the start 6898 and end of the subject in both modes, and if all branches of a pattern 6899 start with \A it is always anchored, whether or not PCRE2_MULTILINE is 6900 set. 6901 6902 6903FULL STOP (PERIOD, DOT) AND \N 6904 6905 Outside a character class, a dot in the pattern matches any one charac- 6906 ter in the subject string except (by default) a character that signi- 6907 fies the end of a line. 6908 6909 When a line ending is defined as a single character, dot never matches 6910 that character; when the two-character sequence CRLF is used, dot does 6911 not match CR if it is immediately followed by LF, but otherwise it 6912 matches all characters (including isolated CRs and LFs). When any Uni- 6913 code line endings are being recognized, dot does not match CR or LF or 6914 any of the other line ending characters. 6915 6916 The behaviour of dot with regard to newlines can be changed. If the 6917 PCRE2_DOTALL option is set, a dot matches any one character, without 6918 exception. If the two-character sequence CRLF is present in the sub- 6919 ject string, it takes two dots to match it. 6920 6921 The handling of dot is entirely independent of the handling of circum- 6922 flex and dollar, the only relationship being that they both involve 6923 newlines. Dot has no special meaning in a character class. 6924 6925 The escape sequence \N when not followed by an opening brace behaves 6926 like a dot, except that it is not affected by the PCRE2_DOTALL option. 6927 In other words, it matches any character except one that signifies the 6928 end of a line. 6929 6930 When \N is followed by an opening brace it has a different meaning. See 6931 the section entitled "Non-printing characters" above for details. Perl 6932 also uses \N{name} to specify characters by Unicode name; PCRE2 does 6933 not support this. 6934 6935 6936MATCHING A SINGLE CODE UNIT 6937 6938 Outside a character class, the escape sequence \C matches any one code 6939 unit, whether or not a UTF mode is set. In the 8-bit library, one code 6940 unit is one byte; in the 16-bit library it is a 16-bit unit; in the 6941 32-bit library it is a 32-bit unit. Unlike a dot, \C always matches 6942 line-ending characters. The feature is provided in Perl in order to 6943 match individual bytes in UTF-8 mode, but it is unclear how it can use- 6944 fully be used. 6945 6946 Because \C breaks up characters into individual code units, matching 6947 one unit with \C in UTF-8 or UTF-16 mode means that the rest of the 6948 string may start with a malformed UTF character. This has undefined 6949 results, because PCRE2 assumes that it is matching character by charac- 6950 ter in a valid UTF string (by default it checks the subject string's 6951 validity at the start of processing unless the PCRE2_NO_UTF_CHECK 6952 option is used). 6953 6954 An application can lock out the use of \C by setting the 6955 PCRE2_NEVER_BACKSLASH_C option when compiling a pattern. It is also 6956 possible to build PCRE2 with the use of \C permanently disabled. 6957 6958 PCRE2 does not allow \C to appear in lookbehind assertions (described 6959 below) in UTF-8 or UTF-16 modes, because this would make it impossible 6960 to calculate the length of the lookbehind. Neither the alternative 6961 matching function pcre2_dfa_match() nor the JIT optimizer support \C in 6962 these UTF modes. The former gives a match-time error; the latter fails 6963 to optimize and so the match is always run using the interpreter. 6964 6965 In the 32-bit library, however, \C is always supported (when not 6966 explicitly locked out) because it always matches a single code unit, 6967 whether or not UTF-32 is specified. 6968 6969 In general, the \C escape sequence is best avoided. However, one way of 6970 using it that avoids the problem of malformed UTF-8 or UTF-16 charac- 6971 ters is to use a lookahead to check the length of the next character, 6972 as in this pattern, which could be used with a UTF-8 string (ignore 6973 white space and line breaks): 6974 6975 (?| (?=[\x00-\x7f])(\C) | 6976 (?=[\x80-\x{7ff}])(\C)(\C) | 6977 (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) | 6978 (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C)) 6979 6980 In this example, a group that starts with (?| resets the capturing 6981 parentheses numbers in each alternative (see "Duplicate Subpattern Num- 6982 bers" below). The assertions at the start of each branch check the next 6983 UTF-8 character for values whose encoding uses 1, 2, 3, or 4 bytes, 6984 respectively. The character's individual bytes are then captured by the 6985 appropriate number of \C groups. 6986 6987 6988SQUARE BRACKETS AND CHARACTER CLASSES 6989 6990 An opening square bracket introduces a character class, terminated by a 6991 closing square bracket. A closing square bracket on its own is not spe- 6992 cial by default. If a closing square bracket is required as a member 6993 of the class, it should be the first data character in the class (after 6994 an initial circumflex, if present) or escaped with a backslash. This 6995 means that, by default, an empty class cannot be defined. However, if 6996 the PCRE2_ALLOW_EMPTY_CLASS option is set, a closing square bracket at 6997 the start does end the (empty) class. 6998 6999 A character class matches a single character in the subject. A matched 7000 character must be in the set of characters defined by the class, unless 7001 the first character in the class definition is a circumflex, in which 7002 case the subject character must not be in the set defined by the class. 7003 If a circumflex is actually required as a member of the class, ensure 7004 it is not the first character, or escape it with a backslash. 7005 7006 For example, the character class [aeiou] matches any lower case vowel, 7007 while [^aeiou] matches any character that is not a lower case vowel. 7008 Note that a circumflex is just a convenient notation for specifying the 7009 characters that are in the class by enumerating those that are not. A 7010 class that starts with a circumflex is not an assertion; it still con- 7011 sumes a character from the subject string, and therefore it fails if 7012 the current pointer is at the end of the string. 7013 7014 Characters in a class may be specified by their code points using \o, 7015 \x, or \N{U+hh..} in the usual way. When caseless matching is set, any 7016 letters in a class represent both their upper case and lower case ver- 7017 sions, so for example, a caseless [aeiou] matches "A" as well as "a", 7018 and a caseless [^aeiou] does not match "A", whereas a caseful version 7019 would. 7020 7021 Characters that might indicate line breaks are never treated in any 7022 special way when matching character classes, whatever line-ending 7023 sequence is in use, and whatever setting of the PCRE2_DOTALL and 7024 PCRE2_MULTILINE options is used. A class such as [^a] always matches 7025 one of these characters. 7026 7027 The generic character type escape sequences \d, \D, \h, \H, \p, \P, \s, 7028 \S, \v, \V, \w, and \W may appear in a character class, and add the 7029 characters that they match to the class. For example, [\dABCDEF] 7030 matches any hexadecimal digit. In UTF modes, the PCRE2_UCP option 7031 affects the meanings of \d, \s, \w and their upper case partners, just 7032 as it does when they appear outside a character class, as described in 7033 the section entitled "Generic character types" above. The escape 7034 sequence \b has a different meaning inside a character class; it 7035 matches the backspace character. The sequences \B, \R, and \X are not 7036 special inside a character class. Like any other unrecognized escape 7037 sequences, they cause an error. The same is true for \N when not fol- 7038 lowed by an opening brace. 7039 7040 The minus (hyphen) character can be used to specify a range of charac- 7041 ters in a character class. For example, [d-m] matches any letter 7042 between d and m, inclusive. If a minus character is required in a 7043 class, it must be escaped with a backslash or appear in a position 7044 where it cannot be interpreted as indicating a range, typically as the 7045 first or last character in the class, or immediately after a range. For 7046 example, [b-d-z] matches letters in the range b to d, a hyphen charac- 7047 ter, or z. 7048 7049 Perl treats a hyphen as a literal if it appears before or after a POSIX 7050 class (see below) or before or after a character type escape such as as 7051 \d or \H. However, unless the hyphen is the last character in the 7052 class, Perl outputs a warning in its warning mode, as this is most 7053 likely a user error. As PCRE2 has no facility for warning, an error is 7054 given in these cases. 7055 7056 It is not possible to have the literal character "]" as the end charac- 7057 ter of a range. A pattern such as [W-]46] is interpreted as a class of 7058 two characters ("W" and "-") followed by a literal string "46]", so it 7059 would match "W46]" or "-46]". However, if the "]" is escaped with a 7060 backslash it is interpreted as the end of range, so [W-\]46] is inter- 7061 preted as a class containing a range followed by two other characters. 7062 The octal or hexadecimal representation of "]" can also be used to end 7063 a range. 7064 7065 Ranges normally include all code points between the start and end char- 7066 acters, inclusive. They can also be used for code points specified 7067 numerically, for example [\000-\037]. Ranges can include any characters 7068 that are valid for the current mode. In any UTF mode, the so-called 7069 "surrogate" characters (those whose code points lie between 0xd800 and 7070 0xdfff inclusive) may not be specified explicitly by default (the 7071 PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES option disables this check). How- 7072 ever, ranges such as [\x{d7ff}-\x{e000}], which include the surrogates, 7073 are always permitted. 7074 7075 There is a special case in EBCDIC environments for ranges whose end 7076 points are both specified as literal letters in the same case. For com- 7077 patibility with Perl, EBCDIC code points within the range that are not 7078 letters are omitted. For example, [h-k] matches only four characters, 7079 even though the codes for h and k are 0x88 and 0x92, a range of 11 code 7080 points. However, if the range is specified numerically, for example, 7081 [\x88-\x92] or [h-\x92], all code points are included. 7082 7083 If a range that includes letters is used when caseless matching is set, 7084 it matches the letters in either case. For example, [W-c] is equivalent 7085 to [][\\^_`wxyzabc], matched caselessly, and in a non-UTF mode, if 7086 character tables for a French locale are in use, [\xc8-\xcb] matches 7087 accented E characters in both cases. 7088 7089 A circumflex can conveniently be used with the upper case character 7090 types to specify a more restricted set of characters than the matching 7091 lower case type. For example, the class [^\W_] matches any letter or 7092 digit, but not underscore, whereas [\w] includes underscore. A positive 7093 character class should be read as "something OR something OR ..." and a 7094 negative class as "NOT something AND NOT something AND NOT ...". 7095 7096 The only metacharacters that are recognized in character classes are 7097 backslash, hyphen (only where it can be interpreted as specifying a 7098 range), circumflex (only at the start), opening square bracket (only 7099 when it can be interpreted as introducing a POSIX class name, or for a 7100 special compatibility feature - see the next two sections), and the 7101 terminating closing square bracket. However, escaping other non- 7102 alphanumeric characters does no harm. 7103 7104 7105POSIX CHARACTER CLASSES 7106 7107 Perl supports the POSIX notation for character classes. This uses names 7108 enclosed by [: and :] within the enclosing square brackets. PCRE2 also 7109 supports this notation. For example, 7110 7111 [01[:alpha:]%] 7112 7113 matches "0", "1", any alphabetic character, or "%". The supported class 7114 names are: 7115 7116 alnum letters and digits 7117 alpha letters 7118 ascii character codes 0 - 127 7119 blank space or tab only 7120 cntrl control characters 7121 digit decimal digits (same as \d) 7122 graph printing characters, excluding space 7123 lower lower case letters 7124 print printing characters, including space 7125 punct printing characters, excluding letters and digits and space 7126 space white space (the same as \s from PCRE2 8.34) 7127 upper upper case letters 7128 word "word" characters (same as \w) 7129 xdigit hexadecimal digits 7130 7131 The default "space" characters are HT (9), LF (10), VT (11), FF (12), 7132 CR (13), and space (32). If locale-specific matching is taking place, 7133 the list of space characters may be different; there may be fewer or 7134 more of them. "Space" and \s match the same set of characters. 7135 7136 The name "word" is a Perl extension, and "blank" is a GNU extension 7137 from Perl 5.8. Another Perl extension is negation, which is indicated 7138 by a ^ character after the colon. For example, 7139 7140 [12[:^digit:]] 7141 7142 matches "1", "2", or any non-digit. PCRE2 (and Perl) also recognize the 7143 POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but 7144 these are not supported, and an error is given if they are encountered. 7145 7146 By default, characters with values greater than 127 do not match any of 7147 the POSIX character classes, although this may be different for charac- 7148 ters in the range 128-255 when locale-specific matching is happening. 7149 However, if the PCRE2_UCP option is passed to pcre2_compile(), some of 7150 the classes are changed so that Unicode character properties are used. 7151 This is achieved by replacing certain POSIX classes with other 7152 sequences, as follows: 7153 7154 [:alnum:] becomes \p{Xan} 7155 [:alpha:] becomes \p{L} 7156 [:blank:] becomes \h 7157 [:cntrl:] becomes \p{Cc} 7158 [:digit:] becomes \p{Nd} 7159 [:lower:] becomes \p{Ll} 7160 [:space:] becomes \p{Xps} 7161 [:upper:] becomes \p{Lu} 7162 [:word:] becomes \p{Xwd} 7163 7164 Negated versions, such as [:^alpha:] use \P instead of \p. Three other 7165 POSIX classes are handled specially in UCP mode: 7166 7167 [:graph:] This matches characters that have glyphs that mark the page 7168 when printed. In Unicode property terms, it matches all char- 7169 acters with the L, M, N, P, S, or Cf properties, except for: 7170 7171 U+061C Arabic Letter Mark 7172 U+180E Mongolian Vowel Separator 7173 U+2066 - U+2069 Various "isolate"s 7174 7175 7176 [:print:] This matches the same characters as [:graph:] plus space 7177 characters that are not controls, that is, characters with 7178 the Zs property. 7179 7180 [:punct:] This matches all characters that have the Unicode P (punctua- 7181 tion) property, plus those characters with code points less 7182 than 256 that have the S (Symbol) property. 7183 7184 The other POSIX classes are unchanged, and match only characters with 7185 code points less than 256. 7186 7187 7188COMPATIBILITY FEATURE FOR WORD BOUNDARIES 7189 7190 In the POSIX.2 compliant library that was included in 4.4BSD Unix, the 7191 ugly syntax [[:<:]] and [[:>:]] is used for matching "start of word" 7192 and "end of word". PCRE2 treats these items as follows: 7193 7194 [[:<:]] is converted to \b(?=\w) 7195 [[:>:]] is converted to \b(?<=\w) 7196 7197 Only these exact character sequences are recognized. A sequence such as 7198 [a[:<:]b] provokes error for an unrecognized POSIX class name. This 7199 support is not compatible with Perl. It is provided to help migrations 7200 from other environments, and is best not used in any new patterns. Note 7201 that \b matches at the start and the end of a word (see "Simple asser- 7202 tions" above), and in a Perl-style pattern the preceding or following 7203 character normally shows which is wanted, without the need for the 7204 assertions that are used above in order to give exactly the POSIX be- 7205 haviour. 7206 7207 7208VERTICAL BAR 7209 7210 Vertical bar characters are used to separate alternative patterns. For 7211 example, the pattern 7212 7213 gilbert|sullivan 7214 7215 matches either "gilbert" or "sullivan". Any number of alternatives may 7216 appear, and an empty alternative is permitted (matching the empty 7217 string). The matching process tries each alternative in turn, from left 7218 to right, and the first one that succeeds is used. If the alternatives 7219 are within a subpattern (defined below), "succeeds" means matching the 7220 rest of the main pattern as well as the alternative in the subpattern. 7221 7222 7223INTERNAL OPTION SETTING 7224 7225 The settings of the PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL, 7226 PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE options 7227 can be changed from within the pattern by a sequence of letters 7228 enclosed between "(?" and ")". These options are Perl-compatible, and 7229 are described in detail in the pcre2api documentation. The option let- 7230 ters are: 7231 7232 i for PCRE2_CASELESS 7233 m for PCRE2_MULTILINE 7234 n for PCRE2_NO_AUTO_CAPTURE 7235 s for PCRE2_DOTALL 7236 x for PCRE2_EXTENDED 7237 xx for PCRE2_EXTENDED_MORE 7238 7239 For example, (?im) sets caseless, multiline matching. It is also possi- 7240 ble to unset these options by preceding the relevant letters with a 7241 hyphen, for example (?-im). The two "extended" options are not indepen- 7242 dent; unsetting either one cancels the effects of both of them. 7243 7244 A combined setting and unsetting such as (?im-sx), which sets 7245 PCRE2_CASELESS and PCRE2_MULTILINE while unsetting PCRE2_DOTALL and 7246 PCRE2_EXTENDED, is also permitted. Only one hyphen may appear in the 7247 options string. If a letter appears both before and after the hyphen, 7248 the option is unset. An empty options setting "(?)" is allowed. Need- 7249 less to say, it has no effect. 7250 7251 If the first character following (? is a circumflex, it causes all of 7252 the above options to be unset. Thus, (?^) is equivalent to (?-imnsx). 7253 Letters may follow the circumflex to cause some options to be re- 7254 instated, but a hyphen may not appear. 7255 7256 The PCRE2-specific options PCRE2_DUPNAMES and PCRE2_UNGREEDY can be 7257 changed in the same way as the Perl-compatible options by using the 7258 characters J and U respectively. However, these are not unset by (?^). 7259 7260 When one of these option changes occurs at top level (that is, not 7261 inside subpattern parentheses), the change applies to the remainder of 7262 the pattern that follows. An option change within a subpattern (see 7263 below for a description of subpatterns) affects only that part of the 7264 subpattern that follows it, so 7265 7266 (a(?i)b)c 7267 7268 matches abc and aBc and no other strings (assuming PCRE2_CASELESS is 7269 not used). By this means, options can be made to have different set- 7270 tings in different parts of the pattern. Any changes made in one alter- 7271 native do carry on into subsequent branches within the same subpattern. 7272 For example, 7273 7274 (a(?i)b|c) 7275 7276 matches "ab", "aB", "c", and "C", even though when matching "C" the 7277 first branch is abandoned before the option setting. This is because 7278 the effects of option settings happen at compile time. There would be 7279 some very weird behaviour otherwise. 7280 7281 As a convenient shorthand, if any option settings are required at the 7282 start of a non-capturing subpattern (see the next section), the option 7283 letters may appear between the "?" and the ":". Thus the two patterns 7284 7285 (?i:saturday|sunday) 7286 (?:(?i)saturday|sunday) 7287 7288 match exactly the same set of strings. 7289 7290 Note: There are other PCRE2-specific options that can be set by the 7291 application when the compiling function is called. The pattern can con- 7292 tain special leading sequences such as (*CRLF) to override what the 7293 application has set or what has been defaulted. Details are given in 7294 the section entitled "Newline sequences" above. There are also the 7295 (*UTF) and (*UCP) leading sequences that can be used to set UTF and 7296 Unicode property modes; they are equivalent to setting the PCRE2_UTF 7297 and PCRE2_UCP options, respectively. However, the application can set 7298 the PCRE2_NEVER_UTF and PCRE2_NEVER_UCP options, which lock out the use 7299 of the (*UTF) and (*UCP) sequences. 7300 7301 7302SUBPATTERNS 7303 7304 Subpatterns are delimited by parentheses (round brackets), which can be 7305 nested. Turning part of a pattern into a subpattern does two things: 7306 7307 1. It localizes a set of alternatives. For example, the pattern 7308 7309 cat(aract|erpillar|) 7310 7311 matches "cataract", "caterpillar", or "cat". Without the parentheses, 7312 it would match "cataract", "erpillar" or an empty string. 7313 7314 2. It sets up the subpattern as a capturing subpattern. This means 7315 that, when the whole pattern matches, the portion of the subject string 7316 that matched the subpattern is passed back to the caller, separately 7317 from the portion that matched the whole pattern. (This applies only to 7318 the traditional matching function; the DFA matching function does not 7319 support capturing.) 7320 7321 Opening parentheses are counted from left to right (starting from 1) to 7322 obtain numbers for the capturing subpatterns. For example, if the 7323 string "the red king" is matched against the pattern 7324 7325 the ((red|white) (king|queen)) 7326 7327 the captured substrings are "red king", "red", and "king", and are num- 7328 bered 1, 2, and 3, respectively. 7329 7330 The fact that plain parentheses fulfil two functions is not always 7331 helpful. There are often times when a grouping subpattern is required 7332 without a capturing requirement. If an opening parenthesis is followed 7333 by a question mark and a colon, the subpattern does not do any captur- 7334 ing, and is not counted when computing the number of any subsequent 7335 capturing subpatterns. For example, if the string "the white queen" is 7336 matched against the pattern 7337 7338 the ((?:red|white) (king|queen)) 7339 7340 the captured substrings are "white queen" and "queen", and are numbered 7341 1 and 2. The maximum number of capturing subpatterns is 65535. 7342 7343 As a convenient shorthand, if any option settings are required at the 7344 start of a non-capturing subpattern, the option letters may appear 7345 between the "?" and the ":". Thus the two patterns 7346 7347 (?i:saturday|sunday) 7348 (?:(?i)saturday|sunday) 7349 7350 match exactly the same set of strings. Because alternative branches are 7351 tried from left to right, and options are not reset until the end of 7352 the subpattern is reached, an option setting in one branch does affect 7353 subsequent branches, so the above patterns match "SUNDAY" as well as 7354 "Saturday". 7355 7356 7357DUPLICATE SUBPATTERN NUMBERS 7358 7359 Perl 5.10 introduced a feature whereby each alternative in a subpattern 7360 uses the same numbers for its capturing parentheses. Such a subpattern 7361 starts with (?| and is itself a non-capturing subpattern. For example, 7362 consider this pattern: 7363 7364 (?|(Sat)ur|(Sun))day 7365 7366 Because the two alternatives are inside a (?| group, both sets of cap- 7367 turing parentheses are numbered one. Thus, when the pattern matches, 7368 you can look at captured substring number one, whichever alternative 7369 matched. This construct is useful when you want to capture part, but 7370 not all, of one of a number of alternatives. Inside a (?| group, paren- 7371 theses are numbered as usual, but the number is reset at the start of 7372 each branch. The numbers of any capturing parentheses that follow the 7373 subpattern start after the highest number used in any branch. The fol- 7374 lowing example is taken from the Perl documentation. The numbers under- 7375 neath show in which buffer the captured content will be stored. 7376 7377 # before ---------------branch-reset----------- after 7378 / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x 7379 # 1 2 2 3 2 3 4 7380 7381 A backreference to a numbered subpattern uses the most recent value 7382 that is set for that number by any subpattern. The following pattern 7383 matches "abcabc" or "defdef": 7384 7385 /(?|(abc)|(def))\1/ 7386 7387 In contrast, a subroutine call to a numbered subpattern always refers 7388 to the first one in the pattern with the given number. The following 7389 pattern matches "abcabc" or "defabc": 7390 7391 /(?|(abc)|(def))(?1)/ 7392 7393 A relative reference such as (?-1) is no different: it is just a conve- 7394 nient way of computing an absolute group number. 7395 7396 If a condition test for a subpattern's having matched refers to a non- 7397 unique number, the test is true if any of the subpatterns of that num- 7398 ber have matched. 7399 7400 An alternative approach to using this "branch reset" feature is to use 7401 duplicate named subpatterns, as described in the next section. 7402 7403 7404NAMED SUBPATTERNS 7405 7406 Identifying capturing parentheses by number is simple, but it can be 7407 very hard to keep track of the numbers in complicated patterns. Fur- 7408 thermore, if an expression is modified, the numbers may change. To help 7409 with this difficulty, PCRE2 supports the naming of capturing subpat- 7410 terns. This feature was not added to Perl until release 5.10. Python 7411 had the feature earlier, and PCRE1 introduced it at release 4.0, using 7412 the Python syntax. PCRE2 supports both the Perl and the Python syntax. 7413 7414 In PCRE2, a capturing subpattern can be named in one of three ways: 7415 (?<name>...) or (?'name'...) as in Perl, or (?P<name>...) as in Python. 7416 Names consist of up to 32 alphanumeric characters and underscores, but 7417 must start with a non-digit. References to capturing parentheses from 7418 other parts of the pattern, such as backreferences, recursion, and con- 7419 ditions, can all be made by name as well as by number. 7420 7421 Named capturing parentheses are allocated numbers as well as names, 7422 exactly as if the names were not present. In both PCRE2 and Perl, cap- 7423 turing subpatterns are primarily identified by numbers; any names are 7424 just aliases for these numbers. The PCRE2 API provides function calls 7425 for extracting the complete name-to-number translation table from a 7426 compiled pattern, as well as convenience functions for extracting cap- 7427 tured substrings by name. 7428 7429 Warning: When more than one subpattern has the same number, as 7430 described in the previous section, a name given to one of them applies 7431 to all of them. Perl allows identically numbered subpatterns to have 7432 different names. Consider this pattern, where there are two capturing 7433 subpatterns, both numbered 1: 7434 7435 (?|(?<AA>aa)|(?<BB>bb)) 7436 7437 Perl allows this, with both names AA and BB as aliases of group 1. 7438 Thus, after a successful match, both names yield the same value (either 7439 "aa" or "bb"). 7440 7441 In an attempt to reduce confusion, PCRE2 does not allow the same group 7442 number to be associated with more than one name. The example above pro- 7443 vokes a compile-time error. However, there is still scope for confu- 7444 sion. Consider this pattern: 7445 7446 (?|(?<AA>aa)|(bb)) 7447 7448 Although the second subpattern number 1 is not explicitly named, the 7449 name AA is still an alias for subpattern 1. Whether the pattern matches 7450 "aa" or "bb", a reference by name to group AA yields the matched 7451 string. 7452 7453 By default, a name must be unique within a pattern, except that dupli- 7454 cate names are permitted for subpatterns with the same number, for 7455 example: 7456 7457 (?|(?<AA>aa)|(?<AA>bb)) 7458 7459 The duplicate name constraint can be disabled by setting the PCRE2_DUP- 7460 NAMES option at compile time, or by the use of (?J) within the pattern. 7461 Duplicate names can be useful for patterns where only one instance of 7462 the named parentheses can match. Suppose you want to match the name of 7463 a weekday, either as a 3-letter abbreviation or as the full name, and 7464 in both cases you want to extract the abbreviation. This pattern 7465 (ignoring the line breaks) does the job: 7466 7467 (?<DN>Mon|Fri|Sun)(?:day)?| 7468 (?<DN>Tue)(?:sday)?| 7469 (?<DN>Wed)(?:nesday)?| 7470 (?<DN>Thu)(?:rsday)?| 7471 (?<DN>Sat)(?:urday)? 7472 7473 There are five capturing substrings, but only one is ever set after a 7474 match. The convenience functions for extracting the data by name 7475 returns the substring for the first (and in this example, the only) 7476 subpattern of that name that matched. This saves searching to find 7477 which numbered subpattern it was. (An alternative way of solving this 7478 problem is to use a "branch reset" subpattern, as described in the pre- 7479 vious section.) 7480 7481 If you make a backreference to a non-unique named subpattern from else- 7482 where in the pattern, the subpatterns to which the name refers are 7483 checked in the order in which they appear in the overall pattern. The 7484 first one that is set is used for the reference. For example, this pat- 7485 tern matches both "foofoo" and "barbar" but not "foobar" or "barfoo": 7486 7487 (?:(?<n>foo)|(?<n>bar))\k<n> 7488 7489 7490 If you make a subroutine call to a non-unique named subpattern, the one 7491 that corresponds to the first occurrence of the name is used. In the 7492 absence of duplicate numbers this is the one with the lowest number. 7493 7494 If you use a named reference in a condition test (see the section about 7495 conditions below), either to check whether a subpattern has matched, or 7496 to check for recursion, all subpatterns with the same name are tested. 7497 If the condition is true for any one of them, the overall condition is 7498 true. This is the same behaviour as testing by number. For further 7499 details of the interfaces for handling named subpatterns, see the 7500 pcre2api documentation. 7501 7502 7503REPETITION 7504 7505 Repetition is specified by quantifiers, which can follow any of the 7506 following items: 7507 7508 a literal data character 7509 the dot metacharacter 7510 the \C escape sequence 7511 the \X escape sequence 7512 the \R escape sequence 7513 an escape such as \d or \pL that matches a single character 7514 a character class 7515 a backreference 7516 a parenthesized subpattern (including most assertions) 7517 a subroutine call to a subpattern (recursive or otherwise) 7518 7519 The general repetition quantifier specifies a minimum and maximum num- 7520 ber of permitted matches, by giving the two numbers in curly brackets 7521 (braces), separated by a comma. The numbers must be less than 65536, 7522 and the first must be less than or equal to the second. For example: 7523 7524 z{2,4} 7525 7526 matches "zz", "zzz", or "zzzz". A closing brace on its own is not a 7527 special character. If the second number is omitted, but the comma is 7528 present, there is no upper limit; if the second number and the comma 7529 are both omitted, the quantifier specifies an exact number of required 7530 matches. Thus 7531 7532 [aeiou]{3,} 7533 7534 matches at least 3 successive vowels, but may match many more, whereas 7535 7536 \d{8} 7537 7538 matches exactly 8 digits. An opening curly bracket that appears in a 7539 position where a quantifier is not allowed, or one that does not match 7540 the syntax of a quantifier, is taken as a literal character. For exam- 7541 ple, {,6} is not a quantifier, but a literal string of four characters. 7542 7543 In UTF modes, quantifiers apply to characters rather than to individual 7544 code units. Thus, for example, \x{100}{2} matches two characters, each 7545 of which is represented by a two-byte sequence in a UTF-8 string. Simi- 7546 larly, \X{3} matches three Unicode extended grapheme clusters, each of 7547 which may be several code units long (and they may be of different 7548 lengths). 7549 7550 The quantifier {0} is permitted, causing the expression to behave as if 7551 the previous item and the quantifier were not present. This may be use- 7552 ful for subpatterns that are referenced as subroutines from elsewhere 7553 in the pattern (but see also the section entitled "Defining subpatterns 7554 for use by reference only" below). Items other than subpatterns that 7555 have a {0} quantifier are omitted from the compiled pattern. 7556 7557 For convenience, the three most common quantifiers have single-charac- 7558 ter abbreviations: 7559 7560 * is equivalent to {0,} 7561 + is equivalent to {1,} 7562 ? is equivalent to {0,1} 7563 7564 It is possible to construct infinite loops by following a subpattern 7565 that can match no characters with a quantifier that has no upper limit, 7566 for example: 7567 7568 (a?)* 7569 7570 Earlier versions of Perl and PCRE1 used to give an error at compile 7571 time for such patterns. However, because there are cases where this can 7572 be useful, such patterns are now accepted, but if any repetition of the 7573 subpattern does in fact match no characters, the loop is forcibly bro- 7574 ken. 7575 7576 By default, the quantifiers are "greedy", that is, they match as much 7577 as possible (up to the maximum number of permitted times), without 7578 causing the rest of the pattern to fail. The classic example of where 7579 this gives problems is in trying to match comments in C programs. These 7580 appear between /* and */ and within the comment, individual * and / 7581 characters may appear. An attempt to match C comments by applying the 7582 pattern 7583 7584 /\*.*\*/ 7585 7586 to the string 7587 7588 /* first comment */ not comment /* second comment */ 7589 7590 fails, because it matches the entire string owing to the greediness of 7591 the .* item. 7592 7593 If a quantifier is followed by a question mark, it ceases to be greedy, 7594 and instead matches the minimum number of times possible, so the pat- 7595 tern 7596 7597 /\*.*?\*/ 7598 7599 does the right thing with the C comments. The meaning of the various 7600 quantifiers is not otherwise changed, just the preferred number of 7601 matches. Do not confuse this use of question mark with its use as a 7602 quantifier in its own right. Because it has two uses, it can sometimes 7603 appear doubled, as in 7604 7605 \d??\d 7606 7607 which matches one digit by preference, but can match two if that is the 7608 only way the rest of the pattern matches. 7609 7610 If the PCRE2_UNGREEDY option is set (an option that is not available in 7611 Perl), the quantifiers are not greedy by default, but individual ones 7612 can be made greedy by following them with a question mark. In other 7613 words, it inverts the default behaviour. 7614 7615 When a parenthesized subpattern is quantified with a minimum repeat 7616 count that is greater than 1 or with a limited maximum, more memory is 7617 required for the compiled pattern, in proportion to the size of the 7618 minimum or maximum. 7619 7620 If a pattern starts with .* or .{0,} and the PCRE2_DOTALL option 7621 (equivalent to Perl's /s) is set, thus allowing the dot to match new- 7622 lines, the pattern is implicitly anchored, because whatever follows 7623 will be tried against every character position in the subject string, 7624 so there is no point in retrying the overall match at any position 7625 after the first. PCRE2 normally treats such a pattern as though it were 7626 preceded by \A. 7627 7628 In cases where it is known that the subject string contains no new- 7629 lines, it is worth setting PCRE2_DOTALL in order to obtain this opti- 7630 mization, or alternatively, using ^ to indicate anchoring explicitly. 7631 7632 However, there are some cases where the optimization cannot be used. 7633 When .* is inside capturing parentheses that are the subject of a 7634 backreference elsewhere in the pattern, a match at the start may fail 7635 where a later one succeeds. Consider, for example: 7636 7637 (.*)abc\1 7638 7639 If the subject is "xyz123abc123" the match point is the fourth charac- 7640 ter. For this reason, such a pattern is not implicitly anchored. 7641 7642 Another case where implicit anchoring is not applied is when the lead- 7643 ing .* is inside an atomic group. Once again, a match at the start may 7644 fail where a later one succeeds. Consider this pattern: 7645 7646 (?>.*?a)b 7647 7648 It matches "ab" in the subject "aab". The use of the backtracking con- 7649 trol verbs (*PRUNE) and (*SKIP) also disable this optimization, and 7650 there is an option, PCRE2_NO_DOTSTAR_ANCHOR, to do so explicitly. 7651 7652 When a capturing subpattern is repeated, the value captured is the sub- 7653 string that matched the final iteration. For example, after 7654 7655 (tweedle[dume]{3}\s*)+ 7656 7657 has matched "tweedledum tweedledee" the value of the captured substring 7658 is "tweedledee". However, if there are nested capturing subpatterns, 7659 the corresponding captured values may have been set in previous itera- 7660 tions. For example, after 7661 7662 (a|(b))+ 7663 7664 matches "aba" the value of the second captured substring is "b". 7665 7666 7667ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS 7668 7669 With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy") 7670 repetition, failure of what follows normally causes the repeated item 7671 to be re-evaluated to see if a different number of repeats allows the 7672 rest of the pattern to match. Sometimes it is useful to prevent this, 7673 either to change the nature of the match, or to cause it fail earlier 7674 than it otherwise might, when the author of the pattern knows there is 7675 no point in carrying on. 7676 7677 Consider, for example, the pattern \d+foo when applied to the subject 7678 line 7679 7680 123456bar 7681 7682 After matching all 6 digits and then failing to match "foo", the normal 7683 action of the matcher is to try again with only 5 digits matching the 7684 \d+ item, and then with 4, and so on, before ultimately failing. 7685 "Atomic grouping" (a term taken from Jeffrey Friedl's book) provides 7686 the means for specifying that once a subpattern has matched, it is not 7687 to be re-evaluated in this way. 7688 7689 If we use atomic grouping for the previous example, the matcher gives 7690 up immediately on failing to match "foo" the first time. The notation 7691 is a kind of special parenthesis, starting with (?> as in this example: 7692 7693 (?>\d+)foo 7694 7695 This kind of parenthesis "locks up" the part of the pattern it con- 7696 tains once it has matched, and a failure further into the pattern is 7697 prevented from backtracking into it. Backtracking past it to previous 7698 items, however, works as normal. 7699 7700 An alternative description is that a subpattern of this type matches 7701 exactly the string of characters that an identical standalone pattern 7702 would match, if anchored at the current point in the subject string. 7703 7704 Atomic grouping subpatterns are not capturing subpatterns. Simple cases 7705 such as the above example can be thought of as a maximizing repeat that 7706 must swallow everything it can. So, while both \d+ and \d+? are pre- 7707 pared to adjust the number of digits they match in order to make the 7708 rest of the pattern match, (?>\d+) can only match an entire sequence of 7709 digits. 7710 7711 Atomic groups in general can of course contain arbitrarily complicated 7712 subpatterns, and can be nested. However, when the subpattern for an 7713 atomic group is just a single repeated item, as in the example above, a 7714 simpler notation, called a "possessive quantifier" can be used. This 7715 consists of an additional + character following a quantifier. Using 7716 this notation, the previous example can be rewritten as 7717 7718 \d++foo 7719 7720 Note that a possessive quantifier can be used with an entire group, for 7721 example: 7722 7723 (abc|xyz){2,3}+ 7724 7725 Possessive quantifiers are always greedy; the setting of the 7726 PCRE2_UNGREEDY option is ignored. They are a convenient notation for 7727 the simpler forms of atomic group. However, there is no difference in 7728 the meaning of a possessive quantifier and the equivalent atomic group, 7729 though there may be a performance difference; possessive quantifiers 7730 should be slightly faster. 7731 7732 The possessive quantifier syntax is an extension to the Perl 5.8 syn- 7733 tax. Jeffrey Friedl originated the idea (and the name) in the first 7734 edition of his book. Mike McCloskey liked it, so implemented it when he 7735 built Sun's Java package, and PCRE1 copied it from there. It ultimately 7736 found its way into Perl at release 5.10. 7737 7738 PCRE2 has an optimization that automatically "possessifies" certain 7739 simple pattern constructs. For example, the sequence A+B is treated as 7740 A++B because there is no point in backtracking into a sequence of A's 7741 when B must follow. This feature can be disabled by the PCRE2_NO_AUTO- 7742 POSSESS option, or starting the pattern with (*NO_AUTO_POSSESS). 7743 7744 When a pattern contains an unlimited repeat inside a subpattern that 7745 can itself be repeated an unlimited number of times, the use of an 7746 atomic group is the only way to avoid some failing matches taking a 7747 very long time indeed. The pattern 7748 7749 (\D+|<\d+>)*[!?] 7750 7751 matches an unlimited number of substrings that either consist of non- 7752 digits, or digits enclosed in <>, followed by either ! or ?. When it 7753 matches, it runs quickly. However, if it is applied to 7754 7755 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 7756 7757 it takes a long time before reporting failure. This is because the 7758 string can be divided between the internal \D+ repeat and the external 7759 * repeat in a large number of ways, and all have to be tried. (The 7760 example uses [!?] rather than a single character at the end, because 7761 both PCRE2 and Perl have an optimization that allows for fast failure 7762 when a single character is used. They remember the last single charac- 7763 ter that is required for a match, and fail early if it is not present 7764 in the string.) If the pattern is changed so that it uses an atomic 7765 group, like this: 7766 7767 ((?>\D+)|<\d+>)*[!?] 7768 7769 sequences of non-digits cannot be broken, and failure happens quickly. 7770 7771 7772BACKREFERENCES 7773 7774 Outside a character class, a backslash followed by a digit greater than 7775 0 (and possibly further digits) is a backreference to a capturing sub- 7776 pattern earlier (that is, to its left) in the pattern, provided there 7777 have been that many previous capturing left parentheses. 7778 7779 However, if the decimal number following the backslash is less than 8, 7780 it is always taken as a backreference, and causes an error only if 7781 there are not that many capturing left parentheses in the entire pat- 7782 tern. In other words, the parentheses that are referenced need not be 7783 to the left of the reference for numbers less than 8. A "forward back- 7784 reference" of this type can make sense when a repetition is involved 7785 and the subpattern to the right has participated in an earlier itera- 7786 tion. 7787 7788 It is not possible to have a numerical "forward backreference" to a 7789 subpattern whose number is 8 or more using this syntax because a 7790 sequence such as \50 is interpreted as a character defined in octal. 7791 See the subsection entitled "Non-printing characters" above for further 7792 details of the handling of digits following a backslash. There is no 7793 such problem when named parentheses are used. A backreference to any 7794 subpattern is possible using named parentheses (see below). 7795 7796 Another way of avoiding the ambiguity inherent in the use of digits 7797 following a backslash is to use the \g escape sequence. This escape 7798 must be followed by a signed or unsigned number, optionally enclosed in 7799 braces. These examples are all identical: 7800 7801 (ring), \1 7802 (ring), \g1 7803 (ring), \g{1} 7804 7805 An unsigned number specifies an absolute reference without the ambigu- 7806 ity that is present in the older syntax. It is also useful when literal 7807 digits follow the reference. A signed number is a relative reference. 7808 Consider this example: 7809 7810 (abc(def)ghi)\g{-1} 7811 7812 The sequence \g{-1} is a reference to the most recently started captur- 7813 ing subpattern before \g, that is, is it equivalent to \2 in this exam- 7814 ple. Similarly, \g{-2} would be equivalent to \1. The use of relative 7815 references can be helpful in long patterns, and also in patterns that 7816 are created by joining together fragments that contain references 7817 within themselves. 7818 7819 The sequence \g{+1} is a reference to the next capturing subpattern. 7820 This kind of forward reference can be useful it patterns that repeat. 7821 Perl does not support the use of + in this way. 7822 7823 A backreference matches whatever actually matched the capturing subpat- 7824 tern in the current subject string, rather than anything matching the 7825 subpattern itself (see "Subpatterns as subroutines" below for a way of 7826 doing that). So the pattern 7827 7828 (sens|respons)e and \1ibility 7829 7830 matches "sense and sensibility" and "response and responsibility", but 7831 not "sense and responsibility". If caseful matching is in force at the 7832 time of the backreference, the case of letters is relevant. For exam- 7833 ple, 7834 7835 ((?i)rah)\s+\1 7836 7837 matches "rah rah" and "RAH RAH", but not "RAH rah", even though the 7838 original capturing subpattern is matched caselessly. 7839 7840 There are several different ways of writing backreferences to named 7841 subpatterns. The .NET syntax \k{name} and the Perl syntax \k<name> or 7842 \k'name' are supported, as is the Python syntax (?P=name). Perl 5.10's 7843 unified backreference syntax, in which \g can be used for both numeric 7844 and named references, is also supported. We could rewrite the above 7845 example in any of the following ways: 7846 7847 (?<p1>(?i)rah)\s+\k<p1> 7848 (?'p1'(?i)rah)\s+\k{p1} 7849 (?P<p1>(?i)rah)\s+(?P=p1) 7850 (?<p1>(?i)rah)\s+\g{p1} 7851 7852 A subpattern that is referenced by name may appear in the pattern 7853 before or after the reference. 7854 7855 There may be more than one backreference to the same subpattern. If a 7856 subpattern has not actually been used in a particular match, any back- 7857 references to it always fail by default. For example, the pattern 7858 7859 (a|(bc))\2 7860 7861 always fails if it starts to match "a" rather than "bc". However, if 7862 the PCRE2_MATCH_UNSET_BACKREF option is set at compile time, a backref- 7863 erence to an unset value matches an empty string. 7864 7865 Because there may be many capturing parentheses in a pattern, all dig- 7866 its following a backslash are taken as part of a potential backrefer- 7867 ence number. If the pattern continues with a digit character, some 7868 delimiter must be used to terminate the backreference. If the 7869 PCRE2_EXTENDED or PCRE2_EXTENDED_MORE option is set, this can be white 7870 space. Otherwise, the \g{ syntax or an empty comment (see "Comments" 7871 below) can be used. 7872 7873 Recursive backreferences 7874 7875 A backreference that occurs inside the parentheses to which it refers 7876 fails when the subpattern is first used, so, for example, (a\1) never 7877 matches. However, such references can be useful inside repeated sub- 7878 patterns. For example, the pattern 7879 7880 (a|b\1)+ 7881 7882 matches any number of "a"s and also "aba", "ababbaa" etc. At each iter- 7883 ation of the subpattern, the backreference matches the character string 7884 corresponding to the previous iteration. In order for this to work, the 7885 pattern must be such that the first iteration does not need to match 7886 the backreference. This can be done using alternation, as in the exam- 7887 ple above, or by a quantifier with a minimum of zero. 7888 7889 Backreferences of this type cause the group that they reference to be 7890 treated as an atomic group. Once the whole group has been matched, a 7891 subsequent matching failure cannot cause backtracking into the middle 7892 of the group. 7893 7894 7895ASSERTIONS 7896 7897 An assertion is a test on the characters following or preceding the 7898 current matching point that does not consume any characters. The simple 7899 assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are described 7900 above. 7901 7902 More complicated assertions are coded as subpatterns. There are two 7903 kinds: those that look ahead of the current position in the subject 7904 string, and those that look behind it, and in each case an assertion 7905 may be positive (must succeed for matching to continue) or negative 7906 (must not succeed for matching to continue). An assertion subpattern is 7907 matched in the normal way, except that, when matching continues after a 7908 successful assertion, the matching position in the subject string is as 7909 it was before the assertion was processed. 7910 7911 Assertion subpatterns are not capturing subpatterns. If an assertion 7912 contains capturing subpatterns within it, these are counted for the 7913 purposes of numbering the capturing subpatterns in the whole pattern. 7914 Within each branch of an assertion, locally captured substrings may be 7915 referenced in the usual way. For example, a sequence such as (.)\g{-1} 7916 can be used to check that two adjacent characters are the same. 7917 7918 When a branch within an assertion fails to match, any substrings that 7919 were captured are discarded (as happens with any pattern branch that 7920 fails to match). A negative assertion succeeds only when all its 7921 branches fail to match; this means that no captured substrings are ever 7922 retained after a successful negative assertion. When an assertion con- 7923 tains a matching branch, what happens depends on the type of assertion. 7924 7925 For a positive assertion, internally captured substrings in the suc- 7926 cessful branch are retained, and matching continues with the next pat- 7927 tern item after the assertion. For a negative assertion, a matching 7928 branch means that the assertion has failed. If the assertion is being 7929 used as a condition in a conditional subpattern (see below), captured 7930 substrings are retained, because matching continues with the "no" 7931 branch of the condition. For other failing negative assertions, control 7932 passes to the previous backtracking point, thus discarding any captured 7933 strings within the assertion. 7934 7935 For compatibility with Perl, most assertion subpatterns may be 7936 repeated; though it makes no sense to assert the same thing several 7937 times, the side effect of capturing parentheses may occasionally be 7938 useful. However, an assertion that forms the condition for a condi- 7939 tional subpattern may not be quantified. In practice, for other asser- 7940 tions, there only three cases: 7941 7942 (1) If the quantifier is {0}, the assertion is never obeyed during 7943 matching. However, it may contain internal capturing parenthesized 7944 groups that are called from elsewhere via the subroutine mechanism. 7945 7946 (2) If quantifier is {0,n} where n is greater than zero, it is treated 7947 as if it were {0,1}. At run time, the rest of the pattern match is 7948 tried with and without the assertion, the order depending on the greed- 7949 iness of the quantifier. 7950 7951 (3) If the minimum repetition is greater than zero, the quantifier is 7952 ignored. The assertion is obeyed just once when encountered during 7953 matching. 7954 7955 Lookahead assertions 7956 7957 Lookahead assertions start with (?= for positive assertions and (?! for 7958 negative assertions. For example, 7959 7960 \w+(?=;) 7961 7962 matches a word followed by a semicolon, but does not include the semi- 7963 colon in the match, and 7964 7965 foo(?!bar) 7966 7967 matches any occurrence of "foo" that is not followed by "bar". Note 7968 that the apparently similar pattern 7969 7970 (?!foo)bar 7971 7972 does not find an occurrence of "bar" that is preceded by something 7973 other than "foo"; it finds any occurrence of "bar" whatsoever, because 7974 the assertion (?!foo) is always true when the next three characters are 7975 "bar". A lookbehind assertion is needed to achieve the other effect. 7976 7977 If you want to force a matching failure at some point in a pattern, the 7978 most convenient way to do it is with (?!) because an empty string 7979 always matches, so an assertion that requires there not to be an empty 7980 string must always fail. The backtracking control verb (*FAIL) or (*F) 7981 is a synonym for (?!). 7982 7983 Lookbehind assertions 7984 7985 Lookbehind assertions start with (?<= for positive assertions and (?<! 7986 for negative assertions. For example, 7987 7988 (?<!foo)bar 7989 7990 does find an occurrence of "bar" that is not preceded by "foo". The 7991 contents of a lookbehind assertion are restricted such that all the 7992 strings it matches must have a fixed length. However, if there are sev- 7993 eral top-level alternatives, they do not all have to have the same 7994 fixed length. Thus 7995 7996 (?<=bullock|donkey) 7997 7998 is permitted, but 7999 8000 (?<!dogs?|cats?) 8001 8002 causes an error at compile time. Branches that match different length 8003 strings are permitted only at the top level of a lookbehind assertion. 8004 This is an extension compared with Perl, which requires all branches to 8005 match the same length of string. An assertion such as 8006 8007 (?<=ab(c|de)) 8008 8009 is not permitted, because its single top-level branch can match two 8010 different lengths, but it is acceptable to PCRE2 if rewritten to use 8011 two top-level branches: 8012 8013 (?<=abc|abde) 8014 8015 In some cases, the escape sequence \K (see above) can be used instead 8016 of a lookbehind assertion to get round the fixed-length restriction. 8017 8018 The implementation of lookbehind assertions is, for each alternative, 8019 to temporarily move the current position back by the fixed length and 8020 then try to match. If there are insufficient characters before the cur- 8021 rent position, the assertion fails. 8022 8023 In UTF-8 and UTF-16 modes, PCRE2 does not allow the \C escape (which 8024 matches a single code unit even in a UTF mode) to appear in lookbehind 8025 assertions, because it makes it impossible to calculate the length of 8026 the lookbehind. The \X and \R escapes, which can match different num- 8027 bers of code units, are never permitted in lookbehinds. 8028 8029 "Subroutine" calls (see below) such as (?2) or (?&X) are permitted in 8030 lookbehinds, as long as the subpattern matches a fixed-length string. 8031 However, recursion, that is, a "subroutine" call into a group that is 8032 already active, is not supported. 8033 8034 Perl does not support backreferences in lookbehinds. PCRE2 does support 8035 them, but only if certain conditions are met. The 8036 PCRE2_MATCH_UNSET_BACKREF option must not be set, there must be no use 8037 of (?| in the pattern (it creates duplicate subpattern numbers), and if 8038 the backreference is by name, the name must be unique. Of course, the 8039 referenced subpattern must itself be of fixed length. The following 8040 pattern matches words containing at least two characters that begin and 8041 end with the same character: 8042 8043 \b(\w)\w++(?<=\1) 8044 8045 Possessive quantifiers can be used in conjunction with lookbehind 8046 assertions to specify efficient matching of fixed-length strings at the 8047 end of subject strings. Consider a simple pattern such as 8048 8049 abcd$ 8050 8051 when applied to a long string that does not match. Because matching 8052 proceeds from left to right, PCRE2 will look for each "a" in the sub- 8053 ject and then see if what follows matches the rest of the pattern. If 8054 the pattern is specified as 8055 8056 ^.*abcd$ 8057 8058 the initial .* matches the entire string at first, but when this fails 8059 (because there is no following "a"), it backtracks to match all but the 8060 last character, then all but the last two characters, and so on. Once 8061 again the search for "a" covers the entire string, from right to left, 8062 so we are no better off. However, if the pattern is written as 8063 8064 ^.*+(?<=abcd) 8065 8066 there can be no backtracking for the .*+ item because of the possessive 8067 quantifier; it can match only the entire string. The subsequent lookbe- 8068 hind assertion does a single test on the last four characters. If it 8069 fails, the match fails immediately. For long strings, this approach 8070 makes a significant difference to the processing time. 8071 8072 Using multiple assertions 8073 8074 Several assertions (of any sort) may occur in succession. For example, 8075 8076 (?<=\d{3})(?<!999)foo 8077 8078 matches "foo" preceded by three digits that are not "999". Notice that 8079 each of the assertions is applied independently at the same point in 8080 the subject string. First there is a check that the previous three 8081 characters are all digits, and then there is a check that the same 8082 three characters are not "999". This pattern does not match "foo" pre- 8083 ceded by six characters, the first of which are digits and the last 8084 three of which are not "999". For example, it doesn't match "123abc- 8085 foo". A pattern to do that is 8086 8087 (?<=\d{3}...)(?<!999)foo 8088 8089 This time the first assertion looks at the preceding six characters, 8090 checking that the first three are digits, and then the second assertion 8091 checks that the preceding three characters are not "999". 8092 8093 Assertions can be nested in any combination. For example, 8094 8095 (?<=(?<!foo)bar)baz 8096 8097 matches an occurrence of "baz" that is preceded by "bar" which in turn 8098 is not preceded by "foo", while 8099 8100 (?<=\d{3}(?!999)...)foo 8101 8102 is another pattern that matches "foo" preceded by three digits and any 8103 three characters that are not "999". 8104 8105 8106CONDITIONAL SUBPATTERNS 8107 8108 It is possible to cause the matching process to obey a subpattern con- 8109 ditionally or to choose between two alternative subpatterns, depending 8110 on the result of an assertion, or whether a specific capturing subpat- 8111 tern has already been matched. The two possible forms of conditional 8112 subpattern are: 8113 8114 (?(condition)yes-pattern) 8115 (?(condition)yes-pattern|no-pattern) 8116 8117 If the condition is satisfied, the yes-pattern is used; otherwise the 8118 no-pattern (if present) is used. An absent no-pattern is equivalent to 8119 an empty string (it always matches). If there are more than two alter- 8120 natives in the subpattern, a compile-time error occurs. Each of the two 8121 alternatives may itself contain nested subpatterns of any form, includ- 8122 ing conditional subpatterns; the restriction to two alternatives 8123 applies only at the level of the condition. This pattern fragment is an 8124 example where the alternatives are complex: 8125 8126 (?(1) (A|B|C) | (D | (?(2)E|F) | E) ) 8127 8128 8129 There are five kinds of condition: references to subpatterns, refer- 8130 ences to recursion, two pseudo-conditions called DEFINE and VERSION, 8131 and assertions. 8132 8133 Checking for a used subpattern by number 8134 8135 If the text between the parentheses consists of a sequence of digits, 8136 the condition is true if a capturing subpattern of that number has pre- 8137 viously matched. If there is more than one capturing subpattern with 8138 the same number (see the earlier section about duplicate subpattern 8139 numbers), the condition is true if any of them have matched. An alter- 8140 native notation is to precede the digits with a plus or minus sign. In 8141 this case, the subpattern number is relative rather than absolute. The 8142 most recently opened parentheses can be referenced by (?(-1), the next 8143 most recent by (?(-2), and so on. Inside loops it can also make sense 8144 to refer to subsequent groups. The next parentheses to be opened can be 8145 referenced as (?(+1), and so on. (The value zero in any of these forms 8146 is not used; it provokes a compile-time error.) 8147 8148 Consider the following pattern, which contains non-significant white 8149 space to make it more readable (assume the PCRE2_EXTENDED option) and 8150 to divide it into three parts for ease of discussion: 8151 8152 ( \( )? [^()]+ (?(1) \) ) 8153 8154 The first part matches an optional opening parenthesis, and if that 8155 character is present, sets it as the first captured substring. The sec- 8156 ond part matches one or more characters that are not parentheses. The 8157 third part is a conditional subpattern that tests whether or not the 8158 first set of parentheses matched. If they did, that is, if subject 8159 started with an opening parenthesis, the condition is true, and so the 8160 yes-pattern is executed and a closing parenthesis is required. Other- 8161 wise, since no-pattern is not present, the subpattern matches nothing. 8162 In other words, this pattern matches a sequence of non-parentheses, 8163 optionally enclosed in parentheses. 8164 8165 If you were embedding this pattern in a larger one, you could use a 8166 relative reference: 8167 8168 ...other stuff... ( \( )? [^()]+ (?(-1) \) ) ... 8169 8170 This makes the fragment independent of the parentheses in the larger 8171 pattern. 8172 8173 Checking for a used subpattern by name 8174 8175 Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a 8176 used subpattern by name. For compatibility with earlier versions of 8177 PCRE1, which had this facility before Perl, the syntax (?(name)...) is 8178 also recognized. Note, however, that undelimited names consisting of 8179 the letter R followed by digits are ambiguous (see the following sec- 8180 tion). 8181 8182 Rewriting the above example to use a named subpattern gives this: 8183 8184 (?<OPEN> \( )? [^()]+ (?(<OPEN>) \) ) 8185 8186 If the name used in a condition of this kind is a duplicate, the test 8187 is applied to all subpatterns of the same name, and is true if any one 8188 of them has matched. 8189 8190 Checking for pattern recursion 8191 8192 "Recursion" in this sense refers to any subroutine-like call from one 8193 part of the pattern to another, whether or not it is actually recur- 8194 sive. See the sections entitled "Recursive patterns" and "Subpatterns 8195 as subroutines" below for details of recursion and subpattern calls. 8196 8197 If a condition is the string (R), and there is no subpattern with the 8198 name R, the condition is true if matching is currently in a recursion 8199 or subroutine call to the whole pattern or any subpattern. If digits 8200 follow the letter R, and there is no subpattern with that name, the 8201 condition is true if the most recent call is into a subpattern with the 8202 given number, which must exist somewhere in the overall pattern. This 8203 is a contrived example that is equivalent to a+b: 8204 8205 ((?(R1)a+|(?1)b)) 8206 8207 However, in both cases, if there is a subpattern with a matching name, 8208 the condition tests for its being set, as described in the section 8209 above, instead of testing for recursion. For example, creating a group 8210 with the name R1 by adding (?<R1>) to the above pattern completely 8211 changes its meaning. 8212 8213 If a name preceded by ampersand follows the letter R, for example: 8214 8215 (?(R&name)...) 8216 8217 the condition is true if the most recent recursion is into a subpattern 8218 of that name (which must exist within the pattern). 8219 8220 This condition does not check the entire recursion stack. It tests only 8221 the current level. If the name used in a condition of this kind is a 8222 duplicate, the test is applied to all subpatterns of the same name, and 8223 is true if any one of them is the most recent recursion. 8224 8225 At "top level", all these recursion test conditions are false. 8226 8227 Defining subpatterns for use by reference only 8228 8229 If the condition is the string (DEFINE), the condition is always false, 8230 even if there is a group with the name DEFINE. In this case, there may 8231 be only one alternative in the subpattern. It is always skipped if con- 8232 trol reaches this point in the pattern; the idea of DEFINE is that it 8233 can be used to define subroutines that can be referenced from else- 8234 where. (The use of subroutines is described below.) For example, a pat- 8235 tern to match an IPv4 address such as "192.168.23.245" could be written 8236 like this (ignore white space and line breaks): 8237 8238 (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) 8239 \b (?&byte) (\.(?&byte)){3} \b 8240 8241 The first part of the pattern is a DEFINE group inside which a another 8242 group named "byte" is defined. This matches an individual component of 8243 an IPv4 address (a number less than 256). When matching takes place, 8244 this part of the pattern is skipped because DEFINE acts like a false 8245 condition. The rest of the pattern uses references to the named group 8246 to match the four dot-separated components of an IPv4 address, insist- 8247 ing on a word boundary at each end. 8248 8249 Checking the PCRE2 version 8250 8251 Programs that link with a PCRE2 library can check the version by call- 8252 ing pcre2_config() with appropriate arguments. Users of applications 8253 that do not have access to the underlying code cannot do this. A spe- 8254 cial "condition" called VERSION exists to allow such users to discover 8255 which version of PCRE2 they are dealing with by using this condition to 8256 match a string such as "yesno". VERSION must be followed either by "=" 8257 or ">=" and a version number. For example: 8258 8259 (?(VERSION>=10.4)yes|no) 8260 8261 This pattern matches "yes" if the PCRE2 version is greater or equal to 8262 10.4, or "no" otherwise. The fractional part of the version number may 8263 not contain more than two digits. 8264 8265 Assertion conditions 8266 8267 If the condition is not in any of the above formats, it must be an 8268 assertion. This may be a positive or negative lookahead or lookbehind 8269 assertion. Consider this pattern, again containing non-significant 8270 white space, and with the two alternatives on the second line: 8271 8272 (?(?=[^a-z]*[a-z]) 8273 \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} ) 8274 8275 The condition is a positive lookahead assertion that matches an 8276 optional sequence of non-letters followed by a letter. In other words, 8277 it tests for the presence of at least one letter in the subject. If a 8278 letter is found, the subject is matched against the first alternative; 8279 otherwise it is matched against the second. This pattern matches 8280 strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are 8281 letters and dd are digits. 8282 8283 When an assertion that is a condition contains capturing subpatterns, 8284 any capturing that occurs in a matching branch is retained afterwards, 8285 for both positive and negative assertions, because matching always con- 8286 tinues after the assertion, whether it succeeds or fails. (Compare non- 8287 conditional assertions, when captures are retained only for positive 8288 assertions that succeed.) 8289 8290 8291COMMENTS 8292 8293 There are two ways of including comments in patterns that are processed 8294 by PCRE2. In both cases, the start of the comment must not be in a 8295 character class, nor in the middle of any other sequence of related 8296 characters such as (?: or a subpattern name or number. The characters 8297 that make up a comment play no part in the pattern matching. 8298 8299 The sequence (?# marks the start of a comment that continues up to the 8300 next closing parenthesis. Nested parentheses are not permitted. If the 8301 PCRE2_EXTENDED or PCRE2_EXTENDED_MORE option is set, an unescaped # 8302 character also introduces a comment, which in this case continues to 8303 immediately after the next newline character or character sequence in 8304 the pattern. Which characters are interpreted as newlines is controlled 8305 by an option passed to the compiling function or by a special sequence 8306 at the start of the pattern, as described in the section entitled "New- 8307 line conventions" above. Note that the end of this type of comment is a 8308 literal newline sequence in the pattern; escape sequences that happen 8309 to represent a newline do not count. For example, consider this pattern 8310 when PCRE2_EXTENDED is set, and the default newline convention (a sin- 8311 gle linefeed character) is in force: 8312 8313 abc #comment \n still comment 8314 8315 On encountering the # character, pcre2_compile() skips along, looking 8316 for a newline in the pattern. The sequence \n is still literal at this 8317 stage, so it does not terminate the comment. Only an actual character 8318 with the code value 0x0a (the default newline) does so. 8319 8320 8321RECURSIVE PATTERNS 8322 8323 Consider the problem of matching a string in parentheses, allowing for 8324 unlimited nested parentheses. Without the use of recursion, the best 8325 that can be done is to use a pattern that matches up to some fixed 8326 depth of nesting. It is not possible to handle an arbitrary nesting 8327 depth. 8328 8329 For some time, Perl has provided a facility that allows regular expres- 8330 sions to recurse (amongst other things). It does this by interpolating 8331 Perl code in the expression at run time, and the code can refer to the 8332 expression itself. A Perl pattern using code interpolation to solve the 8333 parentheses problem can be created like this: 8334 8335 $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x; 8336 8337 The (?p{...}) item interpolates Perl code at run time, and in this case 8338 refers recursively to the pattern in which it appears. 8339 8340 Obviously, PCRE2 cannot support the interpolation of Perl code. 8341 Instead, it supports special syntax for recursion of the entire pat- 8342 tern, and also for individual subpattern recursion. After its introduc- 8343 tion in PCRE1 and Python, this kind of recursion was subsequently 8344 introduced into Perl at release 5.10. 8345 8346 A special item that consists of (? followed by a number greater than 8347 zero and a closing parenthesis is a recursive subroutine call of the 8348 subpattern of the given number, provided that it occurs inside that 8349 subpattern. (If not, it is a non-recursive subroutine call, which is 8350 described in the next section.) The special item (?R) or (?0) is a 8351 recursive call of the entire regular expression. 8352 8353 This PCRE2 pattern solves the nested parentheses problem (assume the 8354 PCRE2_EXTENDED option is set so that white space is ignored): 8355 8356 \( ( [^()]++ | (?R) )* \) 8357 8358 First it matches an opening parenthesis. Then it matches any number of 8359 substrings which can either be a sequence of non-parentheses, or a 8360 recursive match of the pattern itself (that is, a correctly parenthe- 8361 sized substring). Finally there is a closing parenthesis. Note the use 8362 of a possessive quantifier to avoid backtracking into sequences of non- 8363 parentheses. 8364 8365 If this were part of a larger pattern, you would not want to recurse 8366 the entire pattern, so instead you could use this: 8367 8368 ( \( ( [^()]++ | (?1) )* \) ) 8369 8370 We have put the pattern into parentheses, and caused the recursion to 8371 refer to them instead of the whole pattern. 8372 8373 In a larger pattern, keeping track of parenthesis numbers can be 8374 tricky. This is made easier by the use of relative references. Instead 8375 of (?1) in the pattern above you can write (?-2) to refer to the second 8376 most recently opened parentheses preceding the recursion. In other 8377 words, a negative number counts capturing parentheses leftwards from 8378 the point at which it is encountered. 8379 8380 Be aware however, that if duplicate subpattern numbers are in use, rel- 8381 ative references refer to the earliest subpattern with the appropriate 8382 number. Consider, for example: 8383 8384 (?|(a)|(b)) (c) (?-2) 8385 8386 The first two capturing groups (a) and (b) are both numbered 1, and 8387 group (c) is number 2. When the reference (?-2) is encountered, the 8388 second most recently opened parentheses has the number 1, but it is the 8389 first such group (the (a) group) to which the recursion refers. This 8390 would be the same if an absolute reference (?1) was used. In other 8391 words, relative references are just a shorthand for computing a group 8392 number. 8393 8394 It is also possible to refer to subsequently opened parentheses, by 8395 writing references such as (?+2). However, these cannot be recursive 8396 because the reference is not inside the parentheses that are refer- 8397 enced. They are always non-recursive subroutine calls, as described in 8398 the next section. 8399 8400 An alternative approach is to use named parentheses. The Perl syntax 8401 for this is (?&name); PCRE1's earlier syntax (?P>name) is also sup- 8402 ported. We could rewrite the above example as follows: 8403 8404 (?<pn> \( ( [^()]++ | (?&pn) )* \) ) 8405 8406 If there is more than one subpattern with the same name, the earliest 8407 one is used. 8408 8409 The example pattern that we have been looking at contains nested unlim- 8410 ited repeats, and so the use of a possessive quantifier for matching 8411 strings of non-parentheses is important when applying the pattern to 8412 strings that do not match. For example, when this pattern is applied to 8413 8414 (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() 8415 8416 it yields "no match" quickly. However, if a possessive quantifier is 8417 not used, the match runs for a very long time indeed because there are 8418 so many different ways the + and * repeats can carve up the subject, 8419 and all have to be tested before failure can be reported. 8420 8421 At the end of a match, the values of capturing parentheses are those 8422 from the outermost level. If you want to obtain intermediate values, a 8423 callout function can be used (see below and the pcre2callout documenta- 8424 tion). If the pattern above is matched against 8425 8426 (ab(cd)ef) 8427 8428 the value for the inner capturing parentheses (numbered 2) is "ef", 8429 which is the last value taken on at the top level. If a capturing sub- 8430 pattern is not matched at the top level, its final captured value is 8431 unset, even if it was (temporarily) set at a deeper level during the 8432 matching process. 8433 8434 Do not confuse the (?R) item with the condition (R), which tests for 8435 recursion. Consider this pattern, which matches text in angle brack- 8436 ets, allowing for arbitrary nesting. Only digits are allowed in nested 8437 brackets (that is, when recursing), whereas any characters are permit- 8438 ted at the outer level. 8439 8440 < (?: (?(R) \d++ | [^<>]*+) | (?R)) * > 8441 8442 In this pattern, (?(R) is the start of a conditional subpattern, with 8443 two different alternatives for the recursive and non-recursive cases. 8444 The (?R) item is the actual recursive call. 8445 8446 Differences in recursion processing between PCRE2 and Perl 8447 8448 Some former differences between PCRE2 and Perl no longer exist. 8449 8450 Before release 10.30, recursion processing in PCRE2 differed from Perl 8451 in that a recursive subpattern call was always treated as an atomic 8452 group. That is, once it had matched some of the subject string, it was 8453 never re-entered, even if it contained untried alternatives and there 8454 was a subsequent matching failure. (Historical note: PCRE implemented 8455 recursion before Perl did.) 8456 8457 Starting with release 10.30, recursive subroutine calls are no longer 8458 treated as atomic. That is, they can be re-entered to try unused alter- 8459 natives if there is a matching failure later in the pattern. This is 8460 now compatible with the way Perl works. If you want a subroutine call 8461 to be atomic, you must explicitly enclose it in an atomic group. 8462 8463 Supporting backtracking into recursions simplifies certain types of 8464 recursive pattern. For example, this pattern matches palindromic 8465 strings: 8466 8467 ^((.)(?1)\2|.?)$ 8468 8469 The second branch in the group matches a single central character in 8470 the palindrome when there are an odd number of characters, or nothing 8471 when there are an even number of characters, but in order to work it 8472 has to be able to try the second case when the rest of the pattern 8473 match fails. If you want to match typical palindromic phrases, the pat- 8474 tern has to ignore all non-word characters, which can be done like 8475 this: 8476 8477 ^\W*+((.)\W*+(?1)\W*+\2|\W*+.?)\W*+$ 8478 8479 If run with the PCRE2_CASELESS option, this pattern matches phrases 8480 such as "A man, a plan, a canal: Panama!". Note the use of the posses- 8481 sive quantifier *+ to avoid backtracking into sequences of non-word 8482 characters. Without this, PCRE2 takes a great deal longer (ten times or 8483 more) to match typical phrases, and Perl takes so long that you think 8484 it has gone into a loop. 8485 8486 Another way in which PCRE2 and Perl used to differ in their recursion 8487 processing is in the handling of captured values. Formerly in Perl, 8488 when a subpattern was called recursively or as a subpattern (see the 8489 next section), it had no access to any values that were captured out- 8490 side the recursion, whereas in PCRE2 these values can be referenced. 8491 Consider this pattern: 8492 8493 ^(.)(\1|a(?2)) 8494 8495 This pattern matches "bab". The first capturing parentheses match "b", 8496 then in the second group, when the backreference \1 fails to match "b", 8497 the second alternative matches "a" and then recurses. In the recursion, 8498 \1 does now match "b" and so the whole match succeeds. This match used 8499 to fail in Perl, but in later versions (I tried 5.024) it now works. 8500 8501 8502SUBPATTERNS AS SUBROUTINES 8503 8504 If the syntax for a recursive subpattern call (either by number or by 8505 name) is used outside the parentheses to which it refers, it operates a 8506 bit like a subroutine in a programming language. More accurately, PCRE2 8507 treats the referenced subpattern as an independent subpattern which it 8508 tries to match at the current matching position. The called subpattern 8509 may be defined before or after the reference. A numbered reference can 8510 be absolute or relative, as in these examples: 8511 8512 (...(absolute)...)...(?2)... 8513 (...(relative)...)...(?-1)... 8514 (...(?+1)...(relative)... 8515 8516 An earlier example pointed out that the pattern 8517 8518 (sens|respons)e and \1ibility 8519 8520 matches "sense and sensibility" and "response and responsibility", but 8521 not "sense and responsibility". If instead the pattern 8522 8523 (sens|respons)e and (?1)ibility 8524 8525 is used, it does match "sense and responsibility" as well as the other 8526 two strings. Another example is given in the discussion of DEFINE 8527 above. 8528 8529 Like recursions, subroutine calls used to be treated as atomic, but 8530 this changed at PCRE2 release 10.30, so backtracking into subroutine 8531 calls can now occur. However, any capturing parentheses that are set 8532 during the subroutine call revert to their previous values afterwards. 8533 8534 Processing options such as case-independence are fixed when a subpat- 8535 tern is defined, so if it is used as a subroutine, such options cannot 8536 be changed for different calls. For example, consider this pattern: 8537 8538 (abc)(?i:(?-1)) 8539 8540 It matches "abcabc". It does not match "abcABC" because the change of 8541 processing option does not affect the called subpattern. 8542 8543 The behaviour of backtracking control verbs in subpatterns when called 8544 as subroutines is described in the section entitled "Backtracking verbs 8545 in subroutines" below. 8546 8547 8548ONIGURUMA SUBROUTINE SYNTAX 8549 8550 For compatibility with Oniguruma, the non-Perl syntax \g followed by a 8551 name or a number enclosed either in angle brackets or single quotes, is 8552 an alternative syntax for referencing a subpattern as a subroutine, 8553 possibly recursively. Here are two of the examples used above, rewrit- 8554 ten using this syntax: 8555 8556 (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) ) 8557 (sens|respons)e and \g'1'ibility 8558 8559 PCRE2 supports an extension to Oniguruma: if a number is preceded by a 8560 plus or a minus sign it is taken as a relative reference. For example: 8561 8562 (abc)(?i:\g<-1>) 8563 8564 Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not 8565 synonymous. The former is a backreference; the latter is a subroutine 8566 call. 8567 8568 8569CALLOUTS 8570 8571 Perl has a feature whereby using the sequence (?{...}) causes arbitrary 8572 Perl code to be obeyed in the middle of matching a regular expression. 8573 This makes it possible, amongst other things, to extract different sub- 8574 strings that match the same pair of parentheses when there is a repeti- 8575 tion. 8576 8577 PCRE2 provides a similar feature, but of course it cannot obey arbi- 8578 trary Perl code. The feature is called "callout". The caller of PCRE2 8579 provides an external function by putting its entry point in a match 8580 context using the function pcre2_set_callout(), and then passing that 8581 context to pcre2_match() or pcre2_dfa_match(). If no match context is 8582 passed, or if the callout entry point is set to NULL, callouts are dis- 8583 abled. 8584 8585 Within a regular expression, (?C<arg>) indicates a point at which the 8586 external function is to be called. There are two kinds of callout: 8587 those with a numerical argument and those with a string argument. (?C) 8588 on its own with no argument is treated as (?C0). A numerical argument 8589 allows the application to distinguish between different callouts. 8590 String arguments were added for release 10.20 to make it possible for 8591 script languages that use PCRE2 to embed short scripts within patterns 8592 in a similar way to Perl. 8593 8594 During matching, when PCRE2 reaches a callout point, the external func- 8595 tion is called. It is provided with the number or string argument of 8596 the callout, the position in the pattern, and one item of data that is 8597 also set in the match block. The callout function may cause matching to 8598 proceed, to backtrack, or to fail. 8599 8600 By default, PCRE2 implements a number of optimizations at matching 8601 time, and one side-effect is that sometimes callouts are skipped. If 8602 you need all possible callouts to happen, you need to set options that 8603 disable the relevant optimizations. More details, including a complete 8604 description of the programming interface to the callout function, are 8605 given in the pcre2callout documentation. 8606 8607 Callouts with numerical arguments 8608 8609 If you just want to have a means of identifying different callout 8610 points, put a number less than 256 after the letter C. For example, 8611 this pattern has two callout points: 8612 8613 (?C1)abc(?C2)def 8614 8615 If the PCRE2_AUTO_CALLOUT flag is passed to pcre2_compile(), numerical 8616 callouts are automatically installed before each item in the pattern. 8617 They are all numbered 255. If there is a conditional group in the pat- 8618 tern whose condition is an assertion, an additional callout is inserted 8619 just before the condition. An explicit callout may also be set at this 8620 position, as in this example: 8621 8622 (?(?C9)(?=a)abc|def) 8623 8624 Note that this applies only to assertion conditions, not to other types 8625 of condition. 8626 8627 Callouts with string arguments 8628 8629 A delimited string may be used instead of a number as a callout argu- 8630 ment. The starting delimiter must be one of ` ' " ^ % # $ { and the 8631 ending delimiter is the same as the start, except for {, where the end- 8632 ing delimiter is }. If the ending delimiter is needed within the 8633 string, it must be doubled. For example: 8634 8635 (?C'ab ''c'' d')xyz(?C{any text})pqr 8636 8637 The doubling is removed before the string is passed to the callout 8638 function. 8639 8640 8641BACKTRACKING CONTROL 8642 8643 There are a number of special "Backtracking Control Verbs" (to use 8644 Perl's terminology) that modify the behaviour of backtracking during 8645 matching. They are generally of the form (*VERB) or (*VERB:NAME). Some 8646 verbs take either form, possibly behaving differently depending on 8647 whether or not a name is present. 8648 8649 By default, for compatibility with Perl, a name is any sequence of 8650 characters that does not include a closing parenthesis. The name is not 8651 processed in any way, and it is not possible to include a closing 8652 parenthesis in the name. This can be changed by setting the 8653 PCRE2_ALT_VERBNAMES option, but the result is no longer Perl-compati- 8654 ble. 8655 8656 When PCRE2_ALT_VERBNAMES is set, backslash processing is applied to 8657 verb names and only an unescaped closing parenthesis terminates the 8658 name. However, the only backslash items that are permitted are \Q, \E, 8659 and sequences such as \x{100} that define character code points. Char- 8660 acter type escapes such as \d are faulted. 8661 8662 A closing parenthesis can be included in a name either as \) or between 8663 \Q and \E. In addition to backslash processing, if the PCRE2_EXTENDED 8664 or PCRE2_EXTENDED_MORE option is also set, unescaped whitespace in verb 8665 names is skipped, and #-comments are recognized, exactly as in the rest 8666 of the pattern. PCRE2_EXTENDED and PCRE2_EXTENDED_MORE do not affect 8667 verb names unless PCRE2_ALT_VERBNAMES is also set. 8668 8669 The maximum length of a name is 255 in the 8-bit library and 65535 in 8670 the 16-bit and 32-bit libraries. If the name is empty, that is, if the 8671 closing parenthesis immediately follows the colon, the effect is as if 8672 the colon were not there. Any number of these verbs may occur in a pat- 8673 tern. 8674 8675 Since these verbs are specifically related to backtracking, most of 8676 them can be used only when the pattern is to be matched using the tra- 8677 ditional matching function, because that uses a backtracking algorithm. 8678 With the exception of (*FAIL), which behaves like a failing negative 8679 assertion, the backtracking control verbs cause an error if encountered 8680 by the DFA matching function. 8681 8682 The behaviour of these verbs in repeated groups, assertions, and in 8683 subpatterns called as subroutines (whether or not recursively) is docu- 8684 mented below. 8685 8686 Optimizations that affect backtracking verbs 8687 8688 PCRE2 contains some optimizations that are used to speed up matching by 8689 running some checks at the start of each match attempt. For example, it 8690 may know the minimum length of matching subject, or that a particular 8691 character must be present. When one of these optimizations bypasses the 8692 running of a match, any included backtracking verbs will not, of 8693 course, be processed. You can suppress the start-of-match optimizations 8694 by setting the PCRE2_NO_START_OPTIMIZE option when calling pcre2_com- 8695 pile(), or by starting the pattern with (*NO_START_OPT). There is more 8696 discussion of this option in the section entitled "Compiling a pattern" 8697 in the pcre2api documentation. 8698 8699 Experiments with Perl suggest that it too has similar optimizations, 8700 and like PCRE2, turning them off can change the result of a match. 8701 8702 Verbs that act immediately 8703 8704 The following verbs act as soon as they are encountered. 8705 8706 (*ACCEPT) or (*ACCEPT:NAME) 8707 8708 This verb causes the match to end successfully, skipping the remainder 8709 of the pattern. However, when it is inside a subpattern that is called 8710 as a subroutine, only that subpattern is ended successfully. Matching 8711 then continues at the outer level. If (*ACCEPT) in triggered in a posi- 8712 tive assertion, the assertion succeeds; in a negative assertion, the 8713 assertion fails. 8714 8715 If (*ACCEPT) is inside capturing parentheses, the data so far is cap- 8716 tured. For example: 8717 8718 A((?:A|B(*ACCEPT)|C)D) 8719 8720 This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap- 8721 tured by the outer parentheses. 8722 8723 (*FAIL) or (*FAIL:NAME) 8724 8725 This verb causes a matching failure, forcing backtracking to occur. It 8726 may be abbreviated to (*F). It is equivalent to (?!) but easier to 8727 read. The Perl documentation notes that it is probably useful only when 8728 combined with (?{}) or (??{}). Those are, of course, Perl features that 8729 are not present in PCRE2. The nearest equivalent is the callout fea- 8730 ture, as for example in this pattern: 8731 8732 a+(?C)(*FAIL) 8733 8734 A match with the string "aaaa" always fails, but the callout is taken 8735 before each backtrack happens (in this example, 10 times). 8736 8737 (*ACCEPT:NAME) and (*FAIL:NAME) behave exactly the same as 8738 (*MARK:NAME)(*ACCEPT) and (*MARK:NAME)(*FAIL), respectively. 8739 8740 Recording which path was taken 8741 8742 There is one verb whose main purpose is to track how a match was 8743 arrived at, though it also has a secondary use in conjunction with 8744 advancing the match starting point (see (*SKIP) below). 8745 8746 (*MARK:NAME) or (*:NAME) 8747 8748 A name is always required with this verb. There may be as many 8749 instances of (*MARK) as you like in a pattern, and their names do not 8750 have to be unique. 8751 8752 When a match succeeds, the name of the last-encountered (*MARK:NAME) on 8753 the matching path is passed back to the caller as described in the sec- 8754 tion entitled "Other information about the match" in the pcre2api docu- 8755 mentation. This applies to all instances of (*MARK), including those 8756 inside assertions and atomic groups. (There are differences in those 8757 cases when (*MARK) is used in conjunction with (*SKIP) as described 8758 below.) 8759 8760 As well as (*MARK), the (*COMMIT), (*PRUNE) and (*THEN) verbs may have 8761 associated NAME arguments. Whichever is last on the matching path is 8762 passed back. See below for more details of these other verbs. 8763 8764 Here is an example of pcre2test output, where the "mark" modifier 8765 requests the retrieval and outputting of (*MARK) data: 8766 8767 re> /X(*MARK:A)Y|X(*MARK:B)Z/mark 8768 data> XY 8769 0: XY 8770 MK: A 8771 XZ 8772 0: XZ 8773 MK: B 8774 8775 The (*MARK) name is tagged with "MK:" in this output, and in this exam- 8776 ple it indicates which of the two alternatives matched. This is a more 8777 efficient way of obtaining this information than putting each alterna- 8778 tive in its own capturing parentheses. 8779 8780 If a verb with a name is encountered in a positive assertion that is 8781 true, the name is recorded and passed back if it is the last-encoun- 8782 tered. This does not happen for negative assertions or failing positive 8783 assertions. 8784 8785 After a partial match or a failed match, the last encountered name in 8786 the entire match process is returned. For example: 8787 8788 re> /X(*MARK:A)Y|X(*MARK:B)Z/mark 8789 data> XP 8790 No match, mark = B 8791 8792 Note that in this unanchored example the mark is retained from the 8793 match attempt that started at the letter "X" in the subject. Subsequent 8794 match attempts starting at "P" and then with an empty string do not get 8795 as far as the (*MARK) item, but nevertheless do not reset it. 8796 8797 If you are interested in (*MARK) values after failed matches, you 8798 should probably set the PCRE2_NO_START_OPTIMIZE option (see above) to 8799 ensure that the match is always attempted. 8800 8801 Verbs that act after backtracking 8802 8803 The following verbs do nothing when they are encountered. Matching con- 8804 tinues with what follows, but if there is a subsequent match failure, 8805 causing a backtrack to the verb, a failure is forced. That is, back- 8806 tracking cannot pass to the left of the verb. However, when one of 8807 these verbs appears inside an atomic group or in a lookaround assertion 8808 that is true, its effect is confined to that group, because once the 8809 group has been matched, there is never any backtracking into it. Back- 8810 tracking from beyond an assertion or an atomic group ignores the entire 8811 group, and seeks a preceeding backtracking point. 8812 8813 These verbs differ in exactly what kind of failure occurs when back- 8814 tracking reaches them. The behaviour described below is what happens 8815 when the verb is not in a subroutine or an assertion. Subsequent sec- 8816 tions cover these special cases. 8817 8818 (*COMMIT) or (*COMMIT:NAME) 8819 8820 This verb causes the whole match to fail outright if there is a later 8821 matching failure that causes backtracking to reach it. Even if the pat- 8822 tern is unanchored, no further attempts to find a match by advancing 8823 the starting point take place. If (*COMMIT) is the only backtracking 8824 verb that is encountered, once it has been passed pcre2_match() is com- 8825 mitted to finding a match at the current starting point, or not at all. 8826 For example: 8827 8828 a+(*COMMIT)b 8829 8830 This matches "xxaab" but not "aacaab". It can be thought of as a kind 8831 of dynamic anchor, or "I've started, so I must finish." 8832 8833 The behaviour of (*COMMIT:NAME) is not the same as (*MARK:NAME)(*COM- 8834 MIT). It is like (*MARK:NAME) in that the name is remembered for pass- 8835 ing back to the caller. However, (*SKIP:NAME) searches only for names 8836 set with (*MARK), ignoring those set by (*COMMIT), (*PRUNE) and 8837 (*THEN). 8838 8839 If there is more than one backtracking verb in a pattern, a different 8840 one that follows (*COMMIT) may be triggered first, so merely passing 8841 (*COMMIT) during a match does not always guarantee that a match must be 8842 at this starting point. 8843 8844 Note that (*COMMIT) at the start of a pattern is not the same as an 8845 anchor, unless PCRE2's start-of-match optimizations are turned off, as 8846 shown in this output from pcre2test: 8847 8848 re> /(*COMMIT)abc/ 8849 data> xyzabc 8850 0: abc 8851 data> 8852 re> /(*COMMIT)abc/no_start_optimize 8853 data> xyzabc 8854 No match 8855 8856 For the first pattern, PCRE2 knows that any match must start with "a", 8857 so the optimization skips along the subject to "a" before applying the 8858 pattern to the first set of data. The match attempt then succeeds. The 8859 second pattern disables the optimization that skips along to the first 8860 character. The pattern is now applied starting at "x", and so the 8861 (*COMMIT) causes the match to fail without trying any other starting 8862 points. 8863 8864 (*PRUNE) or (*PRUNE:NAME) 8865 8866 This verb causes the match to fail at the current starting position in 8867 the subject if there is a later matching failure that causes backtrack- 8868 ing to reach it. If the pattern is unanchored, the normal "bumpalong" 8869 advance to the next starting character then happens. Backtracking can 8870 occur as usual to the left of (*PRUNE), before it is reached, or when 8871 matching to the right of (*PRUNE), but if there is no match to the 8872 right, backtracking cannot cross (*PRUNE). In simple cases, the use of 8873 (*PRUNE) is just an alternative to an atomic group or possessive quan- 8874 tifier, but there are some uses of (*PRUNE) that cannot be expressed in 8875 any other way. In an anchored pattern (*PRUNE) has the same effect as 8876 (*COMMIT). 8877 8878 The behaviour of (*PRUNE:NAME) is not the same as (*MARK:NAME)(*PRUNE). 8879 It is like (*MARK:NAME) in that the name is remembered for passing back 8880 to the caller. However, (*SKIP:NAME) searches only for names set with 8881 (*MARK), ignoring those set by (*COMMIT), (*PRUNE) or (*THEN). 8882 8883 (*SKIP) 8884 8885 This verb, when given without a name, is like (*PRUNE), except that if 8886 the pattern is unanchored, the "bumpalong" advance is not to the next 8887 character, but to the position in the subject where (*SKIP) was encoun- 8888 tered. (*SKIP) signifies that whatever text was matched leading up to 8889 it cannot be part of a successful match if there is a later mismatch. 8890 Consider: 8891 8892 a+(*SKIP)b 8893 8894 If the subject is "aaaac...", after the first match attempt fails 8895 (starting at the first character in the string), the starting point 8896 skips on to start the next attempt at "c". Note that a possessive quan- 8897 tifer does not have the same effect as this example; although it would 8898 suppress backtracking during the first match attempt, the second 8899 attempt would start at the second character instead of skipping on to 8900 "c". 8901 8902 (*SKIP:NAME) 8903 8904 When (*SKIP) has an associated name, its behaviour is modified. When 8905 such a (*SKIP) is triggered, the previous path through the pattern is 8906 searched for the most recent (*MARK) that has the same name. If one is 8907 found, the "bumpalong" advance is to the subject position that corre- 8908 sponds to that (*MARK) instead of to where (*SKIP) was encountered. If 8909 no (*MARK) with a matching name is found, the (*SKIP) is ignored. 8910 8911 The search for a (*MARK) name uses the normal backtracking mechanism, 8912 which means that it does not see (*MARK) settings that are inside 8913 atomic groups or assertions, because they are never re-entered by back- 8914 tracking. Compare the following pcre2test examples: 8915 8916 re> /a(?>(*MARK:X))(*SKIP:X)(*F)|(.)/ 8917 data: abc 8918 0: a 8919 1: a 8920 data: 8921 re> /a(?:(*MARK:X))(*SKIP:X)(*F)|(.)/ 8922 data: abc 8923 0: b 8924 1: b 8925 8926 In the first example, the (*MARK) setting is in an atomic group, so it 8927 is not seen when (*SKIP:X) triggers, causing the (*SKIP) to be ignored. 8928 This allows the second branch of the pattern to be tried at the first 8929 character position. In the second example, the (*MARK) setting is not 8930 in an atomic group. This allows (*SKIP:X) to find the (*MARK) when it 8931 backtracks, and this causes a new matching attempt to start at the sec- 8932 ond character. This time, the (*MARK) is never seen because "a" does 8933 not match "b", so the matcher immediately jumps to the second branch of 8934 the pattern. 8935 8936 Note that (*SKIP:NAME) searches only for names set by (*MARK:NAME). It 8937 ignores names that are set by (*COMMIT:NAME), (*PRUNE:NAME) or 8938 (*THEN:NAME). 8939 8940 (*THEN) or (*THEN:NAME) 8941 8942 This verb causes a skip to the next innermost alternative when back- 8943 tracking reaches it. That is, it cancels any further backtracking 8944 within the current alternative. Its name comes from the observation 8945 that it can be used for a pattern-based if-then-else block: 8946 8947 ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ... 8948 8949 If the COND1 pattern matches, FOO is tried (and possibly further items 8950 after the end of the group if FOO succeeds); on failure, the matcher 8951 skips to the second alternative and tries COND2, without backtracking 8952 into COND1. If that succeeds and BAR fails, COND3 is tried. If subse- 8953 quently BAZ fails, there are no more alternatives, so there is a back- 8954 track to whatever came before the entire group. If (*THEN) is not 8955 inside an alternation, it acts like (*PRUNE). 8956 8957 The behaviour of (*THEN:NAME) is not the same as (*MARK:NAME)(*THEN). 8958 It is like (*MARK:NAME) in that the name is remembered for passing back 8959 to the caller. However, (*SKIP:NAME) searches only for names set with 8960 (*MARK), ignoring those set by (*COMMIT), (*PRUNE) and (*THEN). 8961 8962 A subpattern that does not contain a | character is just a part of the 8963 enclosing alternative; it is not a nested alternation with only one 8964 alternative. The effect of (*THEN) extends beyond such a subpattern to 8965 the enclosing alternative. Consider this pattern, where A, B, etc. are 8966 complex pattern fragments that do not contain any | characters at this 8967 level: 8968 8969 A (B(*THEN)C) | D 8970 8971 If A and B are matched, but there is a failure in C, matching does not 8972 backtrack into A; instead it moves to the next alternative, that is, D. 8973 However, if the subpattern containing (*THEN) is given an alternative, 8974 it behaves differently: 8975 8976 A (B(*THEN)C | (*FAIL)) | D 8977 8978 The effect of (*THEN) is now confined to the inner subpattern. After a 8979 failure in C, matching moves to (*FAIL), which causes the whole subpat- 8980 tern to fail because there are no more alternatives to try. In this 8981 case, matching does now backtrack into A. 8982 8983 Note that a conditional subpattern is not considered as having two 8984 alternatives, because only one is ever used. In other words, the | 8985 character in a conditional subpattern has a different meaning. Ignoring 8986 white space, consider: 8987 8988 ^.*? (?(?=a) a | b(*THEN)c ) 8989 8990 If the subject is "ba", this pattern does not match. Because .*? is 8991 ungreedy, it initially matches zero characters. The condition (?=a) 8992 then fails, the character "b" is matched, but "c" is not. At this 8993 point, matching does not backtrack to .*? as might perhaps be expected 8994 from the presence of the | character. The conditional subpattern is 8995 part of the single alternative that comprises the whole pattern, and so 8996 the match fails. (If there was a backtrack into .*?, allowing it to 8997 match "b", the match would succeed.) 8998 8999 The verbs just described provide four different "strengths" of control 9000 when subsequent matching fails. (*THEN) is the weakest, carrying on the 9001 match at the next alternative. (*PRUNE) comes next, failing the match 9002 at the current starting position, but allowing an advance to the next 9003 character (for an unanchored pattern). (*SKIP) is similar, except that 9004 the advance may be more than one character. (*COMMIT) is the strongest, 9005 causing the entire match to fail. 9006 9007 More than one backtracking verb 9008 9009 If more than one backtracking verb is present in a pattern, the one 9010 that is backtracked onto first acts. For example, consider this pat- 9011 tern, where A, B, etc. are complex pattern fragments: 9012 9013 (A(*COMMIT)B(*THEN)C|ABD) 9014 9015 If A matches but B fails, the backtrack to (*COMMIT) causes the entire 9016 match to fail. However, if A and B match, but C fails, the backtrack to 9017 (*THEN) causes the next alternative (ABD) to be tried. This behaviour 9018 is consistent, but is not always the same as Perl's. It means that if 9019 two or more backtracking verbs appear in succession, all the the last 9020 of them has no effect. Consider this example: 9021 9022 ...(*COMMIT)(*PRUNE)... 9023 9024 If there is a matching failure to the right, backtracking onto (*PRUNE) 9025 causes it to be triggered, and its action is taken. There can never be 9026 a backtrack onto (*COMMIT). 9027 9028 Backtracking verbs in repeated groups 9029 9030 PCRE2 sometimes differs from Perl in its handling of backtracking verbs 9031 in repeated groups. For example, consider: 9032 9033 /(a(*COMMIT)b)+ac/ 9034 9035 If the subject is "abac", Perl matches unless its optimizations are 9036 disabled, but PCRE2 always fails because the (*COMMIT) in the second 9037 repeat of the group acts. 9038 9039 Backtracking verbs in assertions 9040 9041 (*FAIL) in any assertion has its normal effect: it forces an immediate 9042 backtrack. The behaviour of the other backtracking verbs depends on 9043 whether or not the assertion is standalone or acting as the condition 9044 in a conditional subpattern. 9045 9046 (*ACCEPT) in a standalone positive assertion causes the assertion to 9047 succeed without any further processing; captured strings and a (*MARK) 9048 name (if set) are retained. In a standalone negative assertion, 9049 (*ACCEPT) causes the assertion to fail without any further processing; 9050 captured substrings and any (*MARK) name are discarded. 9051 9052 If the assertion is a condition, (*ACCEPT) causes the condition to be 9053 true for a positive assertion and false for a negative one; captured 9054 substrings are retained in both cases. 9055 9056 The remaining verbs act only when a later failure causes a backtrack to 9057 reach them. This means that their effect is confined to the assertion, 9058 because lookaround assertions are atomic. A backtrack that occurs after 9059 an assertion is complete does not jump back into the assertion. Note in 9060 particular that a (*MARK) name that is set in an assertion is not 9061 "seen" by an instance of (*SKIP:NAME) latter in the pattern. 9062 9063 The effect of (*THEN) is not allowed to escape beyond an assertion. If 9064 there are no more branches to try, (*THEN) causes a positive assertion 9065 to be false, and a negative assertion to be true. 9066 9067 The other backtracking verbs are not treated specially if they appear 9068 in a standalone positive assertion. In a conditional positive asser- 9069 tion, backtracking (from within the assertion) into (*COMMIT), (*SKIP), 9070 or (*PRUNE) causes the condition to be false. However, for both stand- 9071 alone and conditional negative assertions, backtracking into (*COMMIT), 9072 (*SKIP), or (*PRUNE) causes the assertion to be true, without consider- 9073 ing any further alternative branches. 9074 9075 Backtracking verbs in subroutines 9076 9077 These behaviours occur whether or not the subpattern is called recur- 9078 sively. 9079 9080 (*ACCEPT) in a subpattern called as a subroutine causes the subroutine 9081 match to succeed without any further processing. Matching then contin- 9082 ues after the subroutine call. Perl documents this behaviour. Perl's 9083 treatment of the other verbs in subroutines is different in some cases. 9084 9085 (*FAIL) in a subpattern called as a subroutine has its normal effect: 9086 it forces an immediate backtrack. 9087 9088 (*COMMIT), (*SKIP), and (*PRUNE) cause the subroutine match to fail 9089 when triggered by being backtracked to in a subpattern called as a sub- 9090 routine. There is then a backtrack at the outer level. 9091 9092 (*THEN), when triggered, skips to the next alternative in the innermost 9093 enclosing group within the subpattern that has alternatives (its normal 9094 behaviour). However, if there is no such group within the subroutine 9095 subpattern, the subroutine match fails and there is a backtrack at the 9096 outer level. 9097 9098 9099SEE ALSO 9100 9101 pcre2api(3), pcre2callout(3), pcre2matching(3), pcre2syntax(3), 9102 pcre2(3). 9103 9104 9105AUTHOR 9106 9107 Philip Hazel 9108 University Computing Service 9109 Cambridge, England. 9110 9111 9112REVISION 9113 9114 Last updated: 04 September 2018 9115 Copyright (c) 1997-2018 University of Cambridge. 9116------------------------------------------------------------------------------ 9117 9118 9119PCRE2PERFORM(3) Library Functions Manual PCRE2PERFORM(3) 9120 9121 9122 9123NAME 9124 PCRE2 - Perl-compatible regular expressions (revised API) 9125 9126PCRE2 PERFORMANCE 9127 9128 Two aspects of performance are discussed below: memory usage and pro- 9129 cessing time. The way you express your pattern as a regular expression 9130 can affect both of them. 9131 9132 9133COMPILED PATTERN MEMORY USAGE 9134 9135 Patterns are compiled by PCRE2 into a reasonably efficient interpretive 9136 code, so that most simple patterns do not use much memory for storing 9137 the compiled version. However, there is one case where the memory usage 9138 of a compiled pattern can be unexpectedly large. If a parenthesized 9139 subpattern has a quantifier with a minimum greater than 1 and/or a lim- 9140 ited maximum, the whole subpattern is repeated in the compiled code. 9141 For example, the pattern 9142 9143 (abc|def){2,4} 9144 9145 is compiled as if it were 9146 9147 (abc|def)(abc|def)((abc|def)(abc|def)?)? 9148 9149 (Technical aside: It is done this way so that backtrack points within 9150 each of the repetitions can be independently maintained.) 9151 9152 For regular expressions whose quantifiers use only small numbers, this 9153 is not usually a problem. However, if the numbers are large, and par- 9154 ticularly if such repetitions are nested, the memory usage can become 9155 an embarrassment. For example, the very simple pattern 9156 9157 ((ab){1,1000}c){1,3} 9158 9159 uses over 50KiB when compiled using the 8-bit library. When PCRE2 is 9160 compiled with its default internal pointer size of two bytes, the size 9161 limit on a compiled pattern is 65535 code units in the 8-bit and 16-bit 9162 libraries, and this is reached with the above pattern if the outer rep- 9163 etition is increased from 3 to 4. PCRE2 can be compiled to use larger 9164 internal pointers and thus handle larger compiled patterns, but it is 9165 better to try to rewrite your pattern to use less memory if you can. 9166 9167 One way of reducing the memory usage for such patterns is to make use 9168 of PCRE2's "subroutine" facility. Re-writing the above pattern as 9169 9170 ((ab)(?2){0,999}c)(?1){0,2} 9171 9172 reduces the memory requirements to around 16KiB, and indeed it remains 9173 under 20KiB even with the outer repetition increased to 100. However, 9174 this kind of pattern is not always exactly equivalent, because any cap- 9175 tures within subroutine calls are lost when the subroutine completes. 9176 If this is not a problem, this kind of rewriting will allow you to 9177 process patterns that PCRE2 cannot otherwise handle. The matching per- 9178 formance of the two different versions of the pattern are roughly the 9179 same. (This applies from release 10.30 - things were different in ear- 9180 lier releases.) 9181 9182 9183STACK AND HEAP USAGE AT RUN TIME 9184 9185 From release 10.30, the interpretive (non-JIT) version of pcre2_match() 9186 uses very little system stack at run time. In earlier releases recur- 9187 sive function calls could use a great deal of stack, and this could 9188 cause problems, but this usage has been eliminated. Backtracking posi- 9189 tions are now explicitly remembered in memory frames controlled by the 9190 code. An initial 20KiB vector of frames is allocated on the system 9191 stack (enough for about 100 frames for small patterns), but if this is 9192 insufficient, heap memory is used. The amount of heap memory can be 9193 limited; if the limit is set to zero, only the initial stack vector is 9194 used. Rewriting patterns to be time-efficient, as described below, may 9195 also reduce the memory requirements. 9196 9197 In contrast to pcre2_match(), pcre2_dfa_match() does use recursive 9198 function calls, but only for processing atomic groups, lookaround 9199 assertions, and recursion within the pattern. The original version of 9200 the code used to allocate quite large internal workspace vectors on the 9201 stack, which caused some problems for some patterns in environments 9202 with small stacks. From release 10.32 the code for pcre2_dfa_match() 9203 has been re-factored to use heap memory when necessary for internal 9204 workspace when recursing, though recursive function calls are still 9205 used. 9206 9207 The "match depth" parameter can be used to limit the depth of function 9208 recursion, and the "match heap" parameter to limit heap memory in 9209 pcre2_dfa_match(). 9210 9211 9212PROCESSING TIME 9213 9214 Certain items in regular expression patterns are processed more effi- 9215 ciently than others. It is more efficient to use a character class like 9216 [aeiou] than a set of single-character alternatives such as 9217 (a|e|i|o|u). In general, the simplest construction that provides the 9218 required behaviour is usually the most efficient. Jeffrey Friedl's book 9219 contains a lot of useful general discussion about optimizing regular 9220 expressions for efficient performance. This document contains a few 9221 observations about PCRE2. 9222 9223 Using Unicode character properties (the \p, \P, and \X escapes) is 9224 slow, because PCRE2 has to use a multi-stage table lookup whenever it 9225 needs a character's property. If you can find an alternative pattern 9226 that does not use character properties, it will probably be faster. 9227 9228 By default, the escape sequences \b, \d, \s, and \w, and the POSIX 9229 character classes such as [:alpha:] do not use Unicode properties, 9230 partly for backwards compatibility, and partly for performance reasons. 9231 However, you can set the PCRE2_UCP option or start the pattern with 9232 (*UCP) if you want Unicode character properties to be used. This can 9233 double the matching time for items such as \d, when matched with 9234 pcre2_match(); the performance loss is less with a DFA matching func- 9235 tion, and in both cases there is not much difference for \b. 9236 9237 When a pattern begins with .* not in atomic parentheses, nor in paren- 9238 theses that are the subject of a backreference, and the PCRE2_DOTALL 9239 option is set, the pattern is implicitly anchored by PCRE2, since it 9240 can match only at the start of a subject string. If the pattern has 9241 multiple top-level branches, they must all be anchorable. The optimiza- 9242 tion can be disabled by the PCRE2_NO_DOTSTAR_ANCHOR option, and is 9243 automatically disabled if the pattern contains (*PRUNE) or (*SKIP). 9244 9245 If PCRE2_DOTALL is not set, PCRE2 cannot make this optimization, 9246 because the dot metacharacter does not then match a newline, and if the 9247 subject string contains newlines, the pattern may match from the char- 9248 acter immediately following one of them instead of from the very start. 9249 For example, the pattern 9250 9251 .*second 9252 9253 matches the subject "first\nand second" (where \n stands for a newline 9254 character), with the match starting at the seventh character. In order 9255 to do this, PCRE2 has to retry the match starting after every newline 9256 in the subject. 9257 9258 If you are using such a pattern with subject strings that do not con- 9259 tain newlines, the best performance is obtained by setting 9260 PCRE2_DOTALL, or starting the pattern with ^.* or ^.*? to indicate 9261 explicit anchoring. That saves PCRE2 from having to scan along the sub- 9262 ject looking for a newline to restart at. 9263 9264 Beware of patterns that contain nested indefinite repeats. These can 9265 take a long time to run when applied to a string that does not match. 9266 Consider the pattern fragment 9267 9268 ^(a+)* 9269 9270 This can match "aaaa" in 16 different ways, and this number increases 9271 very rapidly as the string gets longer. (The * repeat can match 0, 1, 9272 2, 3, or 4 times, and for each of those cases other than 0 or 4, the + 9273 repeats can match different numbers of times.) When the remainder of 9274 the pattern is such that the entire match is going to fail, PCRE2 has 9275 in principle to try every possible variation, and this can take an 9276 extremely long time, even for relatively short strings. 9277 9278 An optimization catches some of the more simple cases such as 9279 9280 (a+)*b 9281 9282 where a literal character follows. Before embarking on the standard 9283 matching procedure, PCRE2 checks that there is a "b" later in the sub- 9284 ject string, and if there is not, it fails the match immediately. How- 9285 ever, when there is no following literal this optimization cannot be 9286 used. You can see the difference by comparing the behaviour of 9287 9288 (a+)*\d 9289 9290 with the pattern above. The former gives a failure almost instantly 9291 when applied to a whole line of "a" characters, whereas the latter 9292 takes an appreciable time with strings longer than about 20 characters. 9293 9294 In many cases, the solution to this kind of performance issue is to use 9295 an atomic group or a possessive quantifier. This can often reduce mem- 9296 ory requirements as well. As another example, consider this pattern: 9297 9298 ([^<]|<(?!inet))+ 9299 9300 It matches from wherever it starts until it encounters "<inet" or the 9301 end of the data, and is the kind of pattern that might be used when 9302 processing an XML file. Each iteration of the outer parentheses matches 9303 either one character that is not "<" or a "<" that is not followed by 9304 "inet". However, each time a parenthesis is processed, a backtracking 9305 position is passed, so this formulation uses a memory frame for each 9306 matched character. For a long string, a lot of memory is required. Con- 9307 sider now this rewritten pattern, which matches exactly the same 9308 strings: 9309 9310 ([^<]++|<(?!inet))+ 9311 9312 This runs much faster, because sequences of characters that do not con- 9313 tain "<" are "swallowed" in one item inside the parentheses, and a pos- 9314 sessive quantifier is used to stop any backtracking into the runs of 9315 non-"<" characters. This version also uses a lot less memory because 9316 entry to a new set of parentheses happens only when a "<" character 9317 that is not followed by "inet" is encountered (and we assume this is 9318 relatively rare). 9319 9320 This example shows that one way of optimizing performance when matching 9321 long subject strings is to write repeated parenthesized subpatterns to 9322 match more than one character whenever possible. 9323 9324 SETTING RESOURCE LIMITS 9325 9326 You can set limits on the amount of processing that takes place when 9327 matching, and on the amount of heap memory that is used. The default 9328 values of the limits are very large, and unlikely ever to operate. They 9329 can be changed when PCRE2 is built, and they can also be set when 9330 pcre2_match() or pcre2_dfa_match() is called. For details of these 9331 interfaces, see the pcre2build documentation and the section entitled 9332 "The match context" in the pcre2api documentation. 9333 9334 The pcre2test test program has a modifier called "find_limits" which, 9335 if applied to a subject line, causes it to find the smallest limits 9336 that allow a pattern to match. This is done by repeatedly matching with 9337 different limits. 9338 9339 9340AUTHOR 9341 9342 Philip Hazel 9343 University Computing Service 9344 Cambridge, England. 9345 9346 9347REVISION 9348 9349 Last updated: 25 April 2018 9350 Copyright (c) 1997-2018 University of Cambridge. 9351------------------------------------------------------------------------------ 9352 9353 9354PCRE2POSIX(3) Library Functions Manual PCRE2POSIX(3) 9355 9356 9357 9358NAME 9359 PCRE2 - Perl-compatible regular expressions (revised API) 9360 9361SYNOPSIS 9362 9363 #include <pcre2posix.h> 9364 9365 int regcomp(regex_t *preg, const char *pattern, 9366 int cflags); 9367 9368 int regexec(const regex_t *preg, const char *string, 9369 size_t nmatch, regmatch_t pmatch[], int eflags); 9370 9371 size_t regerror(int errcode, const regex_t *preg, 9372 char *errbuf, size_t errbuf_size); 9373 9374 void regfree(regex_t *preg); 9375 9376 9377DESCRIPTION 9378 9379 This set of functions provides a POSIX-style API for the PCRE2 regular 9380 expression 8-bit library. See the pcre2api documentation for a descrip- 9381 tion of PCRE2's native API, which contains much additional functional- 9382 ity. There are no POSIX-style wrappers for PCRE2's 16-bit and 32-bit 9383 libraries. 9384 9385 The functions described here are just wrapper functions that ultimately 9386 call the PCRE2 native API. Their prototypes are defined in the 9387 pcre2posix.h header file, and on Unix systems the library itself is 9388 called libpcre2-posix.a, so can be accessed by adding -lpcre2-posix to 9389 the command for linking an application that uses them. Because the 9390 POSIX functions call the native ones, it is also necessary to add 9391 -lpcre2-8. 9392 9393 Those POSIX option bits that can reasonably be mapped to PCRE2 native 9394 options have been implemented. In addition, the option REG_EXTENDED is 9395 defined with the value zero. This has no effect, but since programs 9396 that are written to the POSIX interface often use it, this makes it 9397 easier to slot in PCRE2 as a replacement library. Other POSIX options 9398 are not even defined. 9399 9400 There are also some options that are not defined by POSIX. These have 9401 been added at the request of users who want to make use of certain 9402 PCRE2-specific features via the POSIX calling interface or to add BSD 9403 or GNU functionality. 9404 9405 When PCRE2 is called via these functions, it is only the API that is 9406 POSIX-like in style. The syntax and semantics of the regular expres- 9407 sions themselves are still those of Perl, subject to the setting of 9408 various PCRE2 options, as described below. "POSIX-like in style" means 9409 that the API approximates to the POSIX definition; it is not fully 9410 POSIX-compatible, and in multi-unit encoding domains it is probably 9411 even less compatible. 9412 9413 The header for these functions is supplied as pcre2posix.h to avoid any 9414 potential clash with other POSIX libraries. It can, of course, be 9415 renamed or aliased as regex.h, which is the "correct" name. It provides 9416 two structure types, regex_t for compiled internal forms, and reg- 9417 match_t for returning captured substrings. It also defines some con- 9418 stants whose names start with "REG_"; these are used for setting 9419 options and identifying error codes. 9420 9421 9422COMPILING A PATTERN 9423 9424 The function regcomp() is called to compile a pattern into an internal 9425 form. By default, the pattern is a C string terminated by a binary zero 9426 (but see REG_PEND below). The preg argument is a pointer to a regex_t 9427 structure that is used as a base for storing information about the com- 9428 piled regular expression. (It is also used for input when REG_PEND is 9429 set.) 9430 9431 The argument cflags is either zero, or contains one or more of the bits 9432 defined by the following macros: 9433 9434 REG_DOTALL 9435 9436 The PCRE2_DOTALL option is set when the regular expression is passed 9437 for compilation to the native function. Note that REG_DOTALL is not 9438 part of the POSIX standard. 9439 9440 REG_ICASE 9441 9442 The PCRE2_CASELESS option is set when the regular expression is passed 9443 for compilation to the native function. 9444 9445 REG_NEWLINE 9446 9447 The PCRE2_MULTILINE option is set when the regular expression is passed 9448 for compilation to the native function. Note that this does not mimic 9449 the defined POSIX behaviour for REG_NEWLINE (see the following sec- 9450 tion). 9451 9452 REG_NOSPEC 9453 9454 The PCRE2_LITERAL option is set when the regular expression is passed 9455 for compilation to the native function. This disables all meta charac- 9456 ters in the pattern, causing it to be treated as a literal string. The 9457 only other options that are allowed with REG_NOSPEC are REG_ICASE, 9458 REG_NOSUB, REG_PEND, and REG_UTF. Note that REG_NOSPEC is not part of 9459 the POSIX standard. 9460 9461 REG_NOSUB 9462 9463 When a pattern that is compiled with this flag is passed to regexec() 9464 for matching, the nmatch and pmatch arguments are ignored, and no cap- 9465 tured strings are returned. Versions of the PCRE library prior to 10.22 9466 used to set the PCRE2_NO_AUTO_CAPTURE compile option, but this no 9467 longer happens because it disables the use of backreferences. 9468 9469 REG_PEND 9470 9471 If this option is set, the reg_endp field in the preg structure (which 9472 has the type const char *) must be set to point to the character beyond 9473 the end of the pattern before calling regcomp(). The pattern itself may 9474 now contain binary zeros, which are treated as data characters. Without 9475 REG_PEND, a binary zero terminates the pattern and the re_endp field is 9476 ignored. This is a GNU extension to the POSIX standard and should be 9477 used with caution in software intended to be portable to other systems. 9478 9479 REG_UCP 9480 9481 The PCRE2_UCP option is set when the regular expression is passed for 9482 compilation to the native function. This causes PCRE2 to use Unicode 9483 properties when matchine \d, \w, etc., instead of just recognizing 9484 ASCII values. Note that REG_UCP is not part of the POSIX standard. 9485 9486 REG_UNGREEDY 9487 9488 The PCRE2_UNGREEDY option is set when the regular expression is passed 9489 for compilation to the native function. Note that REG_UNGREEDY is not 9490 part of the POSIX standard. 9491 9492 REG_UTF 9493 9494 The PCRE2_UTF option is set when the regular expression is passed for 9495 compilation to the native function. This causes the pattern itself and 9496 all data strings used for matching it to be treated as UTF-8 strings. 9497 Note that REG_UTF is not part of the POSIX standard. 9498 9499 In the absence of these flags, no options are passed to the native 9500 function. This means the the regex is compiled with PCRE2 default 9501 semantics. In particular, the way it handles newline characters in the 9502 subject string is the Perl way, not the POSIX way. Note that setting 9503 PCRE2_MULTILINE has only some of the effects specified for REG_NEWLINE. 9504 It does not affect the way newlines are matched by the dot metacharac- 9505 ter (they are not) or by a negative class such as [^a] (they are). 9506 9507 The yield of regcomp() is zero on success, and non-zero otherwise. The 9508 preg structure is filled in on success, and one other member of the 9509 structure (as well as re_endp) is public: re_nsub contains the number 9510 of capturing subpatterns in the regular expression. Various error codes 9511 are defined in the header file. 9512 9513 NOTE: If the yield of regcomp() is non-zero, you must not attempt to 9514 use the contents of the preg structure. If, for example, you pass it to 9515 regexec(), the result is undefined and your program is likely to crash. 9516 9517 9518MATCHING NEWLINE CHARACTERS 9519 9520 This area is not simple, because POSIX and Perl take different views of 9521 things. It is not possible to get PCRE2 to obey POSIX semantics, but 9522 then PCRE2 was never intended to be a POSIX engine. The following table 9523 lists the different possibilities for matching newline characters in 9524 Perl and PCRE2: 9525 9526 Default Change with 9527 9528 . matches newline no PCRE2_DOTALL 9529 newline matches [^a] yes not changeable 9530 $ matches \n at end yes PCRE2_DOLLAR_ENDONLY 9531 $ matches \n in middle no PCRE2_MULTILINE 9532 ^ matches \n in middle no PCRE2_MULTILINE 9533 9534 This is the equivalent table for a POSIX-compatible pattern matcher: 9535 9536 Default Change with 9537 9538 . matches newline yes REG_NEWLINE 9539 newline matches [^a] yes REG_NEWLINE 9540 $ matches \n at end no REG_NEWLINE 9541 $ matches \n in middle no REG_NEWLINE 9542 ^ matches \n in middle no REG_NEWLINE 9543 9544 This behaviour is not what happens when PCRE2 is called via its POSIX 9545 API. By default, PCRE2's behaviour is the same as Perl's, except that 9546 there is no equivalent for PCRE2_DOLLAR_ENDONLY in Perl. In both PCRE2 9547 and Perl, there is no way to stop newline from matching [^a]. 9548 9549 Default POSIX newline handling can be obtained by setting PCRE2_DOTALL 9550 and PCRE2_DOLLAR_ENDONLY when calling pcre2_compile() directly, but 9551 there is no way to make PCRE2 behave exactly as for the REG_NEWLINE 9552 action. When using the POSIX API, passing REG_NEWLINE to PCRE2's reg- 9553 comp() function causes PCRE2_MULTILINE to be passed to pcre2_compile(), 9554 and REG_DOTALL passes PCRE2_DOTALL. There is no way to pass PCRE2_DOL- 9555 LAR_ENDONLY. 9556 9557 9558MATCHING A PATTERN 9559 9560 The function regexec() is called to match a compiled pattern preg 9561 against a given string, which is by default terminated by a zero byte 9562 (but see REG_STARTEND below), subject to the options in eflags. These 9563 can be: 9564 9565 REG_NOTBOL 9566 9567 The PCRE2_NOTBOL option is set when calling the underlying PCRE2 match- 9568 ing function. 9569 9570 REG_NOTEMPTY 9571 9572 The PCRE2_NOTEMPTY option is set when calling the underlying PCRE2 9573 matching function. Note that REG_NOTEMPTY is not part of the POSIX 9574 standard. However, setting this option can give more POSIX-like behav- 9575 iour in some situations. 9576 9577 REG_NOTEOL 9578 9579 The PCRE2_NOTEOL option is set when calling the underlying PCRE2 match- 9580 ing function. 9581 9582 REG_STARTEND 9583 9584 When this option is set, the subject string starts at string + 9585 pmatch[0].rm_so and ends at string + pmatch[0].rm_eo, which should 9586 point to the first character beyond the string. There may be binary 9587 zeros within the subject string, and indeed, using REG_STARTEND is the 9588 only way to pass a subject string that contains a binary zero. 9589 9590 Whatever the value of pmatch[0].rm_so, the offsets of the matched 9591 string and any captured substrings are still given relative to the 9592 start of string itself. (Before PCRE2 release 10.30 these were given 9593 relative to string + pmatch[0].rm_so, but this differs from other 9594 implementations.) 9595 9596 This is a BSD extension, compatible with but not specified by IEEE 9597 Standard 1003.2 (POSIX.2), and should be used with caution in software 9598 intended to be portable to other systems. Note that a non-zero rm_so 9599 does not imply REG_NOTBOL; REG_STARTEND affects only the location and 9600 length of the string, not how it is matched. Setting REG_STARTEND and 9601 passing pmatch as NULL are mutually exclusive; the error REG_INVARG is 9602 returned. 9603 9604 If the pattern was compiled with the REG_NOSUB flag, no data about any 9605 matched strings is returned. The nmatch and pmatch arguments of 9606 regexec() are ignored (except possibly as input for REG_STARTEND). 9607 9608 The value of nmatch may be zero, and the value pmatch may be NULL 9609 (unless REG_STARTEND is set); in both these cases no data about any 9610 matched strings is returned. 9611 9612 Otherwise, the portion of the string that was matched, and also any 9613 captured substrings, are returned via the pmatch argument, which points 9614 to an array of nmatch structures of type regmatch_t, containing the 9615 members rm_so and rm_eo. These contain the byte offset to the first 9616 character of each substring and the offset to the first character after 9617 the end of each substring, respectively. The 0th element of the vector 9618 relates to the entire portion of string that was matched; subsequent 9619 elements relate to the capturing subpatterns of the regular expression. 9620 Unused entries in the array have both structure members set to -1. 9621 9622 A successful match yields a zero return; various error codes are 9623 defined in the header file, of which REG_NOMATCH is the "expected" 9624 failure code. 9625 9626 9627ERROR MESSAGES 9628 9629 The regerror() function maps a non-zero errorcode from either regcomp() 9630 or regexec() to a printable message. If preg is not NULL, the error 9631 should have arisen from the use of that structure. A message terminated 9632 by a binary zero is placed in errbuf. If the buffer is too short, only 9633 the first errbuf_size - 1 characters of the error message are used. The 9634 yield of the function is the size of buffer needed to hold the whole 9635 message, including the terminating zero. This value is greater than 9636 errbuf_size if the message was truncated. 9637 9638 9639MEMORY USAGE 9640 9641 Compiling a regular expression causes memory to be allocated and asso- 9642 ciated with the preg structure. The function regfree() frees all such 9643 memory, after which preg may no longer be used as a compiled expres- 9644 sion. 9645 9646 9647AUTHOR 9648 9649 Philip Hazel 9650 University Computing Service 9651 Cambridge, England. 9652 9653 9654REVISION 9655 9656 Last updated: 15 June 2017 9657 Copyright (c) 1997-2017 University of Cambridge. 9658------------------------------------------------------------------------------ 9659 9660 9661PCRE2SAMPLE(3) Library Functions Manual PCRE2SAMPLE(3) 9662 9663 9664 9665NAME 9666 PCRE2 - Perl-compatible regular expressions (revised API) 9667 9668PCRE2 SAMPLE PROGRAM 9669 9670 A simple, complete demonstration program to get you started with using 9671 PCRE2 is supplied in the file pcre2demo.c in the src directory in the 9672 PCRE2 distribution. A listing of this program is given in the pcre2demo 9673 documentation. If you do not have a copy of the PCRE2 distribution, you 9674 can save this listing to re-create the contents of pcre2demo.c. 9675 9676 The demonstration program compiles the regular expression that is its 9677 first argument, and matches it against the subject string in its second 9678 argument. No PCRE2 options are set, and default character tables are 9679 used. If matching succeeds, the program outputs the portion of the sub- 9680 ject that matched, together with the contents of any captured sub- 9681 strings. 9682 9683 If the -g option is given on the command line, the program then goes on 9684 to check for further matches of the same regular expression in the same 9685 subject string. The logic is a little bit tricky because of the possi- 9686 bility of matching an empty string. Comments in the code explain what 9687 is going on. 9688 9689 The code in pcre2demo.c is an 8-bit program that uses the PCRE2 8-bit 9690 library. It handles strings and characters that are stored in 8-bit 9691 code units. By default, one character corresponds to one code unit, 9692 but if the pattern starts with "(*UTF)", both it and the subject are 9693 treated as UTF-8 strings, where characters may occupy multiple code 9694 units. 9695 9696 If PCRE2 is installed in the standard include and library directories 9697 for your operating system, you should be able to compile the demonstra- 9698 tion program using a command like this: 9699 9700 cc -o pcre2demo pcre2demo.c -lpcre2-8 9701 9702 If PCRE2 is installed elsewhere, you may need to add additional options 9703 to the command line. For example, on a Unix-like system that has PCRE2 9704 installed in /usr/local, you can compile the demonstration program 9705 using a command like this: 9706 9707 cc -o pcre2demo -I/usr/local/include pcre2demo.c \ 9708 -L/usr/local/lib -lpcre2-8 9709 9710 Once you have built the demonstration program, you can run simple tests 9711 like this: 9712 9713 ./pcre2demo 'cat|dog' 'the cat sat on the mat' 9714 ./pcre2demo -g 'cat|dog' 'the dog sat on the cat' 9715 9716 Note that there is a much more comprehensive test program, called 9717 pcre2test, which supports many more facilities for testing regular 9718 expressions using all three PCRE2 libraries (8-bit, 16-bit, and 32-bit, 9719 though not all three need be installed). The pcre2demo program is pro- 9720 vided as a relatively simple coding example. 9721 9722 If you try to run pcre2demo when PCRE2 is not installed in the standard 9723 library directory, you may get an error like this on some operating 9724 systems (e.g. Solaris): 9725 9726 ld.so.1: pcre2demo: fatal: libpcre2-8.so.0: open failed: No such file 9727 or directory 9728 9729 This is caused by the way shared library support works on those sys- 9730 tems. You need to add 9731 9732 -R/usr/local/lib 9733 9734 (for example) to the compile command to get round this problem. 9735 9736 9737AUTHOR 9738 9739 Philip Hazel 9740 University Computing Service 9741 Cambridge, England. 9742 9743 9744REVISION 9745 9746 Last updated: 02 February 2016 9747 Copyright (c) 1997-2016 University of Cambridge. 9748------------------------------------------------------------------------------ 9749PCRE2SERIALIZE(3) Library Functions Manual PCRE2SERIALIZE(3) 9750 9751 9752 9753NAME 9754 PCRE2 - Perl-compatible regular expressions (revised API) 9755 9756SAVING AND RE-USING PRECOMPILED PCRE2 PATTERNS 9757 9758 int32_t pcre2_serialize_decode(pcre2_code **codes, 9759 int32_t number_of_codes, const uint32_t *bytes, 9760 pcre2_general_context *gcontext); 9761 9762 int32_t pcre2_serialize_encode(pcre2_code **codes, 9763 int32_t number_of_codes, uint32_t **serialized_bytes, 9764 PCRE2_SIZE *serialized_size, pcre2_general_context *gcontext); 9765 9766 void pcre2_serialize_free(uint8_t *bytes); 9767 9768 int32_t pcre2_serialize_get_number_of_codes(const uint8_t *bytes); 9769 9770 If you are running an application that uses a large number of regular 9771 expression patterns, it may be useful to store them in a precompiled 9772 form instead of having to compile them every time the application is 9773 run. However, if you are using the just-in-time optimization feature, 9774 it is not possible to save and reload the JIT data, because it is posi- 9775 tion-dependent. The host on which the patterns are reloaded must be 9776 running the same version of PCRE2, with the same code unit width, and 9777 must also have the same endianness, pointer width and PCRE2_SIZE type. 9778 For example, patterns compiled on a 32-bit system using PCRE2's 16-bit 9779 library cannot be reloaded on a 64-bit system, nor can they be reloaded 9780 using the 8-bit library. 9781 9782 Note that "serialization" in PCRE2 does not convert compiled patterns 9783 to an abstract format like Java or .NET serialization. The serialized 9784 output is really just a bytecode dump, which is why it can only be 9785 reloaded in the same environment as the one that created it. Hence the 9786 restrictions mentioned above. Applications that are not statically 9787 linked with a fixed version of PCRE2 must be prepared to recompile pat- 9788 terns from their sources, in order to be immune to PCRE2 upgrades. 9789 9790 9791SECURITY CONCERNS 9792 9793 The facility for saving and restoring compiled patterns is intended for 9794 use within individual applications. As such, the data supplied to 9795 pcre2_serialize_decode() is expected to be trusted data, not data from 9796 arbitrary external sources. There is only some simple consistency 9797 checking, not complete validation of what is being re-loaded. Corrupted 9798 data may cause undefined results. For example, if the length field of a 9799 pattern in the serialized data is corrupted, the deserializing code may 9800 read beyond the end of the byte stream that is passed to it. 9801 9802 9803SAVING COMPILED PATTERNS 9804 9805 Before compiled patterns can be saved they must be serialized, which in 9806 PCRE2 means converting the pattern to a stream of bytes. A single byte 9807 stream may contain any number of compiled patterns, but they must all 9808 use the same character tables. A single copy of the tables is included 9809 in the byte stream (its size is 1088 bytes). For more details of char- 9810 acter tables, see the section on locale support in the pcre2api docu- 9811 mentation. 9812 9813 The function pcre2_serialize_encode() creates a serialized byte stream 9814 from a list of compiled patterns. Its first two arguments specify the 9815 list, being a pointer to a vector of pointers to compiled patterns, and 9816 the length of the vector. The third and fourth arguments point to vari- 9817 ables which are set to point to the created byte stream and its length, 9818 respectively. The final argument is a pointer to a general context, 9819 which can be used to specify custom memory mangagement functions. If 9820 this argument is NULL, malloc() is used to obtain memory for the byte 9821 stream. The yield of the function is the number of serialized patterns, 9822 or one of the following negative error codes: 9823 9824 PCRE2_ERROR_BADDATA the number of patterns is zero or less 9825 PCRE2_ERROR_BADMAGIC mismatch of id bytes in one of the patterns 9826 PCRE2_ERROR_MEMORY memory allocation failed 9827 PCRE2_ERROR_MIXEDTABLES the patterns do not all use the same tables 9828 PCRE2_ERROR_NULL the 1st, 3rd, or 4th argument is NULL 9829 9830 PCRE2_ERROR_BADMAGIC means either that a pattern's code has been cor- 9831 rupted, or that a slot in the vector does not point to a compiled pat- 9832 tern. 9833 9834 Once a set of patterns has been serialized you can save the data in any 9835 appropriate manner. Here is sample code that compiles two patterns and 9836 writes them to a file. It assumes that the variable fd refers to a file 9837 that is open for output. The error checking that should be present in a 9838 real application has been omitted for simplicity. 9839 9840 int errorcode; 9841 uint8_t *bytes; 9842 PCRE2_SIZE erroroffset; 9843 PCRE2_SIZE bytescount; 9844 pcre2_code *list_of_codes[2]; 9845 list_of_codes[0] = pcre2_compile("first pattern", 9846 PCRE2_ZERO_TERMINATED, 0, &errorcode, &erroroffset, NULL); 9847 list_of_codes[1] = pcre2_compile("second pattern", 9848 PCRE2_ZERO_TERMINATED, 0, &errorcode, &erroroffset, NULL); 9849 errorcode = pcre2_serialize_encode(list_of_codes, 2, &bytes, 9850 &bytescount, NULL); 9851 errorcode = fwrite(bytes, 1, bytescount, fd); 9852 9853 Note that the serialized data is binary data that may contain any of 9854 the 256 possible byte values. On systems that make a distinction 9855 between binary and non-binary data, be sure that the file is opened for 9856 binary output. 9857 9858 Serializing a set of patterns leaves the original data untouched, so 9859 they can still be used for matching. Their memory must eventually be 9860 freed in the usual way by calling pcre2_code_free(). When you have fin- 9861 ished with the byte stream, it too must be freed by calling pcre2_seri- 9862 alize_free(). If this function is called with a NULL argument, it 9863 returns immediately without doing anything. 9864 9865 9866RE-USING PRECOMPILED PATTERNS 9867 9868 In order to re-use a set of saved patterns you must first make the 9869 serialized byte stream available in main memory (for example, by read- 9870 ing from a file). The management of this memory block is up to the 9871 application. You can use the pcre2_serialize_get_number_of_codes() 9872 function to find out how many compiled patterns are in the serialized 9873 data without actually decoding the patterns: 9874 9875 uint8_t *bytes = <serialized data>; 9876 int32_t number_of_codes = pcre2_serialize_get_number_of_codes(bytes); 9877 9878 The pcre2_serialize_decode() function reads a byte stream and recreates 9879 the compiled patterns in new memory blocks, setting pointers to them in 9880 a vector. The first two arguments are a pointer to a suitable vector 9881 and its length, and the third argument points to a byte stream. The 9882 final argument is a pointer to a general context, which can be used to 9883 specify custom memory mangagement functions for the decoded patterns. 9884 If this argument is NULL, malloc() and free() are used. After deserial- 9885 ization, the byte stream is no longer needed and can be discarded. 9886 9887 int32_t number_of_codes; 9888 pcre2_code *list_of_codes[2]; 9889 uint8_t *bytes = <serialized data>; 9890 int32_t number_of_codes = 9891 pcre2_serialize_decode(list_of_codes, 2, bytes, NULL); 9892 9893 If the vector is not large enough for all the patterns in the byte 9894 stream, it is filled with those that fit, and the remainder are 9895 ignored. The yield of the function is the number of decoded patterns, 9896 or one of the following negative error codes: 9897 9898 PCRE2_ERROR_BADDATA second argument is zero or less 9899 PCRE2_ERROR_BADMAGIC mismatch of id bytes in the data 9900 PCRE2_ERROR_BADMODE mismatch of code unit size or PCRE2 version 9901 PCRE2_ERROR_BADSERIALIZEDDATA other sanity check failure 9902 PCRE2_ERROR_MEMORY memory allocation failed 9903 PCRE2_ERROR_NULL first or third argument is NULL 9904 9905 PCRE2_ERROR_BADMAGIC may mean that the data is corrupt, or that it was 9906 compiled on a system with different endianness. 9907 9908 Decoded patterns can be used for matching in the usual way, and must be 9909 freed by calling pcre2_code_free(). However, be aware that there is a 9910 potential race issue if you are using multiple patterns that were 9911 decoded from a single byte stream in a multithreaded application. A 9912 single copy of the character tables is used by all the decoded patterns 9913 and a reference count is used to arrange for its memory to be automati- 9914 cally freed when the last pattern is freed, but there is no locking on 9915 this reference count. Therefore, if you want to call pcre2_code_free() 9916 for these patterns in different threads, you must arrange your own 9917 locking, and ensure that pcre2_code_free() cannot be called by two 9918 threads at the same time. 9919 9920 If a pattern was processed by pcre2_jit_compile() before being serial- 9921 ized, the JIT data is discarded and so is no longer available after a 9922 save/restore cycle. You can, however, process a restored pattern with 9923 pcre2_jit_compile() if you wish. 9924 9925 9926AUTHOR 9927 9928 Philip Hazel 9929 University Computing Service 9930 Cambridge, England. 9931 9932 9933REVISION 9934 9935 Last updated: 27 June 2018 9936 Copyright (c) 1997-2018 University of Cambridge. 9937------------------------------------------------------------------------------ 9938 9939 9940PCRE2SYNTAX(3) Library Functions Manual PCRE2SYNTAX(3) 9941 9942 9943 9944NAME 9945 PCRE2 - Perl-compatible regular expressions (revised API) 9946 9947PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY 9948 9949 The full syntax and semantics of the regular expressions that are sup- 9950 ported by PCRE2 are described in the pcre2pattern documentation. This 9951 document contains a quick-reference summary of the syntax. 9952 9953 9954QUOTING 9955 9956 \x where x is non-alphanumeric is a literal x 9957 \Q...\E treat enclosed characters as literal 9958 9959 9960ESCAPED CHARACTERS 9961 9962 This table applies to ASCII and Unicode environments. 9963 9964 \a alarm, that is, the BEL character (hex 07) 9965 \cx "control-x", where x is any ASCII printing character 9966 \e escape (hex 1B) 9967 \f form feed (hex 0C) 9968 \n newline (hex 0A) 9969 \r carriage return (hex 0D) 9970 \t tab (hex 09) 9971 \0dd character with octal code 0dd 9972 \ddd character with octal code ddd, or backreference 9973 \o{ddd..} character with octal code ddd.. 9974 \U "U" if PCRE2_ALT_BSUX is set (otherwise is an error) 9975 \N{U+hh..} character with Unicode code point hh.. (Unicode mode only) 9976 \uhhhh character with hex code hhhh (if PCRE2_ALT_BSUX is set) 9977 \xhh character with hex code hh 9978 \x{hh..} character with hex code hh.. 9979 9980 Note that \0dd is always an octal code. The treatment of backslash fol- 9981 lowed by a non-zero digit is complicated; for details see the section 9982 "Non-printing characters" in the pcre2pattern documentation, where 9983 details of escape processing in EBCDIC environments are also given. 9984 \N{U+hh..} is synonymous with \x{hh..} in PCRE2 but is not supported in 9985 EBCDIC environments. Note that \N not followed by an opening curly 9986 bracket has a different meaning (see below). 9987 9988 When \x is not followed by {, from zero to two hexadecimal digits are 9989 read, but if PCRE2_ALT_BSUX is set, \x must be followed by two hexadec- 9990 imal digits to be recognized as a hexadecimal escape; otherwise it 9991 matches a literal "x". Likewise, if \u (in ALT_BSUX mode) is not fol- 9992 lowed by four hexadecimal digits, it matches a literal "u". 9993 9994 9995CHARACTER TYPES 9996 9997 . any character except newline; 9998 in dotall mode, any character whatsoever 9999 \C one code unit, even in UTF mode (best avoided) 10000 \d a decimal digit 10001 \D a character that is not a decimal digit 10002 \h a horizontal white space character 10003 \H a character that is not a horizontal white space character 10004 \N a character that is not a newline 10005 \p{xx} a character with the xx property 10006 \P{xx} a character without the xx property 10007 \R a newline sequence 10008 \s a white space character 10009 \S a character that is not a white space character 10010 \v a vertical white space character 10011 \V a character that is not a vertical white space character 10012 \w a "word" character 10013 \W a "non-word" character 10014 \X a Unicode extended grapheme cluster 10015 10016 \C is dangerous because it may leave the current matching point in the 10017 middle of a UTF-8 or UTF-16 character. The application can lock out the 10018 use of \C by setting the PCRE2_NEVER_BACKSLASH_C option. It is also 10019 possible to build PCRE2 with the use of \C permanently disabled. 10020 10021 By default, \d, \s, and \w match only ASCII characters, even in UTF-8 10022 mode or in the 16-bit and 32-bit libraries. However, if locale-specific 10023 matching is happening, \s and \w may also match characters with code 10024 points in the range 128-255. If the PCRE2_UCP option is set, the behav- 10025 iour of these escape sequences is changed to use Unicode properties and 10026 they match many more characters. 10027 10028 10029GENERAL CATEGORY PROPERTIES FOR \p and \P 10030 10031 C Other 10032 Cc Control 10033 Cf Format 10034 Cn Unassigned 10035 Co Private use 10036 Cs Surrogate 10037 10038 L Letter 10039 Ll Lower case letter 10040 Lm Modifier letter 10041 Lo Other letter 10042 Lt Title case letter 10043 Lu Upper case letter 10044 L& Ll, Lu, or Lt 10045 10046 M Mark 10047 Mc Spacing mark 10048 Me Enclosing mark 10049 Mn Non-spacing mark 10050 10051 N Number 10052 Nd Decimal number 10053 Nl Letter number 10054 No Other number 10055 10056 P Punctuation 10057 Pc Connector punctuation 10058 Pd Dash punctuation 10059 Pe Close punctuation 10060 Pf Final punctuation 10061 Pi Initial punctuation 10062 Po Other punctuation 10063 Ps Open punctuation 10064 10065 S Symbol 10066 Sc Currency symbol 10067 Sk Modifier symbol 10068 Sm Mathematical symbol 10069 So Other symbol 10070 10071 Z Separator 10072 Zl Line separator 10073 Zp Paragraph separator 10074 Zs Space separator 10075 10076 10077PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P 10078 10079 Xan Alphanumeric: union of properties L and N 10080 Xps POSIX space: property Z or tab, NL, VT, FF, CR 10081 Xsp Perl space: property Z or tab, NL, VT, FF, CR 10082 Xuc Univerally-named character: one that can be 10083 represented by a Universal Character Name 10084 Xwd Perl word: property Xan or underscore 10085 10086 Perl and POSIX space are now the same. Perl added VT to its space char- 10087 acter set at release 5.18. 10088 10089 10090SCRIPT NAMES FOR \p AND \P 10091 10092 Adlam, Ahom, Anatolian_Hieroglyphs, Arabic, Armenian, Avestan, Bali- 10093 nese, Bamum, Bassa_Vah, Batak, Bengali, Bhaiksuki, Bopomofo, Brahmi, 10094 Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Caucasian_Alba- 10095 nian, Chakma, Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot, 10096 Cyrillic, Deseret, Devanagari, Dogra, Duployan, Egyptian_Hieroglyphs, 10097 Elbasan, Ethiopic, Georgian, Glagolitic, Gothic, Grantha, Greek, 10098 Gujarati, Gunjala_Gondi, Gurmukhi, Han, Hangul, Hanifi_Rohingya, 10099 Hanunoo, Hatran, Hebrew, Hiragana, Imperial_Aramaic, Inherited, 10100 Inscriptional_Pahlavi, Inscriptional_Parthian, Javanese, Kaithi, Kan- 10101 nada, Katakana, Kayah_Li, Kharoshthi, Khmer, Khojki, Khudawadi, Lao, 10102 Latin, Lepcha, Limbu, Linear_A, Linear_B, Lisu, Lycian, Lydian, Maha- 10103 jani, Makasar, Malayalam, Mandaic, Manichaean, Marchen, Masaram_Gondi, 10104 Medefaidrin, Meetei_Mayek, Mende_Kikakui, Meroitic_Cursive, 10105 Meroitic_Hieroglyphs, Miao, Modi, Mongolian, Mro, Multani, Myanmar, 10106 Nabataean, New_Tai_Lue, Newa, Nko, Nushu, Ogham, Ol_Chiki, Old_Hungar- 10107 ian, Old_Italic, Old_North_Arabian, Old_Permic, Old_Persian, Old_Sog- 10108 dian, Old_South_Arabian, Old_Turkic, Oriya, Osage, Osmanya, 10109 Pahawh_Hmong, Palmyrene, Pau_Cin_Hau, Phags_Pa, Phoenician, 10110 Psalter_Pahlavi, Rejang, Runic, Samaritan, Saurashtra, Sharada, Sha- 10111 vian, Siddham, SignWriting, Sinhala, Sogdian, Sora_Sompeng, Soyombo, 10112 Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tai_Tham, 10113 Tai_Viet, Takri, Tamil, Tangut, Telugu, Thaana, Thai, Tibetan, Tifi- 10114 nagh, Tirhuta, Ugaritic, Vai, Warang_Citi, Yi, Zanabazar_Square. 10115 10116 10117CHARACTER CLASSES 10118 10119 [...] positive character class 10120 [^...] negative character class 10121 [x-y] range (can be used for hex characters) 10122 [[:xxx:]] positive POSIX named set 10123 [[:^xxx:]] negative POSIX named set 10124 10125 alnum alphanumeric 10126 alpha alphabetic 10127 ascii 0-127 10128 blank space or tab 10129 cntrl control character 10130 digit decimal digit 10131 graph printing, excluding space 10132 lower lower case letter 10133 print printing, including space 10134 punct printing, excluding alphanumeric 10135 space white space 10136 upper upper case letter 10137 word same as \w 10138 xdigit hexadecimal digit 10139 10140 In PCRE2, POSIX character set names recognize only ASCII characters by 10141 default, but some of them use Unicode properties if PCRE2_UCP is set. 10142 You can use \Q...\E inside a character class. 10143 10144 10145QUANTIFIERS 10146 10147 ? 0 or 1, greedy 10148 ?+ 0 or 1, possessive 10149 ?? 0 or 1, lazy 10150 * 0 or more, greedy 10151 *+ 0 or more, possessive 10152 *? 0 or more, lazy 10153 + 1 or more, greedy 10154 ++ 1 or more, possessive 10155 +? 1 or more, lazy 10156 {n} exactly n 10157 {n,m} at least n, no more than m, greedy 10158 {n,m}+ at least n, no more than m, possessive 10159 {n,m}? at least n, no more than m, lazy 10160 {n,} n or more, greedy 10161 {n,}+ n or more, possessive 10162 {n,}? n or more, lazy 10163 10164 10165ANCHORS AND SIMPLE ASSERTIONS 10166 10167 \b word boundary 10168 \B not a word boundary 10169 ^ start of subject 10170 also after an internal newline in multiline mode 10171 (after any newline if PCRE2_ALT_CIRCUMFLEX is set) 10172 \A start of subject 10173 $ end of subject 10174 also before newline at end of subject 10175 also before internal newline in multiline mode 10176 \Z end of subject 10177 also before newline at end of subject 10178 \z end of subject 10179 \G first matching position in subject 10180 10181 10182REPORTED MATCH POINT SETTING 10183 10184 \K set reported start of match 10185 10186 \K is honoured in positive assertions, but ignored in negative ones. 10187 10188 10189ALTERNATION 10190 10191 expr|expr|expr... 10192 10193 10194CAPTURING 10195 10196 (...) capturing group 10197 (?<name>...) named capturing group (Perl) 10198 (?'name'...) named capturing group (Perl) 10199 (?P<name>...) named capturing group (Python) 10200 (?:...) non-capturing group 10201 (?|...) non-capturing group; reset group numbers for 10202 capturing groups in each alternative 10203 10204 10205ATOMIC GROUPS 10206 10207 (?>...) atomic, non-capturing group 10208 10209 10210COMMENT 10211 10212 (?#....) comment (not nestable) 10213 10214 10215OPTION SETTING 10216 Changes of these options within a group are automatically cancelled at 10217 the end of the group. 10218 10219 (?i) caseless 10220 (?J) allow duplicate names 10221 (?m) multiline 10222 (?n) no auto capture 10223 (?s) single line (dotall) 10224 (?U) default ungreedy (lazy) 10225 (?x) extended: ignore white space except in classes 10226 (?xx) as (?x) but also ignore space and tab in classes 10227 (?-...) unset option(s) 10228 (?^) unset imnsx options 10229 10230 Unsetting x or xx unsets both. Several options may be set at once, and 10231 a mixture of setting and unsetting such as (?i-x) is allowed, but there 10232 may be only one hyphen. Setting (but no unsetting) is allowed after (?^ 10233 for example (?^in). An option setting may appear at the start of a non- 10234 capturing group, for example (?i:...). 10235 10236 The following are recognized only at the very start of a pattern or 10237 after one of the newline or \R options with similar syntax. More than 10238 one of them may appear. For the first three, d is a decimal number. 10239 10240 (*LIMIT_DEPTH=d) set the backtracking limit to d 10241 (*LIMIT_HEAP=d) set the heap size limit to d * 1024 bytes 10242 (*LIMIT_MATCH=d) set the match limit to d 10243 (*NOTEMPTY) set PCRE2_NOTEMPTY when matching 10244 (*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching 10245 (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS) 10246 (*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR) 10247 (*NO_JIT) disable JIT optimization 10248 (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE) 10249 (*UTF) set appropriate UTF mode for the library in use 10250 (*UCP) set PCRE2_UCP (use Unicode properties for \d etc) 10251 10252 Note that LIMIT_DEPTH, LIMIT_HEAP, and LIMIT_MATCH can only reduce the 10253 value of the limits set by the caller of pcre2_match() or 10254 pcre2_dfa_match(), not increase them. LIMIT_RECURSION is an obsolete 10255 synonym for LIMIT_DEPTH. The application can lock out the use of (*UTF) 10256 and (*UCP) by setting the PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options, 10257 respectively, at compile time. 10258 10259 10260NEWLINE CONVENTION 10261 10262 These are recognized only at the very start of the pattern or after 10263 option settings with a similar syntax. 10264 10265 (*CR) carriage return only 10266 (*LF) linefeed only 10267 (*CRLF) carriage return followed by linefeed 10268 (*ANYCRLF) all three of the above 10269 (*ANY) any Unicode newline sequence 10270 (*NUL) the NUL character (binary zero) 10271 10272 10273WHAT \R MATCHES 10274 10275 These are recognized only at the very start of the pattern or after 10276 option setting with a similar syntax. 10277 10278 (*BSR_ANYCRLF) CR, LF, or CRLF 10279 (*BSR_UNICODE) any Unicode newline sequence 10280 10281 10282LOOKAHEAD AND LOOKBEHIND ASSERTIONS 10283 10284 (?=...) positive look ahead 10285 (?!...) negative look ahead 10286 (?<=...) positive look behind 10287 (?<!...) negative look behind 10288 10289 Each top-level branch of a look behind must be of a fixed length. 10290 10291 10292BACKREFERENCES 10293 10294 \n reference by number (can be ambiguous) 10295 \gn reference by number 10296 \g{n} reference by number 10297 \g+n relative reference by number (PCRE2 extension) 10298 \g-n relative reference by number 10299 \g{+n} relative reference by number (PCRE2 extension) 10300 \g{-n} relative reference by number 10301 \k<name> reference by name (Perl) 10302 \k'name' reference by name (Perl) 10303 \g{name} reference by name (Perl) 10304 \k{name} reference by name (.NET) 10305 (?P=name) reference by name (Python) 10306 10307 10308SUBROUTINE REFERENCES (POSSIBLY RECURSIVE) 10309 10310 (?R) recurse whole pattern 10311 (?n) call subpattern by absolute number 10312 (?+n) call subpattern by relative number 10313 (?-n) call subpattern by relative number 10314 (?&name) call subpattern by name (Perl) 10315 (?P>name) call subpattern by name (Python) 10316 \g<name> call subpattern by name (Oniguruma) 10317 \g'name' call subpattern by name (Oniguruma) 10318 \g<n> call subpattern by absolute number (Oniguruma) 10319 \g'n' call subpattern by absolute number (Oniguruma) 10320 \g<+n> call subpattern by relative number (PCRE2 extension) 10321 \g'+n' call subpattern by relative number (PCRE2 extension) 10322 \g<-n> call subpattern by relative number (PCRE2 extension) 10323 \g'-n' call subpattern by relative number (PCRE2 extension) 10324 10325 10326CONDITIONAL PATTERNS 10327 10328 (?(condition)yes-pattern) 10329 (?(condition)yes-pattern|no-pattern) 10330 10331 (?(n) absolute reference condition 10332 (?(+n) relative reference condition 10333 (?(-n) relative reference condition 10334 (?(<name>) named reference condition (Perl) 10335 (?('name') named reference condition (Perl) 10336 (?(name) named reference condition (PCRE2, deprecated) 10337 (?(R) overall recursion condition 10338 (?(Rn) specific numbered group recursion condition 10339 (?(R&name) specific named group recursion condition 10340 (?(DEFINE) define subpattern for reference 10341 (?(VERSION[>]=n.m) test PCRE2 version 10342 (?(assert) assertion condition 10343 10344 Note the ambiguity of (?(R) and (?(Rn) which might be named reference 10345 conditions or recursion tests. Such a condition is interpreted as a 10346 reference condition if the relevant named group exists. 10347 10348 10349BACKTRACKING CONTROL 10350 10351 All backtracking control verbs may be in the form (*VERB:NAME). For 10352 (*MARK) the name is mandatory, for the others it is optional. (*SKIP) 10353 changes its behaviour if :NAME is present. The others just set a name 10354 for passing back to the caller, but this is not a name that (*SKIP) can 10355 see. The following act immediately they are reached: 10356 10357 (*ACCEPT) force successful match 10358 (*FAIL) force backtrack; synonym (*F) 10359 (*MARK:NAME) set name to be passed back; synonym (*:NAME) 10360 10361 The following act only when a subsequent match failure causes a back- 10362 track to reach them. They all force a match failure, but they differ in 10363 what happens afterwards. Those that advance the start-of-match point do 10364 so only if the pattern is not anchored. 10365 10366 (*COMMIT) overall failure, no advance of starting point 10367 (*PRUNE) advance to next starting character 10368 (*SKIP) advance to current matching position 10369 (*SKIP:NAME) advance to position corresponding to an earlier 10370 (*MARK:NAME); if not found, the (*SKIP) is ignored 10371 (*THEN) local failure, backtrack to next alternation 10372 10373 The effect of one of these verbs in a group called as a subroutine is 10374 confined to the subroutine call. 10375 10376 10377CALLOUTS 10378 10379 (?C) callout (assumed number 0) 10380 (?Cn) callout with numerical data n 10381 (?C"text") callout with string data 10382 10383 The allowed string delimiters are ` ' " ^ % # $ (which are the same for 10384 the start and the end), and the starting delimiter { matched with the 10385 ending delimiter }. To encode the ending delimiter within the string, 10386 double it. 10387 10388 10389SEE ALSO 10390 10391 pcre2pattern(3), pcre2api(3), pcre2callout(3), pcre2matching(3), 10392 pcre2(3). 10393 10394 10395AUTHOR 10396 10397 Philip Hazel 10398 University Computing Service 10399 Cambridge, England. 10400 10401 10402REVISION 10403 10404 Last updated: 02 September 2018 10405 Copyright (c) 1997-2018 University of Cambridge. 10406------------------------------------------------------------------------------ 10407 10408 10409PCRE2UNICODE(3) Library Functions Manual PCRE2UNICODE(3) 10410 10411 10412 10413NAME 10414 PCRE - Perl-compatible regular expressions (revised API) 10415 10416UNICODE AND UTF SUPPORT 10417 10418 When PCRE2 is built with Unicode support (which is the default), it has 10419 knowledge of Unicode character properties and can process text strings 10420 in UTF-8, UTF-16, or UTF-32 format (depending on the code unit width). 10421 However, by default, PCRE2 assumes that one code unit is one character. 10422 To process a pattern as a UTF string, where a character may require 10423 more than one code unit, you must call pcre2_compile() with the 10424 PCRE2_UTF option flag, or the pattern must start with the sequence 10425 (*UTF). When either of these is the case, both the pattern and any sub- 10426 ject strings that are matched against it are treated as UTF strings 10427 instead of strings of individual one-code-unit characters. There are 10428 also some other changes to the way characters are handled, as docu- 10429 mented below. 10430 10431 If you do not need Unicode support you can build PCRE2 without it, in 10432 which case the library will be smaller. 10433 10434 10435UNICODE PROPERTY SUPPORT 10436 10437 When PCRE2 is built with Unicode support, the escape sequences \p{..}, 10438 \P{..}, and \X can be used. The Unicode properties that can be tested 10439 are limited to the general category properties such as Lu for an upper 10440 case letter or Nd for a decimal number, the Unicode script names such 10441 as Arabic or Han, and the derived properties Any and L&. Full lists are 10442 given in the pcre2pattern and pcre2syntax documentation. Only the short 10443 names for properties are supported. For example, \p{L} matches a let- 10444 ter. Its Perl synonym, \p{Letter}, is not supported. Furthermore, in 10445 Perl, many properties may optionally be prefixed by "Is", for compati- 10446 bility with Perl 5.6. PCRE2 does not support this. 10447 10448 10449WIDE CHARACTERS AND UTF MODES 10450 10451 Code points less than 256 can be specified in patterns by either braced 10452 or unbraced hexadecimal escape sequences (for example, \x{b3} or \xb3). 10453 Larger values have to use braced sequences. Unbraced octal code points 10454 up to \777 are also recognized; larger ones can be coded using \o{...}. 10455 10456 The escape sequence \N{U+<hex digits>} is recognized as another way of 10457 specifying a Unicode character by code point in a UTF mode. It is not 10458 allowed in non-UTF modes. 10459 10460 In UTF modes, repeat quantifiers apply to complete UTF characters, not 10461 to individual code units. 10462 10463 In UTF modes, the dot metacharacter matches one UTF character instead 10464 of a single code unit. 10465 10466 The escape sequence \C can be used to match a single code unit in a UTF 10467 mode, but its use can lead to some strange effects because it breaks up 10468 multi-unit characters (see the description of \C in the pcre2pattern 10469 documentation). 10470 10471 The use of \C is not supported by the alternative matching function 10472 pcre2_dfa_match() when in UTF-8 or UTF-16 mode, that is, when a charac- 10473 ter may consist of more than one code unit. The use of \C in these 10474 modes provokes a match-time error. Also, the JIT optimization does not 10475 support \C in these modes. If JIT optimization is requested for a UTF-8 10476 or UTF-16 pattern that contains \C, it will not succeed, and so when 10477 pcre2_match() is called, the matching will be carried out by the normal 10478 interpretive function. 10479 10480 The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly test 10481 characters of any code value, but, by default, the characters that 10482 PCRE2 recognizes as digits, spaces, or word characters remain the same 10483 set as in non-UTF mode, all with code points less than 256. This 10484 remains true even when PCRE2 is built to include Unicode support, 10485 because to do otherwise would slow down matching in many common cases. 10486 Note that this also applies to \b and \B, because they are defined in 10487 terms of \w and \W. If you want to test for a wider sense of, say, 10488 "digit", you can use explicit Unicode property tests such as \p{Nd}. 10489 Alternatively, if you set the PCRE2_UCP option, the way that the char- 10490 acter escapes work is changed so that Unicode properties are used to 10491 determine which characters match. There are more details in the section 10492 on generic character types in the pcre2pattern documentation. 10493 10494 Similarly, characters that match the POSIX named character classes are 10495 all low-valued characters, unless the PCRE2_UCP option is set. 10496 10497 However, the special horizontal and vertical white space matching 10498 escapes (\h, \H, \v, and \V) do match all the appropriate Unicode char- 10499 acters, whether or not PCRE2_UCP is set. 10500 10501 10502CASE-EQUIVALENCE IN UTF MODES 10503 10504 Case-insensitive matching in a UTF mode makes use of Unicode properties 10505 except for characters whose code points are less than 128 and that have 10506 at most two case-equivalent values. For these, a direct table lookup is 10507 used for speed. A few Unicode characters such as Greek sigma have more 10508 than two code points that are case-equivalent, and these are treated as 10509 such. 10510 10511 10512VALIDITY OF UTF STRINGS 10513 10514 When the PCRE2_UTF option is set, the strings passed as patterns and 10515 subjects are (by default) checked for validity on entry to the relevant 10516 functions. If an invalid UTF string is passed, an negative error code 10517 is returned. The code unit offset to the offending character can be 10518 extracted from the match data block by calling pcre2_get_startchar(), 10519 which is used for this purpose after a UTF error. 10520 10521 UTF-16 and UTF-32 strings can indicate their endianness by special code 10522 knows as a byte-order mark (BOM). The PCRE2 functions do not handle 10523 this, expecting strings to be in host byte order. 10524 10525 A UTF string is checked before any other processing takes place. In the 10526 case of pcre2_match() and pcre2_dfa_match() calls with a non-zero 10527 starting offset, the check is applied only to that part of the subject 10528 that could be inspected during matching, and there is a check that the 10529 starting offset points to the first code unit of a character or to the 10530 end of the subject. If there are no lookbehind assertions in the pat- 10531 tern, the check starts at the starting offset. Otherwise, it starts at 10532 the length of the longest lookbehind before the starting offset, or at 10533 the start of the subject if there are not that many characters before 10534 the starting offset. Note that the sequences \b and \B are one-charac- 10535 ter lookbehinds. 10536 10537 In addition to checking the format of the string, there is a check to 10538 ensure that all code points lie in the range U+0 to U+10FFFF, excluding 10539 the surrogate area. The so-called "non-character" code points are not 10540 excluded because Unicode corrigendum #9 makes it clear that they should 10541 not be. 10542 10543 Characters in the "Surrogate Area" of Unicode are reserved for use by 10544 UTF-16, where they are used in pairs to encode code points with values 10545 greater than 0xFFFF. The code points that are encoded by UTF-16 pairs 10546 are available independently in the UTF-8 and UTF-32 encodings. (In 10547 other words, the whole surrogate thing is a fudge for UTF-16 which 10548 unfortunately messes up UTF-8 and UTF-32.) 10549 10550 In some situations, you may already know that your strings are valid, 10551 and therefore want to skip these checks in order to improve perfor- 10552 mance, for example in the case of a long subject string that is being 10553 scanned repeatedly. If you set the PCRE2_NO_UTF_CHECK option at com- 10554 pile time or at match time, PCRE2 assumes that the pattern or subject 10555 it is given (respectively) contains only valid UTF code unit sequences. 10556 10557 Passing PCRE2_NO_UTF_CHECK to pcre2_compile() just disables the check 10558 for the pattern; it does not also apply to subject strings. If you want 10559 to disable the check for a subject string you must pass this option to 10560 pcre2_match() or pcre2_dfa_match(). 10561 10562 If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the 10563 result is undefined and your program may crash or loop indefinitely. 10564 10565 Note that setting PCRE2_NO_UTF_CHECK at compile time does not disable 10566 the error that is given if an escape sequence for an invalid Unicode 10567 code point is encountered in the pattern. If you want to allow escape 10568 sequences such as \x{d800} (a surrogate code point) you can set the 10569 PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra option. However, this is pos- 10570 sible only in UTF-8 and UTF-32 modes, because these values are not rep- 10571 resentable in UTF-16. 10572 10573 Errors in UTF-8 strings 10574 10575 The following negative error codes are given for invalid UTF-8 strings: 10576 10577 PCRE2_ERROR_UTF8_ERR1 10578 PCRE2_ERROR_UTF8_ERR2 10579 PCRE2_ERROR_UTF8_ERR3 10580 PCRE2_ERROR_UTF8_ERR4 10581 PCRE2_ERROR_UTF8_ERR5 10582 10583 The string ends with a truncated UTF-8 character; the code specifies 10584 how many bytes are missing (1 to 5). Although RFC 3629 restricts UTF-8 10585 characters to be no longer than 4 bytes, the encoding scheme (origi- 10586 nally defined by RFC 2279) allows for up to 6 bytes, and this is 10587 checked first; hence the possibility of 4 or 5 missing bytes. 10588 10589 PCRE2_ERROR_UTF8_ERR6 10590 PCRE2_ERROR_UTF8_ERR7 10591 PCRE2_ERROR_UTF8_ERR8 10592 PCRE2_ERROR_UTF8_ERR9 10593 PCRE2_ERROR_UTF8_ERR10 10594 10595 The two most significant bits of the 2nd, 3rd, 4th, 5th, or 6th byte of 10596 the character do not have the binary value 0b10 (that is, either the 10597 most significant bit is 0, or the next bit is 1). 10598 10599 PCRE2_ERROR_UTF8_ERR11 10600 PCRE2_ERROR_UTF8_ERR12 10601 10602 A character that is valid by the RFC 2279 rules is either 5 or 6 bytes 10603 long; these code points are excluded by RFC 3629. 10604 10605 PCRE2_ERROR_UTF8_ERR13 10606 10607 A 4-byte character has a value greater than 0x10fff; these code points 10608 are excluded by RFC 3629. 10609 10610 PCRE2_ERROR_UTF8_ERR14 10611 10612 A 3-byte character has a value in the range 0xd800 to 0xdfff; this 10613 range of code points are reserved by RFC 3629 for use with UTF-16, and 10614 so are excluded from UTF-8. 10615 10616 PCRE2_ERROR_UTF8_ERR15 10617 PCRE2_ERROR_UTF8_ERR16 10618 PCRE2_ERROR_UTF8_ERR17 10619 PCRE2_ERROR_UTF8_ERR18 10620 PCRE2_ERROR_UTF8_ERR19 10621 10622 A 2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it codes 10623 for a value that can be represented by fewer bytes, which is invalid. 10624 For example, the two bytes 0xc0, 0xae give the value 0x2e, whose cor- 10625 rect coding uses just one byte. 10626 10627 PCRE2_ERROR_UTF8_ERR20 10628 10629 The two most significant bits of the first byte of a character have the 10630 binary value 0b10 (that is, the most significant bit is 1 and the sec- 10631 ond is 0). Such a byte can only validly occur as the second or subse- 10632 quent byte of a multi-byte character. 10633 10634 PCRE2_ERROR_UTF8_ERR21 10635 10636 The first byte of a character has the value 0xfe or 0xff. These values 10637 can never occur in a valid UTF-8 string. 10638 10639 Errors in UTF-16 strings 10640 10641 The following negative error codes are given for invalid UTF-16 10642 strings: 10643 10644 PCRE2_ERROR_UTF16_ERR1 Missing low surrogate at end of string 10645 PCRE2_ERROR_UTF16_ERR2 Invalid low surrogate follows high surrogate 10646 PCRE2_ERROR_UTF16_ERR3 Isolated low surrogate 10647 10648 10649 Errors in UTF-32 strings 10650 10651 The following negative error codes are given for invalid UTF-32 10652 strings: 10653 10654 PCRE2_ERROR_UTF32_ERR1 Surrogate character (0xd800 to 0xdfff) 10655 PCRE2_ERROR_UTF32_ERR2 Code point is greater than 0x10ffff 10656 10657 10658AUTHOR 10659 10660 Philip Hazel 10661 University Computing Service 10662 Cambridge, England. 10663 10664 10665REVISION 10666 10667 Last updated: 02 September 2018 10668 Copyright (c) 1997-2018 University of Cambridge. 10669------------------------------------------------------------------------------ 10670 10671 10672