1PCRE2TEST(1) General Commands Manual PCRE2TEST(1) 2 3 4 5NAME 6 pcre2test - a program for testing Perl-compatible regular expressions. 7 8SYNOPSIS 9 10 pcre2test [options] [input file [output file]] 11 12 pcre2test is a test program for the PCRE2 regular expression libraries, 13 but it can also be used for experimenting with regular expressions. 14 This document describes the features of the test program; for details 15 of the regular expressions themselves, see the pcre2pattern documenta- 16 tion. For details of the PCRE2 library function calls and their op- 17 tions, see the pcre2api documentation. 18 19 The input for pcre2test is a sequence of regular expression patterns 20 and subject strings to be matched. There are also command lines for 21 setting defaults and controlling some special actions. The output shows 22 the result of each match attempt. Modifiers on external or internal 23 command lines, the patterns, and the subject lines specify PCRE2 func- 24 tion options, control how the subject is processed, and what output is 25 produced. 26 27 There are many obscure modifiers, some of which are specifically de- 28 signed for use in conjunction with the test script and data files that 29 are distributed as part of PCRE2. All the modifiers are documented 30 here, some without much justification, but many of them are unlikely to 31 be of use except when testing the libraries. 32 33 34PCRE2's 8-BIT, 16-BIT AND 32-BIT LIBRARIES 35 36 Different versions of the PCRE2 library can be built to support charac- 37 ter strings that are encoded in 8-bit, 16-bit, or 32-bit code units. 38 One, two, or all three of these libraries may be simultaneously in- 39 stalled. The pcre2test program can be used to test all the libraries. 40 However, its own input and output are always in 8-bit format. When 41 testing the 16-bit or 32-bit libraries, patterns and subject strings 42 are converted to 16-bit or 32-bit format before being passed to the li- 43 brary functions. Results are converted back to 8-bit code units for 44 output. 45 46 In the rest of this document, the names of library functions and struc- 47 tures are given in generic form, for example, pcre_compile(). The ac- 48 tual names used in the libraries have a suffix _8, _16, or _32, as ap- 49 propriate. 50 51 52INPUT ENCODING 53 54 Input to pcre2test is processed line by line, either by calling the C 55 library's fgets() function, or via the libreadline or libedit library. 56 In some Windows environments character 26 (hex 1A) causes an immediate 57 end of file, and no further data is read, so this character should be 58 avoided unless you really want that action. 59 60 The input is processed using using C's string functions, so must not 61 contain binary zeros, even though in Unix-like environments, fgets() 62 treats any bytes other than newline as data characters. An error is 63 generated if a binary zero is encountered. By default subject lines are 64 processed for backslash escapes, which makes it possible to include any 65 data value in strings that are passed to the library for matching. For 66 patterns, there is a facility for specifying some or all of the 8-bit 67 input characters as hexadecimal pairs, which makes it possible to in- 68 clude binary zeros. 69 70 Input for the 16-bit and 32-bit libraries 71 72 When testing the 16-bit or 32-bit libraries, there is a need to be able 73 to generate character code points greater than 255 in the strings that 74 are passed to the library. For subject lines, backslash escapes can be 75 used. In addition, when the utf modifier (see "Setting compilation op- 76 tions" below) is set, the pattern and any following subject lines are 77 interpreted as UTF-8 strings and translated to UTF-16 or UTF-32 as ap- 78 propriate. 79 80 For non-UTF testing of wide characters, the utf8_input modifier can be 81 used. This is mutually exclusive with utf, and is allowed only in 82 16-bit or 32-bit mode. It causes the pattern and following subject 83 lines to be treated as UTF-8 according to the original definition (RFC 84 2279), which allows for character values up to 0x7fffffff. Each charac- 85 ter is placed in one 16-bit or 32-bit code unit (in the 16-bit case, 86 values greater than 0xffff cause an error to occur). 87 88 UTF-8 (in its original definition) is not capable of encoding values 89 greater than 0x7fffffff, but such values can be handled by the 32-bit 90 library. When testing this library in non-UTF mode with utf8_input set, 91 if any character is preceded by the byte 0xff (which is an invalid byte 92 in UTF-8) 0x80000000 is added to the character's value. This is the 93 only way of passing such code points in a pattern string. For subject 94 strings, using an escape sequence is preferable. 95 96 97COMMAND LINE OPTIONS 98 99 -8 If the 8-bit library has been built, this option causes it to 100 be used (this is the default). If the 8-bit library has not 101 been built, this option causes an error. 102 103 -16 If the 16-bit library has been built, this option causes it 104 to be used. If only the 16-bit library has been built, this 105 is the default. If the 16-bit library has not been built, 106 this option causes an error. 107 108 -32 If the 32-bit library has been built, this option causes it 109 to be used. If only the 32-bit library has been built, this 110 is the default. If the 32-bit library has not been built, 111 this option causes an error. 112 113 -ac Behave as if each pattern has the auto_callout modifier, that 114 is, insert automatic callouts into every pattern that is com- 115 piled. 116 117 -AC As for -ac, but in addition behave as if each subject line 118 has the callout_extra modifier, that is, show additional in- 119 formation from callouts. 120 121 -b Behave as if each pattern has the fullbincode modifier; the 122 full internal binary form of the pattern is output after com- 123 pilation. 124 125 -C Output the version number of the PCRE2 library, and all 126 available information about the optional features that are 127 included, and then exit with zero exit code. All other op- 128 tions are ignored. If both -C and -LM are present, whichever 129 is first is recognized. 130 131 -C option Output information about a specific build-time option, then 132 exit. This functionality is intended for use in scripts such 133 as RunTest. The following options output the value and set 134 the exit code as indicated: 135 136 ebcdic-nl the code for LF (= NL) in an EBCDIC environment: 137 0x15 or 0x25 138 0 if used in an ASCII environment 139 exit code is always 0 140 linksize the configured internal link size (2, 3, or 4) 141 exit code is set to the link size 142 newline the default newline setting: 143 CR, LF, CRLF, ANYCRLF, ANY, or NUL 144 exit code is always 0 145 bsr the default setting for what \R matches: 146 ANYCRLF or ANY 147 exit code is always 0 148 149 The following options output 1 for true or 0 for false, and 150 set the exit code to the same value: 151 152 backslash-C \C is supported (not locked out) 153 ebcdic compiled for an EBCDIC environment 154 jit just-in-time support is available 155 pcre2-16 the 16-bit library was built 156 pcre2-32 the 32-bit library was built 157 pcre2-8 the 8-bit library was built 158 unicode Unicode support is available 159 160 If an unknown option is given, an error message is output; 161 the exit code is 0. 162 163 -d Behave as if each pattern has the debug modifier; the inter- 164 nal form and information about the compiled pattern is output 165 after compilation; -d is equivalent to -b -i. 166 167 -dfa Behave as if each subject line has the dfa modifier; matching 168 is done using the pcre2_dfa_match() function instead of the 169 default pcre2_match(). 170 171 -error number[,number,...] 172 Call pcre2_get_error_message() for each of the error numbers 173 in the comma-separated list, display the resulting messages 174 on the standard output, then exit with zero exit code. The 175 numbers may be positive or negative. This is a convenience 176 facility for PCRE2 maintainers. 177 178 -help Output a brief summary these options and then exit. 179 180 -i Behave as if each pattern has the info modifier; information 181 about the compiled pattern is given after compilation. 182 183 -jit Behave as if each pattern line has the jit modifier; after 184 successful compilation, each pattern is passed to the just- 185 in-time compiler, if available. 186 187 -jitfast Behave as if each pattern line has the jitfast modifier; af- 188 ter successful compilation, each pattern is passed to the 189 just-in-time compiler, if available, and each subject line is 190 passed directly to the JIT matcher via its "fast path". 191 192 -jitverify 193 Behave as if each pattern line has the jitverify modifier; 194 after successful compilation, each pattern is passed to the 195 just-in-time compiler, if available, and the use of JIT for 196 matching is verified. 197 198 -LM List modifiers: write a list of available pattern and subject 199 modifiers to the standard output, then exit with zero exit 200 code. All other options are ignored. If both -C and -LM are 201 present, whichever is first is recognized. 202 203 -pattern modifier-list 204 Behave as if each pattern line contains the given modifiers. 205 206 -q Do not output the version number of pcre2test at the start of 207 execution. 208 209 -S size On Unix-like systems, set the size of the run-time stack to 210 size mebibytes (units of 1024*1024 bytes). 211 212 -subject modifier-list 213 Behave as if each subject line contains the given modifiers. 214 215 -t Run each compile and match many times with a timer, and out- 216 put the resulting times per compile or match. When JIT is 217 used, separate times are given for the initial compile and 218 the JIT compile. You can control the number of iterations 219 that are used for timing by following -t with a number (as a 220 separate item on the command line). For example, "-t 1000" 221 iterates 1000 times. The default is to iterate 500,000 times. 222 223 -tm This is like -t except that it times only the matching phase, 224 not the compile phase. 225 226 -T -TM These behave like -t and -tm, but in addition, at the end of 227 a run, the total times for all compiles and matches are out- 228 put. 229 230 -version Output the PCRE2 version number and then exit. 231 232 233DESCRIPTION 234 235 If pcre2test is given two filename arguments, it reads from the first 236 and writes to the second. If the first name is "-", input is taken from 237 the standard input. If pcre2test is given only one argument, it reads 238 from that file and writes to stdout. Otherwise, it reads from stdin and 239 writes to stdout. 240 241 When pcre2test is built, a configuration option can specify that it 242 should be linked with the libreadline or libedit library. When this is 243 done, if the input is from a terminal, it is read using the readline() 244 function. This provides line-editing and history facilities. The output 245 from the -help option states whether or not readline() will be used. 246 247 The program handles any number of tests, each of which consists of a 248 set of input lines. Each set starts with a regular expression pattern, 249 followed by any number of subject lines to be matched against that pat- 250 tern. In between sets of test data, command lines that begin with # may 251 appear. This file format, with some restrictions, can also be processed 252 by the perltest.sh script that is distributed with PCRE2 as a means of 253 checking that the behaviour of PCRE2 and Perl is the same. For a speci- 254 fication of perltest.sh, see the comments near its beginning. See also 255 the #perltest command below. 256 257 When the input is a terminal, pcre2test prompts for each line of input, 258 using "re>" to prompt for regular expression patterns, and "data>" to 259 prompt for subject lines. Command lines starting with # can be entered 260 only in response to the "re>" prompt. 261 262 Each subject line is matched separately and independently. If you want 263 to do multi-line matches, you have to use the \n escape sequence (or \r 264 or \r\n, etc., depending on the newline setting) in a single line of 265 input to encode the newline sequences. There is no limit on the length 266 of subject lines; the input buffer is automatically extended if it is 267 too small. There are replication features that makes it possible to 268 generate long repetitive pattern or subject lines without having to 269 supply them explicitly. 270 271 An empty line or the end of the file signals the end of the subject 272 lines for a test, at which point a new pattern or command line is ex- 273 pected if there is still input to be read. 274 275 276COMMAND LINES 277 278 In between sets of test data, a line that begins with # is interpreted 279 as a command line. If the first character is followed by white space or 280 an exclamation mark, the line is treated as a comment, and ignored. 281 Otherwise, the following commands are recognized: 282 283 #forbid_utf 284 285 Subsequent patterns automatically have the PCRE2_NEVER_UTF and 286 PCRE2_NEVER_UCP options set, which locks out the use of the PCRE2_UTF 287 and PCRE2_UCP options and the use of (*UTF) and (*UCP) at the start of 288 patterns. This command also forces an error if a subsequent pattern 289 contains any occurrences of \P, \p, or \X, which are still supported 290 when PCRE2_UTF is not set, but which require Unicode property support 291 to be included in the library. 292 293 This is a trigger guard that is used in test files to ensure that UTF 294 or Unicode property tests are not accidentally added to files that are 295 used when Unicode support is not included in the library. Setting 296 PCRE2_NEVER_UTF and PCRE2_NEVER_UCP as a default can also be obtained 297 by the use of #pattern; the difference is that #forbid_utf cannot be 298 unset, and the automatic options are not displayed in pattern informa- 299 tion, to avoid cluttering up test output. 300 301 #load <filename> 302 303 This command is used to load a set of precompiled patterns from a file, 304 as described in the section entitled "Saving and restoring compiled 305 patterns" below. 306 307 #loadtables <filename> 308 309 This command is used to load a set of binary character tables that can 310 be accessed by the tables=3 qualifier. Such tables can be created by 311 the pcre2_dftables program with the -b option. 312 313 #newline_default [<newline-list>] 314 315 When PCRE2 is built, a default newline convention can be specified. 316 This determines which characters and/or character pairs are recognized 317 as indicating a newline in a pattern or subject string. The default can 318 be overridden when a pattern is compiled. The standard test files con- 319 tain tests of various newline conventions, but the majority of the 320 tests expect a single linefeed to be recognized as a newline by de- 321 fault. Without special action the tests would fail when PCRE2 is com- 322 piled with either CR or CRLF as the default newline. 323 324 The #newline_default command specifies a list of newline types that are 325 acceptable as the default. The types must be one of CR, LF, CRLF, ANY- 326 CRLF, ANY, or NUL (in upper or lower case), for example: 327 328 #newline_default LF Any anyCRLF 329 330 If the default newline is in the list, this command has no effect. Oth- 331 erwise, except when testing the POSIX API, a newline modifier that 332 specifies the first newline convention in the list (LF in the above ex- 333 ample) is added to any pattern that does not already have a newline 334 modifier. If the newline list is empty, the feature is turned off. This 335 command is present in a number of the standard test input files. 336 337 When the POSIX API is being tested there is no way to override the de- 338 fault newline convention, though it is possible to set the newline con- 339 vention from within the pattern. A warning is given if the posix or 340 posix_nosub modifier is used when #newline_default would set a default 341 for the non-POSIX API. 342 343 #pattern <modifier-list> 344 345 This command sets a default modifier list that applies to all subse- 346 quent patterns. Modifiers on a pattern can change these settings. 347 348 #perltest 349 350 This line is used in test files that can also be processed by perl- 351 test.sh to confirm that Perl gives the same results as PCRE2. Subse- 352 quent tests are checked for the use of pcre2test features that are in- 353 compatible with the perltest.sh script. 354 355 Patterns must use '/' as their delimiter, and only certain modifiers 356 are supported. Comment lines, #pattern commands, and #subject commands 357 that set or unset "mark" are recognized and acted on. The #perltest, 358 #forbid_utf, and #newline_default commands, which are needed in the 359 relevant pcre2test files, are silently ignored. All other command lines 360 are ignored, but give a warning message. The #perltest command helps 361 detect tests that are accidentally put in the wrong file or use the 362 wrong delimiter. For more details of the perltest.sh script see the 363 comments it contains. 364 365 #pop [<modifiers>] 366 #popcopy [<modifiers>] 367 368 These commands are used to manipulate the stack of compiled patterns, 369 as described in the section entitled "Saving and restoring compiled 370 patterns" below. 371 372 #save <filename> 373 374 This command is used to save a set of compiled patterns to a file, as 375 described in the section entitled "Saving and restoring compiled pat- 376 terns" below. 377 378 #subject <modifier-list> 379 380 This command sets a default modifier list that applies to all subse- 381 quent subject lines. Modifiers on a subject line can change these set- 382 tings. 383 384 385MODIFIER SYNTAX 386 387 Modifier lists are used with both pattern and subject lines. Items in a 388 list are separated by commas followed by optional white space. Trailing 389 whitespace in a modifier list is ignored. Some modifiers may be given 390 for both patterns and subject lines, whereas others are valid only for 391 one or the other. Each modifier has a long name, for example "an- 392 chored", and some of them must be followed by an equals sign and a 393 value, for example, "offset=12". Values cannot contain comma charac- 394 ters, but may contain spaces. Modifiers that do not take values may be 395 preceded by a minus sign to turn off a previous setting. 396 397 A few of the more common modifiers can also be specified as single let- 398 ters, for example "i" for "caseless". In documentation, following the 399 Perl convention, these are written with a slash ("the /i modifier") for 400 clarity. Abbreviated modifiers must all be concatenated in the first 401 item of a modifier list. If the first item is not recognized as a long 402 modifier name, it is interpreted as a sequence of these abbreviations. 403 For example: 404 405 /abc/ig,newline=cr,jit=3 406 407 This is a pattern line whose modifier list starts with two one-letter 408 modifiers (/i and /g). The lower-case abbreviated modifiers are the 409 same as used in Perl. 410 411 412PATTERN SYNTAX 413 414 A pattern line must start with one of the following characters (common 415 symbols, excluding pattern meta-characters): 416 417 / ! " ' ` - = _ : ; , % & @ ~ 418 419 This is interpreted as the pattern's delimiter. A regular expression 420 may be continued over several input lines, in which case the newline 421 characters are included within it. It is possible to include the delim- 422 iter as a literal within the pattern by escaping it with a backslash, 423 for example 424 425 /abc\/def/ 426 427 If you do this, the escape and the delimiter form part of the pattern, 428 but since the delimiters are all non-alphanumeric, the inclusion of the 429 backslash does not affect the pattern's interpretation. Note, however, 430 that this trick does not work within \Q...\E literal bracketing because 431 the backslash will itself be interpreted as a literal. If the terminat- 432 ing delimiter is immediately followed by a backslash, for example, 433 434 /abc/\ 435 436 then a backslash is added to the end of the pattern. This is done to 437 provide a way of testing the error condition that arises if a pattern 438 finishes with a backslash, because 439 440 /abc\/ 441 442 is interpreted as the first line of a pattern that starts with "abc/", 443 causing pcre2test to read the next line as a continuation of the regu- 444 lar expression. 445 446 A pattern can be followed by a modifier list (details below). 447 448 449SUBJECT LINE SYNTAX 450 451 Before each subject line is passed to pcre2_match(), pcre2_dfa_match(), 452 or pcre2_jit_match(), leading and trailing white space is removed, and 453 the line is scanned for backslash escapes, unless the subject_literal 454 modifier was set for the pattern. The following provide a means of en- 455 coding non-printing characters in a visible way: 456 457 \a alarm (BEL, \x07) 458 \b backspace (\x08) 459 \e escape (\x27) 460 \f form feed (\x0c) 461 \n newline (\x0a) 462 \r carriage return (\x0d) 463 \t tab (\x09) 464 \v vertical tab (\x0b) 465 \nnn octal character (up to 3 octal digits); always 466 a byte unless > 255 in UTF-8 or 16-bit or 32-bit mode 467 \o{dd...} octal character (any number of octal digits} 468 \xhh hexadecimal byte (up to 2 hex digits) 469 \x{hh...} hexadecimal character (any number of hex digits) 470 471 The use of \x{hh...} is not dependent on the use of the utf modifier on 472 the pattern. It is recognized always. There may be any number of hexa- 473 decimal digits inside the braces; invalid values provoke error mes- 474 sages. 475 476 Note that \xhh specifies one byte rather than one character in UTF-8 477 mode; this makes it possible to construct invalid UTF-8 sequences for 478 testing purposes. On the other hand, \x{hh} is interpreted as a UTF-8 479 character in UTF-8 mode, generating more than one byte if the value is 480 greater than 127. When testing the 8-bit library not in UTF-8 mode, 481 \x{hh} generates one byte for values less than 256, and causes an error 482 for greater values. 483 484 In UTF-16 mode, all 4-digit \x{hhhh} values are accepted. This makes it 485 possible to construct invalid UTF-16 sequences for testing purposes. 486 487 In UTF-32 mode, all 4- to 8-digit \x{...} values are accepted. This 488 makes it possible to construct invalid UTF-32 sequences for testing 489 purposes. 490 491 There is a special backslash sequence that specifies replication of one 492 or more characters: 493 494 \[<characters>]{<count>} 495 496 This makes it possible to test long strings without having to provide 497 them as part of the file. For example: 498 499 \[abc]{4} 500 501 is converted to "abcabcabcabc". This feature does not support nesting. 502 To include a closing square bracket in the characters, code it as \x5D. 503 504 A backslash followed by an equals sign marks the end of the subject 505 string and the start of a modifier list. For example: 506 507 abc\=notbol,notempty 508 509 If the subject string is empty and \= is followed by whitespace, the 510 line is treated as a comment line, and is not used for matching. For 511 example: 512 513 \= This is a comment. 514 abc\= This is an invalid modifier list. 515 516 A backslash followed by any other non-alphanumeric character just es- 517 capes that character. A backslash followed by anything else causes an 518 error. However, if the very last character in the line is a backslash 519 (and there is no modifier list), it is ignored. This gives a way of 520 passing an empty line as data, since a real empty line terminates the 521 data input. 522 523 If the subject_literal modifier is set for a pattern, all subject lines 524 that follow are treated as literals, with no special treatment of back- 525 slashes. No replication is possible, and any subject modifiers must be 526 set as defaults by a #subject command. 527 528 529PATTERN MODIFIERS 530 531 There are several types of modifier that can appear in pattern lines. 532 Except where noted below, they may also be used in #pattern commands. A 533 pattern's modifier list can add to or override default modifiers that 534 were set by a previous #pattern command. 535 536 Setting compilation options 537 538 The following modifiers set options for pcre2_compile(). Most of them 539 set bits in the options argument of that function, but those whose 540 names start with PCRE2_EXTRA are additional options that are set in the 541 compile context. For the main options, there are some single-letter ab- 542 breviations that are the same as Perl options. There is special han- 543 dling for /x: if a second x is present, PCRE2_EXTENDED is converted 544 into PCRE2_EXTENDED_MORE as in Perl. A third appearance adds PCRE2_EX- 545 TENDED as well, though this makes no difference to the way pcre2_com- 546 pile() behaves. See pcre2api for a description of the effects of these 547 options. 548 549 allow_empty_class set PCRE2_ALLOW_EMPTY_CLASS 550 allow_lookaround_bsk set PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK 551 allow_surrogate_escapes set PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES 552 alt_bsux set PCRE2_ALT_BSUX 553 alt_circumflex set PCRE2_ALT_CIRCUMFLEX 554 alt_verbnames set PCRE2_ALT_VERBNAMES 555 anchored set PCRE2_ANCHORED 556 auto_callout set PCRE2_AUTO_CALLOUT 557 bad_escape_is_literal set PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL 558 /i caseless set PCRE2_CASELESS 559 dollar_endonly set PCRE2_DOLLAR_ENDONLY 560 /s dotall set PCRE2_DOTALL 561 dupnames set PCRE2_DUPNAMES 562 endanchored set PCRE2_ENDANCHORED 563 escaped_cr_is_lf set PCRE2_EXTRA_ESCAPED_CR_IS_LF 564 /x extended set PCRE2_EXTENDED 565 /xx extended_more set PCRE2_EXTENDED_MORE 566 extra_alt_bsux set PCRE2_EXTRA_ALT_BSUX 567 firstline set PCRE2_FIRSTLINE 568 literal set PCRE2_LITERAL 569 match_line set PCRE2_EXTRA_MATCH_LINE 570 match_invalid_utf set PCRE2_MATCH_INVALID_UTF 571 match_unset_backref set PCRE2_MATCH_UNSET_BACKREF 572 match_word set PCRE2_EXTRA_MATCH_WORD 573 /m multiline set PCRE2_MULTILINE 574 never_backslash_c set PCRE2_NEVER_BACKSLASH_C 575 never_ucp set PCRE2_NEVER_UCP 576 never_utf set PCRE2_NEVER_UTF 577 /n no_auto_capture set PCRE2_NO_AUTO_CAPTURE 578 no_auto_possess set PCRE2_NO_AUTO_POSSESS 579 no_dotstar_anchor set PCRE2_NO_DOTSTAR_ANCHOR 580 no_start_optimize set PCRE2_NO_START_OPTIMIZE 581 no_utf_check set PCRE2_NO_UTF_CHECK 582 ucp set PCRE2_UCP 583 ungreedy set PCRE2_UNGREEDY 584 use_offset_limit set PCRE2_USE_OFFSET_LIMIT 585 utf set PCRE2_UTF 586 587 As well as turning on the PCRE2_UTF option, the utf modifier causes all 588 non-printing characters in output strings to be printed using the 589 \x{hh...} notation. Otherwise, those less than 0x100 are output in hex 590 without the curly brackets. Setting utf in 16-bit or 32-bit mode also 591 causes pattern and subject strings to be translated to UTF-16 or 592 UTF-32, respectively, before being passed to library functions. 593 594 Setting compilation controls 595 596 The following modifiers affect the compilation process or request in- 597 formation about the pattern. There are single-letter abbreviations for 598 some that are heavily used in the test files. 599 600 bsr=[anycrlf|unicode] specify \R handling 601 /B bincode show binary code without lengths 602 callout_info show callout information 603 convert=<options> request foreign pattern conversion 604 convert_glob_escape=c set glob escape character 605 convert_glob_separator=c set glob separator character 606 convert_length set convert buffer length 607 debug same as info,fullbincode 608 framesize show matching frame size 609 fullbincode show binary code with lengths 610 /I info show info about compiled pattern 611 hex unquoted characters are hexadecimal 612 jit[=<number>] use JIT 613 jitfast use JIT fast path 614 jitverify verify JIT use 615 locale=<name> use this locale 616 max_pattern_length=<n> set the maximum pattern length 617 memory show memory used 618 newline=<type> set newline type 619 null_context compile with a NULL context 620 parens_nest_limit=<n> set maximum parentheses depth 621 posix use the POSIX API 622 posix_nosub use the POSIX API with REG_NOSUB 623 push push compiled pattern onto the stack 624 pushcopy push a copy onto the stack 625 stackguard=<number> test the stackguard feature 626 subject_literal treat all subject lines as literal 627 tables=[0|1|2|3] select internal tables 628 use_length do not zero-terminate the pattern 629 utf8_input treat input as UTF-8 630 631 The effects of these modifiers are described in the following sections. 632 633 Newline and \R handling 634 635 The bsr modifier specifies what \R in a pattern should match. If it is 636 set to "anycrlf", \R matches CR, LF, or CRLF only. If it is set to 637 "unicode", \R matches any Unicode newline sequence. The default can be 638 specified when PCRE2 is built; if it is not, the default is set to Uni- 639 code. 640 641 The newline modifier specifies which characters are to be interpreted 642 as newlines, both in the pattern and in subject lines. The type must be 643 one of CR, LF, CRLF, ANYCRLF, ANY, or NUL (in upper or lower case). 644 645 Information about a pattern 646 647 The debug modifier is a shorthand for info,fullbincode, requesting all 648 available information. 649 650 The bincode modifier causes a representation of the compiled code to be 651 output after compilation. This information does not contain length and 652 offset values, which ensures that the same output is generated for dif- 653 ferent internal link sizes and different code unit widths. By using 654 bincode, the same regression tests can be used in different environ- 655 ments. 656 657 The fullbincode modifier, by contrast, does include length and offset 658 values. This is used in a few special tests that run only for specific 659 code unit widths and link sizes, and is also useful for one-off tests. 660 661 The info modifier requests information about the compiled pattern 662 (whether it is anchored, has a fixed first character, and so on). The 663 information is obtained from the pcre2_pattern_info() function. Here 664 are some typical examples: 665 666 re> /(?i)(^a|^b)/m,info 667 Capture group count = 1 668 Compile options: multiline 669 Overall options: caseless multiline 670 First code unit at start or follows newline 671 Subject length lower bound = 1 672 673 re> /(?i)abc/info 674 Capture group count = 0 675 Compile options: <none> 676 Overall options: caseless 677 First code unit = 'a' (caseless) 678 Last code unit = 'c' (caseless) 679 Subject length lower bound = 3 680 681 "Compile options" are those specified by modifiers; "overall options" 682 have added options that are taken or deduced from the pattern. If both 683 sets of options are the same, just a single "options" line is output; 684 if there are no options, the line is omitted. "First code unit" is 685 where any match must start; if there is more than one they are listed 686 as "starting code units". "Last code unit" is the last literal code 687 unit that must be present in any match. This is not necessarily the 688 last character. These lines are omitted if no starting or ending code 689 units are recorded. The subject length line is omitted when 690 no_start_optimize is set because the minimum length is not calculated 691 when it can never be used. 692 693 The framesize modifier shows the size, in bytes, of the storage frames 694 used by pcre2_match() for handling backtracking. The size depends on 695 the number of capturing parentheses in the pattern. 696 697 The callout_info modifier requests information about all the callouts 698 in the pattern. A list of them is output at the end of any other infor- 699 mation that is requested. For each callout, either its number or string 700 is given, followed by the item that follows it in the pattern. 701 702 Passing a NULL context 703 704 Normally, pcre2test passes a context block to pcre2_compile(). If the 705 null_context modifier is set, however, NULL is passed. This is for 706 testing that pcre2_compile() behaves correctly in this case (it uses 707 default values). 708 709 Specifying pattern characters in hexadecimal 710 711 The hex modifier specifies that the characters of the pattern, except 712 for substrings enclosed in single or double quotes, are to be inter- 713 preted as pairs of hexadecimal digits. This feature is provided as a 714 way of creating patterns that contain binary zeros and other non-print- 715 ing characters. White space is permitted between pairs of digits. For 716 example, this pattern contains three characters: 717 718 /ab 32 59/hex 719 720 Parts of such a pattern are taken literally if quoted. This pattern 721 contains nine characters, only two of which are specified in hexadeci- 722 mal: 723 724 /ab "literal" 32/hex 725 726 Either single or double quotes may be used. There is no way of includ- 727 ing the delimiter within a substring. The hex and expand modifiers are 728 mutually exclusive. 729 730 Specifying the pattern's length 731 732 By default, patterns are passed to the compiling functions as zero-ter- 733 minated strings but can be passed by length instead of being zero-ter- 734 minated. The use_length modifier causes this to happen. Using a length 735 happens automatically (whether or not use_length is set) when hex is 736 set, because patterns specified in hexadecimal may contain binary ze- 737 ros. 738 739 If hex or use_length is used with the POSIX wrapper API (see "Using the 740 POSIX wrapper API" below), the REG_PEND extension is used to pass the 741 pattern's length. 742 743 Specifying wide characters in 16-bit and 32-bit modes 744 745 In 16-bit and 32-bit modes, all input is automatically treated as UTF-8 746 and translated to UTF-16 or UTF-32 when the utf modifier is set. For 747 testing the 16-bit and 32-bit libraries in non-UTF mode, the utf8_input 748 modifier can be used. It is mutually exclusive with utf. Input lines 749 are interpreted as UTF-8 as a means of specifying wide characters. More 750 details are given in "Input encoding" above. 751 752 Generating long repetitive patterns 753 754 Some tests use long patterns that are very repetitive. Instead of cre- 755 ating a very long input line for such a pattern, you can use a special 756 repetition feature, similar to the one described for subject lines 757 above. If the expand modifier is present on a pattern, parts of the 758 pattern that have the form 759 760 \[<characters>]{<count>} 761 762 are expanded before the pattern is passed to pcre2_compile(). For exam- 763 ple, \[AB]{6000} is expanded to "ABAB..." 6000 times. This construction 764 cannot be nested. An initial "\[" sequence is recognized only if "]{" 765 followed by decimal digits and "}" is found later in the pattern. If 766 not, the characters remain in the pattern unaltered. The expand and hex 767 modifiers are mutually exclusive. 768 769 If part of an expanded pattern looks like an expansion, but is really 770 part of the actual pattern, unwanted expansion can be avoided by giving 771 two values in the quantifier. For example, \[AB]{6000,6000} is not rec- 772 ognized as an expansion item. 773 774 If the info modifier is set on an expanded pattern, the result of the 775 expansion is included in the information that is output. 776 777 JIT compilation 778 779 Just-in-time (JIT) compiling is a heavyweight optimization that can 780 greatly speed up pattern matching. See the pcre2jit documentation for 781 details. JIT compiling happens, optionally, after a pattern has been 782 successfully compiled into an internal form. The JIT compiler converts 783 this to optimized machine code. It needs to know whether the match-time 784 options PCRE2_PARTIAL_HARD and PCRE2_PARTIAL_SOFT are going to be used, 785 because different code is generated for the different cases. See the 786 partial modifier in "Subject Modifiers" below for details of how these 787 options are specified for each match attempt. 788 789 JIT compilation is requested by the jit pattern modifier, which may op- 790 tionally be followed by an equals sign and a number in the range 0 to 791 7. The three bits that make up the number specify which of the three 792 JIT operating modes are to be compiled: 793 794 1 compile JIT code for non-partial matching 795 2 compile JIT code for soft partial matching 796 4 compile JIT code for hard partial matching 797 798 The possible values for the jit modifier are therefore: 799 800 0 disable JIT 801 1 normal matching only 802 2 soft partial matching only 803 3 normal and soft partial matching 804 4 hard partial matching only 805 6 soft and hard partial matching only 806 7 all three modes 807 808 If no number is given, 7 is assumed. The phrase "partial matching" 809 means a call to pcre2_match() with either the PCRE2_PARTIAL_SOFT or the 810 PCRE2_PARTIAL_HARD option set. Note that such a call may return a com- 811 plete match; the options enable the possibility of a partial match, but 812 do not require it. Note also that if you request JIT compilation only 813 for partial matching (for example, jit=2) but do not set the partial 814 modifier on a subject line, that match will not use JIT code because 815 none was compiled for non-partial matching. 816 817 If JIT compilation is successful, the compiled JIT code will automati- 818 cally be used when an appropriate type of match is run, except when in- 819 compatible run-time options are specified. For more details, see the 820 pcre2jit documentation. See also the jitstack modifier below for a way 821 of setting the size of the JIT stack. 822 823 If the jitfast modifier is specified, matching is done using the JIT 824 "fast path" interface, pcre2_jit_match(), which skips some of the san- 825 ity checks that are done by pcre2_match(), and of course does not work 826 when JIT is not supported. If jitfast is specified without jit, jit=7 827 is assumed. 828 829 If the jitverify modifier is specified, information about the compiled 830 pattern shows whether JIT compilation was or was not successful. If 831 jitverify is specified without jit, jit=7 is assumed. If JIT compila- 832 tion is successful when jitverify is set, the text "(JIT)" is added to 833 the first output line after a match or non match when JIT-compiled code 834 was actually used in the match. 835 836 Setting a locale 837 838 The locale modifier must specify the name of a locale, for example: 839 840 /pattern/locale=fr_FR 841 842 The given locale is set, pcre2_maketables() is called to build a set of 843 character tables for the locale, and this is then passed to pcre2_com- 844 pile() when compiling the regular expression. The same tables are used 845 when matching the following subject lines. The locale modifier applies 846 only to the pattern on which it appears, but can be given in a #pattern 847 command if a default is needed. Setting a locale and alternate charac- 848 ter tables are mutually exclusive. 849 850 Showing pattern memory 851 852 The memory modifier causes the size in bytes of the memory used to hold 853 the compiled pattern to be output. This does not include the size of 854 the pcre2_code block; it is just the actual compiled data. If the pat- 855 tern is subsequently passed to the JIT compiler, the size of the JIT 856 compiled code is also output. Here is an example: 857 858 re> /a(b)c/jit,memory 859 Memory allocation (code space): 21 860 Memory allocation (JIT code): 1910 861 862 863 Limiting nested parentheses 864 865 The parens_nest_limit modifier sets a limit on the depth of nested 866 parentheses in a pattern. Breaching the limit causes a compilation er- 867 ror. The default for the library is set when PCRE2 is built, but 868 pcre2test sets its own default of 220, which is required for running 869 the standard test suite. 870 871 Limiting the pattern length 872 873 The max_pattern_length modifier sets a limit, in code units, to the 874 length of pattern that pcre2_compile() will accept. Breaching the limit 875 causes a compilation error. The default is the largest number a 876 PCRE2_SIZE variable can hold (essentially unlimited). 877 878 Using the POSIX wrapper API 879 880 The posix and posix_nosub modifiers cause pcre2test to call PCRE2 via 881 the POSIX wrapper API rather than its native API. When posix_nosub is 882 used, the POSIX option REG_NOSUB is passed to regcomp(). The POSIX 883 wrapper supports only the 8-bit library. Note that it does not imply 884 POSIX matching semantics; for more detail see the pcre2posix documenta- 885 tion. The following pattern modifiers set options for the regcomp() 886 function: 887 888 caseless REG_ICASE 889 multiline REG_NEWLINE 890 dotall REG_DOTALL ) 891 ungreedy REG_UNGREEDY ) These options are not part of 892 ucp REG_UCP ) the POSIX standard 893 utf REG_UTF8 ) 894 895 The regerror_buffsize modifier specifies a size for the error buffer 896 that is passed to regerror() in the event of a compilation error. For 897 example: 898 899 /abc/posix,regerror_buffsize=20 900 901 This provides a means of testing the behaviour of regerror() when the 902 buffer is too small for the error message. If this modifier has not 903 been set, a large buffer is used. 904 905 The aftertext and allaftertext subject modifiers work as described be- 906 low. All other modifiers are either ignored, with a warning message, or 907 cause an error. 908 909 The pattern is passed to regcomp() as a zero-terminated string by de- 910 fault, but if the use_length or hex modifiers are set, the REG_PEND ex- 911 tension is used to pass it by length. 912 913 Testing the stack guard feature 914 915 The stackguard modifier is used to test the use of pcre2_set_com- 916 pile_recursion_guard(), a function that is provided to enable stack 917 availability to be checked during compilation (see the pcre2api docu- 918 mentation for details). If the number specified by the modifier is 919 greater than zero, pcre2_set_compile_recursion_guard() is called to set 920 up callback from pcre2_compile() to a local function. The argument it 921 receives is the current nesting parenthesis depth; if this is greater 922 than the value given by the modifier, non-zero is returned, causing the 923 compilation to be aborted. 924 925 Using alternative character tables 926 927 The value specified for the tables modifier must be one of the digits 928 0, 1, 2, or 3. It causes a specific set of built-in character tables to 929 be passed to pcre2_compile(). This is used in the PCRE2 tests to check 930 behaviour with different character tables. The digit specifies the ta- 931 bles as follows: 932 933 0 do not pass any special character tables 934 1 the default ASCII tables, as distributed in 935 pcre2_chartables.c.dist 936 2 a set of tables defining ISO 8859 characters 937 3 a set of tables loaded by the #loadtables command 938 939 In tables 2, some characters whose codes are greater than 128 are iden- 940 tified as letters, digits, spaces, etc. Tables 3 can be used only after 941 a #loadtables command has loaded them from a binary file. Setting al- 942 ternate character tables and a locale are mutually exclusive. 943 944 Setting certain match controls 945 946 The following modifiers are really subject modifiers, and are described 947 under "Subject Modifiers" below. However, they may be included in a 948 pattern's modifier list, in which case they are applied to every sub- 949 ject line that is processed with that pattern. These modifiers do not 950 affect the compilation process. 951 952 aftertext show text after match 953 allaftertext show text after captures 954 allcaptures show all captures 955 allvector show the entire ovector 956 allusedtext show all consulted text 957 altglobal alternative global matching 958 /g global global matching 959 jitstack=<n> set size of JIT stack 960 mark show mark values 961 replace=<string> specify a replacement string 962 startchar show starting character when relevant 963 substitute_callout use substitution callouts 964 substitute_extended use PCRE2_SUBSTITUTE_EXTENDED 965 substitute_literal use PCRE2_SUBSTITUTE_LITERAL 966 substitute_matched use PCRE2_SUBSTITUTE_MATCHED 967 substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH 968 substitute_replacement_only use PCRE2_SUBSTITUTE_REPLACEMENT_ONLY 969 substitute_skip=<n> skip substitution <n> 970 substitute_stop=<n> skip substitution <n> and following 971 substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET 972 substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY 973 974 These modifiers may not appear in a #pattern command. If you want them 975 as defaults, set them in a #subject command. 976 977 Specifying literal subject lines 978 979 If the subject_literal modifier is present on a pattern, all the sub- 980 ject lines that it matches are taken as literal strings, with no inter- 981 pretation of backslashes. It is not possible to set subject modifiers 982 on such lines, but any that are set as defaults by a #subject command 983 are recognized. 984 985 Saving a compiled pattern 986 987 When a pattern with the push modifier is successfully compiled, it is 988 pushed onto a stack of compiled patterns, and pcre2test expects the 989 next line to contain a new pattern (or a command) instead of a subject 990 line. This facility is used when saving compiled patterns to a file, as 991 described in the section entitled "Saving and restoring compiled pat- 992 terns" below. If pushcopy is used instead of push, a copy of the com- 993 piled pattern is stacked, leaving the original as current, ready to 994 match the following input lines. This provides a way of testing the 995 pcre2_code_copy() function. The push and pushcopy modifiers are in- 996 compatible with compilation modifiers such as global that act at match 997 time. Any that are specified are ignored (for the stacked copy), with a 998 warning message, except for replace, which causes an error. Note that 999 jitverify, which is allowed, does not carry through to any subsequent 1000 matching that uses a stacked pattern. 1001 1002 Testing foreign pattern conversion 1003 1004 The experimental foreign pattern conversion functions in PCRE2 can be 1005 tested by setting the convert modifier. Its argument is a colon-sepa- 1006 rated list of options, which set the equivalent option for the 1007 pcre2_pattern_convert() function: 1008 1009 glob PCRE2_CONVERT_GLOB 1010 glob_no_starstar PCRE2_CONVERT_GLOB_NO_STARSTAR 1011 glob_no_wild_separator PCRE2_CONVERT_GLOB_NO_WILD_SEPARATOR 1012 posix_basic PCRE2_CONVERT_POSIX_BASIC 1013 posix_extended PCRE2_CONVERT_POSIX_EXTENDED 1014 unset Unset all options 1015 1016 The "unset" value is useful for turning off a default that has been set 1017 by a #pattern command. When one of these options is set, the input pat- 1018 tern is passed to pcre2_pattern_convert(). If the conversion is suc- 1019 cessful, the result is reflected in the output and then passed to 1020 pcre2_compile(). The normal utf and no_utf_check options, if set, cause 1021 the PCRE2_CONVERT_UTF and PCRE2_CONVERT_NO_UTF_CHECK options to be 1022 passed to pcre2_pattern_convert(). 1023 1024 By default, the conversion function is allowed to allocate a buffer for 1025 its output. However, if the convert_length modifier is set to a value 1026 greater than zero, pcre2test passes a buffer of the given length. This 1027 makes it possible to test the length check. 1028 1029 The convert_glob_escape and convert_glob_separator modifiers can be 1030 used to specify the escape and separator characters for glob process- 1031 ing, overriding the defaults, which are operating-system dependent. 1032 1033 1034SUBJECT MODIFIERS 1035 1036 The modifiers that can appear in subject lines and the #subject command 1037 are of two types. 1038 1039 Setting match options 1040 1041 The following modifiers set options for pcre2_match() or 1042 pcre2_dfa_match(). See pcreapi for a description of their effects. 1043 1044 anchored set PCRE2_ANCHORED 1045 endanchored set PCRE2_ENDANCHORED 1046 dfa_restart set PCRE2_DFA_RESTART 1047 dfa_shortest set PCRE2_DFA_SHORTEST 1048 no_jit set PCRE2_NO_JIT 1049 no_utf_check set PCRE2_NO_UTF_CHECK 1050 notbol set PCRE2_NOTBOL 1051 notempty set PCRE2_NOTEMPTY 1052 notempty_atstart set PCRE2_NOTEMPTY_ATSTART 1053 noteol set PCRE2_NOTEOL 1054 partial_hard (or ph) set PCRE2_PARTIAL_HARD 1055 partial_soft (or ps) set PCRE2_PARTIAL_SOFT 1056 1057 The partial matching modifiers are provided with abbreviations because 1058 they appear frequently in tests. 1059 1060 If the posix or posix_nosub modifier was present on the pattern, caus- 1061 ing the POSIX wrapper API to be used, the only option-setting modifiers 1062 that have any effect are notbol, notempty, and noteol, causing REG_NOT- 1063 BOL, REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to 1064 regexec(). The other modifiers are ignored, with a warning message. 1065 1066 There is one additional modifier that can be used with the POSIX wrap- 1067 per. It is ignored (with a warning) if used for non-POSIX matching. 1068 1069 posix_startend=<n>[:<m>] 1070 1071 This causes the subject string to be passed to regexec() using the 1072 REG_STARTEND option, which uses offsets to specify which part of the 1073 string is searched. If only one number is given, the end offset is 1074 passed as the end of the subject string. For more detail of REG_STAR- 1075 TEND, see the pcre2posix documentation. If the subject string contains 1076 binary zeros (coded as escapes such as \x{00} because pcre2test does 1077 not support actual binary zeros in its input), you must use posix_star- 1078 tend to specify its length. 1079 1080 Setting match controls 1081 1082 The following modifiers affect the matching process or request addi- 1083 tional information. Some of them may also be specified on a pattern 1084 line (see above), in which case they apply to every subject line that 1085 is matched against that pattern, but can be overridden by modifiers on 1086 the subject. 1087 1088 aftertext show text after match 1089 allaftertext show text after captures 1090 allcaptures show all captures 1091 allvector show the entire ovector 1092 allusedtext show all consulted text (non-JIT only) 1093 altglobal alternative global matching 1094 callout_capture show captures at callout time 1095 callout_data=<n> set a value to pass via callouts 1096 callout_error=<n>[:<m>] control callout error 1097 callout_extra show extra callout information 1098 callout_fail=<n>[:<m>] control callout failure 1099 callout_no_where do not show position of a callout 1100 callout_none do not supply a callout function 1101 copy=<number or name> copy captured substring 1102 depth_limit=<n> set a depth limit 1103 dfa use pcre2_dfa_match() 1104 find_limits find match and depth limits 1105 get=<number or name> extract captured substring 1106 getall extract all captured substrings 1107 /g global global matching 1108 heap_limit=<n> set a limit on heap memory (Kbytes) 1109 jitstack=<n> set size of JIT stack 1110 mark show mark values 1111 match_limit=<n> set a match limit 1112 memory show heap memory usage 1113 null_context match with a NULL context 1114 offset=<n> set starting offset 1115 offset_limit=<n> set offset limit 1116 ovector=<n> set size of output vector 1117 recursion_limit=<n> obsolete synonym for depth_limit 1118 replace=<string> specify a replacement string 1119 startchar show startchar when relevant 1120 startoffset=<n> same as offset=<n> 1121 substitute_callout use substitution callouts 1122 substitute_extedded use PCRE2_SUBSTITUTE_EXTENDED 1123 substitute_literal use PCRE2_SUBSTITUTE_LITERAL 1124 substitute_matched use PCRE2_SUBSTITUTE_MATCHED 1125 substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH 1126 substitute_replacement_only use PCRE2_SUBSTITUTE_REPLACEMENT_ONLY 1127 substitute_skip=<n> skip substitution number n 1128 substitute_stop=<n> skip substitution number n and greater 1129 substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET 1130 substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY 1131 zero_terminate pass the subject as zero-terminated 1132 1133 The effects of these modifiers are described in the following sections. 1134 When matching via the POSIX wrapper API, the aftertext, allaftertext, 1135 and ovector subject modifiers work as described below. All other modi- 1136 fiers are either ignored, with a warning message, or cause an error. 1137 1138 Showing more text 1139 1140 The aftertext modifier requests that as well as outputting the part of 1141 the subject string that matched the entire pattern, pcre2test should in 1142 addition output the remainder of the subject string. This is useful for 1143 tests where the subject contains multiple copies of the same substring. 1144 The allaftertext modifier requests the same action for captured sub- 1145 strings as well as the main matched substring. In each case the remain- 1146 der is output on the following line with a plus character following the 1147 capture number. 1148 1149 The allusedtext modifier requests that all the text that was consulted 1150 during a successful pattern match by the interpreter should be shown, 1151 for both full and partial matches. This feature is not supported for 1152 JIT matching, and if requested with JIT it is ignored (with a warning 1153 message). Setting this modifier affects the output if there is a look- 1154 behind at the start of a match, or, for a complete match, a lookahead 1155 at the end, or if \K is used in the pattern. Characters that precede or 1156 follow the start and end of the actual match are indicated in the out- 1157 put by '<' or '>' characters underneath them. Here is an example: 1158 1159 re> /(?<=pqr)abc(?=xyz)/ 1160 data> 123pqrabcxyz456\=allusedtext 1161 0: pqrabcxyz 1162 <<< >>> 1163 data> 123pqrabcxy\=ph,allusedtext 1164 Partial match: pqrabcxy 1165 <<< 1166 1167 The first, complete match shows that the matched string is "abc", with 1168 the preceding and following strings "pqr" and "xyz" having been con- 1169 sulted during the match (when processing the assertions). The partial 1170 match can indicate only the preceding string. 1171 1172 The startchar modifier requests that the starting character for the 1173 match be indicated, if it is different to the start of the matched 1174 string. The only time when this occurs is when \K has been processed as 1175 part of the match. In this situation, the output for the matched string 1176 is displayed from the starting character instead of from the match 1177 point, with circumflex characters under the earlier characters. For ex- 1178 ample: 1179 1180 re> /abc\Kxyz/ 1181 data> abcxyz\=startchar 1182 0: abcxyz 1183 ^^^ 1184 1185 Unlike allusedtext, the startchar modifier can be used with JIT. How- 1186 ever, these two modifiers are mutually exclusive. 1187 1188 Showing the value of all capture groups 1189 1190 The allcaptures modifier requests that the values of all potential cap- 1191 tured parentheses be output after a match. By default, only those up to 1192 the highest one actually used in the match are output (corresponding to 1193 the return code from pcre2_match()). Groups that did not take part in 1194 the match are output as "<unset>". This modifier is not relevant for 1195 DFA matching (which does no capturing) and does not apply when replace 1196 is specified; it is ignored, with a warning message, if present. 1197 1198 Showing the entire ovector, for all outcomes 1199 1200 The allvector modifier requests that the entire ovector be shown, what- 1201 ever the outcome of the match. Compare allcaptures, which shows only up 1202 to the maximum number of capture groups for the pattern, and then only 1203 for a successful complete non-DFA match. This modifier, which acts af- 1204 ter any match result, and also for DFA matching, provides a means of 1205 checking that there are no unexpected modifications to ovector fields. 1206 Before each match attempt, the ovector is filled with a special value, 1207 and if this is found in both elements of a capturing pair, "<un- 1208 changed>" is output. After a successful match, this applies to all 1209 groups after the maximum capture group for the pattern. In other cases 1210 it applies to the entire ovector. After a partial match, the first two 1211 elements are the only ones that should be set. After a DFA match, the 1212 amount of ovector that is used depends on the number of matches that 1213 were found. 1214 1215 Testing pattern callouts 1216 1217 A callout function is supplied when pcre2test calls the library match- 1218 ing functions, unless callout_none is specified. Its behaviour can be 1219 controlled by various modifiers listed above whose names begin with 1220 callout_. Details are given in the section entitled "Callouts" below. 1221 Testing callouts from pcre2_substitute() is decribed separately in 1222 "Testing the substitution function" below. 1223 1224 Finding all matches in a string 1225 1226 Searching for all possible matches within a subject can be requested by 1227 the global or altglobal modifier. After finding a match, the matching 1228 function is called again to search the remainder of the subject. The 1229 difference between global and altglobal is that the former uses the 1230 start_offset argument to pcre2_match() or pcre2_dfa_match() to start 1231 searching at a new point within the entire string (which is what Perl 1232 does), whereas the latter passes over a shortened subject. This makes a 1233 difference to the matching process if the pattern begins with a lookbe- 1234 hind assertion (including \b or \B). 1235 1236 If an empty string is matched, the next match is done with the 1237 PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED flags set, in order to search 1238 for another, non-empty, match at the same point in the subject. If this 1239 match fails, the start offset is advanced, and the normal match is re- 1240 tried. This imitates the way Perl handles such cases when using the /g 1241 modifier or the split() function. Normally, the start offset is ad- 1242 vanced by one character, but if the newline convention recognizes CRLF 1243 as a newline, and the current character is CR followed by LF, an ad- 1244 vance of two characters occurs. 1245 1246 Testing substring extraction functions 1247 1248 The copy and get modifiers can be used to test the pcre2_sub- 1249 string_copy_xxx() and pcre2_substring_get_xxx() functions. They can be 1250 given more than once, and each can specify a capture group name or num- 1251 ber, for example: 1252 1253 abcd\=copy=1,copy=3,get=G1 1254 1255 If the #subject command is used to set default copy and/or get lists, 1256 these can be unset by specifying a negative number to cancel all num- 1257 bered groups and an empty name to cancel all named groups. 1258 1259 The getall modifier tests pcre2_substring_list_get(), which extracts 1260 all captured substrings. 1261 1262 If the subject line is successfully matched, the substrings extracted 1263 by the convenience functions are output with C, G, or L after the 1264 string number instead of a colon. This is in addition to the normal 1265 full list. The string length (that is, the return from the extraction 1266 function) is given in parentheses after each substring, followed by the 1267 name when the extraction was by name. 1268 1269 Testing the substitution function 1270 1271 If the replace modifier is set, the pcre2_substitute() function is 1272 called instead of one of the matching functions (or after one call of 1273 pcre2_match() in the case of PCRE2_SUBSTITUTE_MATCHED). Note that re- 1274 placement strings cannot contain commas, because a comma signifies the 1275 end of a modifier. This is not thought to be an issue in a test pro- 1276 gram. 1277 1278 Specifying a completely empty replacement string disables this modi- 1279 fier. However, it is possible to specify an empty replacement by pro- 1280 viding a buffer length, as described below, for an otherwise empty re- 1281 placement. 1282 1283 Unlike subject strings, pcre2test does not process replacement strings 1284 for escape sequences. In UTF mode, a replacement string is checked to 1285 see if it is a valid UTF-8 string. If so, it is correctly converted to 1286 a UTF string of the appropriate code unit width. If it is not a valid 1287 UTF-8 string, the individual code units are copied directly. This pro- 1288 vides a means of passing an invalid UTF-8 string for testing purposes. 1289 1290 The following modifiers set options (in additional to the normal match 1291 options) for pcre2_substitute(): 1292 1293 global PCRE2_SUBSTITUTE_GLOBAL 1294 substitute_extended PCRE2_SUBSTITUTE_EXTENDED 1295 substitute_literal PCRE2_SUBSTITUTE_LITERAL 1296 substitute_matched PCRE2_SUBSTITUTE_MATCHED 1297 substitute_overflow_length PCRE2_SUBSTITUTE_OVERFLOW_LENGTH 1298 substitute_replacement_only PCRE2_SUBSTITUTE_REPLACEMENT_ONLY 1299 substitute_unknown_unset PCRE2_SUBSTITUTE_UNKNOWN_UNSET 1300 substitute_unset_empty PCRE2_SUBSTITUTE_UNSET_EMPTY 1301 1302 See the pcre2api documentation for details of these options. 1303 1304 After a successful substitution, the modified string is output, pre- 1305 ceded by the number of replacements. This may be zero if there were no 1306 matches. Here is a simple example of a substitution test: 1307 1308 /abc/replace=xxx 1309 =abc=abc= 1310 1: =xxx=abc= 1311 =abc=abc=\=global 1312 2: =xxx=xxx= 1313 1314 Subject and replacement strings should be kept relatively short (fewer 1315 than 256 characters) for substitution tests, as fixed-size buffers are 1316 used. To make it easy to test for buffer overflow, if the replacement 1317 string starts with a number in square brackets, that number is passed 1318 to pcre2_substitute() as the size of the output buffer, with the re- 1319 placement string starting at the next character. Here is an example 1320 that tests the edge case: 1321 1322 /abc/ 1323 123abc123\=replace=[10]XYZ 1324 1: 123XYZ123 1325 123abc123\=replace=[9]XYZ 1326 Failed: error -47: no more memory 1327 1328 The default action of pcre2_substitute() is to return PCRE2_ER- 1329 ROR_NOMEMORY when the output buffer is too small. However, if the 1330 PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set (by using the substi- 1331 tute_overflow_length modifier), pcre2_substitute() continues to go 1332 through the motions of matching and substituting (but not doing any 1333 callouts), in order to compute the size of buffer that is required. 1334 When this happens, pcre2test shows the required buffer length (which 1335 includes space for the trailing zero) as part of the error message. For 1336 example: 1337 1338 /abc/substitute_overflow_length 1339 123abc123\=replace=[9]XYZ 1340 Failed: error -47: no more memory: 10 code units are needed 1341 1342 A replacement string is ignored with POSIX and DFA matching. Specifying 1343 partial matching provokes an error return ("bad option value") from 1344 pcre2_substitute(). 1345 1346 Testing substitute callouts 1347 1348 If the substitute_callout modifier is set, a substitution callout func- 1349 tion is set up. The null_context modifier must not be set, because the 1350 address of the callout function is passed in a match context. When the 1351 callout function is called (after each substitution), details of the 1352 the input and output strings are output. For example: 1353 1354 /abc/g,replace=<$0>,substitute_callout 1355 abcdefabcpqr 1356 1(1) Old 0 3 "abc" New 0 5 "<abc>" 1357 2(1) Old 6 9 "abc" New 8 13 "<abc>" 1358 2: <abc>def<abc>pqr 1359 1360 The first number on each callout line is the count of matches. The 1361 parenthesized number is the number of pairs that are set in the ovector 1362 (that is, one more than the number of capturing groups that were set). 1363 Then are listed the offsets of the old substring, its contents, and the 1364 same for the replacement. 1365 1366 By default, the substitution callout function returns zero, which ac- 1367 cepts the replacement and causes matching to continue if /g was used. 1368 Two further modifiers can be used to test other return values. If sub- 1369 stitute_skip is set to a value greater than zero the callout function 1370 returns +1 for the match of that number, and similarly substitute_stop 1371 returns -1. These cause the replacement to be rejected, and -1 causes 1372 no further matching to take place. If either of them are set, substi- 1373 tute_callout is assumed. For example: 1374 1375 /abc/g,replace=<$0>,substitute_skip=1 1376 abcdefabcpqr 1377 1(1) Old 0 3 "abc" New 0 5 "<abc> SKIPPED" 1378 2(1) Old 6 9 "abc" New 6 11 "<abc>" 1379 2: abcdef<abc>pqr 1380 abcdefabcpqr\=substitute_stop=1 1381 1(1) Old 0 3 "abc" New 0 5 "<abc> STOPPED" 1382 1: abcdefabcpqr 1383 1384 If both are set for the same number, stop takes precedence. Only a sin- 1385 gle skip or stop is supported, which is sufficient for testing that the 1386 feature works. 1387 1388 Setting the JIT stack size 1389 1390 The jitstack modifier provides a way of setting the maximum stack size 1391 that is used by the just-in-time optimization code. It is ignored if 1392 JIT optimization is not being used. The value is a number of kibibytes 1393 (units of 1024 bytes). Setting zero reverts to the default of 32KiB. 1394 Providing a stack that is larger than the default is necessary only for 1395 very complicated patterns. If jitstack is set non-zero on a subject 1396 line it overrides any value that was set on the pattern. 1397 1398 Setting heap, match, and depth limits 1399 1400 The heap_limit, match_limit, and depth_limit modifiers set the appro- 1401 priate limits in the match context. These values are ignored when the 1402 find_limits modifier is specified. 1403 1404 Finding minimum limits 1405 1406 If the find_limits modifier is present on a subject line, pcre2test 1407 calls the relevant matching function several times, setting different 1408 values in the match context via pcre2_set_heap_limit(), 1409 pcre2_set_match_limit(), or pcre2_set_depth_limit() until it finds the 1410 minimum values for each parameter that allows the match to complete 1411 without error. If JIT is being used, only the match limit is relevant. 1412 1413 When using this modifier, the pattern should not contain any limit set- 1414 tings such as (*LIMIT_MATCH=...) within it. If such a setting is 1415 present and is lower than the minimum matching value, the minimum value 1416 cannot be found because pcre2_set_match_limit() etc. are only able to 1417 reduce the value of an in-pattern limit; they cannot increase it. 1418 1419 For non-DFA matching, the minimum depth_limit number is a measure of 1420 how much nested backtracking happens (that is, how deeply the pattern's 1421 tree is searched). In the case of DFA matching, depth_limit controls 1422 the depth of recursive calls of the internal function that is used for 1423 handling pattern recursion, lookaround assertions, and atomic groups. 1424 1425 For non-DFA matching, the match_limit number is a measure of the amount 1426 of backtracking that takes place, and learning the minimum value can be 1427 instructive. For most simple matches, the number is quite small, but 1428 for patterns with very large numbers of matching possibilities, it can 1429 become large very quickly with increasing length of subject string. In 1430 the case of DFA matching, match_limit controls the total number of 1431 calls, both recursive and non-recursive, to the internal matching func- 1432 tion, thus controlling the overall amount of computing resource that is 1433 used. 1434 1435 For both kinds of matching, the heap_limit number, which is in 1436 kibibytes (units of 1024 bytes), limits the amount of heap memory used 1437 for matching. A value of zero disables the use of any heap memory; many 1438 simple pattern matches can be done without using the heap, so zero is 1439 not an unreasonable setting. 1440 1441 Showing MARK names 1442 1443 1444 The mark modifier causes the names from backtracking control verbs that 1445 are returned from calls to pcre2_match() to be displayed. If a mark is 1446 returned for a match, non-match, or partial match, pcre2test shows it. 1447 For a match, it is on a line by itself, tagged with "MK:". Otherwise, 1448 it is added to the non-match message. 1449 1450 Showing memory usage 1451 1452 The memory modifier causes pcre2test to log the sizes of all heap mem- 1453 ory allocation and freeing calls that occur during a call to 1454 pcre2_match() or pcre2_dfa_match(). These occur only when a match re- 1455 quires a bigger vector than the default for remembering backtracking 1456 points (pcre2_match()) or for internal workspace (pcre2_dfa_match()). 1457 In many cases there will be no heap memory used and therefore no addi- 1458 tional output. No heap memory is allocated during matching with JIT, so 1459 in that case the memory modifier never has any effect. For this modi- 1460 fier to work, the null_context modifier must not be set on both the 1461 pattern and the subject, though it can be set on one or the other. 1462 1463 Setting a starting offset 1464 1465 The offset modifier sets an offset in the subject string at which 1466 matching starts. Its value is a number of code units, not characters. 1467 1468 Setting an offset limit 1469 1470 The offset_limit modifier sets a limit for unanchored matches. If a 1471 match cannot be found starting at or before this offset in the subject, 1472 a "no match" return is given. The data value is a number of code units, 1473 not characters. When this modifier is used, the use_offset_limit modi- 1474 fier must have been set for the pattern; if not, an error is generated. 1475 1476 Setting the size of the output vector 1477 1478 The ovector modifier applies only to the subject line in which it ap- 1479 pears, though of course it can also be used to set a default in a #sub- 1480 ject command. It specifies the number of pairs of offsets that are 1481 available for storing matching information. The default is 15. 1482 1483 A value of zero is useful when testing the POSIX API because it causes 1484 regexec() to be called with a NULL capture vector. When not testing the 1485 POSIX API, a value of zero is used to cause pcre2_match_data_cre- 1486 ate_from_pattern() to be called, in order to create a match block of 1487 exactly the right size for the pattern. (It is not possible to create a 1488 match block with a zero-length ovector; there is always at least one 1489 pair of offsets.) 1490 1491 Passing the subject as zero-terminated 1492 1493 By default, the subject string is passed to a native API matching func- 1494 tion with its correct length. In order to test the facility for passing 1495 a zero-terminated string, the zero_terminate modifier is provided. It 1496 causes the length to be passed as PCRE2_ZERO_TERMINATED. When matching 1497 via the POSIX interface, this modifier is ignored, with a warning. 1498 1499 When testing pcre2_substitute(), this modifier also has the effect of 1500 passing the replacement string as zero-terminated. 1501 1502 Passing a NULL context 1503 1504 Normally, pcre2test passes a context block to pcre2_match(), 1505 pcre2_dfa_match(), pcre2_jit_match() or pcre2_substitute(). If the 1506 null_context modifier is set, however, NULL is passed. This is for 1507 testing that the matching and substitution functions behave correctly 1508 in this case (they use default values). This modifier cannot be used 1509 with the find_limits or substitute_callout modifiers. 1510 1511 1512THE ALTERNATIVE MATCHING FUNCTION 1513 1514 By default, pcre2test uses the standard PCRE2 matching function, 1515 pcre2_match() to match each subject line. PCRE2 also supports an alter- 1516 native matching function, pcre2_dfa_match(), which operates in a dif- 1517 ferent way, and has some restrictions. The differences between the two 1518 functions are described in the pcre2matching documentation. 1519 1520 If the dfa modifier is set, the alternative matching function is used. 1521 This function finds all possible matches at a given point in the sub- 1522 ject. If, however, the dfa_shortest modifier is set, processing stops 1523 after the first match is found. This is always the shortest possible 1524 match. 1525 1526 1527DEFAULT OUTPUT FROM pcre2test 1528 1529 This section describes the output when the normal matching function, 1530 pcre2_match(), is being used. 1531 1532 When a match succeeds, pcre2test outputs the list of captured sub- 1533 strings, starting with number 0 for the string that matched the whole 1534 pattern. Otherwise, it outputs "No match" when the return is PCRE2_ER- 1535 ROR_NOMATCH, or "Partial match:" followed by the partially matching 1536 substring when the return is PCRE2_ERROR_PARTIAL. (Note that this is 1537 the entire substring that was inspected during the partial match; it 1538 may include characters before the actual match start if a lookbehind 1539 assertion, \K, \b, or \B was involved.) 1540 1541 For any other return, pcre2test outputs the PCRE2 negative error number 1542 and a short descriptive phrase. If the error is a failed UTF string 1543 check, the code unit offset of the start of the failing character is 1544 also output. Here is an example of an interactive pcre2test run. 1545 1546 $ pcre2test 1547 PCRE2 version 10.22 2016-07-29 1548 1549 re> /^abc(\d+)/ 1550 data> abc123 1551 0: abc123 1552 1: 123 1553 data> xyz 1554 No match 1555 1556 Unset capturing substrings that are not followed by one that is set are 1557 not shown by pcre2test unless the allcaptures modifier is specified. In 1558 the following example, there are two capturing substrings, but when the 1559 first data line is matched, the second, unset substring is not shown. 1560 An "internal" unset substring is shown as "<unset>", as for the second 1561 data line. 1562 1563 re> /(a)|(b)/ 1564 data> a 1565 0: a 1566 1: a 1567 data> b 1568 0: b 1569 1: <unset> 1570 2: b 1571 1572 If the strings contain any non-printing characters, they are output as 1573 \xhh escapes if the value is less than 256 and UTF mode is not set. 1574 Otherwise they are output as \x{hh...} escapes. See below for the defi- 1575 nition of non-printing characters. If the aftertext modifier is set, 1576 the output for substring 0 is followed by the the rest of the subject 1577 string, identified by "0+" like this: 1578 1579 re> /cat/aftertext 1580 data> cataract 1581 0: cat 1582 0+ aract 1583 1584 If global matching is requested, the results of successive matching at- 1585 tempts are output in sequence, like this: 1586 1587 re> /\Bi(\w\w)/g 1588 data> Mississippi 1589 0: iss 1590 1: ss 1591 0: iss 1592 1: ss 1593 0: ipp 1594 1: pp 1595 1596 "No match" is output only if the first match attempt fails. Here is an 1597 example of a failure message (the offset 4 that is specified by the 1598 offset modifier is past the end of the subject string): 1599 1600 re> /xyz/ 1601 data> xyz\=offset=4 1602 Error -24 (bad offset value) 1603 1604 Note that whereas patterns can be continued over several lines (a plain 1605 ">" prompt is used for continuations), subject lines may not. However 1606 newlines can be included in a subject by means of the \n escape (or \r, 1607 \r\n, etc., depending on the newline sequence setting). 1608 1609 1610OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION 1611 1612 When the alternative matching function, pcre2_dfa_match(), is used, the 1613 output consists of a list of all the matches that start at the first 1614 point in the subject where there is at least one match. For example: 1615 1616 re> /(tang|tangerine|tan)/ 1617 data> yellow tangerine\=dfa 1618 0: tangerine 1619 1: tang 1620 2: tan 1621 1622 Using the normal matching function on this data finds only "tang". The 1623 longest matching string is always given first (and numbered zero). Af- 1624 ter a PCRE2_ERROR_PARTIAL return, the output is "Partial match:", fol- 1625 lowed by the partially matching substring. Note that this is the entire 1626 substring that was inspected during the partial match; it may include 1627 characters before the actual match start if a lookbehind assertion, \b, 1628 or \B was involved. (\K is not supported for DFA matching.) 1629 1630 If global matching is requested, the search for further matches resumes 1631 at the end of the longest match. For example: 1632 1633 re> /(tang|tangerine|tan)/g 1634 data> yellow tangerine and tangy sultana\=dfa 1635 0: tangerine 1636 1: tang 1637 2: tan 1638 0: tang 1639 1: tan 1640 0: tan 1641 1642 The alternative matching function does not support substring capture, 1643 so the modifiers that are concerned with captured substrings are not 1644 relevant. 1645 1646 1647RESTARTING AFTER A PARTIAL MATCH 1648 1649 When the alternative matching function has given the PCRE2_ERROR_PAR- 1650 TIAL return, indicating that the subject partially matched the pattern, 1651 you can restart the match with additional subject data by means of the 1652 dfa_restart modifier. For example: 1653 1654 re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ 1655 data> 23ja\=ps,dfa 1656 Partial match: 23ja 1657 data> n05\=dfa,dfa_restart 1658 0: n05 1659 1660 For further information about partial matching, see the pcre2partial 1661 documentation. 1662 1663 1664CALLOUTS 1665 1666 If the pattern contains any callout requests, pcre2test's callout func- 1667 tion is called during matching unless callout_none is specified. This 1668 works with both matching functions, and with JIT, though there are some 1669 differences in behaviour. The output for callouts with numerical argu- 1670 ments and those with string arguments is slightly different. 1671 1672 Callouts with numerical arguments 1673 1674 By default, the callout function displays the callout number, the start 1675 and current positions in the subject text at the callout time, and the 1676 next pattern item to be tested. For example: 1677 1678 --->pqrabcdef 1679 0 ^ ^ \d 1680 1681 This output indicates that callout number 0 occurred for a match at- 1682 tempt starting at the fourth character of the subject string, when the 1683 pointer was at the seventh character, and when the next pattern item 1684 was \d. Just one circumflex is output if the start and current posi- 1685 tions are the same, or if the current position precedes the start posi- 1686 tion, which can happen if the callout is in a lookbehind assertion. 1687 1688 Callouts numbered 255 are assumed to be automatic callouts, inserted as 1689 a result of the auto_callout pattern modifier. In this case, instead of 1690 showing the callout number, the offset in the pattern, preceded by a 1691 plus, is output. For example: 1692 1693 re> /\d?[A-E]\*/auto_callout 1694 data> E* 1695 --->E* 1696 +0 ^ \d? 1697 +3 ^ [A-E] 1698 +8 ^^ \* 1699 +10 ^ ^ 1700 0: E* 1701 1702 If a pattern contains (*MARK) items, an additional line is output when- 1703 ever a change of latest mark is passed to the callout function. For ex- 1704 ample: 1705 1706 re> /a(*MARK:X)bc/auto_callout 1707 data> abc 1708 --->abc 1709 +0 ^ a 1710 +1 ^^ (*MARK:X) 1711 +10 ^^ b 1712 Latest Mark: X 1713 +11 ^ ^ c 1714 +12 ^ ^ 1715 0: abc 1716 1717 The mark changes between matching "a" and "b", but stays the same for 1718 the rest of the match, so nothing more is output. If, as a result of 1719 backtracking, the mark reverts to being unset, the text "<unset>" is 1720 output. 1721 1722 Callouts with string arguments 1723 1724 The output for a callout with a string argument is similar, except that 1725 instead of outputting a callout number before the position indicators, 1726 the callout string and its offset in the pattern string are output be- 1727 fore the reflection of the subject string, and the subject string is 1728 reflected for each callout. For example: 1729 1730 re> /^ab(?C'first')cd(?C"second")ef/ 1731 data> abcdefg 1732 Callout (7): 'first' 1733 --->abcdefg 1734 ^ ^ c 1735 Callout (20): "second" 1736 --->abcdefg 1737 ^ ^ e 1738 0: abcdef 1739 1740 1741 Callout modifiers 1742 1743 The callout function in pcre2test returns zero (carry on matching) by 1744 default, but you can use a callout_fail modifier in a subject line to 1745 change this and other parameters of the callout (see below). 1746 1747 If the callout_capture modifier is set, the current captured groups are 1748 output when a callout occurs. This is useful only for non-DFA matching, 1749 as pcre2_dfa_match() does not support capturing, so no captures are 1750 ever shown. 1751 1752 The normal callout output, showing the callout number or pattern offset 1753 (as described above) is suppressed if the callout_no_where modifier is 1754 set. 1755 1756 When using the interpretive matching function pcre2_match() without 1757 JIT, setting the callout_extra modifier causes additional output from 1758 pcre2test's callout function to be generated. For the first callout in 1759 a match attempt at a new starting position in the subject, "New match 1760 attempt" is output. If there has been a backtrack since the last call- 1761 out (or start of matching if this is the first callout), "Backtrack" is 1762 output, followed by "No other matching paths" if the backtrack ended 1763 the previous match attempt. For example: 1764 1765 re> /(a+)b/auto_callout,no_start_optimize,no_auto_possess 1766 data> aac\=callout_extra 1767 New match attempt 1768 --->aac 1769 +0 ^ ( 1770 +1 ^ a+ 1771 +3 ^ ^ ) 1772 +4 ^ ^ b 1773 Backtrack 1774 --->aac 1775 +3 ^^ ) 1776 +4 ^^ b 1777 Backtrack 1778 No other matching paths 1779 New match attempt 1780 --->aac 1781 +0 ^ ( 1782 +1 ^ a+ 1783 +3 ^^ ) 1784 +4 ^^ b 1785 Backtrack 1786 No other matching paths 1787 New match attempt 1788 --->aac 1789 +0 ^ ( 1790 +1 ^ a+ 1791 Backtrack 1792 No other matching paths 1793 New match attempt 1794 --->aac 1795 +0 ^ ( 1796 +1 ^ a+ 1797 No match 1798 1799 Notice that various optimizations must be turned off if you want all 1800 possible matching paths to be scanned. If no_start_optimize is not 1801 used, there is an immediate "no match", without any callouts, because 1802 the starting optimization fails to find "b" in the subject, which it 1803 knows must be present for any match. If no_auto_possess is not used, 1804 the "a+" item is turned into "a++", which reduces the number of back- 1805 tracks. 1806 1807 The callout_extra modifier has no effect if used with the DFA matching 1808 function, or with JIT. 1809 1810 Return values from callouts 1811 1812 The default return from the callout function is zero, which allows 1813 matching to continue. The callout_fail modifier can be given one or two 1814 numbers. If there is only one number, 1 is returned instead of 0 (caus- 1815 ing matching to backtrack) when a callout of that number is reached. If 1816 two numbers (<n>:<m>) are given, 1 is returned when callout <n> is 1817 reached and there have been at least <m> callouts. The callout_error 1818 modifier is similar, except that PCRE2_ERROR_CALLOUT is returned, caus- 1819 ing the entire matching process to be aborted. If both these modifiers 1820 are set for the same callout number, callout_error takes precedence. 1821 Note that callouts with string arguments are always given the number 1822 zero. 1823 1824 The callout_data modifier can be given an unsigned or a negative num- 1825 ber. This is set as the "user data" that is passed to the matching 1826 function, and passed back when the callout function is invoked. Any 1827 value other than zero is used as a return from pcre2test's callout 1828 function. 1829 1830 Inserting callouts can be helpful when using pcre2test to check compli- 1831 cated regular expressions. For further information about callouts, see 1832 the pcre2callout documentation. 1833 1834 1835NON-PRINTING CHARACTERS 1836 1837 When pcre2test is outputting text in the compiled version of a pattern, 1838 bytes other than 32-126 are always treated as non-printing characters 1839 and are therefore shown as hex escapes. 1840 1841 When pcre2test is outputting text that is a matched part of a subject 1842 string, it behaves in the same way, unless a different locale has been 1843 set for the pattern (using the locale modifier). In this case, the is- 1844 print() function is used to distinguish printing and non-printing char- 1845 acters. 1846 1847 1848SAVING AND RESTORING COMPILED PATTERNS 1849 1850 It is possible to save compiled patterns on disc or elsewhere, and 1851 reload them later, subject to a number of restrictions. JIT data cannot 1852 be saved. The host on which the patterns are reloaded must be running 1853 the same version of PCRE2, with the same code unit width, and must also 1854 have the same endianness, pointer width and PCRE2_SIZE type. Before 1855 compiled patterns can be saved they must be serialized, that is, con- 1856 verted to a stream of bytes. A single byte stream may contain any num- 1857 ber of compiled patterns, but they must all use the same character ta- 1858 bles. A single copy of the tables is included in the byte stream (its 1859 size is 1088 bytes). 1860 1861 The functions whose names begin with pcre2_serialize_ are used for se- 1862 rializing and de-serializing. They are described in the pcre2serialize 1863 documentation. In this section we describe the features of pcre2test 1864 that can be used to test these functions. 1865 1866 Note that "serialization" in PCRE2 does not convert compiled patterns 1867 to an abstract format like Java or .NET. It just makes a reloadable 1868 byte code stream. Hence the restrictions on reloading mentioned above. 1869 1870 In pcre2test, when a pattern with push modifier is successfully com- 1871 piled, it is pushed onto a stack of compiled patterns, and pcre2test 1872 expects the next line to contain a new pattern (or command) instead of 1873 a subject line. By contrast, the pushcopy modifier causes a copy of the 1874 compiled pattern to be stacked, leaving the original available for im- 1875 mediate matching. By using push and/or pushcopy, a number of patterns 1876 can be compiled and retained. These modifiers are incompatible with 1877 posix, and control modifiers that act at match time are ignored (with a 1878 message) for the stacked patterns. The jitverify modifier applies only 1879 at compile time. 1880 1881 The command 1882 1883 #save <filename> 1884 1885 causes all the stacked patterns to be serialized and the result written 1886 to the named file. Afterwards, all the stacked patterns are freed. The 1887 command 1888 1889 #load <filename> 1890 1891 reads the data in the file, and then arranges for it to be de-serial- 1892 ized, with the resulting compiled patterns added to the pattern stack. 1893 The pattern on the top of the stack can be retrieved by the #pop com- 1894 mand, which must be followed by lines of subjects that are to be 1895 matched with the pattern, terminated as usual by an empty line or end 1896 of file. This command may be followed by a modifier list containing 1897 only control modifiers that act after a pattern has been compiled. In 1898 particular, hex, posix, posix_nosub, push, and pushcopy are not al- 1899 lowed, nor are any option-setting modifiers. The JIT modifiers are, 1900 however permitted. Here is an example that saves and reloads two pat- 1901 terns. 1902 1903 /abc/push 1904 /xyz/push 1905 #save tempfile 1906 #load tempfile 1907 #pop info 1908 xyz 1909 1910 #pop jit,bincode 1911 abc 1912 1913 If jitverify is used with #pop, it does not automatically imply jit, 1914 which is different behaviour from when it is used on a pattern. 1915 1916 The #popcopy command is analagous to the pushcopy modifier in that it 1917 makes current a copy of the topmost stack pattern, leaving the original 1918 still on the stack. 1919 1920 1921SEE ALSO 1922 1923 pcre2(3), pcre2api(3), pcre2callout(3), pcre2jit, pcre2matching(3), 1924 pcre2partial(d), pcre2pattern(3), pcre2serialize(3). 1925 1926 1927AUTHOR 1928 1929 Philip Hazel 1930 Retired from University Computing Service 1931 Cambridge, England. 1932 1933 1934REVISION 1935 1936 Last updated: 30 August 2021 1937 Copyright (c) 1997-2021 University of Cambridge. 1938