1PCRE2TEST(1) General Commands Manual PCRE2TEST(1) 2 3 4 5NAME 6 pcre2test - a program for testing Perl-compatible regular expressions. 7 8SYNOPSIS 9 10 pcre2test [options] [input file [output file]] 11 12 pcre2test is a test program for the PCRE2 regular expression libraries, 13 but it can also be used for experimenting with regular expressions. 14 This document describes the features of the test program; for details 15 of the regular expressions themselves, see the pcre2pattern documenta- 16 tion. For details of the PCRE2 library function calls and their op- 17 tions, see the pcre2api documentation. 18 19 The input for pcre2test is a sequence of regular expression patterns 20 and subject strings to be matched. There are also command lines for 21 setting defaults and controlling some special actions. The output shows 22 the result of each match attempt. Modifiers on external or internal 23 command lines, the patterns, and the subject lines specify PCRE2 func- 24 tion options, control how the subject is processed, and what output is 25 produced. 26 27 As the original fairly simple PCRE library evolved, it acquired many 28 different features, and as a result, the original pcretest program 29 ended up with a lot of options in a messy, arcane syntax for testing 30 all the features. The move to the new PCRE2 API provided an opportunity 31 to re-implement the test program as pcre2test, with a cleaner modifier 32 syntax. Nevertheless, there are still many obscure modifiers, some of 33 which are specifically designed for use in conjunction with the test 34 script and data files that are distributed as part of PCRE2. All the 35 modifiers are documented here, some without much justification, but 36 many of them are unlikely to be of use except when testing the li- 37 braries. 38 39 40PCRE2's 8-BIT, 16-BIT AND 32-BIT LIBRARIES 41 42 Different versions of the PCRE2 library can be built to support charac- 43 ter strings that are encoded in 8-bit, 16-bit, or 32-bit code units. 44 One, two, or all three of these libraries may be simultaneously in- 45 stalled. The pcre2test program can be used to test all the libraries. 46 However, its own input and output are always in 8-bit format. When 47 testing the 16-bit or 32-bit libraries, patterns and subject strings 48 are converted to 16-bit or 32-bit format before being passed to the li- 49 brary functions. Results are converted back to 8-bit code units for 50 output. 51 52 In the rest of this document, the names of library functions and struc- 53 tures are given in generic form, for example, pcre_compile(). The ac- 54 tual names used in the libraries have a suffix _8, _16, or _32, as ap- 55 propriate. 56 57 58INPUT ENCODING 59 60 Input to pcre2test is processed line by line, either by calling the C 61 library's fgets() function, or via the libreadline library. In some 62 Windows environments character 26 (hex 1A) causes an immediate end of 63 file, and no further data is read, so this character should be avoided 64 unless you really want that action. 65 66 The input is processed using using C's string functions, so must not 67 contain binary zeros, even though in Unix-like environments, fgets() 68 treats any bytes other than newline as data characters. An error is 69 generated if a binary zero is encountered. By default subject lines are 70 processed for backslash escapes, which makes it possible to include any 71 data value in strings that are passed to the library for matching. For 72 patterns, there is a facility for specifying some or all of the 8-bit 73 input characters as hexadecimal pairs, which makes it possible to in- 74 clude binary zeros. 75 76 Input for the 16-bit and 32-bit libraries 77 78 When testing the 16-bit or 32-bit libraries, there is a need to be able 79 to generate character code points greater than 255 in the strings that 80 are passed to the library. For subject lines, backslash escapes can be 81 used. In addition, when the utf modifier (see "Setting compilation op- 82 tions" below) is set, the pattern and any following subject lines are 83 interpreted as UTF-8 strings and translated to UTF-16 or UTF-32 as ap- 84 propriate. 85 86 For non-UTF testing of wide characters, the utf8_input modifier can be 87 used. This is mutually exclusive with utf, and is allowed only in 88 16-bit or 32-bit mode. It causes the pattern and following subject 89 lines to be treated as UTF-8 according to the original definition (RFC 90 2279), which allows for character values up to 0x7fffffff. Each charac- 91 ter is placed in one 16-bit or 32-bit code unit (in the 16-bit case, 92 values greater than 0xffff cause an error to occur). 93 94 UTF-8 (in its original definition) is not capable of encoding values 95 greater than 0x7fffffff, but such values can be handled by the 32-bit 96 library. When testing this library in non-UTF mode with utf8_input set, 97 if any character is preceded by the byte 0xff (which is an invalid byte 98 in UTF-8) 0x80000000 is added to the character's value. This is the 99 only way of passing such code points in a pattern string. For subject 100 strings, using an escape sequence is preferable. 101 102 103COMMAND LINE OPTIONS 104 105 -8 If the 8-bit library has been built, this option causes it to 106 be used (this is the default). If the 8-bit library has not 107 been built, this option causes an error. 108 109 -16 If the 16-bit library has been built, this option causes it 110 to be used. If only the 16-bit library has been built, this 111 is the default. If the 16-bit library has not been built, 112 this option causes an error. 113 114 -32 If the 32-bit library has been built, this option causes it 115 to be used. If only the 32-bit library has been built, this 116 is the default. If the 32-bit library has not been built, 117 this option causes an error. 118 119 -ac Behave as if each pattern has the auto_callout modifier, that 120 is, insert automatic callouts into every pattern that is com- 121 piled. 122 123 -AC As for -ac, but in addition behave as if each subject line 124 has the callout_extra modifier, that is, show additional in- 125 formation from callouts. 126 127 -b Behave as if each pattern has the fullbincode modifier; the 128 full internal binary form of the pattern is output after com- 129 pilation. 130 131 -C Output the version number of the PCRE2 library, and all 132 available information about the optional features that are 133 included, and then exit with zero exit code. All other op- 134 tions are ignored. If both -C and -LM are present, whichever 135 is first is recognized. 136 137 -C option Output information about a specific build-time option, then 138 exit. This functionality is intended for use in scripts such 139 as RunTest. The following options output the value and set 140 the exit code as indicated: 141 142 ebcdic-nl the code for LF (= NL) in an EBCDIC environment: 143 0x15 or 0x25 144 0 if used in an ASCII environment 145 exit code is always 0 146 linksize the configured internal link size (2, 3, or 4) 147 exit code is set to the link size 148 newline the default newline setting: 149 CR, LF, CRLF, ANYCRLF, ANY, or NUL 150 exit code is always 0 151 bsr the default setting for what \R matches: 152 ANYCRLF or ANY 153 exit code is always 0 154 155 The following options output 1 for true or 0 for false, and 156 set the exit code to the same value: 157 158 backslash-C \C is supported (not locked out) 159 ebcdic compiled for an EBCDIC environment 160 jit just-in-time support is available 161 pcre2-16 the 16-bit library was built 162 pcre2-32 the 32-bit library was built 163 pcre2-8 the 8-bit library was built 164 unicode Unicode support is available 165 166 If an unknown option is given, an error message is output; 167 the exit code is 0. 168 169 -d Behave as if each pattern has the debug modifier; the inter- 170 nal form and information about the compiled pattern is output 171 after compilation; -d is equivalent to -b -i. 172 173 -dfa Behave as if each subject line has the dfa modifier; matching 174 is done using the pcre2_dfa_match() function instead of the 175 default pcre2_match(). 176 177 -error number[,number,...] 178 Call pcre2_get_error_message() for each of the error numbers 179 in the comma-separated list, display the resulting messages 180 on the standard output, then exit with zero exit code. The 181 numbers may be positive or negative. This is a convenience 182 facility for PCRE2 maintainers. 183 184 -help Output a brief summary these options and then exit. 185 186 -i Behave as if each pattern has the info modifier; information 187 about the compiled pattern is given after compilation. 188 189 -jit Behave as if each pattern line has the jit modifier; after 190 successful compilation, each pattern is passed to the just- 191 in-time compiler, if available. 192 193 -jitfast Behave as if each pattern line has the jitfast modifier; af- 194 ter successful compilation, each pattern is passed to the 195 just-in-time compiler, if available, and each subject line is 196 passed directly to the JIT matcher via its "fast path". 197 198 -jitverify 199 Behave as if each pattern line has the jitverify modifier; 200 after successful compilation, each pattern is passed to the 201 just-in-time compiler, if available, and the use of JIT for 202 matching is verified. 203 204 -LM List modifiers: write a list of available pattern and subject 205 modifiers to the standard output, then exit with zero exit 206 code. All other options are ignored. If both -C and -LM are 207 present, whichever is first is recognized. 208 209 -pattern modifier-list 210 Behave as if each pattern line contains the given modifiers. 211 212 -q Do not output the version number of pcre2test at the start of 213 execution. 214 215 -S size On Unix-like systems, set the size of the run-time stack to 216 size mebibytes (units of 1024*1024 bytes). 217 218 -subject modifier-list 219 Behave as if each subject line contains the given modifiers. 220 221 -t Run each compile and match many times with a timer, and out- 222 put the resulting times per compile or match. When JIT is 223 used, separate times are given for the initial compile and 224 the JIT compile. You can control the number of iterations 225 that are used for timing by following -t with a number (as a 226 separate item on the command line). For example, "-t 1000" 227 iterates 1000 times. The default is to iterate 500,000 times. 228 229 -tm This is like -t except that it times only the matching phase, 230 not the compile phase. 231 232 -T -TM These behave like -t and -tm, but in addition, at the end of 233 a run, the total times for all compiles and matches are out- 234 put. 235 236 -version Output the PCRE2 version number and then exit. 237 238 239DESCRIPTION 240 241 If pcre2test is given two filename arguments, it reads from the first 242 and writes to the second. If the first name is "-", input is taken from 243 the standard input. If pcre2test is given only one argument, it reads 244 from that file and writes to stdout. Otherwise, it reads from stdin and 245 writes to stdout. 246 247 When pcre2test is built, a configuration option can specify that it 248 should be linked with the libreadline or libedit library. When this is 249 done, if the input is from a terminal, it is read using the readline() 250 function. This provides line-editing and history facilities. The output 251 from the -help option states whether or not readline() will be used. 252 253 The program handles any number of tests, each of which consists of a 254 set of input lines. Each set starts with a regular expression pattern, 255 followed by any number of subject lines to be matched against that pat- 256 tern. In between sets of test data, command lines that begin with # may 257 appear. This file format, with some restrictions, can also be processed 258 by the perltest.sh script that is distributed with PCRE2 as a means of 259 checking that the behaviour of PCRE2 and Perl is the same. For a speci- 260 fication of perltest.sh, see the comments near its beginning. See also 261 the #perltest command below. 262 263 When the input is a terminal, pcre2test prompts for each line of input, 264 using "re>" to prompt for regular expression patterns, and "data>" to 265 prompt for subject lines. Command lines starting with # can be entered 266 only in response to the "re>" prompt. 267 268 Each subject line is matched separately and independently. If you want 269 to do multi-line matches, you have to use the \n escape sequence (or \r 270 or \r\n, etc., depending on the newline setting) in a single line of 271 input to encode the newline sequences. There is no limit on the length 272 of subject lines; the input buffer is automatically extended if it is 273 too small. There are replication features that makes it possible to 274 generate long repetitive pattern or subject lines without having to 275 supply them explicitly. 276 277 An empty line or the end of the file signals the end of the subject 278 lines for a test, at which point a new pattern or command line is ex- 279 pected if there is still input to be read. 280 281 282COMMAND LINES 283 284 In between sets of test data, a line that begins with # is interpreted 285 as a command line. If the first character is followed by white space or 286 an exclamation mark, the line is treated as a comment, and ignored. 287 Otherwise, the following commands are recognized: 288 289 #forbid_utf 290 291 Subsequent patterns automatically have the PCRE2_NEVER_UTF and 292 PCRE2_NEVER_UCP options set, which locks out the use of the PCRE2_UTF 293 and PCRE2_UCP options and the use of (*UTF) and (*UCP) at the start of 294 patterns. This command also forces an error if a subsequent pattern 295 contains any occurrences of \P, \p, or \X, which are still supported 296 when PCRE2_UTF is not set, but which require Unicode property support 297 to be included in the library. 298 299 This is a trigger guard that is used in test files to ensure that UTF 300 or Unicode property tests are not accidentally added to files that are 301 used when Unicode support is not included in the library. Setting 302 PCRE2_NEVER_UTF and PCRE2_NEVER_UCP as a default can also be obtained 303 by the use of #pattern; the difference is that #forbid_utf cannot be 304 unset, and the automatic options are not displayed in pattern informa- 305 tion, to avoid cluttering up test output. 306 307 #load <filename> 308 309 This command is used to load a set of precompiled patterns from a file, 310 as described in the section entitled "Saving and restoring compiled 311 patterns" below. 312 313 #loadtables <filename> 314 315 This command is used to load a set of binary character tables that can 316 be accessed by the tables=3 qualifier. Such tables can be created by 317 the pcre2_dftables program with the -b option. 318 319 #newline_default [<newline-list>] 320 321 When PCRE2 is built, a default newline convention can be specified. 322 This determines which characters and/or character pairs are recognized 323 as indicating a newline in a pattern or subject string. The default can 324 be overridden when a pattern is compiled. The standard test files con- 325 tain tests of various newline conventions, but the majority of the 326 tests expect a single linefeed to be recognized as a newline by de- 327 fault. Without special action the tests would fail when PCRE2 is com- 328 piled with either CR or CRLF as the default newline. 329 330 The #newline_default command specifies a list of newline types that are 331 acceptable as the default. The types must be one of CR, LF, CRLF, ANY- 332 CRLF, ANY, or NUL (in upper or lower case), for example: 333 334 #newline_default LF Any anyCRLF 335 336 If the default newline is in the list, this command has no effect. Oth- 337 erwise, except when testing the POSIX API, a newline modifier that 338 specifies the first newline convention in the list (LF in the above ex- 339 ample) is added to any pattern that does not already have a newline 340 modifier. If the newline list is empty, the feature is turned off. This 341 command is present in a number of the standard test input files. 342 343 When the POSIX API is being tested there is no way to override the de- 344 fault newline convention, though it is possible to set the newline con- 345 vention from within the pattern. A warning is given if the posix or 346 posix_nosub modifier is used when #newline_default would set a default 347 for the non-POSIX API. 348 349 #pattern <modifier-list> 350 351 This command sets a default modifier list that applies to all subse- 352 quent patterns. Modifiers on a pattern can change these settings. 353 354 #perltest 355 356 This line is used in test files that can also be processed by perl- 357 test.sh to confirm that Perl gives the same results as PCRE2. Subse- 358 quent tests are checked for the use of pcre2test features that are in- 359 compatible with the perltest.sh script. 360 361 Patterns must use '/' as their delimiter, and only certain modifiers 362 are supported. Comment lines, #pattern commands, and #subject commands 363 that set or unset "mark" are recognized and acted on. The #perltest, 364 #forbid_utf, and #newline_default commands, which are needed in the 365 relevant pcre2test files, are silently ignored. All other command lines 366 are ignored, but give a warning message. The #perltest command helps 367 detect tests that are accidentally put in the wrong file or use the 368 wrong delimiter. For more details of the perltest.sh script see the 369 comments it contains. 370 371 #pop [<modifiers>] 372 #popcopy [<modifiers>] 373 374 These commands are used to manipulate the stack of compiled patterns, 375 as described in the section entitled "Saving and restoring compiled 376 patterns" below. 377 378 #save <filename> 379 380 This command is used to save a set of compiled patterns to a file, as 381 described in the section entitled "Saving and restoring compiled pat- 382 terns" below. 383 384 #subject <modifier-list> 385 386 This command sets a default modifier list that applies to all subse- 387 quent subject lines. Modifiers on a subject line can change these set- 388 tings. 389 390 391MODIFIER SYNTAX 392 393 Modifier lists are used with both pattern and subject lines. Items in a 394 list are separated by commas followed by optional white space. Trailing 395 whitespace in a modifier list is ignored. Some modifiers may be given 396 for both patterns and subject lines, whereas others are valid only for 397 one or the other. Each modifier has a long name, for example "an- 398 chored", and some of them must be followed by an equals sign and a 399 value, for example, "offset=12". Values cannot contain comma charac- 400 ters, but may contain spaces. Modifiers that do not take values may be 401 preceded by a minus sign to turn off a previous setting. 402 403 A few of the more common modifiers can also be specified as single let- 404 ters, for example "i" for "caseless". In documentation, following the 405 Perl convention, these are written with a slash ("the /i modifier") for 406 clarity. Abbreviated modifiers must all be concatenated in the first 407 item of a modifier list. If the first item is not recognized as a long 408 modifier name, it is interpreted as a sequence of these abbreviations. 409 For example: 410 411 /abc/ig,newline=cr,jit=3 412 413 This is a pattern line whose modifier list starts with two one-letter 414 modifiers (/i and /g). The lower-case abbreviated modifiers are the 415 same as used in Perl. 416 417 418PATTERN SYNTAX 419 420 A pattern line must start with one of the following characters (common 421 symbols, excluding pattern meta-characters): 422 423 / ! " ' ` - = _ : ; , % & @ ~ 424 425 This is interpreted as the pattern's delimiter. A regular expression 426 may be continued over several input lines, in which case the newline 427 characters are included within it. It is possible to include the delim- 428 iter within the pattern by escaping it with a backslash, for example 429 430 /abc\/def/ 431 432 If you do this, the escape and the delimiter form part of the pattern, 433 but since the delimiters are all non-alphanumeric, this does not affect 434 its interpretation. If the terminating delimiter is immediately fol- 435 lowed by a backslash, for example, 436 437 /abc/\ 438 439 then a backslash is added to the end of the pattern. This is done to 440 provide a way of testing the error condition that arises if a pattern 441 finishes with a backslash, because 442 443 /abc\/ 444 445 is interpreted as the first line of a pattern that starts with "abc/", 446 causing pcre2test to read the next line as a continuation of the regu- 447 lar expression. 448 449 A pattern can be followed by a modifier list (details below). 450 451 452SUBJECT LINE SYNTAX 453 454 Before each subject line is passed to pcre2_match() or 455 pcre2_dfa_match(), leading and trailing white space is removed, and the 456 line is scanned for backslash escapes, unless the subject_literal modi- 457 fier was set for the pattern. The following provide a means of encoding 458 non-printing characters in a visible way: 459 460 \a alarm (BEL, \x07) 461 \b backspace (\x08) 462 \e escape (\x27) 463 \f form feed (\x0c) 464 \n newline (\x0a) 465 \r carriage return (\x0d) 466 \t tab (\x09) 467 \v vertical tab (\x0b) 468 \nnn octal character (up to 3 octal digits); always 469 a byte unless > 255 in UTF-8 or 16-bit or 32-bit mode 470 \o{dd...} octal character (any number of octal digits} 471 \xhh hexadecimal byte (up to 2 hex digits) 472 \x{hh...} hexadecimal character (any number of hex digits) 473 474 The use of \x{hh...} is not dependent on the use of the utf modifier on 475 the pattern. It is recognized always. There may be any number of hexa- 476 decimal digits inside the braces; invalid values provoke error mes- 477 sages. 478 479 Note that \xhh specifies one byte rather than one character in UTF-8 480 mode; this makes it possible to construct invalid UTF-8 sequences for 481 testing purposes. On the other hand, \x{hh} is interpreted as a UTF-8 482 character in UTF-8 mode, generating more than one byte if the value is 483 greater than 127. When testing the 8-bit library not in UTF-8 mode, 484 \x{hh} generates one byte for values less than 256, and causes an error 485 for greater values. 486 487 In UTF-16 mode, all 4-digit \x{hhhh} values are accepted. This makes it 488 possible to construct invalid UTF-16 sequences for testing purposes. 489 490 In UTF-32 mode, all 4- to 8-digit \x{...} values are accepted. This 491 makes it possible to construct invalid UTF-32 sequences for testing 492 purposes. 493 494 There is a special backslash sequence that specifies replication of one 495 or more characters: 496 497 \[<characters>]{<count>} 498 499 This makes it possible to test long strings without having to provide 500 them as part of the file. For example: 501 502 \[abc]{4} 503 504 is converted to "abcabcabcabc". This feature does not support nesting. 505 To include a closing square bracket in the characters, code it as \x5D. 506 507 A backslash followed by an equals sign marks the end of the subject 508 string and the start of a modifier list. For example: 509 510 abc\=notbol,notempty 511 512 If the subject string is empty and \= is followed by whitespace, the 513 line is treated as a comment line, and is not used for matching. For 514 example: 515 516 \= This is a comment. 517 abc\= This is an invalid modifier list. 518 519 A backslash followed by any other non-alphanumeric character just es- 520 capes that character. A backslash followed by anything else causes an 521 error. However, if the very last character in the line is a backslash 522 (and there is no modifier list), it is ignored. This gives a way of 523 passing an empty line as data, since a real empty line terminates the 524 data input. 525 526 If the subject_literal modifier is set for a pattern, all subject lines 527 that follow are treated as literals, with no special treatment of back- 528 slashes. No replication is possible, and any subject modifiers must be 529 set as defaults by a #subject command. 530 531 532PATTERN MODIFIERS 533 534 There are several types of modifier that can appear in pattern lines. 535 Except where noted below, they may also be used in #pattern commands. A 536 pattern's modifier list can add to or override default modifiers that 537 were set by a previous #pattern command. 538 539 Setting compilation options 540 541 The following modifiers set options for pcre2_compile(). Most of them 542 set bits in the options argument of that function, but those whose 543 names start with PCRE2_EXTRA are additional options that are set in the 544 compile context. For the main options, there are some single-letter ab- 545 breviations that are the same as Perl options. There is special han- 546 dling for /x: if a second x is present, PCRE2_EXTENDED is converted 547 into PCRE2_EXTENDED_MORE as in Perl. A third appearance adds PCRE2_EX- 548 TENDED as well, though this makes no difference to the way pcre2_com- 549 pile() behaves. See pcre2api for a description of the effects of these 550 options. 551 552 allow_empty_class set PCRE2_ALLOW_EMPTY_CLASS 553 allow_surrogate_escapes set PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES 554 alt_bsux set PCRE2_ALT_BSUX 555 alt_circumflex set PCRE2_ALT_CIRCUMFLEX 556 alt_verbnames set PCRE2_ALT_VERBNAMES 557 anchored set PCRE2_ANCHORED 558 auto_callout set PCRE2_AUTO_CALLOUT 559 bad_escape_is_literal set PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL 560 /i caseless set PCRE2_CASELESS 561 dollar_endonly set PCRE2_DOLLAR_ENDONLY 562 /s dotall set PCRE2_DOTALL 563 dupnames set PCRE2_DUPNAMES 564 endanchored set PCRE2_ENDANCHORED 565 escaped_cr_is_lf set PCRE2_EXTRA_ESCAPED_CR_IS_LF 566 /x extended set PCRE2_EXTENDED 567 /xx extended_more set PCRE2_EXTENDED_MORE 568 extra_alt_bsux set PCRE2_EXTRA_ALT_BSUX 569 firstline set PCRE2_FIRSTLINE 570 literal set PCRE2_LITERAL 571 match_line set PCRE2_EXTRA_MATCH_LINE 572 match_invalid_utf set PCRE2_MATCH_INVALID_UTF 573 match_unset_backref set PCRE2_MATCH_UNSET_BACKREF 574 match_word set PCRE2_EXTRA_MATCH_WORD 575 /m multiline set PCRE2_MULTILINE 576 never_backslash_c set PCRE2_NEVER_BACKSLASH_C 577 never_ucp set PCRE2_NEVER_UCP 578 never_utf set PCRE2_NEVER_UTF 579 /n no_auto_capture set PCRE2_NO_AUTO_CAPTURE 580 no_auto_possess set PCRE2_NO_AUTO_POSSESS 581 no_dotstar_anchor set PCRE2_NO_DOTSTAR_ANCHOR 582 no_start_optimize set PCRE2_NO_START_OPTIMIZE 583 no_utf_check set PCRE2_NO_UTF_CHECK 584 ucp set PCRE2_UCP 585 ungreedy set PCRE2_UNGREEDY 586 use_offset_limit set PCRE2_USE_OFFSET_LIMIT 587 utf set PCRE2_UTF 588 589 As well as turning on the PCRE2_UTF option, the utf modifier causes all 590 non-printing characters in output strings to be printed using the 591 \x{hh...} notation. Otherwise, those less than 0x100 are output in hex 592 without the curly brackets. Setting utf in 16-bit or 32-bit mode also 593 causes pattern and subject strings to be translated to UTF-16 or 594 UTF-32, respectively, before being passed to library functions. 595 596 Setting compilation controls 597 598 The following modifiers affect the compilation process or request in- 599 formation about the pattern. There are single-letter abbreviations for 600 some that are heavily used in the test files. 601 602 bsr=[anycrlf|unicode] specify \R handling 603 /B bincode show binary code without lengths 604 callout_info show callout information 605 convert=<options> request foreign pattern conversion 606 convert_glob_escape=c set glob escape character 607 convert_glob_separator=c set glob separator character 608 convert_length set convert buffer length 609 debug same as info,fullbincode 610 framesize show matching frame size 611 fullbincode show binary code with lengths 612 /I info show info about compiled pattern 613 hex unquoted characters are hexadecimal 614 jit[=<number>] use JIT 615 jitfast use JIT fast path 616 jitverify verify JIT use 617 locale=<name> use this locale 618 max_pattern_length=<n> set the maximum pattern length 619 memory show memory used 620 newline=<type> set newline type 621 null_context compile with a NULL context 622 parens_nest_limit=<n> set maximum parentheses depth 623 posix use the POSIX API 624 posix_nosub use the POSIX API with REG_NOSUB 625 push push compiled pattern onto the stack 626 pushcopy push a copy onto the stack 627 stackguard=<number> test the stackguard feature 628 subject_literal treat all subject lines as literal 629 tables=[0|1|2|3] select internal tables 630 use_length do not zero-terminate the pattern 631 utf8_input treat input as UTF-8 632 633 The effects of these modifiers are described in the following sections. 634 635 Newline and \R handling 636 637 The bsr modifier specifies what \R in a pattern should match. If it is 638 set to "anycrlf", \R matches CR, LF, or CRLF only. If it is set to 639 "unicode", \R matches any Unicode newline sequence. The default can be 640 specified when PCRE2 is built; if it is not, the default is set to Uni- 641 code. 642 643 The newline modifier specifies which characters are to be interpreted 644 as newlines, both in the pattern and in subject lines. The type must be 645 one of CR, LF, CRLF, ANYCRLF, ANY, or NUL (in upper or lower case). 646 647 Information about a pattern 648 649 The debug modifier is a shorthand for info,fullbincode, requesting all 650 available information. 651 652 The bincode modifier causes a representation of the compiled code to be 653 output after compilation. This information does not contain length and 654 offset values, which ensures that the same output is generated for dif- 655 ferent internal link sizes and different code unit widths. By using 656 bincode, the same regression tests can be used in different environ- 657 ments. 658 659 The fullbincode modifier, by contrast, does include length and offset 660 values. This is used in a few special tests that run only for specific 661 code unit widths and link sizes, and is also useful for one-off tests. 662 663 The info modifier requests information about the compiled pattern 664 (whether it is anchored, has a fixed first character, and so on). The 665 information is obtained from the pcre2_pattern_info() function. Here 666 are some typical examples: 667 668 re> /(?i)(^a|^b)/m,info 669 Capture group count = 1 670 Compile options: multiline 671 Overall options: caseless multiline 672 First code unit at start or follows newline 673 Subject length lower bound = 1 674 675 re> /(?i)abc/info 676 Capture group count = 0 677 Compile options: <none> 678 Overall options: caseless 679 First code unit = 'a' (caseless) 680 Last code unit = 'c' (caseless) 681 Subject length lower bound = 3 682 683 "Compile options" are those specified by modifiers; "overall options" 684 have added options that are taken or deduced from the pattern. If both 685 sets of options are the same, just a single "options" line is output; 686 if there are no options, the line is omitted. "First code unit" is 687 where any match must start; if there is more than one they are listed 688 as "starting code units". "Last code unit" is the last literal code 689 unit that must be present in any match. This is not necessarily the 690 last character. These lines are omitted if no starting or ending code 691 units are recorded. The subject length line is omitted when 692 no_start_optimize is set because the minimum length is not calculated 693 when it can never be used. 694 695 The framesize modifier shows the size, in bytes, of the storage frames 696 used by pcre2_match() for handling backtracking. The size depends on 697 the number of capturing parentheses in the pattern. 698 699 The callout_info modifier requests information about all the callouts 700 in the pattern. A list of them is output at the end of any other infor- 701 mation that is requested. For each callout, either its number or string 702 is given, followed by the item that follows it in the pattern. 703 704 Passing a NULL context 705 706 Normally, pcre2test passes a context block to pcre2_compile(). If the 707 null_context modifier is set, however, NULL is passed. This is for 708 testing that pcre2_compile() behaves correctly in this case (it uses 709 default values). 710 711 Specifying pattern characters in hexadecimal 712 713 The hex modifier specifies that the characters of the pattern, except 714 for substrings enclosed in single or double quotes, are to be inter- 715 preted as pairs of hexadecimal digits. This feature is provided as a 716 way of creating patterns that contain binary zeros and other non-print- 717 ing characters. White space is permitted between pairs of digits. For 718 example, this pattern contains three characters: 719 720 /ab 32 59/hex 721 722 Parts of such a pattern are taken literally if quoted. This pattern 723 contains nine characters, only two of which are specified in hexadeci- 724 mal: 725 726 /ab "literal" 32/hex 727 728 Either single or double quotes may be used. There is no way of includ- 729 ing the delimiter within a substring. The hex and expand modifiers are 730 mutually exclusive. 731 732 Specifying the pattern's length 733 734 By default, patterns are passed to the compiling functions as zero-ter- 735 minated strings but can be passed by length instead of being zero-ter- 736 minated. The use_length modifier causes this to happen. Using a length 737 happens automatically (whether or not use_length is set) when hex is 738 set, because patterns specified in hexadecimal may contain binary ze- 739 ros. 740 741 If hex or use_length is used with the POSIX wrapper API (see "Using the 742 POSIX wrapper API" below), the REG_PEND extension is used to pass the 743 pattern's length. 744 745 Specifying wide characters in 16-bit and 32-bit modes 746 747 In 16-bit and 32-bit modes, all input is automatically treated as UTF-8 748 and translated to UTF-16 or UTF-32 when the utf modifier is set. For 749 testing the 16-bit and 32-bit libraries in non-UTF mode, the utf8_input 750 modifier can be used. It is mutually exclusive with utf. Input lines 751 are interpreted as UTF-8 as a means of specifying wide characters. More 752 details are given in "Input encoding" above. 753 754 Generating long repetitive patterns 755 756 Some tests use long patterns that are very repetitive. Instead of cre- 757 ating a very long input line for such a pattern, you can use a special 758 repetition feature, similar to the one described for subject lines 759 above. If the expand modifier is present on a pattern, parts of the 760 pattern that have the form 761 762 \[<characters>]{<count>} 763 764 are expanded before the pattern is passed to pcre2_compile(). For exam- 765 ple, \[AB]{6000} is expanded to "ABAB..." 6000 times. This construction 766 cannot be nested. An initial "\[" sequence is recognized only if "]{" 767 followed by decimal digits and "}" is found later in the pattern. If 768 not, the characters remain in the pattern unaltered. The expand and hex 769 modifiers are mutually exclusive. 770 771 If part of an expanded pattern looks like an expansion, but is really 772 part of the actual pattern, unwanted expansion can be avoided by giving 773 two values in the quantifier. For example, \[AB]{6000,6000} is not rec- 774 ognized as an expansion item. 775 776 If the info modifier is set on an expanded pattern, the result of the 777 expansion is included in the information that is output. 778 779 JIT compilation 780 781 Just-in-time (JIT) compiling is a heavyweight optimization that can 782 greatly speed up pattern matching. See the pcre2jit documentation for 783 details. JIT compiling happens, optionally, after a pattern has been 784 successfully compiled into an internal form. The JIT compiler converts 785 this to optimized machine code. It needs to know whether the match-time 786 options PCRE2_PARTIAL_HARD and PCRE2_PARTIAL_SOFT are going to be used, 787 because different code is generated for the different cases. See the 788 partial modifier in "Subject Modifiers" below for details of how these 789 options are specified for each match attempt. 790 791 JIT compilation is requested by the jit pattern modifier, which may op- 792 tionally be followed by an equals sign and a number in the range 0 to 793 7. The three bits that make up the number specify which of the three 794 JIT operating modes are to be compiled: 795 796 1 compile JIT code for non-partial matching 797 2 compile JIT code for soft partial matching 798 4 compile JIT code for hard partial matching 799 800 The possible values for the jit modifier are therefore: 801 802 0 disable JIT 803 1 normal matching only 804 2 soft partial matching only 805 3 normal and soft partial matching 806 4 hard partial matching only 807 6 soft and hard partial matching only 808 7 all three modes 809 810 If no number is given, 7 is assumed. The phrase "partial matching" 811 means a call to pcre2_match() with either the PCRE2_PARTIAL_SOFT or the 812 PCRE2_PARTIAL_HARD option set. Note that such a call may return a com- 813 plete match; the options enable the possibility of a partial match, but 814 do not require it. Note also that if you request JIT compilation only 815 for partial matching (for example, jit=2) but do not set the partial 816 modifier on a subject line, that match will not use JIT code because 817 none was compiled for non-partial matching. 818 819 If JIT compilation is successful, the compiled JIT code will automati- 820 cally be used when an appropriate type of match is run, except when in- 821 compatible run-time options are specified. For more details, see the 822 pcre2jit documentation. See also the jitstack modifier below for a way 823 of setting the size of the JIT stack. 824 825 If the jitfast modifier is specified, matching is done using the JIT 826 "fast path" interface, pcre2_jit_match(), which skips some of the san- 827 ity checks that are done by pcre2_match(), and of course does not work 828 when JIT is not supported. If jitfast is specified without jit, jit=7 829 is assumed. 830 831 If the jitverify modifier is specified, information about the compiled 832 pattern shows whether JIT compilation was or was not successful. If 833 jitverify is specified without jit, jit=7 is assumed. If JIT compila- 834 tion is successful when jitverify is set, the text "(JIT)" is added to 835 the first output line after a match or non match when JIT-compiled code 836 was actually used in the match. 837 838 Setting a locale 839 840 The locale modifier must specify the name of a locale, for example: 841 842 /pattern/locale=fr_FR 843 844 The given locale is set, pcre2_maketables() is called to build a set of 845 character tables for the locale, and this is then passed to pcre2_com- 846 pile() when compiling the regular expression. The same tables are used 847 when matching the following subject lines. The locale modifier applies 848 only to the pattern on which it appears, but can be given in a #pattern 849 command if a default is needed. Setting a locale and alternate charac- 850 ter tables are mutually exclusive. 851 852 Showing pattern memory 853 854 The memory modifier causes the size in bytes of the memory used to hold 855 the compiled pattern to be output. This does not include the size of 856 the pcre2_code block; it is just the actual compiled data. If the pat- 857 tern is subsequently passed to the JIT compiler, the size of the JIT 858 compiled code is also output. Here is an example: 859 860 re> /a(b)c/jit,memory 861 Memory allocation (code space): 21 862 Memory allocation (JIT code): 1910 863 864 865 Limiting nested parentheses 866 867 The parens_nest_limit modifier sets a limit on the depth of nested 868 parentheses in a pattern. Breaching the limit causes a compilation er- 869 ror. The default for the library is set when PCRE2 is built, but 870 pcre2test sets its own default of 220, which is required for running 871 the standard test suite. 872 873 Limiting the pattern length 874 875 The max_pattern_length modifier sets a limit, in code units, to the 876 length of pattern that pcre2_compile() will accept. Breaching the limit 877 causes a compilation error. The default is the largest number a 878 PCRE2_SIZE variable can hold (essentially unlimited). 879 880 Using the POSIX wrapper API 881 882 The posix and posix_nosub modifiers cause pcre2test to call PCRE2 via 883 the POSIX wrapper API rather than its native API. When posix_nosub is 884 used, the POSIX option REG_NOSUB is passed to regcomp(). The POSIX 885 wrapper supports only the 8-bit library. Note that it does not imply 886 POSIX matching semantics; for more detail see the pcre2posix documenta- 887 tion. The following pattern modifiers set options for the regcomp() 888 function: 889 890 caseless REG_ICASE 891 multiline REG_NEWLINE 892 dotall REG_DOTALL ) 893 ungreedy REG_UNGREEDY ) These options are not part of 894 ucp REG_UCP ) the POSIX standard 895 utf REG_UTF8 ) 896 897 The regerror_buffsize modifier specifies a size for the error buffer 898 that is passed to regerror() in the event of a compilation error. For 899 example: 900 901 /abc/posix,regerror_buffsize=20 902 903 This provides a means of testing the behaviour of regerror() when the 904 buffer is too small for the error message. If this modifier has not 905 been set, a large buffer is used. 906 907 The aftertext and allaftertext subject modifiers work as described be- 908 low. All other modifiers are either ignored, with a warning message, or 909 cause an error. 910 911 The pattern is passed to regcomp() as a zero-terminated string by de- 912 fault, but if the use_length or hex modifiers are set, the REG_PEND ex- 913 tension is used to pass it by length. 914 915 Testing the stack guard feature 916 917 The stackguard modifier is used to test the use of pcre2_set_com- 918 pile_recursion_guard(), a function that is provided to enable stack 919 availability to be checked during compilation (see the pcre2api docu- 920 mentation for details). If the number specified by the modifier is 921 greater than zero, pcre2_set_compile_recursion_guard() is called to set 922 up callback from pcre2_compile() to a local function. The argument it 923 receives is the current nesting parenthesis depth; if this is greater 924 than the value given by the modifier, non-zero is returned, causing the 925 compilation to be aborted. 926 927 Using alternative character tables 928 929 The value specified for the tables modifier must be one of the digits 930 0, 1, 2, or 3. It causes a specific set of built-in character tables to 931 be passed to pcre2_compile(). This is used in the PCRE2 tests to check 932 behaviour with different character tables. The digit specifies the ta- 933 bles as follows: 934 935 0 do not pass any special character tables 936 1 the default ASCII tables, as distributed in 937 pcre2_chartables.c.dist 938 2 a set of tables defining ISO 8859 characters 939 3 a set of tables loaded by the #loadtables command 940 941 In tables 2, some characters whose codes are greater than 128 are iden- 942 tified as letters, digits, spaces, etc. Tables 3 can be used only after 943 a #loadtables command has loaded them from a binary file. Setting al- 944 ternate character tables and a locale are mutually exclusive. 945 946 Setting certain match controls 947 948 The following modifiers are really subject modifiers, and are described 949 under "Subject Modifiers" below. However, they may be included in a 950 pattern's modifier list, in which case they are applied to every sub- 951 ject line that is processed with that pattern. These modifiers do not 952 affect the compilation process. 953 954 aftertext show text after match 955 allaftertext show text after captures 956 allcaptures show all captures 957 allvector show the entire ovector 958 allusedtext show all consulted text 959 altglobal alternative global matching 960 /g global global matching 961 jitstack=<n> set size of JIT stack 962 mark show mark values 963 replace=<string> specify a replacement string 964 startchar show starting character when relevant 965 substitute_callout use substitution callouts 966 substitute_extended use PCRE2_SUBSTITUTE_EXTENDED 967 substitute_literal use PCRE2_SUBSTITUTE_LITERAL 968 substitute_matched use PCRE2_SUBSTITUTE_MATCHED 969 substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH 970 substitute_replacement_only use PCRE2_SUBSTITUTE_REPLACEMENT_ONLY 971 substitute_skip=<n> skip substitution <n> 972 substitute_stop=<n> skip substitution <n> and following 973 substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET 974 substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY 975 976 These modifiers may not appear in a #pattern command. If you want them 977 as defaults, set them in a #subject command. 978 979 Specifying literal subject lines 980 981 If the subject_literal modifier is present on a pattern, all the sub- 982 ject lines that it matches are taken as literal strings, with no inter- 983 pretation of backslashes. It is not possible to set subject modifiers 984 on such lines, but any that are set as defaults by a #subject command 985 are recognized. 986 987 Saving a compiled pattern 988 989 When a pattern with the push modifier is successfully compiled, it is 990 pushed onto a stack of compiled patterns, and pcre2test expects the 991 next line to contain a new pattern (or a command) instead of a subject 992 line. This facility is used when saving compiled patterns to a file, as 993 described in the section entitled "Saving and restoring compiled pat- 994 terns" below. If pushcopy is used instead of push, a copy of the com- 995 piled pattern is stacked, leaving the original as current, ready to 996 match the following input lines. This provides a way of testing the 997 pcre2_code_copy() function. The push and pushcopy modifiers are in- 998 compatible with compilation modifiers such as global that act at match 999 time. Any that are specified are ignored (for the stacked copy), with a 1000 warning message, except for replace, which causes an error. Note that 1001 jitverify, which is allowed, does not carry through to any subsequent 1002 matching that uses a stacked pattern. 1003 1004 Testing foreign pattern conversion 1005 1006 The experimental foreign pattern conversion functions in PCRE2 can be 1007 tested by setting the convert modifier. Its argument is a colon-sepa- 1008 rated list of options, which set the equivalent option for the 1009 pcre2_pattern_convert() function: 1010 1011 glob PCRE2_CONVERT_GLOB 1012 glob_no_starstar PCRE2_CONVERT_GLOB_NO_STARSTAR 1013 glob_no_wild_separator PCRE2_CONVERT_GLOB_NO_WILD_SEPARATOR 1014 posix_basic PCRE2_CONVERT_POSIX_BASIC 1015 posix_extended PCRE2_CONVERT_POSIX_EXTENDED 1016 unset Unset all options 1017 1018 The "unset" value is useful for turning off a default that has been set 1019 by a #pattern command. When one of these options is set, the input pat- 1020 tern is passed to pcre2_pattern_convert(). If the conversion is suc- 1021 cessful, the result is reflected in the output and then passed to 1022 pcre2_compile(). The normal utf and no_utf_check options, if set, cause 1023 the PCRE2_CONVERT_UTF and PCRE2_CONVERT_NO_UTF_CHECK options to be 1024 passed to pcre2_pattern_convert(). 1025 1026 By default, the conversion function is allowed to allocate a buffer for 1027 its output. However, if the convert_length modifier is set to a value 1028 greater than zero, pcre2test passes a buffer of the given length. This 1029 makes it possible to test the length check. 1030 1031 The convert_glob_escape and convert_glob_separator modifiers can be 1032 used to specify the escape and separator characters for glob process- 1033 ing, overriding the defaults, which are operating-system dependent. 1034 1035 1036SUBJECT MODIFIERS 1037 1038 The modifiers that can appear in subject lines and the #subject command 1039 are of two types. 1040 1041 Setting match options 1042 1043 The following modifiers set options for pcre2_match() or 1044 pcre2_dfa_match(). See pcreapi for a description of their effects. 1045 1046 anchored set PCRE2_ANCHORED 1047 endanchored set PCRE2_ENDANCHORED 1048 dfa_restart set PCRE2_DFA_RESTART 1049 dfa_shortest set PCRE2_DFA_SHORTEST 1050 no_jit set PCRE2_NO_JIT 1051 no_utf_check set PCRE2_NO_UTF_CHECK 1052 notbol set PCRE2_NOTBOL 1053 notempty set PCRE2_NOTEMPTY 1054 notempty_atstart set PCRE2_NOTEMPTY_ATSTART 1055 noteol set PCRE2_NOTEOL 1056 partial_hard (or ph) set PCRE2_PARTIAL_HARD 1057 partial_soft (or ps) set PCRE2_PARTIAL_SOFT 1058 1059 The partial matching modifiers are provided with abbreviations because 1060 they appear frequently in tests. 1061 1062 If the posix or posix_nosub modifier was present on the pattern, caus- 1063 ing the POSIX wrapper API to be used, the only option-setting modifiers 1064 that have any effect are notbol, notempty, and noteol, causing REG_NOT- 1065 BOL, REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to 1066 regexec(). The other modifiers are ignored, with a warning message. 1067 1068 There is one additional modifier that can be used with the POSIX wrap- 1069 per. It is ignored (with a warning) if used for non-POSIX matching. 1070 1071 posix_startend=<n>[:<m>] 1072 1073 This causes the subject string to be passed to regexec() using the 1074 REG_STARTEND option, which uses offsets to specify which part of the 1075 string is searched. If only one number is given, the end offset is 1076 passed as the end of the subject string. For more detail of REG_STAR- 1077 TEND, see the pcre2posix documentation. If the subject string contains 1078 binary zeros (coded as escapes such as \x{00} because pcre2test does 1079 not support actual binary zeros in its input), you must use posix_star- 1080 tend to specify its length. 1081 1082 Setting match controls 1083 1084 The following modifiers affect the matching process or request addi- 1085 tional information. Some of them may also be specified on a pattern 1086 line (see above), in which case they apply to every subject line that 1087 is matched against that pattern. 1088 1089 aftertext show text after match 1090 allaftertext show text after captures 1091 allcaptures show all captures 1092 allvector show the entire ovector 1093 allusedtext show all consulted text (non-JIT only) 1094 altglobal alternative global matching 1095 callout_capture show captures at callout time 1096 callout_data=<n> set a value to pass via callouts 1097 callout_error=<n>[:<m>] control callout error 1098 callout_extra show extra callout information 1099 callout_fail=<n>[:<m>] control callout failure 1100 callout_no_where do not show position of a callout 1101 callout_none do not supply a callout function 1102 copy=<number or name> copy captured substring 1103 depth_limit=<n> set a depth limit 1104 dfa use pcre2_dfa_match() 1105 find_limits find match and depth limits 1106 get=<number or name> extract captured substring 1107 getall extract all captured substrings 1108 /g global global matching 1109 heap_limit=<n> set a limit on heap memory (Kbytes) 1110 jitstack=<n> set size of JIT stack 1111 mark show mark values 1112 match_limit=<n> set a match limit 1113 memory show heap memory usage 1114 null_context match with a NULL context 1115 offset=<n> set starting offset 1116 offset_limit=<n> set offset limit 1117 ovector=<n> set size of output vector 1118 recursion_limit=<n> obsolete synonym for depth_limit 1119 replace=<string> specify a replacement string 1120 startchar show startchar when relevant 1121 startoffset=<n> same as offset=<n> 1122 substitute_callout use substitution callouts 1123 substitute_extedded use PCRE2_SUBSTITUTE_EXTENDED 1124 substitute_literal use PCRE2_SUBSTITUTE_LITERAL 1125 substitute_matched use PCRE2_SUBSTITUTE_MATCHED 1126 substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH 1127 substitute_replacement_only use PCRE2_SUBSTITUTE_REPLACEMENT_ONLY 1128 substitute_skip=<n> skip substitution number n 1129 substitute_stop=<n> skip substitution number n and greater 1130 substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET 1131 substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY 1132 zero_terminate pass the subject as zero-terminated 1133 1134 The effects of these modifiers are described in the following sections. 1135 When matching via the POSIX wrapper API, the aftertext, allaftertext, 1136 and ovector subject modifiers work as described below. All other modi- 1137 fiers are either ignored, with a warning message, or cause an error. 1138 1139 Showing more text 1140 1141 The aftertext modifier requests that as well as outputting the part of 1142 the subject string that matched the entire pattern, pcre2test should in 1143 addition output the remainder of the subject string. This is useful for 1144 tests where the subject contains multiple copies of the same substring. 1145 The allaftertext modifier requests the same action for captured sub- 1146 strings as well as the main matched substring. In each case the remain- 1147 der is output on the following line with a plus character following the 1148 capture number. 1149 1150 The allusedtext modifier requests that all the text that was consulted 1151 during a successful pattern match by the interpreter should be shown, 1152 for both full and partial matches. This feature is not supported for 1153 JIT matching, and if requested with JIT it is ignored (with a warning 1154 message). Setting this modifier affects the output if there is a look- 1155 behind at the start of a match, or, for a complete match, a lookahead 1156 at the end, or if \K is used in the pattern. Characters that precede or 1157 follow the start and end of the actual match are indicated in the out- 1158 put by '<' or '>' characters underneath them. Here is an example: 1159 1160 re> /(?<=pqr)abc(?=xyz)/ 1161 data> 123pqrabcxyz456\=allusedtext 1162 0: pqrabcxyz 1163 <<< >>> 1164 data> 123pqrabcxy\=ph,allusedtext 1165 Partial match: pqrabcxy 1166 <<< 1167 1168 The first, complete match shows that the matched string is "abc", with 1169 the preceding and following strings "pqr" and "xyz" having been con- 1170 sulted during the match (when processing the assertions). The partial 1171 match can indicate only the preceding string. 1172 1173 The startchar modifier requests that the starting character for the 1174 match be indicated, if it is different to the start of the matched 1175 string. The only time when this occurs is when \K has been processed as 1176 part of the match. In this situation, the output for the matched string 1177 is displayed from the starting character instead of from the match 1178 point, with circumflex characters under the earlier characters. For ex- 1179 ample: 1180 1181 re> /abc\Kxyz/ 1182 data> abcxyz\=startchar 1183 0: abcxyz 1184 ^^^ 1185 1186 Unlike allusedtext, the startchar modifier can be used with JIT. How- 1187 ever, these two modifiers are mutually exclusive. 1188 1189 Showing the value of all capture groups 1190 1191 The allcaptures modifier requests that the values of all potential cap- 1192 tured parentheses be output after a match. By default, only those up to 1193 the highest one actually used in the match are output (corresponding to 1194 the return code from pcre2_match()). Groups that did not take part in 1195 the match are output as "<unset>". This modifier is not relevant for 1196 DFA matching (which does no capturing) and does not apply when replace 1197 is specified; it is ignored, with a warning message, if present. 1198 1199 Showing the entire ovector, for all outcomes 1200 1201 The allvector modifier requests that the entire ovector be shown, what- 1202 ever the outcome of the match. Compare allcaptures, which shows only up 1203 to the maximum number of capture groups for the pattern, and then only 1204 for a successful complete non-DFA match. This modifier, which acts af- 1205 ter any match result, and also for DFA matching, provides a means of 1206 checking that there are no unexpected modifications to ovector fields. 1207 Before each match attempt, the ovector is filled with a special value, 1208 and if this is found in both elements of a capturing pair, "<un- 1209 changed>" is output. After a successful match, this applies to all 1210 groups after the maximum capture group for the pattern. In other cases 1211 it applies to the entire ovector. After a partial match, the first two 1212 elements are the only ones that should be set. After a DFA match, the 1213 amount of ovector that is used depends on the number of matches that 1214 were found. 1215 1216 Testing pattern callouts 1217 1218 A callout function is supplied when pcre2test calls the library match- 1219 ing functions, unless callout_none is specified. Its behaviour can be 1220 controlled by various modifiers listed above whose names begin with 1221 callout_. Details are given in the section entitled "Callouts" below. 1222 Testing callouts from pcre2_substitute() is decribed separately in 1223 "Testing the substitution function" below. 1224 1225 Finding all matches in a string 1226 1227 Searching for all possible matches within a subject can be requested by 1228 the global or altglobal modifier. After finding a match, the matching 1229 function is called again to search the remainder of the subject. The 1230 difference between global and altglobal is that the former uses the 1231 start_offset argument to pcre2_match() or pcre2_dfa_match() to start 1232 searching at a new point within the entire string (which is what Perl 1233 does), whereas the latter passes over a shortened subject. This makes a 1234 difference to the matching process if the pattern begins with a lookbe- 1235 hind assertion (including \b or \B). 1236 1237 If an empty string is matched, the next match is done with the 1238 PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED flags set, in order to search 1239 for another, non-empty, match at the same point in the subject. If this 1240 match fails, the start offset is advanced, and the normal match is re- 1241 tried. This imitates the way Perl handles such cases when using the /g 1242 modifier or the split() function. Normally, the start offset is ad- 1243 vanced by one character, but if the newline convention recognizes CRLF 1244 as a newline, and the current character is CR followed by LF, an ad- 1245 vance of two characters occurs. 1246 1247 Testing substring extraction functions 1248 1249 The copy and get modifiers can be used to test the pcre2_sub- 1250 string_copy_xxx() and pcre2_substring_get_xxx() functions. They can be 1251 given more than once, and each can specify a capture group name or num- 1252 ber, for example: 1253 1254 abcd\=copy=1,copy=3,get=G1 1255 1256 If the #subject command is used to set default copy and/or get lists, 1257 these can be unset by specifying a negative number to cancel all num- 1258 bered groups and an empty name to cancel all named groups. 1259 1260 The getall modifier tests pcre2_substring_list_get(), which extracts 1261 all captured substrings. 1262 1263 If the subject line is successfully matched, the substrings extracted 1264 by the convenience functions are output with C, G, or L after the 1265 string number instead of a colon. This is in addition to the normal 1266 full list. The string length (that is, the return from the extraction 1267 function) is given in parentheses after each substring, followed by the 1268 name when the extraction was by name. 1269 1270 Testing the substitution function 1271 1272 If the replace modifier is set, the pcre2_substitute() function is 1273 called instead of one of the matching functions (or after one call of 1274 pcre2_match() in the case of PCRE2_SUBSTITUTE_MATCHED). Note that re- 1275 placement strings cannot contain commas, because a comma signifies the 1276 end of a modifier. This is not thought to be an issue in a test pro- 1277 gram. 1278 1279 Unlike subject strings, pcre2test does not process replacement strings 1280 for escape sequences. In UTF mode, a replacement string is checked to 1281 see if it is a valid UTF-8 string. If so, it is correctly converted to 1282 a UTF string of the appropriate code unit width. If it is not a valid 1283 UTF-8 string, the individual code units are copied directly. This pro- 1284 vides a means of passing an invalid UTF-8 string for testing purposes. 1285 1286 The following modifiers set options (in additional to the normal match 1287 options) for pcre2_substitute(): 1288 1289 global PCRE2_SUBSTITUTE_GLOBAL 1290 substitute_extended PCRE2_SUBSTITUTE_EXTENDED 1291 substitute_literal PCRE2_SUBSTITUTE_LITERAL 1292 substitute_matched PCRE2_SUBSTITUTE_MATCHED 1293 substitute_overflow_length PCRE2_SUBSTITUTE_OVERFLOW_LENGTH 1294 substitute_replacement_only PCRE2_SUBSTITUTE_REPLACEMENT_ONLY 1295 substitute_unknown_unset PCRE2_SUBSTITUTE_UNKNOWN_UNSET 1296 substitute_unset_empty PCRE2_SUBSTITUTE_UNSET_EMPTY 1297 1298 See the pcre2api documentation for details of these options. 1299 1300 After a successful substitution, the modified string is output, pre- 1301 ceded by the number of replacements. This may be zero if there were no 1302 matches. Here is a simple example of a substitution test: 1303 1304 /abc/replace=xxx 1305 =abc=abc= 1306 1: =xxx=abc= 1307 =abc=abc=\=global 1308 2: =xxx=xxx= 1309 1310 Subject and replacement strings should be kept relatively short (fewer 1311 than 256 characters) for substitution tests, as fixed-size buffers are 1312 used. To make it easy to test for buffer overflow, if the replacement 1313 string starts with a number in square brackets, that number is passed 1314 to pcre2_substitute() as the size of the output buffer, with the re- 1315 placement string starting at the next character. Here is an example 1316 that tests the edge case: 1317 1318 /abc/ 1319 123abc123\=replace=[10]XYZ 1320 1: 123XYZ123 1321 123abc123\=replace=[9]XYZ 1322 Failed: error -47: no more memory 1323 1324 The default action of pcre2_substitute() is to return PCRE2_ER- 1325 ROR_NOMEMORY when the output buffer is too small. However, if the 1326 PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set (by using the substi- 1327 tute_overflow_length modifier), pcre2_substitute() continues to go 1328 through the motions of matching and substituting (but not doing any 1329 callouts), in order to compute the size of buffer that is required. 1330 When this happens, pcre2test shows the required buffer length (which 1331 includes space for the trailing zero) as part of the error message. For 1332 example: 1333 1334 /abc/substitute_overflow_length 1335 123abc123\=replace=[9]XYZ 1336 Failed: error -47: no more memory: 10 code units are needed 1337 1338 A replacement string is ignored with POSIX and DFA matching. Specifying 1339 partial matching provokes an error return ("bad option value") from 1340 pcre2_substitute(). 1341 1342 Testing substitute callouts 1343 1344 If the substitute_callout modifier is set, a substitution callout func- 1345 tion is set up. The null_context modifier must not be set, because the 1346 address of the callout function is passed in a match context. When the 1347 callout function is called (after each substitution), details of the 1348 the input and output strings are output. For example: 1349 1350 /abc/g,replace=<$0>,substitute_callout 1351 abcdefabcpqr 1352 1(1) Old 0 3 "abc" New 0 5 "<abc>" 1353 2(1) Old 6 9 "abc" New 8 13 "<abc>" 1354 2: <abc>def<abc>pqr 1355 1356 The first number on each callout line is the count of matches. The 1357 parenthesized number is the number of pairs that are set in the ovector 1358 (that is, one more than the number of capturing groups that were set). 1359 Then are listed the offsets of the old substring, its contents, and the 1360 same for the replacement. 1361 1362 By default, the substitution callout function returns zero, which ac- 1363 cepts the replacement and causes matching to continue if /g was used. 1364 Two further modifiers can be used to test other return values. If sub- 1365 stitute_skip is set to a value greater than zero the callout function 1366 returns +1 for the match of that number, and similarly substitute_stop 1367 returns -1. These cause the replacement to be rejected, and -1 causes 1368 no further matching to take place. If either of them are set, substi- 1369 tute_callout is assumed. For example: 1370 1371 /abc/g,replace=<$0>,substitute_skip=1 1372 abcdefabcpqr 1373 1(1) Old 0 3 "abc" New 0 5 "<abc> SKIPPED" 1374 2(1) Old 6 9 "abc" New 6 11 "<abc>" 1375 2: abcdef<abc>pqr 1376 abcdefabcpqr\=substitute_stop=1 1377 1(1) Old 0 3 "abc" New 0 5 "<abc> STOPPED" 1378 1: abcdefabcpqr 1379 1380 If both are set for the same number, stop takes precedence. Only a sin- 1381 gle skip or stop is supported, which is sufficient for testing that the 1382 feature works. 1383 1384 Setting the JIT stack size 1385 1386 The jitstack modifier provides a way of setting the maximum stack size 1387 that is used by the just-in-time optimization code. It is ignored if 1388 JIT optimization is not being used. The value is a number of kibibytes 1389 (units of 1024 bytes). Setting zero reverts to the default of 32KiB. 1390 Providing a stack that is larger than the default is necessary only for 1391 very complicated patterns. If jitstack is set non-zero on a subject 1392 line it overrides any value that was set on the pattern. 1393 1394 Setting heap, match, and depth limits 1395 1396 The heap_limit, match_limit, and depth_limit modifiers set the appro- 1397 priate limits in the match context. These values are ignored when the 1398 find_limits modifier is specified. 1399 1400 Finding minimum limits 1401 1402 If the find_limits modifier is present on a subject line, pcre2test 1403 calls the relevant matching function several times, setting different 1404 values in the match context via pcre2_set_heap_limit(), 1405 pcre2_set_match_limit(), or pcre2_set_depth_limit() until it finds the 1406 minimum values for each parameter that allows the match to complete 1407 without error. If JIT is being used, only the match limit is relevant. 1408 1409 When using this modifier, the pattern should not contain any limit set- 1410 tings such as (*LIMIT_MATCH=...) within it. If such a setting is 1411 present and is lower than the minimum matching value, the minimum value 1412 cannot be found because pcre2_set_match_limit() etc. are only able to 1413 reduce the value of an in-pattern limit; they cannot increase it. 1414 1415 For non-DFA matching, the minimum depth_limit number is a measure of 1416 how much nested backtracking happens (that is, how deeply the pattern's 1417 tree is searched). In the case of DFA matching, depth_limit controls 1418 the depth of recursive calls of the internal function that is used for 1419 handling pattern recursion, lookaround assertions, and atomic groups. 1420 1421 For non-DFA matching, the match_limit number is a measure of the amount 1422 of backtracking that takes place, and learning the minimum value can be 1423 instructive. For most simple matches, the number is quite small, but 1424 for patterns with very large numbers of matching possibilities, it can 1425 become large very quickly with increasing length of subject string. In 1426 the case of DFA matching, match_limit controls the total number of 1427 calls, both recursive and non-recursive, to the internal matching func- 1428 tion, thus controlling the overall amount of computing resource that is 1429 used. 1430 1431 For both kinds of matching, the heap_limit number, which is in 1432 kibibytes (units of 1024 bytes), limits the amount of heap memory used 1433 for matching. A value of zero disables the use of any heap memory; many 1434 simple pattern matches can be done without using the heap, so zero is 1435 not an unreasonable setting. 1436 1437 Showing MARK names 1438 1439 1440 The mark modifier causes the names from backtracking control verbs that 1441 are returned from calls to pcre2_match() to be displayed. If a mark is 1442 returned for a match, non-match, or partial match, pcre2test shows it. 1443 For a match, it is on a line by itself, tagged with "MK:". Otherwise, 1444 it is added to the non-match message. 1445 1446 Showing memory usage 1447 1448 The memory modifier causes pcre2test to log the sizes of all heap mem- 1449 ory allocation and freeing calls that occur during a call to 1450 pcre2_match() or pcre2_dfa_match(). These occur only when a match re- 1451 quires a bigger vector than the default for remembering backtracking 1452 points (pcre2_match()) or for internal workspace (pcre2_dfa_match()). 1453 In many cases there will be no heap memory used and therefore no addi- 1454 tional output. No heap memory is allocated during matching with JIT, so 1455 in that case the memory modifier never has any effect. For this modi- 1456 fier to work, the null_context modifier must not be set on both the 1457 pattern and the subject, though it can be set on one or the other. 1458 1459 Setting a starting offset 1460 1461 The offset modifier sets an offset in the subject string at which 1462 matching starts. Its value is a number of code units, not characters. 1463 1464 Setting an offset limit 1465 1466 The offset_limit modifier sets a limit for unanchored matches. If a 1467 match cannot be found starting at or before this offset in the subject, 1468 a "no match" return is given. The data value is a number of code units, 1469 not characters. When this modifier is used, the use_offset_limit modi- 1470 fier must have been set for the pattern; if not, an error is generated. 1471 1472 Setting the size of the output vector 1473 1474 The ovector modifier applies only to the subject line in which it ap- 1475 pears, though of course it can also be used to set a default in a #sub- 1476 ject command. It specifies the number of pairs of offsets that are 1477 available for storing matching information. The default is 15. 1478 1479 A value of zero is useful when testing the POSIX API because it causes 1480 regexec() to be called with a NULL capture vector. When not testing the 1481 POSIX API, a value of zero is used to cause pcre2_match_data_cre- 1482 ate_from_pattern() to be called, in order to create a match block of 1483 exactly the right size for the pattern. (It is not possible to create a 1484 match block with a zero-length ovector; there is always at least one 1485 pair of offsets.) 1486 1487 Passing the subject as zero-terminated 1488 1489 By default, the subject string is passed to a native API matching func- 1490 tion with its correct length. In order to test the facility for passing 1491 a zero-terminated string, the zero_terminate modifier is provided. It 1492 causes the length to be passed as PCRE2_ZERO_TERMINATED. When matching 1493 via the POSIX interface, this modifier is ignored, with a warning. 1494 1495 When testing pcre2_substitute(), this modifier also has the effect of 1496 passing the replacement string as zero-terminated. 1497 1498 Passing a NULL context 1499 1500 Normally, pcre2test passes a context block to pcre2_match(), 1501 pcre2_dfa_match(), pcre2_jit_match() or pcre2_substitute(). If the 1502 null_context modifier is set, however, NULL is passed. This is for 1503 testing that the matching and substitution functions behave correctly 1504 in this case (they use default values). This modifier cannot be used 1505 with the find_limits or substitute_callout modifiers. 1506 1507 1508THE ALTERNATIVE MATCHING FUNCTION 1509 1510 By default, pcre2test uses the standard PCRE2 matching function, 1511 pcre2_match() to match each subject line. PCRE2 also supports an alter- 1512 native matching function, pcre2_dfa_match(), which operates in a dif- 1513 ferent way, and has some restrictions. The differences between the two 1514 functions are described in the pcre2matching documentation. 1515 1516 If the dfa modifier is set, the alternative matching function is used. 1517 This function finds all possible matches at a given point in the sub- 1518 ject. If, however, the dfa_shortest modifier is set, processing stops 1519 after the first match is found. This is always the shortest possible 1520 match. 1521 1522 1523DEFAULT OUTPUT FROM pcre2test 1524 1525 This section describes the output when the normal matching function, 1526 pcre2_match(), is being used. 1527 1528 When a match succeeds, pcre2test outputs the list of captured sub- 1529 strings, starting with number 0 for the string that matched the whole 1530 pattern. Otherwise, it outputs "No match" when the return is PCRE2_ER- 1531 ROR_NOMATCH, or "Partial match:" followed by the partially matching 1532 substring when the return is PCRE2_ERROR_PARTIAL. (Note that this is 1533 the entire substring that was inspected during the partial match; it 1534 may include characters before the actual match start if a lookbehind 1535 assertion, \K, \b, or \B was involved.) 1536 1537 For any other return, pcre2test outputs the PCRE2 negative error number 1538 and a short descriptive phrase. If the error is a failed UTF string 1539 check, the code unit offset of the start of the failing character is 1540 also output. Here is an example of an interactive pcre2test run. 1541 1542 $ pcre2test 1543 PCRE2 version 10.22 2016-07-29 1544 1545 re> /^abc(\d+)/ 1546 data> abc123 1547 0: abc123 1548 1: 123 1549 data> xyz 1550 No match 1551 1552 Unset capturing substrings that are not followed by one that is set are 1553 not shown by pcre2test unless the allcaptures modifier is specified. In 1554 the following example, there are two capturing substrings, but when the 1555 first data line is matched, the second, unset substring is not shown. 1556 An "internal" unset substring is shown as "<unset>", as for the second 1557 data line. 1558 1559 re> /(a)|(b)/ 1560 data> a 1561 0: a 1562 1: a 1563 data> b 1564 0: b 1565 1: <unset> 1566 2: b 1567 1568 If the strings contain any non-printing characters, they are output as 1569 \xhh escapes if the value is less than 256 and UTF mode is not set. 1570 Otherwise they are output as \x{hh...} escapes. See below for the defi- 1571 nition of non-printing characters. If the aftertext modifier is set, 1572 the output for substring 0 is followed by the the rest of the subject 1573 string, identified by "0+" like this: 1574 1575 re> /cat/aftertext 1576 data> cataract 1577 0: cat 1578 0+ aract 1579 1580 If global matching is requested, the results of successive matching at- 1581 tempts are output in sequence, like this: 1582 1583 re> /\Bi(\w\w)/g 1584 data> Mississippi 1585 0: iss 1586 1: ss 1587 0: iss 1588 1: ss 1589 0: ipp 1590 1: pp 1591 1592 "No match" is output only if the first match attempt fails. Here is an 1593 example of a failure message (the offset 4 that is specified by the 1594 offset modifier is past the end of the subject string): 1595 1596 re> /xyz/ 1597 data> xyz\=offset=4 1598 Error -24 (bad offset value) 1599 1600 Note that whereas patterns can be continued over several lines (a plain 1601 ">" prompt is used for continuations), subject lines may not. However 1602 newlines can be included in a subject by means of the \n escape (or \r, 1603 \r\n, etc., depending on the newline sequence setting). 1604 1605 1606OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION 1607 1608 When the alternative matching function, pcre2_dfa_match(), is used, the 1609 output consists of a list of all the matches that start at the first 1610 point in the subject where there is at least one match. For example: 1611 1612 re> /(tang|tangerine|tan)/ 1613 data> yellow tangerine\=dfa 1614 0: tangerine 1615 1: tang 1616 2: tan 1617 1618 Using the normal matching function on this data finds only "tang". The 1619 longest matching string is always given first (and numbered zero). Af- 1620 ter a PCRE2_ERROR_PARTIAL return, the output is "Partial match:", fol- 1621 lowed by the partially matching substring. Note that this is the entire 1622 substring that was inspected during the partial match; it may include 1623 characters before the actual match start if a lookbehind assertion, \b, 1624 or \B was involved. (\K is not supported for DFA matching.) 1625 1626 If global matching is requested, the search for further matches resumes 1627 at the end of the longest match. For example: 1628 1629 re> /(tang|tangerine|tan)/g 1630 data> yellow tangerine and tangy sultana\=dfa 1631 0: tangerine 1632 1: tang 1633 2: tan 1634 0: tang 1635 1: tan 1636 0: tan 1637 1638 The alternative matching function does not support substring capture, 1639 so the modifiers that are concerned with captured substrings are not 1640 relevant. 1641 1642 1643RESTARTING AFTER A PARTIAL MATCH 1644 1645 When the alternative matching function has given the PCRE2_ERROR_PAR- 1646 TIAL return, indicating that the subject partially matched the pattern, 1647 you can restart the match with additional subject data by means of the 1648 dfa_restart modifier. For example: 1649 1650 re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ 1651 data> 23ja\=ps,dfa 1652 Partial match: 23ja 1653 data> n05\=dfa,dfa_restart 1654 0: n05 1655 1656 For further information about partial matching, see the pcre2partial 1657 documentation. 1658 1659 1660CALLOUTS 1661 1662 If the pattern contains any callout requests, pcre2test's callout func- 1663 tion is called during matching unless callout_none is specified. This 1664 works with both matching functions, and with JIT, though there are some 1665 differences in behaviour. The output for callouts with numerical argu- 1666 ments and those with string arguments is slightly different. 1667 1668 Callouts with numerical arguments 1669 1670 By default, the callout function displays the callout number, the start 1671 and current positions in the subject text at the callout time, and the 1672 next pattern item to be tested. For example: 1673 1674 --->pqrabcdef 1675 0 ^ ^ \d 1676 1677 This output indicates that callout number 0 occurred for a match at- 1678 tempt starting at the fourth character of the subject string, when the 1679 pointer was at the seventh character, and when the next pattern item 1680 was \d. Just one circumflex is output if the start and current posi- 1681 tions are the same, or if the current position precedes the start posi- 1682 tion, which can happen if the callout is in a lookbehind assertion. 1683 1684 Callouts numbered 255 are assumed to be automatic callouts, inserted as 1685 a result of the auto_callout pattern modifier. In this case, instead of 1686 showing the callout number, the offset in the pattern, preceded by a 1687 plus, is output. For example: 1688 1689 re> /\d?[A-E]\*/auto_callout 1690 data> E* 1691 --->E* 1692 +0 ^ \d? 1693 +3 ^ [A-E] 1694 +8 ^^ \* 1695 +10 ^ ^ 1696 0: E* 1697 1698 If a pattern contains (*MARK) items, an additional line is output when- 1699 ever a change of latest mark is passed to the callout function. For ex- 1700 ample: 1701 1702 re> /a(*MARK:X)bc/auto_callout 1703 data> abc 1704 --->abc 1705 +0 ^ a 1706 +1 ^^ (*MARK:X) 1707 +10 ^^ b 1708 Latest Mark: X 1709 +11 ^ ^ c 1710 +12 ^ ^ 1711 0: abc 1712 1713 The mark changes between matching "a" and "b", but stays the same for 1714 the rest of the match, so nothing more is output. If, as a result of 1715 backtracking, the mark reverts to being unset, the text "<unset>" is 1716 output. 1717 1718 Callouts with string arguments 1719 1720 The output for a callout with a string argument is similar, except that 1721 instead of outputting a callout number before the position indicators, 1722 the callout string and its offset in the pattern string are output be- 1723 fore the reflection of the subject string, and the subject string is 1724 reflected for each callout. For example: 1725 1726 re> /^ab(?C'first')cd(?C"second")ef/ 1727 data> abcdefg 1728 Callout (7): 'first' 1729 --->abcdefg 1730 ^ ^ c 1731 Callout (20): "second" 1732 --->abcdefg 1733 ^ ^ e 1734 0: abcdef 1735 1736 1737 Callout modifiers 1738 1739 The callout function in pcre2test returns zero (carry on matching) by 1740 default, but you can use a callout_fail modifier in a subject line to 1741 change this and other parameters of the callout (see below). 1742 1743 If the callout_capture modifier is set, the current captured groups are 1744 output when a callout occurs. This is useful only for non-DFA matching, 1745 as pcre2_dfa_match() does not support capturing, so no captures are 1746 ever shown. 1747 1748 The normal callout output, showing the callout number or pattern offset 1749 (as described above) is suppressed if the callout_no_where modifier is 1750 set. 1751 1752 When using the interpretive matching function pcre2_match() without 1753 JIT, setting the callout_extra modifier causes additional output from 1754 pcre2test's callout function to be generated. For the first callout in 1755 a match attempt at a new starting position in the subject, "New match 1756 attempt" is output. If there has been a backtrack since the last call- 1757 out (or start of matching if this is the first callout), "Backtrack" is 1758 output, followed by "No other matching paths" if the backtrack ended 1759 the previous match attempt. For example: 1760 1761 re> /(a+)b/auto_callout,no_start_optimize,no_auto_possess 1762 data> aac\=callout_extra 1763 New match attempt 1764 --->aac 1765 +0 ^ ( 1766 +1 ^ a+ 1767 +3 ^ ^ ) 1768 +4 ^ ^ b 1769 Backtrack 1770 --->aac 1771 +3 ^^ ) 1772 +4 ^^ b 1773 Backtrack 1774 No other matching paths 1775 New match attempt 1776 --->aac 1777 +0 ^ ( 1778 +1 ^ a+ 1779 +3 ^^ ) 1780 +4 ^^ b 1781 Backtrack 1782 No other matching paths 1783 New match attempt 1784 --->aac 1785 +0 ^ ( 1786 +1 ^ a+ 1787 Backtrack 1788 No other matching paths 1789 New match attempt 1790 --->aac 1791 +0 ^ ( 1792 +1 ^ a+ 1793 No match 1794 1795 Notice that various optimizations must be turned off if you want all 1796 possible matching paths to be scanned. If no_start_optimize is not 1797 used, there is an immediate "no match", without any callouts, because 1798 the starting optimization fails to find "b" in the subject, which it 1799 knows must be present for any match. If no_auto_possess is not used, 1800 the "a+" item is turned into "a++", which reduces the number of back- 1801 tracks. 1802 1803 The callout_extra modifier has no effect if used with the DFA matching 1804 function, or with JIT. 1805 1806 Return values from callouts 1807 1808 The default return from the callout function is zero, which allows 1809 matching to continue. The callout_fail modifier can be given one or two 1810 numbers. If there is only one number, 1 is returned instead of 0 (caus- 1811 ing matching to backtrack) when a callout of that number is reached. If 1812 two numbers (<n>:<m>) are given, 1 is returned when callout <n> is 1813 reached and there have been at least <m> callouts. The callout_error 1814 modifier is similar, except that PCRE2_ERROR_CALLOUT is returned, caus- 1815 ing the entire matching process to be aborted. If both these modifiers 1816 are set for the same callout number, callout_error takes precedence. 1817 Note that callouts with string arguments are always given the number 1818 zero. 1819 1820 The callout_data modifier can be given an unsigned or a negative num- 1821 ber. This is set as the "user data" that is passed to the matching 1822 function, and passed back when the callout function is invoked. Any 1823 value other than zero is used as a return from pcre2test's callout 1824 function. 1825 1826 Inserting callouts can be helpful when using pcre2test to check compli- 1827 cated regular expressions. For further information about callouts, see 1828 the pcre2callout documentation. 1829 1830 1831NON-PRINTING CHARACTERS 1832 1833 When pcre2test is outputting text in the compiled version of a pattern, 1834 bytes other than 32-126 are always treated as non-printing characters 1835 and are therefore shown as hex escapes. 1836 1837 When pcre2test is outputting text that is a matched part of a subject 1838 string, it behaves in the same way, unless a different locale has been 1839 set for the pattern (using the locale modifier). In this case, the is- 1840 print() function is used to distinguish printing and non-printing char- 1841 acters. 1842 1843 1844SAVING AND RESTORING COMPILED PATTERNS 1845 1846 It is possible to save compiled patterns on disc or elsewhere, and 1847 reload them later, subject to a number of restrictions. JIT data cannot 1848 be saved. The host on which the patterns are reloaded must be running 1849 the same version of PCRE2, with the same code unit width, and must also 1850 have the same endianness, pointer width and PCRE2_SIZE type. Before 1851 compiled patterns can be saved they must be serialized, that is, con- 1852 verted to a stream of bytes. A single byte stream may contain any num- 1853 ber of compiled patterns, but they must all use the same character ta- 1854 bles. A single copy of the tables is included in the byte stream (its 1855 size is 1088 bytes). 1856 1857 The functions whose names begin with pcre2_serialize_ are used for se- 1858 rializing and de-serializing. They are described in the pcre2serialize 1859 documentation. In this section we describe the features of pcre2test 1860 that can be used to test these functions. 1861 1862 Note that "serialization" in PCRE2 does not convert compiled patterns 1863 to an abstract format like Java or .NET. It just makes a reloadable 1864 byte code stream. Hence the restrictions on reloading mentioned above. 1865 1866 In pcre2test, when a pattern with push modifier is successfully com- 1867 piled, it is pushed onto a stack of compiled patterns, and pcre2test 1868 expects the next line to contain a new pattern (or command) instead of 1869 a subject line. By contrast, the pushcopy modifier causes a copy of the 1870 compiled pattern to be stacked, leaving the original available for im- 1871 mediate matching. By using push and/or pushcopy, a number of patterns 1872 can be compiled and retained. These modifiers are incompatible with 1873 posix, and control modifiers that act at match time are ignored (with a 1874 message) for the stacked patterns. The jitverify modifier applies only 1875 at compile time. 1876 1877 The command 1878 1879 #save <filename> 1880 1881 causes all the stacked patterns to be serialized and the result written 1882 to the named file. Afterwards, all the stacked patterns are freed. The 1883 command 1884 1885 #load <filename> 1886 1887 reads the data in the file, and then arranges for it to be de-serial- 1888 ized, with the resulting compiled patterns added to the pattern stack. 1889 The pattern on the top of the stack can be retrieved by the #pop com- 1890 mand, which must be followed by lines of subjects that are to be 1891 matched with the pattern, terminated as usual by an empty line or end 1892 of file. This command may be followed by a modifier list containing 1893 only control modifiers that act after a pattern has been compiled. In 1894 particular, hex, posix, posix_nosub, push, and pushcopy are not al- 1895 lowed, nor are any option-setting modifiers. The JIT modifiers are, 1896 however permitted. Here is an example that saves and reloads two pat- 1897 terns. 1898 1899 /abc/push 1900 /xyz/push 1901 #save tempfile 1902 #load tempfile 1903 #pop info 1904 xyz 1905 1906 #pop jit,bincode 1907 abc 1908 1909 If jitverify is used with #pop, it does not automatically imply jit, 1910 which is different behaviour from when it is used on a pattern. 1911 1912 The #popcopy command is analagous to the pushcopy modifier in that it 1913 makes current a copy of the topmost stack pattern, leaving the original 1914 still on the stack. 1915 1916 1917SEE ALSO 1918 1919 pcre2(3), pcre2api(3), pcre2callout(3), pcre2jit, pcre2matching(3), 1920 pcre2partial(d), pcre2pattern(3), pcre2serialize(3). 1921 1922 1923AUTHOR 1924 1925 Philip Hazel 1926 University Computing Service 1927 Cambridge, England. 1928 1929 1930REVISION 1931 1932 Last updated: 14 September 2020 1933 Copyright (c) 1997-2020 University of Cambridge. 1934