1PCRE2TEST(1) General Commands Manual PCRE2TEST(1) 2 3 4 5NAME 6 pcre2test - a program for testing Perl-compatible regular expressions. 7 8SYNOPSIS 9 10 pcre2test [options] [input file [output file]] 11 12 pcre2test is a test program for the PCRE2 regular expression libraries, 13 but it can also be used for experimenting with regular expressions. 14 This document describes the features of the test program; for details 15 of the regular expressions themselves, see the pcre2pattern documenta- 16 tion. For details of the PCRE2 library function calls and their 17 options, see the pcre2api documentation. 18 19 The input for pcre2test is a sequence of regular expression patterns 20 and subject strings to be matched. There are also command lines for 21 setting defaults and controlling some special actions. The output shows 22 the result of each match attempt. Modifiers on external or internal 23 command lines, the patterns, and the subject lines specify PCRE2 func- 24 tion options, control how the subject is processed, and what output is 25 produced. 26 27 As the original fairly simple PCRE library evolved, it acquired many 28 different features, and as a result, the original pcretest program 29 ended up with a lot of options in a messy, arcane syntax, for testing 30 all the features. The move to the new PCRE2 API provided an opportunity 31 to re-implement the test program as pcre2test, with a cleaner modifier 32 syntax. Nevertheless, there are still many obscure modifiers, some of 33 which are specifically designed for use in conjunction with the test 34 script and data files that are distributed as part of PCRE2. All the 35 modifiers are documented here, some without much justification, but 36 many of them are unlikely to be of use except when testing the 37 libraries. 38 39 40PCRE2's 8-BIT, 16-BIT AND 32-BIT LIBRARIES 41 42 Different versions of the PCRE2 library can be built to support charac- 43 ter strings that are encoded in 8-bit, 16-bit, or 32-bit code units. 44 One, two, or all three of these libraries may be simultaneously 45 installed. The pcre2test program can be used to test all the libraries. 46 However, its own input and output are always in 8-bit format. When 47 testing the 16-bit or 32-bit libraries, patterns and subject strings 48 are converted to 16- or 32-bit format before being passed to the 49 library functions. Results are converted back to 8-bit code units for 50 output. 51 52 In the rest of this document, the names of library functions and struc- 53 tures are given in generic form, for example, pcre_compile(). The 54 actual names used in the libraries have a suffix _8, _16, or _32, as 55 appropriate. 56 57 58INPUT ENCODING 59 60 Input to pcre2test is processed line by line, either by calling the C 61 library's fgets() function, or via the libreadline library (see below). 62 The input is processed using using C's string functions, so must not 63 contain binary zeroes, even though in Unix-like environments, fgets() 64 treats any bytes other than newline as data characters. In some Windows 65 environments character 26 (hex 1A) causes an immediate end of file, and 66 no further data is read. 67 68 For maximum portability, therefore, it is safest to avoid non-printing 69 characters in pcre2test input files. There is a facility for specifying 70 some or all of a pattern's characters as hexadecimal pairs, thus making 71 it possible to include binary zeroes in a pattern for testing purposes. 72 Subject lines are processed for backslash escapes, which makes it pos- 73 sible to include any data value. 74 75 76COMMAND LINE OPTIONS 77 78 -8 If the 8-bit library has been built, this option causes it to 79 be used (this is the default). If the 8-bit library has not 80 been built, this option causes an error. 81 82 -16 If the 16-bit library has been built, this option causes it 83 to be used. If only the 16-bit library has been built, this 84 is the default. If the 16-bit library has not been built, 85 this option causes an error. 86 87 -32 If the 32-bit library has been built, this option causes it 88 to be used. If only the 32-bit library has been built, this 89 is the default. If the 32-bit library has not been built, 90 this option causes an error. 91 92 -b Behave as if each pattern has the /fullbincode modifier; the 93 full internal binary form of the pattern is output after com- 94 pilation. 95 96 -C Output the version number of the PCRE2 library, and all 97 available information about the optional features that are 98 included, and then exit with zero exit code. All other 99 options are ignored. 100 101 -C option Output information about a specific build-time option, then 102 exit. This functionality is intended for use in scripts such 103 as RunTest. The following options output the value and set 104 the exit code as indicated: 105 106 ebcdic-nl the code for LF (= NL) in an EBCDIC environment: 107 0x15 or 0x25 108 0 if used in an ASCII environment 109 exit code is always 0 110 linksize the configured internal link size (2, 3, or 4) 111 exit code is set to the link size 112 newline the default newline setting: 113 CR, LF, CRLF, ANYCRLF, or ANY 114 exit code is always 0 115 bsr the default setting for what \R matches: 116 ANYCRLF or ANY 117 exit code is always 0 118 119 The following options output 1 for true or 0 for false, and 120 set the exit code to the same value: 121 122 backslash-C \C is supported (not locked out) 123 ebcdic compiled for an EBCDIC environment 124 jit just-in-time support is available 125 pcre2-16 the 16-bit library was built 126 pcre2-32 the 32-bit library was built 127 pcre2-8 the 8-bit library was built 128 unicode Unicode support is available 129 130 If an unknown option is given, an error message is output; 131 the exit code is 0. 132 133 -d Behave as if each pattern has the debug modifier; the inter- 134 nal form and information about the compiled pattern is output 135 after compilation; -d is equivalent to -b -i. 136 137 -dfa Behave as if each subject line has the dfa modifier; matching 138 is done using the pcre2_dfa_match() function instead of the 139 default pcre2_match(). 140 141 -error number[,number,...] 142 Call pcre2_get_error_message() for each of the error numbers 143 in the comma-separated list, display the resulting messages 144 on the standard output, then exit with zero exit code. The 145 numbers may be positive or negative. This is a convenience 146 facility for PCRE2 maintainers. 147 148 -help Output a brief summary these options and then exit. 149 150 -i Behave as if each pattern has the /info modifier; information 151 about the compiled pattern is given after compilation. 152 153 -jit Behave as if each pattern line has the jit modifier; after 154 successful compilation, each pattern is passed to the just- 155 in-time compiler, if available. 156 157 -pattern modifier-list 158 Behave as if each pattern line contains the given modifiers. 159 160 -q Do not output the version number of pcre2test at the start of 161 execution. 162 163 -S size On Unix-like systems, set the size of the run-time stack to 164 size megabytes. 165 166 -subject modifier-list 167 Behave as if each subject line contains the given modifiers. 168 169 -t Run each compile and match many times with a timer, and out- 170 put the resulting times per compile or match. When JIT is 171 used, separate times are given for the initial compile and 172 the JIT compile. You can control the number of iterations 173 that are used for timing by following -t with a number (as a 174 separate item on the command line). For example, "-t 1000" 175 iterates 1000 times. The default is to iterate 500,000 times. 176 177 -tm This is like -t except that it times only the matching phase, 178 not the compile phase. 179 180 -T -TM These behave like -t and -tm, but in addition, at the end of 181 a run, the total times for all compiles and matches are out- 182 put. 183 184 -version Output the PCRE2 version number and then exit. 185 186 187DESCRIPTION 188 189 If pcre2test is given two filename arguments, it reads from the first 190 and writes to the second. If the first name is "-", input is taken from 191 the standard input. If pcre2test is given only one argument, it reads 192 from that file and writes to stdout. Otherwise, it reads from stdin and 193 writes to stdout. 194 195 When pcre2test is built, a configuration option can specify that it 196 should be linked with the libreadline or libedit library. When this is 197 done, if the input is from a terminal, it is read using the readline() 198 function. This provides line-editing and history facilities. The output 199 from the -help option states whether or not readline() will be used. 200 201 The program handles any number of tests, each of which consists of a 202 set of input lines. Each set starts with a regular expression pattern, 203 followed by any number of subject lines to be matched against that pat- 204 tern. In between sets of test data, command lines that begin with # may 205 appear. This file format, with some restrictions, can also be processed 206 by the perltest.sh script that is distributed with PCRE2 as a means of 207 checking that the behaviour of PCRE2 and Perl is the same. 208 209 When the input is a terminal, pcre2test prompts for each line of input, 210 using "re>" to prompt for regular expression patterns, and "data>" to 211 prompt for subject lines. Command lines starting with # can be entered 212 only in response to the "re>" prompt. 213 214 Each subject line is matched separately and independently. If you want 215 to do multi-line matches, you have to use the \n escape sequence (or \r 216 or \r\n, etc., depending on the newline setting) in a single line of 217 input to encode the newline sequences. There is no limit on the length 218 of subject lines; the input buffer is automatically extended if it is 219 too small. There are replication features that makes it possible to 220 generate long repetitive pattern or subject lines without having to 221 supply them explicitly. 222 223 An empty line or the end of the file signals the end of the subject 224 lines for a test, at which point a new pattern or command line is 225 expected if there is still input to be read. 226 227 228COMMAND LINES 229 230 In between sets of test data, a line that begins with # is interpreted 231 as a command line. If the first character is followed by white space or 232 an exclamation mark, the line is treated as a comment, and ignored. 233 Otherwise, the following commands are recognized: 234 235 #forbid_utf 236 237 Subsequent patterns automatically have the PCRE2_NEVER_UTF and 238 PCRE2_NEVER_UCP options set, which locks out the use of the PCRE2_UTF 239 and PCRE2_UCP options and the use of (*UTF) and (*UCP) at the start of 240 patterns. This command also forces an error if a subsequent pattern 241 contains any occurrences of \P, \p, or \X, which are still supported 242 when PCRE2_UTF is not set, but which require Unicode property support 243 to be included in the library. 244 245 This is a trigger guard that is used in test files to ensure that UTF 246 or Unicode property tests are not accidentally added to files that are 247 used when Unicode support is not included in the library. Setting 248 PCRE2_NEVER_UTF and PCRE2_NEVER_UCP as a default can also be obtained 249 by the use of #pattern; the difference is that #forbid_utf cannot be 250 unset, and the automatic options are not displayed in pattern informa- 251 tion, to avoid cluttering up test output. 252 253 #load <filename> 254 255 This command is used to load a set of precompiled patterns from a file, 256 as described in the section entitled "Saving and restoring compiled 257 patterns" below. 258 259 #newline_default [<newline-list>] 260 261 When PCRE2 is built, a default newline convention can be specified. 262 This determines which characters and/or character pairs are recognized 263 as indicating a newline in a pattern or subject string. The default can 264 be overridden when a pattern is compiled. The standard test files con- 265 tain tests of various newline conventions, but the majority of the 266 tests expect a single linefeed to be recognized as a newline by 267 default. Without special action the tests would fail when PCRE2 is com- 268 piled with either CR or CRLF as the default newline. 269 270 The #newline_default command specifies a list of newline types that are 271 acceptable as the default. The types must be one of CR, LF, CRLF, ANY- 272 CRLF, or ANY (in upper or lower case), for example: 273 274 #newline_default LF Any anyCRLF 275 276 If the default newline is in the list, this command has no effect. Oth- 277 erwise, except when testing the POSIX API, a newline modifier that 278 specifies the first newline convention in the list (LF in the above 279 example) is added to any pattern that does not already have a newline 280 modifier. If the newline list is empty, the feature is turned off. This 281 command is present in a number of the standard test input files. 282 283 When the POSIX API is being tested there is no way to override the 284 default newline convention, though it is possible to set the newline 285 convention from within the pattern. A warning is given if the posix 286 modifier is used when #newline_default would set a default for the non- 287 POSIX API. 288 289 #pattern <modifier-list> 290 291 This command sets a default modifier list that applies to all subse- 292 quent patterns. Modifiers on a pattern can change these settings. 293 294 #perltest 295 296 The appearance of this line causes all subsequent modifier settings to 297 be checked for compatibility with the perltest.sh script, which is used 298 to confirm that Perl gives the same results as PCRE2. Also, apart from 299 comment lines, none of the other command lines are permitted, because 300 they and many of the modifiers are specific to pcre2test, and should 301 not be used in test files that are also processed by perltest.sh. The 302 #perltest command helps detect tests that are accidentally put in the 303 wrong file. 304 305 #pop [<modifiers>] 306 #popcopy [<modifiers>] 307 308 These commands are used to manipulate the stack of compiled patterns, 309 as described in the section entitled "Saving and restoring compiled 310 patterns" below. 311 312 #save <filename> 313 314 This command is used to save a set of compiled patterns to a file, as 315 described in the section entitled "Saving and restoring compiled pat- 316 terns" below. 317 318 #subject <modifier-list> 319 320 This command sets a default modifier list that applies to all subse- 321 quent subject lines. Modifiers on a subject line can change these set- 322 tings. 323 324 325MODIFIER SYNTAX 326 327 Modifier lists are used with both pattern and subject lines. Items in a 328 list are separated by commas followed by optional white space. Trailing 329 whitespace in a modifier list is ignored. Some modifiers may be given 330 for both patterns and subject lines, whereas others are valid only for 331 one or the other. Each modifier has a long name, for example 332 "anchored", and some of them must be followed by an equals sign and a 333 value, for example, "offset=12". Values cannot contain comma charac- 334 ters, but may contain spaces. Modifiers that do not take values may be 335 preceded by a minus sign to turn off a previous setting. 336 337 A few of the more common modifiers can also be specified as single let- 338 ters, for example "i" for "caseless". In documentation, following the 339 Perl convention, these are written with a slash ("the /i modifier") for 340 clarity. Abbreviated modifiers must all be concatenated in the first 341 item of a modifier list. If the first item is not recognized as a long 342 modifier name, it is interpreted as a sequence of these abbreviations. 343 For example: 344 345 /abc/ig,newline=cr,jit=3 346 347 This is a pattern line whose modifier list starts with two one-letter 348 modifiers (/i and /g). The lower-case abbreviated modifiers are the 349 same as used in Perl. 350 351 352PATTERN SYNTAX 353 354 A pattern line must start with one of the following characters (common 355 symbols, excluding pattern meta-characters): 356 357 / ! " ' ` - = _ : ; , % & @ ~ 358 359 This is interpreted as the pattern's delimiter. A regular expression 360 may be continued over several input lines, in which case the newline 361 characters are included within it. It is possible to include the delim- 362 iter within the pattern by escaping it with a backslash, for example 363 364 /abc\/def/ 365 366 If you do this, the escape and the delimiter form part of the pattern, 367 but since the delimiters are all non-alphanumeric, this does not affect 368 its interpretation. If the terminating delimiter is immediately fol- 369 lowed by a backslash, for example, 370 371 /abc/\ 372 373 then a backslash is added to the end of the pattern. This is done to 374 provide a way of testing the error condition that arises if a pattern 375 finishes with a backslash, because 376 377 /abc\/ 378 379 is interpreted as the first line of a pattern that starts with "abc/", 380 causing pcre2test to read the next line as a continuation of the regu- 381 lar expression. 382 383 A pattern can be followed by a modifier list (details below). 384 385 386SUBJECT LINE SYNTAX 387 388 Before each subject line is passed to pcre2_match() or 389 pcre2_dfa_match(), leading and trailing white space is removed, and the 390 line is scanned for backslash escapes. The following provide a means of 391 encoding non-printing characters in a visible way: 392 393 \a alarm (BEL, \x07) 394 \b backspace (\x08) 395 \e escape (\x27) 396 \f form feed (\x0c) 397 \n newline (\x0a) 398 \r carriage return (\x0d) 399 \t tab (\x09) 400 \v vertical tab (\x0b) 401 \nnn octal character (up to 3 octal digits); always 402 a byte unless > 255 in UTF-8 or 16-bit or 32-bit mode 403 \o{dd...} octal character (any number of octal digits} 404 \xhh hexadecimal byte (up to 2 hex digits) 405 \x{hh...} hexadecimal character (any number of hex digits) 406 407 The use of \x{hh...} is not dependent on the use of the utf modifier on 408 the pattern. It is recognized always. There may be any number of hexa- 409 decimal digits inside the braces; invalid values provoke error mes- 410 sages. 411 412 Note that \xhh specifies one byte rather than one character in UTF-8 413 mode; this makes it possible to construct invalid UTF-8 sequences for 414 testing purposes. On the other hand, \x{hh} is interpreted as a UTF-8 415 character in UTF-8 mode, generating more than one byte if the value is 416 greater than 127. When testing the 8-bit library not in UTF-8 mode, 417 \x{hh} generates one byte for values less than 256, and causes an error 418 for greater values. 419 420 In UTF-16 mode, all 4-digit \x{hhhh} values are accepted. This makes it 421 possible to construct invalid UTF-16 sequences for testing purposes. 422 423 In UTF-32 mode, all 4- to 8-digit \x{...} values are accepted. This 424 makes it possible to construct invalid UTF-32 sequences for testing 425 purposes. 426 427 There is a special backslash sequence that specifies replication of one 428 or more characters: 429 430 \[<characters>]{<count>} 431 432 This makes it possible to test long strings without having to provide 433 them as part of the file. For example: 434 435 \[abc]{4} 436 437 is converted to "abcabcabcabc". This feature does not support nesting. 438 To include a closing square bracket in the characters, code it as \x5D. 439 440 A backslash followed by an equals sign marks the end of the subject 441 string and the start of a modifier list. For example: 442 443 abc\=notbol,notempty 444 445 If the subject string is empty and \= is followed by whitespace, the 446 line is treated as a comment line, and is not used for matching. For 447 example: 448 449 \= This is a comment. 450 abc\= This is an invalid modifier list. 451 452 A backslash followed by any other non-alphanumeric character just 453 escapes that character. A backslash followed by anything else causes an 454 error. However, if the very last character in the line is a backslash 455 (and there is no modifier list), it is ignored. This gives a way of 456 passing an empty line as data, since a real empty line terminates the 457 data input. 458 459 460PATTERN MODIFIERS 461 462 There are several types of modifier that can appear in pattern lines. 463 Except where noted below, they may also be used in #pattern commands. A 464 pattern's modifier list can add to or override default modifiers that 465 were set by a previous #pattern command. 466 467 Setting compilation options 468 469 The following modifiers set options for pcre2_compile(). The most com- 470 mon ones have single-letter abbreviations. See pcre2api for a descrip- 471 tion of their effects. 472 473 allow_empty_class set PCRE2_ALLOW_EMPTY_CLASS 474 alt_bsux set PCRE2_ALT_BSUX 475 alt_circumflex set PCRE2_ALT_CIRCUMFLEX 476 alt_verbnames set PCRE2_ALT_VERBNAMES 477 anchored set PCRE2_ANCHORED 478 auto_callout set PCRE2_AUTO_CALLOUT 479 /i caseless set PCRE2_CASELESS 480 dollar_endonly set PCRE2_DOLLAR_ENDONLY 481 /s dotall set PCRE2_DOTALL 482 dupnames set PCRE2_DUPNAMES 483 /x extended set PCRE2_EXTENDED 484 firstline set PCRE2_FIRSTLINE 485 match_unset_backref set PCRE2_MATCH_UNSET_BACKREF 486 /m multiline set PCRE2_MULTILINE 487 never_backslash_c set PCRE2_NEVER_BACKSLASH_C 488 never_ucp set PCRE2_NEVER_UCP 489 never_utf set PCRE2_NEVER_UTF 490 no_auto_capture set PCRE2_NO_AUTO_CAPTURE 491 no_auto_possess set PCRE2_NO_AUTO_POSSESS 492 no_dotstar_anchor set PCRE2_NO_DOTSTAR_ANCHOR 493 no_start_optimize set PCRE2_NO_START_OPTIMIZE 494 no_utf_check set PCRE2_NO_UTF_CHECK 495 ucp set PCRE2_UCP 496 ungreedy set PCRE2_UNGREEDY 497 use_offset_limit set PCRE2_USE_OFFSET_LIMIT 498 utf set PCRE2_UTF 499 500 As well as turning on the PCRE2_UTF option, the utf modifier causes all 501 non-printing characters in output strings to be printed using the 502 \x{hh...} notation. Otherwise, those less than 0x100 are output in hex 503 without the curly brackets. 504 505 Setting compilation controls 506 507 The following modifiers affect the compilation process or request 508 information about the pattern: 509 510 bsr=[anycrlf|unicode] specify \R handling 511 /B bincode show binary code without lengths 512 callout_info show callout information 513 debug same as info,fullbincode 514 fullbincode show binary code with lengths 515 /I info show info about compiled pattern 516 hex unquoted characters are hexadecimal 517 jit[=<number>] use JIT 518 jitfast use JIT fast path 519 jitverify verify JIT use 520 locale=<name> use this locale 521 max_pattern_length=<n> set the maximum pattern length 522 memory show memory used 523 newline=<type> set newline type 524 null_context compile with a NULL context 525 parens_nest_limit=<n> set maximum parentheses depth 526 posix use the POSIX API 527 posix_nosub use the POSIX API with REG_NOSUB 528 push push compiled pattern onto the stack 529 pushcopy push a copy onto the stack 530 stackguard=<number> test the stackguard feature 531 tables=[0|1|2] select internal tables 532 533 The effects of these modifiers are described in the following sections. 534 535 Newline and \R handling 536 537 The bsr modifier specifies what \R in a pattern should match. If it is 538 set to "anycrlf", \R matches CR, LF, or CRLF only. If it is set to 539 "unicode", \R matches any Unicode newline sequence. The default is 540 specified when PCRE2 is built, with the default default being Unicode. 541 542 The newline modifier specifies which characters are to be interpreted 543 as newlines, both in the pattern and in subject lines. The type must be 544 one of CR, LF, CRLF, ANYCRLF, or ANY (in upper or lower case). 545 546 Information about a pattern 547 548 The debug modifier is a shorthand for info,fullbincode, requesting all 549 available information. 550 551 The bincode modifier causes a representation of the compiled code to be 552 output after compilation. This information does not contain length and 553 offset values, which ensures that the same output is generated for dif- 554 ferent internal link sizes and different code unit widths. By using 555 bincode, the same regression tests can be used in different environ- 556 ments. 557 558 The fullbincode modifier, by contrast, does include length and offset 559 values. This is used in a few special tests that run only for specific 560 code unit widths and link sizes, and is also useful for one-off tests. 561 562 The info modifier requests information about the compiled pattern 563 (whether it is anchored, has a fixed first character, and so on). The 564 information is obtained from the pcre2_pattern_info() function. Here 565 are some typical examples: 566 567 re> /(?i)(^a|^b)/m,info 568 Capturing subpattern count = 1 569 Compile options: multiline 570 Overall options: caseless multiline 571 First code unit at start or follows newline 572 Subject length lower bound = 1 573 574 re> /(?i)abc/info 575 Capturing subpattern count = 0 576 Compile options: <none> 577 Overall options: caseless 578 First code unit = 'a' (caseless) 579 Last code unit = 'c' (caseless) 580 Subject length lower bound = 3 581 582 "Compile options" are those specified by modifiers; "overall options" 583 have added options that are taken or deduced from the pattern. If both 584 sets of options are the same, just a single "options" line is output; 585 if there are no options, the line is omitted. "First code unit" is 586 where any match must start; if there is more than one they are listed 587 as "starting code units". "Last code unit" is the last literal code 588 unit that must be present in any match. This is not necessarily the 589 last character. These lines are omitted if no starting or ending code 590 units are recorded. 591 592 The callout_info modifier requests information about all the callouts 593 in the pattern. A list of them is output at the end of any other infor- 594 mation that is requested. For each callout, either its number or string 595 is given, followed by the item that follows it in the pattern. 596 597 Passing a NULL context 598 599 Normally, pcre2test passes a context block to pcre2_compile(). If the 600 null_context modifier is set, however, NULL is passed. This is for 601 testing that pcre2_compile() behaves correctly in this case (it uses 602 default values). 603 604 Specifying pattern characters in hexadecimal 605 606 The hex modifier specifies that the characters of the pattern, except 607 for substrings enclosed in single or double quotes, are to be inter- 608 preted as pairs of hexadecimal digits. This feature is provided as a 609 way of creating patterns that contain binary zeros and other non-print- 610 ing characters. White space is permitted between pairs of digits. For 611 example, this pattern contains three characters: 612 613 /ab 32 59/hex 614 615 Parts of such a pattern are taken literally if quoted. This pattern 616 contains nine characters, only two of which are specified in hexadeci- 617 mal: 618 619 /ab "literal" 32/hex 620 621 Either single or double quotes may be used. There is no way of includ- 622 ing the delimiter within a substring. 623 624 By default, pcre2test passes patterns as zero-terminated strings to 625 pcre2_compile(), giving the length as PCRE2_ZERO_TERMINATED. However, 626 for patterns specified with the hex modifier, the actual length of the 627 pattern is passed. 628 629 Generating long repetitive patterns 630 631 Some tests use long patterns that are very repetitive. Instead of cre- 632 ating a very long input line for such a pattern, you can use a special 633 repetition feature, similar to the one described for subject lines 634 above. If the expand modifier is present on a pattern, parts of the 635 pattern that have the form 636 637 \[<characters>]{<count>} 638 639 are expanded before the pattern is passed to pcre2_compile(). For exam- 640 ple, \[AB]{6000} is expanded to "ABAB..." 6000 times. This construction 641 cannot be nested. An initial "\[" sequence is recognized only if "]{" 642 followed by decimal digits and "}" is found later in the pattern. If 643 not, the characters remain in the pattern unaltered. 644 645 If part of an expanded pattern looks like an expansion, but is really 646 part of the actual pattern, unwanted expansion can be avoided by giving 647 two values in the quantifier. For example, \[AB]{6000,6000} is not rec- 648 ognized as an expansion item. 649 650 If the info modifier is set on an expanded pattern, the result of the 651 expansion is included in the information that is output. 652 653 JIT compilation 654 655 Just-in-time (JIT) compiling is a heavyweight optimization that can 656 greatly speed up pattern matching. See the pcre2jit documentation for 657 details. JIT compiling happens, optionally, after a pattern has been 658 successfully compiled into an internal form. The JIT compiler converts 659 this to optimized machine code. It needs to know whether the match-time 660 options PCRE2_PARTIAL_HARD and PCRE2_PARTIAL_SOFT are going to be used, 661 because different code is generated for the different cases. See the 662 partial modifier in "Subject Modifiers" below for details of how these 663 options are specified for each match attempt. 664 665 JIT compilation is requested by the /jit pattern modifier, which may 666 optionally be followed by an equals sign and a number in the range 0 to 667 7. The three bits that make up the number specify which of the three 668 JIT operating modes are to be compiled: 669 670 1 compile JIT code for non-partial matching 671 2 compile JIT code for soft partial matching 672 4 compile JIT code for hard partial matching 673 674 The possible values for the /jit modifier are therefore: 675 676 0 disable JIT 677 1 normal matching only 678 2 soft partial matching only 679 3 normal and soft partial matching 680 4 hard partial matching only 681 6 soft and hard partial matching only 682 7 all three modes 683 684 If no number is given, 7 is assumed. The phrase "partial matching" 685 means a call to pcre2_match() with either the PCRE2_PARTIAL_SOFT or the 686 PCRE2_PARTIAL_HARD option set. Note that such a call may return a com- 687 plete match; the options enable the possibility of a partial match, but 688 do not require it. Note also that if you request JIT compilation only 689 for partial matching (for example, /jit=2) but do not set the partial 690 modifier on a subject line, that match will not use JIT code because 691 none was compiled for non-partial matching. 692 693 If JIT compilation is successful, the compiled JIT code will automati- 694 cally be used when an appropriate type of match is run, except when 695 incompatible run-time options are specified. For more details, see the 696 pcre2jit documentation. See also the jitstack modifier below for a way 697 of setting the size of the JIT stack. 698 699 If the jitfast modifier is specified, matching is done using the JIT 700 "fast path" interface, pcre2_jit_match(), which skips some of the san- 701 ity checks that are done by pcre2_match(), and of course does not work 702 when JIT is not supported. If jitfast is specified without jit, jit=7 703 is assumed. 704 705 If the jitverify modifier is specified, information about the compiled 706 pattern shows whether JIT compilation was or was not successful. If 707 jitverify is specified without jit, jit=7 is assumed. If JIT compila- 708 tion is successful when jitverify is set, the text "(JIT)" is added to 709 the first output line after a match or non match when JIT-compiled code 710 was actually used in the match. 711 712 Setting a locale 713 714 The /locale modifier must specify the name of a locale, for example: 715 716 /pattern/locale=fr_FR 717 718 The given locale is set, pcre2_maketables() is called to build a set of 719 character tables for the locale, and this is then passed to pcre2_com- 720 pile() when compiling the regular expression. The same tables are used 721 when matching the following subject lines. The /locale modifier applies 722 only to the pattern on which it appears, but can be given in a #pattern 723 command if a default is needed. Setting a locale and alternate charac- 724 ter tables are mutually exclusive. 725 726 Showing pattern memory 727 728 The /memory modifier causes the size in bytes of the memory used to 729 hold the compiled pattern to be output. This does not include the size 730 of the pcre2_code block; it is just the actual compiled data. If the 731 pattern is subsequently passed to the JIT compiler, the size of the JIT 732 compiled code is also output. Here is an example: 733 734 re> /a(b)c/jit,memory 735 Memory allocation (code space): 21 736 Memory allocation (JIT code): 1910 737 738 739 Limiting nested parentheses 740 741 The parens_nest_limit modifier sets a limit on the depth of nested 742 parentheses in a pattern. Breaching the limit causes a compilation 743 error. The default for the library is set when PCRE2 is built, but 744 pcre2test sets its own default of 220, which is required for running 745 the standard test suite. 746 747 Limiting the pattern length 748 749 The max_pattern_length modifier sets a limit, in code units, to the 750 length of pattern that pcre2_compile() will accept. Breaching the limit 751 causes a compilation error. The default is the largest number a 752 PCRE2_SIZE variable can hold (essentially unlimited). 753 754 Using the POSIX wrapper API 755 756 The /posix and posix_nosub modifiers cause pcre2test to call PCRE2 via 757 the POSIX wrapper API rather than its native API. When posix_nosub is 758 used, the POSIX option REG_NOSUB is passed to regcomp(). The POSIX 759 wrapper supports only the 8-bit library. Note that it does not imply 760 POSIX matching semantics; for more detail see the pcre2posix documenta- 761 tion. The following pattern modifiers set options for the regcomp() 762 function: 763 764 caseless REG_ICASE 765 multiline REG_NEWLINE 766 dotall REG_DOTALL ) 767 ungreedy REG_UNGREEDY ) These options are not part of 768 ucp REG_UCP ) the POSIX standard 769 utf REG_UTF8 ) 770 771 The regerror_buffsize modifier specifies a size for the error buffer 772 that is passed to regerror() in the event of a compilation error. For 773 example: 774 775 /abc/posix,regerror_buffsize=20 776 777 This provides a means of testing the behaviour of regerror() when the 778 buffer is too small for the error message. If this modifier has not 779 been set, a large buffer is used. 780 781 The aftertext and allaftertext subject modifiers work as described 782 below. All other modifiers are either ignored, with a warning message, 783 or cause an error. 784 785 Testing the stack guard feature 786 787 The /stackguard modifier is used to test the use of pcre2_set_com- 788 pile_recursion_guard(), a function that is provided to enable stack 789 availability to be checked during compilation (see the pcre2api docu- 790 mentation for details). If the number specified by the modifier is 791 greater than zero, pcre2_set_compile_recursion_guard() is called to set 792 up callback from pcre2_compile() to a local function. The argument it 793 receives is the current nesting parenthesis depth; if this is greater 794 than the value given by the modifier, non-zero is returned, causing the 795 compilation to be aborted. 796 797 Using alternative character tables 798 799 The value specified for the /tables modifier must be one of the digits 800 0, 1, or 2. It causes a specific set of built-in character tables to be 801 passed to pcre2_compile(). This is used in the PCRE2 tests to check be- 802 haviour with different character tables. The digit specifies the tables 803 as follows: 804 805 0 do not pass any special character tables 806 1 the default ASCII tables, as distributed in 807 pcre2_chartables.c.dist 808 2 a set of tables defining ISO 8859 characters 809 810 In table 2, some characters whose codes are greater than 128 are iden- 811 tified as letters, digits, spaces, etc. Setting alternate character 812 tables and a locale are mutually exclusive. 813 814 Setting certain match controls 815 816 The following modifiers are really subject modifiers, and are described 817 below. However, they may be included in a pattern's modifier list, in 818 which case they are applied to every subject line that is processed 819 with that pattern. They may not appear in #pattern commands. These mod- 820 ifiers do not affect the compilation process. 821 822 aftertext show text after match 823 allaftertext show text after captures 824 allcaptures show all captures 825 allusedtext show all consulted text 826 /g global global matching 827 mark show mark values 828 replace=<string> specify a replacement string 829 startchar show starting character when relevant 830 substitute_extended use PCRE2_SUBSTITUTE_EXTENDED 831 substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH 832 substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET 833 substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY 834 835 These modifiers may not appear in a #pattern command. If you want them 836 as defaults, set them in a #subject command. 837 838 Saving a compiled pattern 839 840 When a pattern with the push modifier is successfully compiled, it is 841 pushed onto a stack of compiled patterns, and pcre2test expects the 842 next line to contain a new pattern (or a command) instead of a subject 843 line. This facility is used when saving compiled patterns to a file, as 844 described in the section entitled "Saving and restoring compiled pat- 845 terns" below. If pushcopy is used instead of push, a copy of the com- 846 piled pattern is stacked, leaving the original as current, ready to 847 match the following input lines. This provides a way of testing the 848 pcre2_code_copy() function. The push and pushcopy modifiers are 849 incompatible with compilation modifiers such as global that act at 850 match time. Any that are specified are ignored (for the stacked copy), 851 with a warning message, except for replace, which causes an error. Note 852 that jitverify, which is allowed, does not carry through to any subse- 853 quent matching that uses a stacked pattern. 854 855 856SUBJECT MODIFIERS 857 858 The modifiers that can appear in subject lines and the #subject command 859 are of two types. 860 861 Setting match options 862 863 The following modifiers set options for pcre2_match() or 864 pcre2_dfa_match(). See pcreapi for a description of their effects. 865 866 anchored set PCRE2_ANCHORED 867 dfa_restart set PCRE2_DFA_RESTART 868 dfa_shortest set PCRE2_DFA_SHORTEST 869 no_jit set PCRE2_NO_JIT 870 no_utf_check set PCRE2_NO_UTF_CHECK 871 notbol set PCRE2_NOTBOL 872 notempty set PCRE2_NOTEMPTY 873 notempty_atstart set PCRE2_NOTEMPTY_ATSTART 874 noteol set PCRE2_NOTEOL 875 partial_hard (or ph) set PCRE2_PARTIAL_HARD 876 partial_soft (or ps) set PCRE2_PARTIAL_SOFT 877 878 The partial matching modifiers are provided with abbreviations because 879 they appear frequently in tests. 880 881 If the /posix modifier was present on the pattern, causing the POSIX 882 wrapper API to be used, the only option-setting modifiers that have any 883 effect are notbol, notempty, and noteol, causing REG_NOTBOL, 884 REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to regexec(). 885 The other modifiers are ignored, with a warning message. 886 887 Setting match controls 888 889 The following modifiers affect the matching process or request addi- 890 tional information. Some of them may also be specified on a pattern 891 line (see above), in which case they apply to every subject line that 892 is matched against that pattern. 893 894 aftertext show text after match 895 allaftertext show text after captures 896 allcaptures show all captures 897 allusedtext show all consulted text (non-JIT only) 898 altglobal alternative global matching 899 callout_capture show captures at callout time 900 callout_data=<n> set a value to pass via callouts 901 callout_fail=<n>[:<m>] control callout failure 902 callout_none do not supply a callout function 903 copy=<number or name> copy captured substring 904 dfa use pcre2_dfa_match() 905 find_limits find match and recursion limits 906 get=<number or name> extract captured substring 907 getall extract all captured substrings 908 /g global global matching 909 jitstack=<n> set size of JIT stack 910 mark show mark values 911 match_limit=<n> set a match limit 912 memory show memory usage 913 null_context match with a NULL context 914 offset=<n> set starting offset 915 offset_limit=<n> set offset limit 916 ovector=<n> set size of output vector 917 recursion_limit=<n> set a recursion limit 918 replace=<string> specify a replacement string 919 startchar show startchar when relevant 920 startoffset=<n> same as offset=<n> 921 substitute_extedded use PCRE2_SUBSTITUTE_EXTENDED 922 substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH 923 substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET 924 substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY 925 zero_terminate pass the subject as zero-terminated 926 927 The effects of these modifiers are described in the following sections. 928 When matching via the POSIX wrapper API, the aftertext, allaftertext, 929 and ovector subject modifiers work as described below. All other modi- 930 fiers are either ignored, with a warning message, or cause an error. 931 932 Showing more text 933 934 The aftertext modifier requests that as well as outputting the part of 935 the subject string that matched the entire pattern, pcre2test should in 936 addition output the remainder of the subject string. This is useful for 937 tests where the subject contains multiple copies of the same substring. 938 The allaftertext modifier requests the same action for captured sub- 939 strings as well as the main matched substring. In each case the remain- 940 der is output on the following line with a plus character following the 941 capture number. 942 943 The allusedtext modifier requests that all the text that was consulted 944 during a successful pattern match by the interpreter should be shown. 945 This feature is not supported for JIT matching, and if requested with 946 JIT it is ignored (with a warning message). Setting this modifier 947 affects the output if there is a lookbehind at the start of a match, or 948 a lookahead at the end, or if \K is used in the pattern. Characters 949 that precede or follow the start and end of the actual match are indi- 950 cated in the output by '<' or '>' characters underneath them. Here is 951 an example: 952 953 re> /(?<=pqr)abc(?=xyz)/ 954 data> 123pqrabcxyz456\=allusedtext 955 0: pqrabcxyz 956 <<< >>> 957 958 This shows that the matched string is "abc", with the preceding and 959 following strings "pqr" and "xyz" having been consulted during the 960 match (when processing the assertions). 961 962 The startchar modifier requests that the starting character for the 963 match be indicated, if it is different to the start of the matched 964 string. The only time when this occurs is when \K has been processed as 965 part of the match. In this situation, the output for the matched string 966 is displayed from the starting character instead of from the match 967 point, with circumflex characters under the earlier characters. For 968 example: 969 970 re> /abc\Kxyz/ 971 data> abcxyz\=startchar 972 0: abcxyz 973 ^^^ 974 975 Unlike allusedtext, the startchar modifier can be used with JIT. How- 976 ever, these two modifiers are mutually exclusive. 977 978 Showing the value of all capture groups 979 980 The allcaptures modifier requests that the values of all potential cap- 981 tured parentheses be output after a match. By default, only those up to 982 the highest one actually used in the match are output (corresponding to 983 the return code from pcre2_match()). Groups that did not take part in 984 the match are output as "<unset>". This modifier is not relevant for 985 DFA matching (which does no capturing); it is ignored, with a warning 986 message, if present. 987 988 Testing callouts 989 990 A callout function is supplied when pcre2test calls the library match- 991 ing functions, unless callout_none is specified. If callout_capture is 992 set, the current captured groups are output when a callout occurs. 993 994 The callout_fail modifier can be given one or two numbers. If there is 995 only one number, 1 is returned instead of 0 when a callout of that num- 996 ber is reached. If two numbers are given, 1 is returned when callout 997 <n> is reached for the <m>th time. Note that callouts with string argu- 998 ments are always given the number zero. See "Callouts" below for a 999 description of the output when a callout it taken. 1000 1001 The callout_data modifier can be given an unsigned or a negative num- 1002 ber. This is set as the "user data" that is passed to the matching 1003 function, and passed back when the callout function is invoked. Any 1004 value other than zero is used as a return from pcre2test's callout 1005 function. 1006 1007 Finding all matches in a string 1008 1009 Searching for all possible matches within a subject can be requested by 1010 the global or /altglobal modifier. After finding a match, the matching 1011 function is called again to search the remainder of the subject. The 1012 difference between global and altglobal is that the former uses the 1013 start_offset argument to pcre2_match() or pcre2_dfa_match() to start 1014 searching at a new point within the entire string (which is what Perl 1015 does), whereas the latter passes over a shortened subject. This makes a 1016 difference to the matching process if the pattern begins with a lookbe- 1017 hind assertion (including \b or \B). 1018 1019 If an empty string is matched, the next match is done with the 1020 PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED flags set, in order to search 1021 for another, non-empty, match at the same point in the subject. If this 1022 match fails, the start offset is advanced, and the normal match is 1023 retried. This imitates the way Perl handles such cases when using the 1024 /g modifier or the split() function. Normally, the start offset is 1025 advanced by one character, but if the newline convention recognizes 1026 CRLF as a newline, and the current character is CR followed by LF, an 1027 advance of two characters occurs. 1028 1029 Testing substring extraction functions 1030 1031 The copy and get modifiers can be used to test the pcre2_sub- 1032 string_copy_xxx() and pcre2_substring_get_xxx() functions. They can be 1033 given more than once, and each can specify a group name or number, for 1034 example: 1035 1036 abcd\=copy=1,copy=3,get=G1 1037 1038 If the #subject command is used to set default copy and/or get lists, 1039 these can be unset by specifying a negative number to cancel all num- 1040 bered groups and an empty name to cancel all named groups. 1041 1042 The getall modifier tests pcre2_substring_list_get(), which extracts 1043 all captured substrings. 1044 1045 If the subject line is successfully matched, the substrings extracted 1046 by the convenience functions are output with C, G, or L after the 1047 string number instead of a colon. This is in addition to the normal 1048 full list. The string length (that is, the return from the extraction 1049 function) is given in parentheses after each substring, followed by the 1050 name when the extraction was by name. 1051 1052 Testing the substitution function 1053 1054 If the replace modifier is set, the pcre2_substitute() function is 1055 called instead of one of the matching functions. Note that replacement 1056 strings cannot contain commas, because a comma signifies the end of a 1057 modifier. This is not thought to be an issue in a test program. 1058 1059 Unlike subject strings, pcre2test does not process replacement strings 1060 for escape sequences. In UTF mode, a replacement string is checked to 1061 see if it is a valid UTF-8 string. If so, it is correctly converted to 1062 a UTF string of the appropriate code unit width. If it is not a valid 1063 UTF-8 string, the individual code units are copied directly. This pro- 1064 vides a means of passing an invalid UTF-8 string for testing purposes. 1065 1066 The following modifiers set options (in additional to the normal match 1067 options) for pcre2_substitute(): 1068 1069 global PCRE2_SUBSTITUTE_GLOBAL 1070 substitute_extended PCRE2_SUBSTITUTE_EXTENDED 1071 substitute_overflow_length PCRE2_SUBSTITUTE_OVERFLOW_LENGTH 1072 substitute_unknown_unset PCRE2_SUBSTITUTE_UNKNOWN_UNSET 1073 substitute_unset_empty PCRE2_SUBSTITUTE_UNSET_EMPTY 1074 1075 1076 After a successful substitution, the modified string is output, pre- 1077 ceded by the number of replacements. This may be zero if there were no 1078 matches. Here is a simple example of a substitution test: 1079 1080 /abc/replace=xxx 1081 =abc=abc= 1082 1: =xxx=abc= 1083 =abc=abc=\=global 1084 2: =xxx=xxx= 1085 1086 Subject and replacement strings should be kept relatively short (fewer 1087 than 256 characters) for substitution tests, as fixed-size buffers are 1088 used. To make it easy to test for buffer overflow, if the replacement 1089 string starts with a number in square brackets, that number is passed 1090 to pcre2_substitute() as the size of the output buffer, with the 1091 replacement string starting at the next character. Here is an example 1092 that tests the edge case: 1093 1094 /abc/ 1095 123abc123\=replace=[10]XYZ 1096 1: 123XYZ123 1097 123abc123\=replace=[9]XYZ 1098 Failed: error -47: no more memory 1099 1100 The default action of pcre2_substitute() is to return 1101 PCRE2_ERROR_NOMEMORY when the output buffer is too small. However, if 1102 the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set (by using the sub- 1103 stitute_overflow_length modifier), pcre2_substitute() continues to go 1104 through the motions of matching and substituting, in order to compute 1105 the size of buffer that is required. When this happens, pcre2test shows 1106 the required buffer length (which includes space for the trailing zero) 1107 as part of the error message. For example: 1108 1109 /abc/substitute_overflow_length 1110 123abc123\=replace=[9]XYZ 1111 Failed: error -47: no more memory: 10 code units are needed 1112 1113 A replacement string is ignored with POSIX and DFA matching. Specifying 1114 partial matching provokes an error return ("bad option value") from 1115 pcre2_substitute(). 1116 1117 Setting the JIT stack size 1118 1119 The jitstack modifier provides a way of setting the maximum stack size 1120 that is used by the just-in-time optimization code. It is ignored if 1121 JIT optimization is not being used. The value is a number of kilobytes. 1122 Providing a stack that is larger than the default 32K is necessary only 1123 for very complicated patterns. 1124 1125 Setting match and recursion limits 1126 1127 The match_limit and recursion_limit modifiers set the appropriate lim- 1128 its in the match context. These values are ignored when the find_limits 1129 modifier is specified. 1130 1131 Finding minimum limits 1132 1133 If the find_limits modifier is present, pcre2test calls pcre2_match() 1134 several times, setting different values in the match context via 1135 pcre2_set_match_limit() and pcre2_set_recursion_limit() until it finds 1136 the minimum values for each parameter that allow pcre2_match() to com- 1137 plete without error. 1138 1139 If JIT is being used, only the match limit is relevant. If DFA matching 1140 is being used, neither limit is relevant, and this modifier is ignored 1141 (with a warning message). 1142 1143 The match_limit number is a measure of the amount of backtracking that 1144 takes place, and learning the minimum value can be instructive. For 1145 most simple matches, the number is quite small, but for patterns with 1146 very large numbers of matching possibilities, it can become large very 1147 quickly with increasing length of subject string. The 1148 match_limit_recursion number is a measure of how much stack (or, if 1149 PCRE2 is compiled with NO_RECURSE, how much heap) memory is needed to 1150 complete the match attempt. 1151 1152 Showing MARK names 1153 1154 1155 The mark modifier causes the names from backtracking control verbs that 1156 are returned from calls to pcre2_match() to be displayed. If a mark is 1157 returned for a match, non-match, or partial match, pcre2test shows it. 1158 For a match, it is on a line by itself, tagged with "MK:". Otherwise, 1159 it is added to the non-match message. 1160 1161 Showing memory usage 1162 1163 The memory modifier causes pcre2test to log all memory allocation and 1164 freeing calls that occur during a match operation. 1165 1166 Setting a starting offset 1167 1168 The offset modifier sets an offset in the subject string at which 1169 matching starts. Its value is a number of code units, not characters. 1170 1171 Setting an offset limit 1172 1173 The offset_limit modifier sets a limit for unanchored matches. If a 1174 match cannot be found starting at or before this offset in the subject, 1175 a "no match" return is given. The data value is a number of code units, 1176 not characters. When this modifier is used, the use_offset_limit modi- 1177 fier must have been set for the pattern; if not, an error is generated. 1178 1179 Setting the size of the output vector 1180 1181 The ovector modifier applies only to the subject line in which it 1182 appears, though of course it can also be used to set a default in a 1183 #subject command. It specifies the number of pairs of offsets that are 1184 available for storing matching information. The default is 15. 1185 1186 A value of zero is useful when testing the POSIX API because it causes 1187 regexec() to be called with a NULL capture vector. When not testing the 1188 POSIX API, a value of zero is used to cause pcre2_match_data_cre- 1189 ate_from_pattern() to be called, in order to create a match block of 1190 exactly the right size for the pattern. (It is not possible to create a 1191 match block with a zero-length ovector; there is always at least one 1192 pair of offsets.) 1193 1194 Passing the subject as zero-terminated 1195 1196 By default, the subject string is passed to a native API matching func- 1197 tion with its correct length. In order to test the facility for passing 1198 a zero-terminated string, the zero_terminate modifier is provided. It 1199 causes the length to be passed as PCRE2_ZERO_TERMINATED. (When matching 1200 via the POSIX interface, this modifier has no effect, as there is no 1201 facility for passing a length.) 1202 1203 When testing pcre2_substitute(), this modifier also has the effect of 1204 passing the replacement string as zero-terminated. 1205 1206 Passing a NULL context 1207 1208 Normally, pcre2test passes a context block to pcre2_match(), 1209 pcre2_dfa_match() or pcre2_jit_match(). If the null_context modifier is 1210 set, however, NULL is passed. This is for testing that the matching 1211 functions behave correctly in this case (they use default values). This 1212 modifier cannot be used with the find_limits modifier or when testing 1213 the substitution function. 1214 1215 1216THE ALTERNATIVE MATCHING FUNCTION 1217 1218 By default, pcre2test uses the standard PCRE2 matching function, 1219 pcre2_match() to match each subject line. PCRE2 also supports an alter- 1220 native matching function, pcre2_dfa_match(), which operates in a dif- 1221 ferent way, and has some restrictions. The differences between the two 1222 functions are described in the pcre2matching documentation. 1223 1224 If the dfa modifier is set, the alternative matching function is used. 1225 This function finds all possible matches at a given point in the sub- 1226 ject. If, however, the dfa_shortest modifier is set, processing stops 1227 after the first match is found. This is always the shortest possible 1228 match. 1229 1230 1231DEFAULT OUTPUT FROM pcre2test 1232 1233 This section describes the output when the normal matching function, 1234 pcre2_match(), is being used. 1235 1236 When a match succeeds, pcre2test outputs the list of captured sub- 1237 strings, starting with number 0 for the string that matched the whole 1238 pattern. Otherwise, it outputs "No match" when the return is 1239 PCRE2_ERROR_NOMATCH, or "Partial match:" followed by the partially 1240 matching substring when the return is PCRE2_ERROR_PARTIAL. (Note that 1241 this is the entire substring that was inspected during the partial 1242 match; it may include characters before the actual match start if a 1243 lookbehind assertion, \K, \b, or \B was involved.) 1244 1245 For any other return, pcre2test outputs the PCRE2 negative error number 1246 and a short descriptive phrase. If the error is a failed UTF string 1247 check, the code unit offset of the start of the failing character is 1248 also output. Here is an example of an interactive pcre2test run. 1249 1250 $ pcre2test 1251 PCRE2 version 9.00 2014-05-10 1252 1253 re> /^abc(\d+)/ 1254 data> abc123 1255 0: abc123 1256 1: 123 1257 data> xyz 1258 No match 1259 1260 Unset capturing substrings that are not followed by one that is set are 1261 not shown by pcre2test unless the allcaptures modifier is specified. In 1262 the following example, there are two capturing substrings, but when the 1263 first data line is matched, the second, unset substring is not shown. 1264 An "internal" unset substring is shown as "<unset>", as for the second 1265 data line. 1266 1267 re> /(a)|(b)/ 1268 data> a 1269 0: a 1270 1: a 1271 data> b 1272 0: b 1273 1: <unset> 1274 2: b 1275 1276 If the strings contain any non-printing characters, they are output as 1277 \xhh escapes if the value is less than 256 and UTF mode is not set. 1278 Otherwise they are output as \x{hh...} escapes. See below for the defi- 1279 nition of non-printing characters. If the /aftertext modifier is set, 1280 the output for substring 0 is followed by the the rest of the subject 1281 string, identified by "0+" like this: 1282 1283 re> /cat/aftertext 1284 data> cataract 1285 0: cat 1286 0+ aract 1287 1288 If global matching is requested, the results of successive matching 1289 attempts are output in sequence, like this: 1290 1291 re> /\Bi(\w\w)/g 1292 data> Mississippi 1293 0: iss 1294 1: ss 1295 0: iss 1296 1: ss 1297 0: ipp 1298 1: pp 1299 1300 "No match" is output only if the first match attempt fails. Here is an 1301 example of a failure message (the offset 4 that is specified by the 1302 offset modifier is past the end of the subject string): 1303 1304 re> /xyz/ 1305 data> xyz\=offset=4 1306 Error -24 (bad offset value) 1307 1308 Note that whereas patterns can be continued over several lines (a plain 1309 ">" prompt is used for continuations), subject lines may not. However 1310 newlines can be included in a subject by means of the \n escape (or \r, 1311 \r\n, etc., depending on the newline sequence setting). 1312 1313 1314OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION 1315 1316 When the alternative matching function, pcre2_dfa_match(), is used, the 1317 output consists of a list of all the matches that start at the first 1318 point in the subject where there is at least one match. For example: 1319 1320 re> /(tang|tangerine|tan)/ 1321 data> yellow tangerine\=dfa 1322 0: tangerine 1323 1: tang 1324 2: tan 1325 1326 Using the normal matching function on this data finds only "tang". The 1327 longest matching string is always given first (and numbered zero). 1328 After a PCRE2_ERROR_PARTIAL return, the output is "Partial match:", 1329 followed by the partially matching substring. Note that this is the 1330 entire substring that was inspected during the partial match; it may 1331 include characters before the actual match start if a lookbehind asser- 1332 tion, \b, or \B was involved. (\K is not supported for DFA matching.) 1333 1334 If global matching is requested, the search for further matches resumes 1335 at the end of the longest match. For example: 1336 1337 re> /(tang|tangerine|tan)/g 1338 data> yellow tangerine and tangy sultana\=dfa 1339 0: tangerine 1340 1: tang 1341 2: tan 1342 0: tang 1343 1: tan 1344 0: tan 1345 1346 The alternative matching function does not support substring capture, 1347 so the modifiers that are concerned with captured substrings are not 1348 relevant. 1349 1350 1351RESTARTING AFTER A PARTIAL MATCH 1352 1353 When the alternative matching function has given the PCRE2_ERROR_PAR- 1354 TIAL return, indicating that the subject partially matched the pattern, 1355 you can restart the match with additional subject data by means of the 1356 dfa_restart modifier. For example: 1357 1358 re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ 1359 data> 23ja\=P,dfa 1360 Partial match: 23ja 1361 data> n05\=dfa,dfa_restart 1362 0: n05 1363 1364 For further information about partial matching, see the pcre2partial 1365 documentation. 1366 1367 1368CALLOUTS 1369 1370 If the pattern contains any callout requests, pcre2test's callout func- 1371 tion is called during matching unless callout_none is specified. This 1372 works with both matching functions. 1373 1374 The callout function in pcre2test returns zero (carry on matching) by 1375 default, but you can use a callout_fail modifier in a subject line (as 1376 described above) to change this and other parameters of the callout. 1377 1378 Inserting callouts can be helpful when using pcre2test to check compli- 1379 cated regular expressions. For further information about callouts, see 1380 the pcre2callout documentation. 1381 1382 The output for callouts with numerical arguments and those with string 1383 arguments is slightly different. 1384 1385 Callouts with numerical arguments 1386 1387 By default, the callout function displays the callout number, the start 1388 and current positions in the subject text at the callout time, and the 1389 next pattern item to be tested. For example: 1390 1391 --->pqrabcdef 1392 0 ^ ^ \d 1393 1394 This output indicates that callout number 0 occurred for a match 1395 attempt starting at the fourth character of the subject string, when 1396 the pointer was at the seventh character, and when the next pattern 1397 item was \d. Just one circumflex is output if the start and current 1398 positions are the same, or if the current position precedes the start 1399 position, which can happen if the callout is in a lookbehind assertion. 1400 1401 Callouts numbered 255 are assumed to be automatic callouts, inserted as 1402 a result of the /auto_callout pattern modifier. In this case, instead 1403 of showing the callout number, the offset in the pattern, preceded by a 1404 plus, is output. For example: 1405 1406 re> /\d?[A-E]\*/auto_callout 1407 data> E* 1408 --->E* 1409 +0 ^ \d? 1410 +3 ^ [A-E] 1411 +8 ^^ \* 1412 +10 ^ ^ 1413 0: E* 1414 1415 If a pattern contains (*MARK) items, an additional line is output when- 1416 ever a change of latest mark is passed to the callout function. For 1417 example: 1418 1419 re> /a(*MARK:X)bc/auto_callout 1420 data> abc 1421 --->abc 1422 +0 ^ a 1423 +1 ^^ (*MARK:X) 1424 +10 ^^ b 1425 Latest Mark: X 1426 +11 ^ ^ c 1427 +12 ^ ^ 1428 0: abc 1429 1430 The mark changes between matching "a" and "b", but stays the same for 1431 the rest of the match, so nothing more is output. If, as a result of 1432 backtracking, the mark reverts to being unset, the text "<unset>" is 1433 output. 1434 1435 Callouts with string arguments 1436 1437 The output for a callout with a string argument is similar, except that 1438 instead of outputting a callout number before the position indicators, 1439 the callout string and its offset in the pattern string are output 1440 before the reflection of the subject string, and the subject string is 1441 reflected for each callout. For example: 1442 1443 re> /^ab(?C'first')cd(?C"second")ef/ 1444 data> abcdefg 1445 Callout (7): 'first' 1446 --->abcdefg 1447 ^ ^ c 1448 Callout (20): "second" 1449 --->abcdefg 1450 ^ ^ e 1451 0: abcdef 1452 1453 1454NON-PRINTING CHARACTERS 1455 1456 When pcre2test is outputting text in the compiled version of a pattern, 1457 bytes other than 32-126 are always treated as non-printing characters 1458 and are therefore shown as hex escapes. 1459 1460 When pcre2test is outputting text that is a matched part of a subject 1461 string, it behaves in the same way, unless a different locale has been 1462 set for the pattern (using the /locale modifier). In this case, the 1463 isprint() function is used to distinguish printing and non-printing 1464 characters. 1465 1466 1467SAVING AND RESTORING COMPILED PATTERNS 1468 1469 It is possible to save compiled patterns on disc or elsewhere, and 1470 reload them later, subject to a number of restrictions. JIT data cannot 1471 be saved. The host on which the patterns are reloaded must be running 1472 the same version of PCRE2, with the same code unit width, and must also 1473 have the same endianness, pointer width and PCRE2_SIZE type. Before 1474 compiled patterns can be saved they must be serialized, that is, con- 1475 verted to a stream of bytes. A single byte stream may contain any num- 1476 ber of compiled patterns, but they must all use the same character 1477 tables. A single copy of the tables is included in the byte stream (its 1478 size is 1088 bytes). 1479 1480 The functions whose names begin with pcre2_serialize_ are used for 1481 serializing and de-serializing. They are described in the pcre2serial- 1482 ize documentation. In this section we describe the features of 1483 pcre2test that can be used to test these functions. 1484 1485 When a pattern with push modifier is successfully compiled, it is 1486 pushed onto a stack of compiled patterns, and pcre2test expects the 1487 next line to contain a new pattern (or command) instead of a subject 1488 line. By contrast, the pushcopy modifier causes a copy of the compiled 1489 pattern to be stacked, leaving the original available for immediate 1490 matching. By using push and/or pushcopy, a number of patterns can be 1491 compiled and retained. These modifiers are incompatible with posix, and 1492 control modifiers that act at match time are ignored (with a message) 1493 for the stacked patterns. The jitverify modifier applies only at com- 1494 pile time. 1495 1496 The command 1497 1498 #save <filename> 1499 1500 causes all the stacked patterns to be serialized and the result written 1501 to the named file. Afterwards, all the stacked patterns are freed. The 1502 command 1503 1504 #load <filename> 1505 1506 reads the data in the file, and then arranges for it to be de-serial- 1507 ized, with the resulting compiled patterns added to the pattern stack. 1508 The pattern on the top of the stack can be retrieved by the #pop com- 1509 mand, which must be followed by lines of subjects that are to be 1510 matched with the pattern, terminated as usual by an empty line or end 1511 of file. This command may be followed by a modifier list containing 1512 only control modifiers that act after a pattern has been compiled. In 1513 particular, hex, posix, posix_nosub, push, and pushcopy are not 1514 allowed, nor are any option-setting modifiers. The JIT modifiers are, 1515 however permitted. Here is an example that saves and reloads two pat- 1516 terns. 1517 1518 /abc/push 1519 /xyz/push 1520 #save tempfile 1521 #load tempfile 1522 #pop info 1523 xyz 1524 1525 #pop jit,bincode 1526 abc 1527 1528 If jitverify is used with #pop, it does not automatically imply jit, 1529 which is different behaviour from when it is used on a pattern. 1530 1531 The #popcopy command is analagous to the pushcopy modifier in that it 1532 makes current a copy of the topmost stack pattern, leaving the original 1533 still on the stack. 1534 1535 1536SEE ALSO 1537 1538 pcre2(3), pcre2api(3), pcre2callout(3), pcre2jit, pcre2matching(3), 1539 pcre2partial(d), pcre2pattern(3), pcre2serialize(3). 1540 1541 1542AUTHOR 1543 1544 Philip Hazel 1545 University Computing Service 1546 Cambridge, England. 1547 1548 1549REVISION 1550 1551 Last updated: 06 July 2016 1552 Copyright (c) 1997-2016 University of Cambridge. 1553