1<html> 2<head> 3<title>pcre2test specification</title> 4</head> 5<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB"> 6<h1>pcre2test man page</h1> 7<p> 8Return to the <a href="index.html">PCRE2 index page</a>. 9</p> 10<p> 11This page is part of the PCRE2 HTML documentation. It was generated 12automatically from the original man page. If there is any nonsense in it, 13please consult the man page, in case the conversion went wrong. 14<br> 15<ul> 16<li><a name="TOC1" href="#SEC1">SYNOPSIS</a> 17<li><a name="TOC2" href="#SEC2">PCRE2's 8-BIT, 16-BIT AND 32-BIT LIBRARIES</a> 18<li><a name="TOC3" href="#SEC3">INPUT ENCODING</a> 19<li><a name="TOC4" href="#SEC4">COMMAND LINE OPTIONS</a> 20<li><a name="TOC5" href="#SEC5">DESCRIPTION</a> 21<li><a name="TOC6" href="#SEC6">COMMAND LINES</a> 22<li><a name="TOC7" href="#SEC7">MODIFIER SYNTAX</a> 23<li><a name="TOC8" href="#SEC8">PATTERN SYNTAX</a> 24<li><a name="TOC9" href="#SEC9">SUBJECT LINE SYNTAX</a> 25<li><a name="TOC10" href="#SEC10">PATTERN MODIFIERS</a> 26<li><a name="TOC11" href="#SEC11">SUBJECT MODIFIERS</a> 27<li><a name="TOC12" href="#SEC12">THE ALTERNATIVE MATCHING FUNCTION</a> 28<li><a name="TOC13" href="#SEC13">DEFAULT OUTPUT FROM pcre2test</a> 29<li><a name="TOC14" href="#SEC14">OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION</a> 30<li><a name="TOC15" href="#SEC15">RESTARTING AFTER A PARTIAL MATCH</a> 31<li><a name="TOC16" href="#SEC16">CALLOUTS</a> 32<li><a name="TOC17" href="#SEC17">NON-PRINTING CHARACTERS</a> 33<li><a name="TOC18" href="#SEC18">SAVING AND RESTORING COMPILED PATTERNS</a> 34<li><a name="TOC19" href="#SEC19">SEE ALSO</a> 35<li><a name="TOC20" href="#SEC20">AUTHOR</a> 36<li><a name="TOC21" href="#SEC21">REVISION</a> 37</ul> 38<br><a name="SEC1" href="#TOC1">SYNOPSIS</a><br> 39<P> 40<b>pcre2test [options] [input file [output file]]</b> 41<br> 42<br> 43<b>pcre2test</b> is a test program for the PCRE2 regular expression libraries, 44but it can also be used for experimenting with regular expressions. This 45document describes the features of the test program; for details of the regular 46expressions themselves, see the 47<a href="pcre2pattern.html"><b>pcre2pattern</b></a> 48documentation. For details of the PCRE2 library function calls and their 49options, see the 50<a href="pcre2api.html"><b>pcre2api</b></a> 51documentation. 52</P> 53<P> 54The input for <b>pcre2test</b> is a sequence of regular expression patterns and 55subject strings to be matched. There are also command lines for setting 56defaults and controlling some special actions. The output shows the result of 57each match attempt. Modifiers on external or internal command lines, the 58patterns, and the subject lines specify PCRE2 function options, control how the 59subject is processed, and what output is produced. 60</P> 61<P> 62As the original fairly simple PCRE library evolved, it acquired many different 63features, and as a result, the original <b>pcretest</b> program ended up with a 64lot of options in a messy, arcane syntax, for testing all the features. The 65move to the new PCRE2 API provided an opportunity to re-implement the test 66program as <b>pcre2test</b>, with a cleaner modifier syntax. Nevertheless, there 67are still many obscure modifiers, some of which are specifically designed for 68use in conjunction with the test script and data files that are distributed as 69part of PCRE2. All the modifiers are documented here, some without much 70justification, but many of them are unlikely to be of use except when testing 71the libraries. 72</P> 73<br><a name="SEC2" href="#TOC1">PCRE2's 8-BIT, 16-BIT AND 32-BIT LIBRARIES</a><br> 74<P> 75Different versions of the PCRE2 library can be built to support character 76strings that are encoded in 8-bit, 16-bit, or 32-bit code units. One, two, or 77all three of these libraries may be simultaneously installed. The 78<b>pcre2test</b> program can be used to test all the libraries. However, its own 79input and output are always in 8-bit format. When testing the 16-bit or 32-bit 80libraries, patterns and subject strings are converted to 16- or 32-bit format 81before being passed to the library functions. Results are converted back to 828-bit code units for output. 83</P> 84<P> 85In the rest of this document, the names of library functions and structures 86are given in generic form, for example, <b>pcre_compile()</b>. The actual 87names used in the libraries have a suffix _8, _16, or _32, as appropriate. 88</P> 89<br><a name="SEC3" href="#TOC1">INPUT ENCODING</a><br> 90<P> 91Input to <b>pcre2test</b> is processed line by line, either by calling the C 92library's <b>fgets()</b> function, or via the <b>libreadline</b> library (see 93below). The input is processed using using C's string functions, so must not 94contain binary zeroes, even though in Unix-like environments, <b>fgets()</b> 95treats any bytes other than newline as data characters. In some Windows 96environments character 26 (hex 1A) causes an immediate end of file, and no 97further data is read. 98</P> 99<P> 100For maximum portability, therefore, it is safest to avoid non-printing 101characters in <b>pcre2test</b> input files. There is a facility for specifying 102some or all of a pattern's characters as hexadecimal pairs, thus making it 103possible to include binary zeroes in a pattern for testing purposes. Subject 104lines are processed for backslash escapes, which makes it possible to include 105any data value. 106</P> 107<br><a name="SEC4" href="#TOC1">COMMAND LINE OPTIONS</a><br> 108<P> 109<b>-8</b> 110If the 8-bit library has been built, this option causes it to be used (this is 111the default). If the 8-bit library has not been built, this option causes an 112error. 113</P> 114<P> 115<b>-16</b> 116If the 16-bit library has been built, this option causes it to be used. If only 117the 16-bit library has been built, this is the default. If the 16-bit library 118has not been built, this option causes an error. 119</P> 120<P> 121<b>-32</b> 122If the 32-bit library has been built, this option causes it to be used. If only 123the 32-bit library has been built, this is the default. If the 32-bit library 124has not been built, this option causes an error. 125</P> 126<P> 127<b>-b</b> 128Behave as if each pattern has the <b>/fullbincode</b> modifier; the full 129internal binary form of the pattern is output after compilation. 130</P> 131<P> 132<b>-C</b> 133Output the version number of the PCRE2 library, and all available information 134about the optional features that are included, and then exit with zero exit 135code. All other options are ignored. 136</P> 137<P> 138<b>-C</b> <i>option</i> 139Output information about a specific build-time option, then exit. This 140functionality is intended for use in scripts such as <b>RunTest</b>. The 141following options output the value and set the exit code as indicated: 142<pre> 143 ebcdic-nl the code for LF (= NL) in an EBCDIC environment: 144 0x15 or 0x25 145 0 if used in an ASCII environment 146 exit code is always 0 147 linksize the configured internal link size (2, 3, or 4) 148 exit code is set to the link size 149 newline the default newline setting: 150 CR, LF, CRLF, ANYCRLF, or ANY 151 exit code is always 0 152 bsr the default setting for what \R matches: 153 ANYCRLF or ANY 154 exit code is always 0 155</pre> 156The following options output 1 for true or 0 for false, and set the exit code 157to the same value: 158<pre> 159 backslash-C \C is supported (not locked out) 160 ebcdic compiled for an EBCDIC environment 161 jit just-in-time support is available 162 pcre2-16 the 16-bit library was built 163 pcre2-32 the 32-bit library was built 164 pcre2-8 the 8-bit library was built 165 unicode Unicode support is available 166</pre> 167If an unknown option is given, an error message is output; the exit code is 0. 168</P> 169<P> 170<b>-d</b> 171Behave as if each pattern has the <b>debug</b> modifier; the internal 172form and information about the compiled pattern is output after compilation; 173<b>-d</b> is equivalent to <b>-b -i</b>. 174</P> 175<P> 176<b>-dfa</b> 177Behave as if each subject line has the <b>dfa</b> modifier; matching is done 178using the <b>pcre2_dfa_match()</b> function instead of the default 179<b>pcre2_match()</b>. 180</P> 181<P> 182<b>-error</b> <i>number[,number,...]</i> 183Call <b>pcre2_get_error_message()</b> for each of the error numbers in the 184comma-separated list, display the resulting messages on the standard output, 185then exit with zero exit code. The numbers may be positive or negative. This is 186a convenience facility for PCRE2 maintainers. 187</P> 188<P> 189<b>-help</b> 190Output a brief summary these options and then exit. 191</P> 192<P> 193<b>-i</b> 194Behave as if each pattern has the <b>/info</b> modifier; information about the 195compiled pattern is given after compilation. 196</P> 197<P> 198<b>-jit</b> 199Behave as if each pattern line has the <b>jit</b> modifier; after successful 200compilation, each pattern is passed to the just-in-time compiler, if available. 201</P> 202<P> 203\fB-pattern\fB <i>modifier-list</i> 204Behave as if each pattern line contains the given modifiers. 205</P> 206<P> 207<b>-q</b> 208Do not output the version number of <b>pcre2test</b> at the start of execution. 209</P> 210<P> 211<b>-S</b> <i>size</i> 212On Unix-like systems, set the size of the run-time stack to <i>size</i> 213megabytes. 214</P> 215<P> 216<b>-subject</b> <i>modifier-list</i> 217Behave as if each subject line contains the given modifiers. 218</P> 219<P> 220<b>-t</b> 221Run each compile and match many times with a timer, and output the resulting 222times per compile or match. When JIT is used, separate times are given for the 223initial compile and the JIT compile. You can control the number of iterations 224that are used for timing by following <b>-t</b> with a number (as a separate 225item on the command line). For example, "-t 1000" iterates 1000 times. The 226default is to iterate 500,000 times. 227</P> 228<P> 229<b>-tm</b> 230This is like <b>-t</b> except that it times only the matching phase, not the 231compile phase. 232</P> 233<P> 234<b>-T</b> <b>-TM</b> 235These behave like <b>-t</b> and <b>-tm</b>, but in addition, at the end of a run, 236the total times for all compiles and matches are output. 237</P> 238<P> 239<b>-version</b> 240Output the PCRE2 version number and then exit. 241</P> 242<br><a name="SEC5" href="#TOC1">DESCRIPTION</a><br> 243<P> 244If <b>pcre2test</b> is given two filename arguments, it reads from the first and 245writes to the second. If the first name is "-", input is taken from the 246standard input. If <b>pcre2test</b> is given only one argument, it reads from 247that file and writes to stdout. Otherwise, it reads from stdin and writes to 248stdout. 249</P> 250<P> 251When <b>pcre2test</b> is built, a configuration option can specify that it 252should be linked with the <b>libreadline</b> or <b>libedit</b> library. When this 253is done, if the input is from a terminal, it is read using the <b>readline()</b> 254function. This provides line-editing and history facilities. The output from 255the <b>-help</b> option states whether or not <b>readline()</b> will be used. 256</P> 257<P> 258The program handles any number of tests, each of which consists of a set of 259input lines. Each set starts with a regular expression pattern, followed by any 260number of subject lines to be matched against that pattern. In between sets of 261test data, command lines that begin with # may appear. This file format, with 262some restrictions, can also be processed by the <b>perltest.sh</b> script that 263is distributed with PCRE2 as a means of checking that the behaviour of PCRE2 264and Perl is the same. 265</P> 266<P> 267When the input is a terminal, <b>pcre2test</b> prompts for each line of input, 268using "re>" to prompt for regular expression patterns, and "data>" to prompt 269for subject lines. Command lines starting with # can be entered only in 270response to the "re>" prompt. 271</P> 272<P> 273Each subject line is matched separately and independently. If you want to do 274multi-line matches, you have to use the \n escape sequence (or \r or \r\n, 275etc., depending on the newline setting) in a single line of input to encode the 276newline sequences. There is no limit on the length of subject lines; the input 277buffer is automatically extended if it is too small. There are replication 278features that makes it possible to generate long repetitive pattern or subject 279lines without having to supply them explicitly. 280</P> 281<P> 282An empty line or the end of the file signals the end of the subject lines for a 283test, at which point a new pattern or command line is expected if there is 284still input to be read. 285</P> 286<br><a name="SEC6" href="#TOC1">COMMAND LINES</a><br> 287<P> 288In between sets of test data, a line that begins with # is interpreted as a 289command line. If the first character is followed by white space or an 290exclamation mark, the line is treated as a comment, and ignored. Otherwise, the 291following commands are recognized: 292<pre> 293 #forbid_utf 294</pre> 295Subsequent patterns automatically have the PCRE2_NEVER_UTF and PCRE2_NEVER_UCP 296options set, which locks out the use of the PCRE2_UTF and PCRE2_UCP options and 297the use of (*UTF) and (*UCP) at the start of patterns. This command also forces 298an error if a subsequent pattern contains any occurrences of \P, \p, or \X, 299which are still supported when PCRE2_UTF is not set, but which require Unicode 300property support to be included in the library. 301</P> 302<P> 303This is a trigger guard that is used in test files to ensure that UTF or 304Unicode property tests are not accidentally added to files that are used when 305Unicode support is not included in the library. Setting PCRE2_NEVER_UTF and 306PCRE2_NEVER_UCP as a default can also be obtained by the use of <b>#pattern</b>; 307the difference is that <b>#forbid_utf</b> cannot be unset, and the automatic 308options are not displayed in pattern information, to avoid cluttering up test 309output. 310<pre> 311 #load <filename> 312</pre> 313This command is used to load a set of precompiled patterns from a file, as 314described in the section entitled "Saving and restoring compiled patterns" 315<a href="#saverestore">below.</a> 316<pre> 317 #newline_default [<newline-list>] 318</pre> 319When PCRE2 is built, a default newline convention can be specified. This 320determines which characters and/or character pairs are recognized as indicating 321a newline in a pattern or subject string. The default can be overridden when a 322pattern is compiled. The standard test files contain tests of various newline 323conventions, but the majority of the tests expect a single linefeed to be 324recognized as a newline by default. Without special action the tests would fail 325when PCRE2 is compiled with either CR or CRLF as the default newline. 326</P> 327<P> 328The #newline_default command specifies a list of newline types that are 329acceptable as the default. The types must be one of CR, LF, CRLF, ANYCRLF, or 330ANY (in upper or lower case), for example: 331<pre> 332 #newline_default LF Any anyCRLF 333</pre> 334If the default newline is in the list, this command has no effect. Otherwise, 335except when testing the POSIX API, a <b>newline</b> modifier that specifies the 336first newline convention in the list (LF in the above example) is added to any 337pattern that does not already have a <b>newline</b> modifier. If the newline 338list is empty, the feature is turned off. This command is present in a number 339of the standard test input files. 340</P> 341<P> 342When the POSIX API is being tested there is no way to override the default 343newline convention, though it is possible to set the newline convention from 344within the pattern. A warning is given if the <b>posix</b> modifier is used when 345<b>#newline_default</b> would set a default for the non-POSIX API. 346<pre> 347 #pattern <modifier-list> 348</pre> 349This command sets a default modifier list that applies to all subsequent 350patterns. Modifiers on a pattern can change these settings. 351<pre> 352 #perltest 353</pre> 354The appearance of this line causes all subsequent modifier settings to be 355checked for compatibility with the <b>perltest.sh</b> script, which is used to 356confirm that Perl gives the same results as PCRE2. Also, apart from comment 357lines, none of the other command lines are permitted, because they and many 358of the modifiers are specific to <b>pcre2test</b>, and should not be used in 359test files that are also processed by <b>perltest.sh</b>. The <b>#perltest</b> 360command helps detect tests that are accidentally put in the wrong file. 361<pre> 362 #pop [<modifiers>] 363 #popcopy [<modifiers>] 364</pre> 365These commands are used to manipulate the stack of compiled patterns, as 366described in the section entitled "Saving and restoring compiled patterns" 367<a href="#saverestore">below.</a> 368<pre> 369 #save <filename> 370</pre> 371This command is used to save a set of compiled patterns to a file, as described 372in the section entitled "Saving and restoring compiled patterns" 373<a href="#saverestore">below.</a> 374<pre> 375 #subject <modifier-list> 376</pre> 377This command sets a default modifier list that applies to all subsequent 378subject lines. Modifiers on a subject line can change these settings. 379</P> 380<br><a name="SEC7" href="#TOC1">MODIFIER SYNTAX</a><br> 381<P> 382Modifier lists are used with both pattern and subject lines. Items in a list 383are separated by commas followed by optional white space. Trailing whitespace 384in a modifier list is ignored. Some modifiers may be given for both patterns 385and subject lines, whereas others are valid only for one or the other. Each 386modifier has a long name, for example "anchored", and some of them must be 387followed by an equals sign and a value, for example, "offset=12". Values cannot 388contain comma characters, but may contain spaces. Modifiers that do not take 389values may be preceded by a minus sign to turn off a previous setting. 390</P> 391<P> 392A few of the more common modifiers can also be specified as single letters, for 393example "i" for "caseless". In documentation, following the Perl convention, 394these are written with a slash ("the /i modifier") for clarity. Abbreviated 395modifiers must all be concatenated in the first item of a modifier list. If the 396first item is not recognized as a long modifier name, it is interpreted as a 397sequence of these abbreviations. For example: 398<pre> 399 /abc/ig,newline=cr,jit=3 400</pre> 401This is a pattern line whose modifier list starts with two one-letter modifiers 402(/i and /g). The lower-case abbreviated modifiers are the same as used in Perl. 403</P> 404<br><a name="SEC8" href="#TOC1">PATTERN SYNTAX</a><br> 405<P> 406A pattern line must start with one of the following characters (common symbols, 407excluding pattern meta-characters): 408<pre> 409 / ! " ' ` - = _ : ; , % & @ ~ 410</pre> 411This is interpreted as the pattern's delimiter. A regular expression may be 412continued over several input lines, in which case the newline characters are 413included within it. It is possible to include the delimiter within the pattern 414by escaping it with a backslash, for example 415<pre> 416 /abc\/def/ 417</pre> 418If you do this, the escape and the delimiter form part of the pattern, but 419since the delimiters are all non-alphanumeric, this does not affect its 420interpretation. If the terminating delimiter is immediately followed by a 421backslash, for example, 422<pre> 423 /abc/\ 424</pre> 425then a backslash is added to the end of the pattern. This is done to provide a 426way of testing the error condition that arises if a pattern finishes with a 427backslash, because 428<pre> 429 /abc\/ 430</pre> 431is interpreted as the first line of a pattern that starts with "abc/", causing 432pcre2test to read the next line as a continuation of the regular expression. 433</P> 434<P> 435A pattern can be followed by a modifier list (details below). 436</P> 437<br><a name="SEC9" href="#TOC1">SUBJECT LINE SYNTAX</a><br> 438<P> 439Before each subject line is passed to <b>pcre2_match()</b> or 440<b>pcre2_dfa_match()</b>, leading and trailing white space is removed, and the 441line is scanned for backslash escapes. The following provide a means of 442encoding non-printing characters in a visible way: 443<pre> 444 \a alarm (BEL, \x07) 445 \b backspace (\x08) 446 \e escape (\x27) 447 \f form feed (\x0c) 448 \n newline (\x0a) 449 \r carriage return (\x0d) 450 \t tab (\x09) 451 \v vertical tab (\x0b) 452 \nnn octal character (up to 3 octal digits); always 453 a byte unless > 255 in UTF-8 or 16-bit or 32-bit mode 454 \o{dd...} octal character (any number of octal digits} 455 \xhh hexadecimal byte (up to 2 hex digits) 456 \x{hh...} hexadecimal character (any number of hex digits) 457</pre> 458The use of \x{hh...} is not dependent on the use of the <b>utf</b> modifier on 459the pattern. It is recognized always. There may be any number of hexadecimal 460digits inside the braces; invalid values provoke error messages. 461</P> 462<P> 463Note that \xhh specifies one byte rather than one character in UTF-8 mode; 464this makes it possible to construct invalid UTF-8 sequences for testing 465purposes. On the other hand, \x{hh} is interpreted as a UTF-8 character in 466UTF-8 mode, generating more than one byte if the value is greater than 127. 467When testing the 8-bit library not in UTF-8 mode, \x{hh} generates one byte 468for values less than 256, and causes an error for greater values. 469</P> 470<P> 471In UTF-16 mode, all 4-digit \x{hhhh} values are accepted. This makes it 472possible to construct invalid UTF-16 sequences for testing purposes. 473</P> 474<P> 475In UTF-32 mode, all 4- to 8-digit \x{...} values are accepted. This makes it 476possible to construct invalid UTF-32 sequences for testing purposes. 477</P> 478<P> 479There is a special backslash sequence that specifies replication of one or more 480characters: 481<pre> 482 \[<characters>]{<count>} 483</pre> 484This makes it possible to test long strings without having to provide them as 485part of the file. For example: 486<pre> 487 \[abc]{4} 488</pre> 489is converted to "abcabcabcabc". This feature does not support nesting. To 490include a closing square bracket in the characters, code it as \x5D. 491</P> 492<P> 493A backslash followed by an equals sign marks the end of the subject string and 494the start of a modifier list. For example: 495<pre> 496 abc\=notbol,notempty 497</pre> 498If the subject string is empty and \= is followed by whitespace, the line is 499treated as a comment line, and is not used for matching. For example: 500<pre> 501 \= This is a comment. 502 abc\= This is an invalid modifier list. 503</pre> 504A backslash followed by any other non-alphanumeric character just escapes that 505character. A backslash followed by anything else causes an error. However, if 506the very last character in the line is a backslash (and there is no modifier 507list), it is ignored. This gives a way of passing an empty line as data, since 508a real empty line terminates the data input. 509</P> 510<br><a name="SEC10" href="#TOC1">PATTERN MODIFIERS</a><br> 511<P> 512There are several types of modifier that can appear in pattern lines. Except 513where noted below, they may also be used in <b>#pattern</b> commands. A 514pattern's modifier list can add to or override default modifiers that were set 515by a previous <b>#pattern</b> command. 516<a name="optionmodifiers"></a></P> 517<br><b> 518Setting compilation options 519</b><br> 520<P> 521The following modifiers set options for <b>pcre2_compile()</b>. The most common 522ones have single-letter abbreviations. See 523<a href="pcre2api.html"><b>pcre2api</b></a> 524for a description of their effects. 525<pre> 526 allow_empty_class set PCRE2_ALLOW_EMPTY_CLASS 527 alt_bsux set PCRE2_ALT_BSUX 528 alt_circumflex set PCRE2_ALT_CIRCUMFLEX 529 alt_verbnames set PCRE2_ALT_VERBNAMES 530 anchored set PCRE2_ANCHORED 531 auto_callout set PCRE2_AUTO_CALLOUT 532 /i caseless set PCRE2_CASELESS 533 dollar_endonly set PCRE2_DOLLAR_ENDONLY 534 /s dotall set PCRE2_DOTALL 535 dupnames set PCRE2_DUPNAMES 536 /x extended set PCRE2_EXTENDED 537 firstline set PCRE2_FIRSTLINE 538 match_unset_backref set PCRE2_MATCH_UNSET_BACKREF 539 /m multiline set PCRE2_MULTILINE 540 never_backslash_c set PCRE2_NEVER_BACKSLASH_C 541 never_ucp set PCRE2_NEVER_UCP 542 never_utf set PCRE2_NEVER_UTF 543 no_auto_capture set PCRE2_NO_AUTO_CAPTURE 544 no_auto_possess set PCRE2_NO_AUTO_POSSESS 545 no_dotstar_anchor set PCRE2_NO_DOTSTAR_ANCHOR 546 no_start_optimize set PCRE2_NO_START_OPTIMIZE 547 no_utf_check set PCRE2_NO_UTF_CHECK 548 ucp set PCRE2_UCP 549 ungreedy set PCRE2_UNGREEDY 550 use_offset_limit set PCRE2_USE_OFFSET_LIMIT 551 utf set PCRE2_UTF 552</pre> 553As well as turning on the PCRE2_UTF option, the <b>utf</b> modifier causes all 554non-printing characters in output strings to be printed using the \x{hh...} 555notation. Otherwise, those less than 0x100 are output in hex without the curly 556brackets. 557<a name="controlmodifiers"></a></P> 558<br><b> 559Setting compilation controls 560</b><br> 561<P> 562The following modifiers affect the compilation process or request information 563about the pattern: 564<pre> 565 bsr=[anycrlf|unicode] specify \R handling 566 /B bincode show binary code without lengths 567 callout_info show callout information 568 debug same as info,fullbincode 569 fullbincode show binary code with lengths 570 /I info show info about compiled pattern 571 hex unquoted characters are hexadecimal 572 jit[=<number>] use JIT 573 jitfast use JIT fast path 574 jitverify verify JIT use 575 locale=<name> use this locale 576 max_pattern_length=<n> set the maximum pattern length 577 memory show memory used 578 newline=<type> set newline type 579 null_context compile with a NULL context 580 parens_nest_limit=<n> set maximum parentheses depth 581 posix use the POSIX API 582 posix_nosub use the POSIX API with REG_NOSUB 583 push push compiled pattern onto the stack 584 pushcopy push a copy onto the stack 585 stackguard=<number> test the stackguard feature 586 tables=[0|1|2] select internal tables 587</pre> 588The effects of these modifiers are described in the following sections. 589</P> 590<br><b> 591Newline and \R handling 592</b><br> 593<P> 594The <b>bsr</b> modifier specifies what \R in a pattern should match. If it is 595set to "anycrlf", \R matches CR, LF, or CRLF only. If it is set to "unicode", 596\R matches any Unicode newline sequence. The default is specified when PCRE2 597is built, with the default default being Unicode. 598</P> 599<P> 600The <b>newline</b> modifier specifies which characters are to be interpreted as 601newlines, both in the pattern and in subject lines. The type must be one of CR, 602LF, CRLF, ANYCRLF, or ANY (in upper or lower case). 603</P> 604<br><b> 605Information about a pattern 606</b><br> 607<P> 608The <b>debug</b> modifier is a shorthand for <b>info,fullbincode</b>, requesting 609all available information. 610</P> 611<P> 612The <b>bincode</b> modifier causes a representation of the compiled code to be 613output after compilation. This information does not contain length and offset 614values, which ensures that the same output is generated for different internal 615link sizes and different code unit widths. By using <b>bincode</b>, the same 616regression tests can be used in different environments. 617</P> 618<P> 619The <b>fullbincode</b> modifier, by contrast, <i>does</i> include length and 620offset values. This is used in a few special tests that run only for specific 621code unit widths and link sizes, and is also useful for one-off tests. 622</P> 623<P> 624The <b>info</b> modifier requests information about the compiled pattern 625(whether it is anchored, has a fixed first character, and so on). The 626information is obtained from the <b>pcre2_pattern_info()</b> function. Here are 627some typical examples: 628<pre> 629 re> /(?i)(^a|^b)/m,info 630 Capturing subpattern count = 1 631 Compile options: multiline 632 Overall options: caseless multiline 633 First code unit at start or follows newline 634 Subject length lower bound = 1 635 636 re> /(?i)abc/info 637 Capturing subpattern count = 0 638 Compile options: <none> 639 Overall options: caseless 640 First code unit = 'a' (caseless) 641 Last code unit = 'c' (caseless) 642 Subject length lower bound = 3 643</pre> 644"Compile options" are those specified by modifiers; "overall options" have 645added options that are taken or deduced from the pattern. If both sets of 646options are the same, just a single "options" line is output; if there are no 647options, the line is omitted. "First code unit" is where any match must start; 648if there is more than one they are listed as "starting code units". "Last code 649unit" is the last literal code unit that must be present in any match. This is 650not necessarily the last character. These lines are omitted if no starting or 651ending code units are recorded. 652</P> 653<P> 654The <b>callout_info</b> modifier requests information about all the callouts in 655the pattern. A list of them is output at the end of any other information that 656is requested. For each callout, either its number or string is given, followed 657by the item that follows it in the pattern. 658</P> 659<br><b> 660Passing a NULL context 661</b><br> 662<P> 663Normally, <b>pcre2test</b> passes a context block to <b>pcre2_compile()</b>. If 664the <b>null_context</b> modifier is set, however, NULL is passed. This is for 665testing that <b>pcre2_compile()</b> behaves correctly in this case (it uses 666default values). 667</P> 668<br><b> 669Specifying pattern characters in hexadecimal 670</b><br> 671<P> 672The <b>hex</b> modifier specifies that the characters of the pattern, except for 673substrings enclosed in single or double quotes, are to be interpreted as pairs 674of hexadecimal digits. This feature is provided as a way of creating patterns 675that contain binary zeros and other non-printing characters. White space is 676permitted between pairs of digits. For example, this pattern contains three 677characters: 678<pre> 679 /ab 32 59/hex 680</pre> 681Parts of such a pattern are taken literally if quoted. This pattern contains 682nine characters, only two of which are specified in hexadecimal: 683<pre> 684 /ab "literal" 32/hex 685</pre> 686Either single or double quotes may be used. There is no way of including 687the delimiter within a substring. 688</P> 689<P> 690By default, <b>pcre2test</b> passes patterns as zero-terminated strings to 691<b>pcre2_compile()</b>, giving the length as PCRE2_ZERO_TERMINATED. However, for 692patterns specified with the <b>hex</b> modifier, the actual length of the 693pattern is passed. 694</P> 695<br><b> 696Generating long repetitive patterns 697</b><br> 698<P> 699Some tests use long patterns that are very repetitive. Instead of creating a 700very long input line for such a pattern, you can use a special repetition 701feature, similar to the one described for subject lines above. If the 702<b>expand</b> modifier is present on a pattern, parts of the pattern that have 703the form 704<pre> 705 \[<characters>]{<count>} 706</pre> 707are expanded before the pattern is passed to <b>pcre2_compile()</b>. For 708example, \[AB]{6000} is expanded to "ABAB..." 6000 times. This construction 709cannot be nested. An initial "\[" sequence is recognized only if "]{" followed 710by decimal digits and "}" is found later in the pattern. If not, the characters 711remain in the pattern unaltered. 712</P> 713<P> 714If part of an expanded pattern looks like an expansion, but is really part of 715the actual pattern, unwanted expansion can be avoided by giving two values in 716the quantifier. For example, \[AB]{6000,6000} is not recognized as an 717expansion item. 718</P> 719<P> 720If the <b>info</b> modifier is set on an expanded pattern, the result of the 721expansion is included in the information that is output. 722</P> 723<br><b> 724JIT compilation 725</b><br> 726<P> 727Just-in-time (JIT) compiling is a heavyweight optimization that can greatly 728speed up pattern matching. See the 729<a href="pcre2jit.html"><b>pcre2jit</b></a> 730documentation for details. JIT compiling happens, optionally, after a pattern 731has been successfully compiled into an internal form. The JIT compiler converts 732this to optimized machine code. It needs to know whether the match-time options 733PCRE2_PARTIAL_HARD and PCRE2_PARTIAL_SOFT are going to be used, because 734different code is generated for the different cases. See the <b>partial</b> 735modifier in "Subject Modifiers" 736<a href="#subjectmodifiers">below</a> 737for details of how these options are specified for each match attempt. 738</P> 739<P> 740JIT compilation is requested by the <b>/jit</b> pattern modifier, which may 741optionally be followed by an equals sign and a number in the range 0 to 7. 742The three bits that make up the number specify which of the three JIT operating 743modes are to be compiled: 744<pre> 745 1 compile JIT code for non-partial matching 746 2 compile JIT code for soft partial matching 747 4 compile JIT code for hard partial matching 748</pre> 749The possible values for the <b>/jit</b> modifier are therefore: 750<pre> 751 0 disable JIT 752 1 normal matching only 753 2 soft partial matching only 754 3 normal and soft partial matching 755 4 hard partial matching only 756 6 soft and hard partial matching only 757 7 all three modes 758</pre> 759If no number is given, 7 is assumed. The phrase "partial matching" means a call 760to <b>pcre2_match()</b> with either the PCRE2_PARTIAL_SOFT or the 761PCRE2_PARTIAL_HARD option set. Note that such a call may return a complete 762match; the options enable the possibility of a partial match, but do not 763require it. Note also that if you request JIT compilation only for partial 764matching (for example, /jit=2) but do not set the <b>partial</b> modifier on a 765subject line, that match will not use JIT code because none was compiled for 766non-partial matching. 767</P> 768<P> 769If JIT compilation is successful, the compiled JIT code will automatically be 770used when an appropriate type of match is run, except when incompatible 771run-time options are specified. For more details, see the 772<a href="pcre2jit.html"><b>pcre2jit</b></a> 773documentation. See also the <b>jitstack</b> modifier below for a way of 774setting the size of the JIT stack. 775</P> 776<P> 777If the <b>jitfast</b> modifier is specified, matching is done using the JIT 778"fast path" interface, <b>pcre2_jit_match()</b>, which skips some of the sanity 779checks that are done by <b>pcre2_match()</b>, and of course does not work when 780JIT is not supported. If <b>jitfast</b> is specified without <b>jit</b>, jit=7 is 781assumed. 782</P> 783<P> 784If the <b>jitverify</b> modifier is specified, information about the compiled 785pattern shows whether JIT compilation was or was not successful. If 786<b>jitverify</b> is specified without <b>jit</b>, jit=7 is assumed. If JIT 787compilation is successful when <b>jitverify</b> is set, the text "(JIT)" is 788added to the first output line after a match or non match when JIT-compiled 789code was actually used in the match. 790</P> 791<br><b> 792Setting a locale 793</b><br> 794<P> 795The <b>/locale</b> modifier must specify the name of a locale, for example: 796<pre> 797 /pattern/locale=fr_FR 798</pre> 799The given locale is set, <b>pcre2_maketables()</b> is called to build a set of 800character tables for the locale, and this is then passed to 801<b>pcre2_compile()</b> when compiling the regular expression. The same tables 802are used when matching the following subject lines. The <b>/locale</b> modifier 803applies only to the pattern on which it appears, but can be given in a 804<b>#pattern</b> command if a default is needed. Setting a locale and alternate 805character tables are mutually exclusive. 806</P> 807<br><b> 808Showing pattern memory 809</b><br> 810<P> 811The <b>/memory</b> modifier causes the size in bytes of the memory used to hold 812the compiled pattern to be output. This does not include the size of the 813<b>pcre2_code</b> block; it is just the actual compiled data. If the pattern is 814subsequently passed to the JIT compiler, the size of the JIT compiled code is 815also output. Here is an example: 816<pre> 817 re> /a(b)c/jit,memory 818 Memory allocation (code space): 21 819 Memory allocation (JIT code): 1910 820 821</PRE> 822</P> 823<br><b> 824Limiting nested parentheses 825</b><br> 826<P> 827The <b>parens_nest_limit</b> modifier sets a limit on the depth of nested 828parentheses in a pattern. Breaching the limit causes a compilation error. 829The default for the library is set when PCRE2 is built, but <b>pcre2test</b> 830sets its own default of 220, which is required for running the standard test 831suite. 832</P> 833<br><b> 834Limiting the pattern length 835</b><br> 836<P> 837The <b>max_pattern_length</b> modifier sets a limit, in code units, to the 838length of pattern that <b>pcre2_compile()</b> will accept. Breaching the limit 839causes a compilation error. The default is the largest number a PCRE2_SIZE 840variable can hold (essentially unlimited). 841</P> 842<br><b> 843Using the POSIX wrapper API 844</b><br> 845<P> 846The <b>/posix</b> and <b>posix_nosub</b> modifiers cause <b>pcre2test</b> to call 847PCRE2 via the POSIX wrapper API rather than its native API. When 848<b>posix_nosub</b> is used, the POSIX option REG_NOSUB is passed to 849<b>regcomp()</b>. The POSIX wrapper supports only the 8-bit library. Note that 850it does not imply POSIX matching semantics; for more detail see the 851<a href="pcre2posix.html"><b>pcre2posix</b></a> 852documentation. The following pattern modifiers set options for the 853<b>regcomp()</b> function: 854<pre> 855 caseless REG_ICASE 856 multiline REG_NEWLINE 857 dotall REG_DOTALL ) 858 ungreedy REG_UNGREEDY ) These options are not part of 859 ucp REG_UCP ) the POSIX standard 860 utf REG_UTF8 ) 861</pre> 862The <b>regerror_buffsize</b> modifier specifies a size for the error buffer that 863is passed to <b>regerror()</b> in the event of a compilation error. For example: 864<pre> 865 /abc/posix,regerror_buffsize=20 866</pre> 867This provides a means of testing the behaviour of <b>regerror()</b> when the 868buffer is too small for the error message. If this modifier has not been set, a 869large buffer is used. 870</P> 871<P> 872The <b>aftertext</b> and <b>allaftertext</b> subject modifiers work as described 873below. All other modifiers are either ignored, with a warning message, or cause 874an error. 875</P> 876<br><b> 877Testing the stack guard feature 878</b><br> 879<P> 880The <b>/stackguard</b> modifier is used to test the use of 881<b>pcre2_set_compile_recursion_guard()</b>, a function that is provided to 882enable stack availability to be checked during compilation (see the 883<a href="pcre2api.html"><b>pcre2api</b></a> 884documentation for details). If the number specified by the modifier is greater 885than zero, <b>pcre2_set_compile_recursion_guard()</b> is called to set up 886callback from <b>pcre2_compile()</b> to a local function. The argument it 887receives is the current nesting parenthesis depth; if this is greater than the 888value given by the modifier, non-zero is returned, causing the compilation to 889be aborted. 890</P> 891<br><b> 892Using alternative character tables 893</b><br> 894<P> 895The value specified for the <b>/tables</b> modifier must be one of the digits 0, 8961, or 2. It causes a specific set of built-in character tables to be passed to 897<b>pcre2_compile()</b>. This is used in the PCRE2 tests to check behaviour with 898different character tables. The digit specifies the tables as follows: 899<pre> 900 0 do not pass any special character tables 901 1 the default ASCII tables, as distributed in 902 pcre2_chartables.c.dist 903 2 a set of tables defining ISO 8859 characters 904</pre> 905In table 2, some characters whose codes are greater than 128 are identified as 906letters, digits, spaces, etc. Setting alternate character tables and a locale 907are mutually exclusive. 908</P> 909<br><b> 910Setting certain match controls 911</b><br> 912<P> 913The following modifiers are really subject modifiers, and are described below. 914However, they may be included in a pattern's modifier list, in which case they 915are applied to every subject line that is processed with that pattern. They may 916not appear in <b>#pattern</b> commands. These modifiers do not affect the 917compilation process. 918<pre> 919 aftertext show text after match 920 allaftertext show text after captures 921 allcaptures show all captures 922 allusedtext show all consulted text 923 /g global global matching 924 mark show mark values 925 replace=<string> specify a replacement string 926 startchar show starting character when relevant 927 substitute_extended use PCRE2_SUBSTITUTE_EXTENDED 928 substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH 929 substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET 930 substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY 931</pre> 932These modifiers may not appear in a <b>#pattern</b> command. If you want them as 933defaults, set them in a <b>#subject</b> command. 934</P> 935<br><b> 936Saving a compiled pattern 937</b><br> 938<P> 939When a pattern with the <b>push</b> modifier is successfully compiled, it is 940pushed onto a stack of compiled patterns, and <b>pcre2test</b> expects the next 941line to contain a new pattern (or a command) instead of a subject line. This 942facility is used when saving compiled patterns to a file, as described in the 943section entitled "Saving and restoring compiled patterns" 944<a href="#saverestore">below. If <b>pushcopy</b> is used instead of <b>push</b>, a copy of the compiled</a> 945pattern is stacked, leaving the original as current, ready to match the 946following input lines. This provides a way of testing the 947<b>pcre2_code_copy()</b> function. 948The <b>push</b> and <b>pushcopy </b> modifiers are incompatible with compilation 949modifiers such as <b>global</b> that act at match time. Any that are specified 950are ignored (for the stacked copy), with a warning message, except for 951<b>replace</b>, which causes an error. Note that <b>jitverify</b>, which is 952allowed, does not carry through to any subsequent matching that uses a stacked 953pattern. 954<a name="subjectmodifiers"></a></P> 955<br><a name="SEC11" href="#TOC1">SUBJECT MODIFIERS</a><br> 956<P> 957The modifiers that can appear in subject lines and the <b>#subject</b> 958command are of two types. 959</P> 960<br><b> 961Setting match options 962</b><br> 963<P> 964The following modifiers set options for <b>pcre2_match()</b> or 965<b>pcre2_dfa_match()</b>. See 966<a href="pcreapi.html"><b>pcreapi</b></a> 967for a description of their effects. 968<pre> 969 anchored set PCRE2_ANCHORED 970 dfa_restart set PCRE2_DFA_RESTART 971 dfa_shortest set PCRE2_DFA_SHORTEST 972 no_jit set PCRE2_NO_JIT 973 no_utf_check set PCRE2_NO_UTF_CHECK 974 notbol set PCRE2_NOTBOL 975 notempty set PCRE2_NOTEMPTY 976 notempty_atstart set PCRE2_NOTEMPTY_ATSTART 977 noteol set PCRE2_NOTEOL 978 partial_hard (or ph) set PCRE2_PARTIAL_HARD 979 partial_soft (or ps) set PCRE2_PARTIAL_SOFT 980</pre> 981The partial matching modifiers are provided with abbreviations because they 982appear frequently in tests. 983</P> 984<P> 985If the <b>/posix</b> modifier was present on the pattern, causing the POSIX 986wrapper API to be used, the only option-setting modifiers that have any effect 987are <b>notbol</b>, <b>notempty</b>, and <b>noteol</b>, causing REG_NOTBOL, 988REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to <b>regexec()</b>. 989The other modifiers are ignored, with a warning message. 990</P> 991<br><b> 992Setting match controls 993</b><br> 994<P> 995The following modifiers affect the matching process or request additional 996information. Some of them may also be specified on a pattern line (see above), 997in which case they apply to every subject line that is matched against that 998pattern. 999<pre> 1000 aftertext show text after match 1001 allaftertext show text after captures 1002 allcaptures show all captures 1003 allusedtext show all consulted text (non-JIT only) 1004 altglobal alternative global matching 1005 callout_capture show captures at callout time 1006 callout_data=<n> set a value to pass via callouts 1007 callout_fail=<n>[:<m>] control callout failure 1008 callout_none do not supply a callout function 1009 copy=<number or name> copy captured substring 1010 dfa use <b>pcre2_dfa_match()</b> 1011 find_limits find match and recursion limits 1012 get=<number or name> extract captured substring 1013 getall extract all captured substrings 1014 /g global global matching 1015 jitstack=<n> set size of JIT stack 1016 mark show mark values 1017 match_limit=<n> set a match limit 1018 memory show memory usage 1019 null_context match with a NULL context 1020 offset=<n> set starting offset 1021 offset_limit=<n> set offset limit 1022 ovector=<n> set size of output vector 1023 recursion_limit=<n> set a recursion limit 1024 replace=<string> specify a replacement string 1025 startchar show startchar when relevant 1026 startoffset=<n> same as offset=<n> 1027 substitute_extedded use PCRE2_SUBSTITUTE_EXTENDED 1028 substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH 1029 substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET 1030 substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY 1031 zero_terminate pass the subject as zero-terminated 1032</pre> 1033The effects of these modifiers are described in the following sections. When 1034matching via the POSIX wrapper API, the <b>aftertext</b>, <b>allaftertext</b>, 1035and <b>ovector</b> subject modifiers work as described below. All other 1036modifiers are either ignored, with a warning message, or cause an error. 1037</P> 1038<br><b> 1039Showing more text 1040</b><br> 1041<P> 1042The <b>aftertext</b> modifier requests that as well as outputting the part of 1043the subject string that matched the entire pattern, <b>pcre2test</b> should in 1044addition output the remainder of the subject string. This is useful for tests 1045where the subject contains multiple copies of the same substring. The 1046<b>allaftertext</b> modifier requests the same action for captured substrings as 1047well as the main matched substring. In each case the remainder is output on the 1048following line with a plus character following the capture number. 1049</P> 1050<P> 1051The <b>allusedtext</b> modifier requests that all the text that was consulted 1052during a successful pattern match by the interpreter should be shown. This 1053feature is not supported for JIT matching, and if requested with JIT it is 1054ignored (with a warning message). Setting this modifier affects the output if 1055there is a lookbehind at the start of a match, or a lookahead at the end, or if 1056\K is used in the pattern. Characters that precede or follow the start and end 1057of the actual match are indicated in the output by '<' or '>' characters 1058underneath them. Here is an example: 1059<pre> 1060 re> /(?<=pqr)abc(?=xyz)/ 1061 data> 123pqrabcxyz456\=allusedtext 1062 0: pqrabcxyz 1063 <<< >>> 1064</pre> 1065This shows that the matched string is "abc", with the preceding and following 1066strings "pqr" and "xyz" having been consulted during the match (when processing 1067the assertions). 1068</P> 1069<P> 1070The <b>startchar</b> modifier requests that the starting character for the match 1071be indicated, if it is different to the start of the matched string. The only 1072time when this occurs is when \K has been processed as part of the match. In 1073this situation, the output for the matched string is displayed from the 1074starting character instead of from the match point, with circumflex characters 1075under the earlier characters. For example: 1076<pre> 1077 re> /abc\Kxyz/ 1078 data> abcxyz\=startchar 1079 0: abcxyz 1080 ^^^ 1081</pre> 1082Unlike <b>allusedtext</b>, the <b>startchar</b> modifier can be used with JIT. 1083However, these two modifiers are mutually exclusive. 1084</P> 1085<br><b> 1086Showing the value of all capture groups 1087</b><br> 1088<P> 1089The <b>allcaptures</b> modifier requests that the values of all potential 1090captured parentheses be output after a match. By default, only those up to the 1091highest one actually used in the match are output (corresponding to the return 1092code from <b>pcre2_match()</b>). Groups that did not take part in the match 1093are output as "<unset>". This modifier is not relevant for DFA matching (which 1094does no capturing); it is ignored, with a warning message, if present. 1095</P> 1096<br><b> 1097Testing callouts 1098</b><br> 1099<P> 1100A callout function is supplied when <b>pcre2test</b> calls the library matching 1101functions, unless <b>callout_none</b> is specified. If <b>callout_capture</b> is 1102set, the current captured groups are output when a callout occurs. 1103</P> 1104<P> 1105The <b>callout_fail</b> modifier can be given one or two numbers. If there is 1106only one number, 1 is returned instead of 0 when a callout of that number is 1107reached. If two numbers are given, 1 is returned when callout <n> is reached 1108for the <m>th time. Note that callouts with string arguments are always given 1109the number zero. See "Callouts" below for a description of the output when a 1110callout it taken. 1111</P> 1112<P> 1113The <b>callout_data</b> modifier can be given an unsigned or a negative number. 1114This is set as the "user data" that is passed to the matching function, and 1115passed back when the callout function is invoked. Any value other than zero is 1116used as a return from <b>pcre2test</b>'s callout function. 1117</P> 1118<br><b> 1119Finding all matches in a string 1120</b><br> 1121<P> 1122Searching for all possible matches within a subject can be requested by the 1123<b>global</b> or <b>/altglobal</b> modifier. After finding a match, the matching 1124function is called again to search the remainder of the subject. The difference 1125between <b>global</b> and <b>altglobal</b> is that the former uses the 1126<i>start_offset</i> argument to <b>pcre2_match()</b> or <b>pcre2_dfa_match()</b> 1127to start searching at a new point within the entire string (which is what Perl 1128does), whereas the latter passes over a shortened subject. This makes a 1129difference to the matching process if the pattern begins with a lookbehind 1130assertion (including \b or \B). 1131</P> 1132<P> 1133If an empty string is matched, the next match is done with the 1134PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED flags set, in order to search for 1135another, non-empty, match at the same point in the subject. If this match 1136fails, the start offset is advanced, and the normal match is retried. This 1137imitates the way Perl handles such cases when using the <b>/g</b> modifier or 1138the <b>split()</b> function. Normally, the start offset is advanced by one 1139character, but if the newline convention recognizes CRLF as a newline, and the 1140current character is CR followed by LF, an advance of two characters occurs. 1141</P> 1142<br><b> 1143Testing substring extraction functions 1144</b><br> 1145<P> 1146The <b>copy</b> and <b>get</b> modifiers can be used to test the 1147<b>pcre2_substring_copy_xxx()</b> and <b>pcre2_substring_get_xxx()</b> functions. 1148They can be given more than once, and each can specify a group name or number, 1149for example: 1150<pre> 1151 abcd\=copy=1,copy=3,get=G1 1152</pre> 1153If the <b>#subject</b> command is used to set default copy and/or get lists, 1154these can be unset by specifying a negative number to cancel all numbered 1155groups and an empty name to cancel all named groups. 1156</P> 1157<P> 1158The <b>getall</b> modifier tests <b>pcre2_substring_list_get()</b>, which 1159extracts all captured substrings. 1160</P> 1161<P> 1162If the subject line is successfully matched, the substrings extracted by the 1163convenience functions are output with C, G, or L after the string number 1164instead of a colon. This is in addition to the normal full list. The string 1165length (that is, the return from the extraction function) is given in 1166parentheses after each substring, followed by the name when the extraction was 1167by name. 1168</P> 1169<br><b> 1170Testing the substitution function 1171</b><br> 1172<P> 1173If the <b>replace</b> modifier is set, the <b>pcre2_substitute()</b> function is 1174called instead of one of the matching functions. Note that replacement strings 1175cannot contain commas, because a comma signifies the end of a modifier. This is 1176not thought to be an issue in a test program. 1177</P> 1178<P> 1179Unlike subject strings, <b>pcre2test</b> does not process replacement strings 1180for escape sequences. In UTF mode, a replacement string is checked to see if it 1181is a valid UTF-8 string. If so, it is correctly converted to a UTF string of 1182the appropriate code unit width. If it is not a valid UTF-8 string, the 1183individual code units are copied directly. This provides a means of passing an 1184invalid UTF-8 string for testing purposes. 1185</P> 1186<P> 1187The following modifiers set options (in additional to the normal match options) 1188for <b>pcre2_substitute()</b>: 1189<pre> 1190 global PCRE2_SUBSTITUTE_GLOBAL 1191 substitute_extended PCRE2_SUBSTITUTE_EXTENDED 1192 substitute_overflow_length PCRE2_SUBSTITUTE_OVERFLOW_LENGTH 1193 substitute_unknown_unset PCRE2_SUBSTITUTE_UNKNOWN_UNSET 1194 substitute_unset_empty PCRE2_SUBSTITUTE_UNSET_EMPTY 1195 1196</PRE> 1197</P> 1198<P> 1199After a successful substitution, the modified string is output, preceded by the 1200number of replacements. This may be zero if there were no matches. Here is a 1201simple example of a substitution test: 1202<pre> 1203 /abc/replace=xxx 1204 =abc=abc= 1205 1: =xxx=abc= 1206 =abc=abc=\=global 1207 2: =xxx=xxx= 1208</pre> 1209Subject and replacement strings should be kept relatively short (fewer than 256 1210characters) for substitution tests, as fixed-size buffers are used. To make it 1211easy to test for buffer overflow, if the replacement string starts with a 1212number in square brackets, that number is passed to <b>pcre2_substitute()</b> as 1213the size of the output buffer, with the replacement string starting at the next 1214character. Here is an example that tests the edge case: 1215<pre> 1216 /abc/ 1217 123abc123\=replace=[10]XYZ 1218 1: 123XYZ123 1219 123abc123\=replace=[9]XYZ 1220 Failed: error -47: no more memory 1221</pre> 1222The default action of <b>pcre2_substitute()</b> is to return 1223PCRE2_ERROR_NOMEMORY when the output buffer is too small. However, if the 1224PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set (by using the 1225<b>substitute_overflow_length</b> modifier), <b>pcre2_substitute()</b> continues 1226to go through the motions of matching and substituting, in order to compute the 1227size of buffer that is required. When this happens, <b>pcre2test</b> shows the 1228required buffer length (which includes space for the trailing zero) as part of 1229the error message. For example: 1230<pre> 1231 /abc/substitute_overflow_length 1232 123abc123\=replace=[9]XYZ 1233 Failed: error -47: no more memory: 10 code units are needed 1234</pre> 1235A replacement string is ignored with POSIX and DFA matching. Specifying partial 1236matching provokes an error return ("bad option value") from 1237<b>pcre2_substitute()</b>. 1238</P> 1239<br><b> 1240Setting the JIT stack size 1241</b><br> 1242<P> 1243The <b>jitstack</b> modifier provides a way of setting the maximum stack size 1244that is used by the just-in-time optimization code. It is ignored if JIT 1245optimization is not being used. The value is a number of kilobytes. Providing a 1246stack that is larger than the default 32K is necessary only for very 1247complicated patterns. 1248</P> 1249<br><b> 1250Setting match and recursion limits 1251</b><br> 1252<P> 1253The <b>match_limit</b> and <b>recursion_limit</b> modifiers set the appropriate 1254limits in the match context. These values are ignored when the 1255<b>find_limits</b> modifier is specified. 1256</P> 1257<br><b> 1258Finding minimum limits 1259</b><br> 1260<P> 1261If the <b>find_limits</b> modifier is present, <b>pcre2test</b> calls 1262<b>pcre2_match()</b> several times, setting different values in the match 1263context via <b>pcre2_set_match_limit()</b> and <b>pcre2_set_recursion_limit()</b> 1264until it finds the minimum values for each parameter that allow 1265<b>pcre2_match()</b> to complete without error. 1266</P> 1267<P> 1268If JIT is being used, only the match limit is relevant. If DFA matching is 1269being used, neither limit is relevant, and this modifier is ignored (with a 1270warning message). 1271</P> 1272<P> 1273The <i>match_limit</i> number is a measure of the amount of backtracking 1274that takes place, and learning the minimum value can be instructive. For most 1275simple matches, the number is quite small, but for patterns with very large 1276numbers of matching possibilities, it can become large very quickly with 1277increasing length of subject string. The <i>match_limit_recursion</i> number is 1278a measure of how much stack (or, if PCRE2 is compiled with NO_RECURSE, how much 1279heap) memory is needed to complete the match attempt. 1280</P> 1281<br><b> 1282Showing MARK names 1283</b><br> 1284<P> 1285The <b>mark</b> modifier causes the names from backtracking control verbs that 1286are returned from calls to <b>pcre2_match()</b> to be displayed. If a mark is 1287returned for a match, non-match, or partial match, <b>pcre2test</b> shows it. 1288For a match, it is on a line by itself, tagged with "MK:". Otherwise, it 1289is added to the non-match message. 1290</P> 1291<br><b> 1292Showing memory usage 1293</b><br> 1294<P> 1295The <b>memory</b> modifier causes <b>pcre2test</b> to log all memory allocation 1296and freeing calls that occur during a match operation. 1297</P> 1298<br><b> 1299Setting a starting offset 1300</b><br> 1301<P> 1302The <b>offset</b> modifier sets an offset in the subject string at which 1303matching starts. Its value is a number of code units, not characters. 1304</P> 1305<br><b> 1306Setting an offset limit 1307</b><br> 1308<P> 1309The <b>offset_limit</b> modifier sets a limit for unanchored matches. If a match 1310cannot be found starting at or before this offset in the subject, a "no match" 1311return is given. The data value is a number of code units, not characters. When 1312this modifier is used, the <b>use_offset_limit</b> modifier must have been set 1313for the pattern; if not, an error is generated. 1314</P> 1315<br><b> 1316Setting the size of the output vector 1317</b><br> 1318<P> 1319The <b>ovector</b> modifier applies only to the subject line in which it 1320appears, though of course it can also be used to set a default in a 1321<b>#subject</b> command. It specifies the number of pairs of offsets that are 1322available for storing matching information. The default is 15. 1323</P> 1324<P> 1325A value of zero is useful when testing the POSIX API because it causes 1326<b>regexec()</b> to be called with a NULL capture vector. When not testing the 1327POSIX API, a value of zero is used to cause 1328<b>pcre2_match_data_create_from_pattern()</b> to be called, in order to create a 1329match block of exactly the right size for the pattern. (It is not possible to 1330create a match block with a zero-length ovector; there is always at least one 1331pair of offsets.) 1332</P> 1333<br><b> 1334Passing the subject as zero-terminated 1335</b><br> 1336<P> 1337By default, the subject string is passed to a native API matching function with 1338its correct length. In order to test the facility for passing a zero-terminated 1339string, the <b>zero_terminate</b> modifier is provided. It causes the length to 1340be passed as PCRE2_ZERO_TERMINATED. (When matching via the POSIX interface, 1341this modifier has no effect, as there is no facility for passing a length.) 1342</P> 1343<P> 1344When testing <b>pcre2_substitute()</b>, this modifier also has the effect of 1345passing the replacement string as zero-terminated. 1346</P> 1347<br><b> 1348Passing a NULL context 1349</b><br> 1350<P> 1351Normally, <b>pcre2test</b> passes a context block to <b>pcre2_match()</b>, 1352<b>pcre2_dfa_match()</b> or <b>pcre2_jit_match()</b>. If the <b>null_context</b> 1353modifier is set, however, NULL is passed. This is for testing that the matching 1354functions behave correctly in this case (they use default values). This 1355modifier cannot be used with the <b>find_limits</b> modifier or when testing the 1356substitution function. 1357</P> 1358<br><a name="SEC12" href="#TOC1">THE ALTERNATIVE MATCHING FUNCTION</a><br> 1359<P> 1360By default, <b>pcre2test</b> uses the standard PCRE2 matching function, 1361<b>pcre2_match()</b> to match each subject line. PCRE2 also supports an 1362alternative matching function, <b>pcre2_dfa_match()</b>, which operates in a 1363different way, and has some restrictions. The differences between the two 1364functions are described in the 1365<a href="pcre2matching.html"><b>pcre2matching</b></a> 1366documentation. 1367</P> 1368<P> 1369If the <b>dfa</b> modifier is set, the alternative matching function is used. 1370This function finds all possible matches at a given point in the subject. If, 1371however, the <b>dfa_shortest</b> modifier is set, processing stops after the 1372first match is found. This is always the shortest possible match. 1373</P> 1374<br><a name="SEC13" href="#TOC1">DEFAULT OUTPUT FROM pcre2test</a><br> 1375<P> 1376This section describes the output when the normal matching function, 1377<b>pcre2_match()</b>, is being used. 1378</P> 1379<P> 1380When a match succeeds, <b>pcre2test</b> outputs the list of captured substrings, 1381starting with number 0 for the string that matched the whole pattern. 1382Otherwise, it outputs "No match" when the return is PCRE2_ERROR_NOMATCH, or 1383"Partial match:" followed by the partially matching substring when the 1384return is PCRE2_ERROR_PARTIAL. (Note that this is the 1385entire substring that was inspected during the partial match; it may include 1386characters before the actual match start if a lookbehind assertion, \K, \b, 1387or \B was involved.) 1388</P> 1389<P> 1390For any other return, <b>pcre2test</b> outputs the PCRE2 negative error number 1391and a short descriptive phrase. If the error is a failed UTF string check, the 1392code unit offset of the start of the failing character is also output. Here is 1393an example of an interactive <b>pcre2test</b> run. 1394<pre> 1395 $ pcre2test 1396 PCRE2 version 9.00 2014-05-10 1397 1398 re> /^abc(\d+)/ 1399 data> abc123 1400 0: abc123 1401 1: 123 1402 data> xyz 1403 No match 1404</pre> 1405Unset capturing substrings that are not followed by one that is set are not 1406shown by <b>pcre2test</b> unless the <b>allcaptures</b> modifier is specified. In 1407the following example, there are two capturing substrings, but when the first 1408data line is matched, the second, unset substring is not shown. An "internal" 1409unset substring is shown as "<unset>", as for the second data line. 1410<pre> 1411 re> /(a)|(b)/ 1412 data> a 1413 0: a 1414 1: a 1415 data> b 1416 0: b 1417 1: <unset> 1418 2: b 1419</pre> 1420If the strings contain any non-printing characters, they are output as \xhh 1421escapes if the value is less than 256 and UTF mode is not set. Otherwise they 1422are output as \x{hh...} escapes. See below for the definition of non-printing 1423characters. If the <b>/aftertext</b> modifier is set, the output for substring 14240 is followed by the the rest of the subject string, identified by "0+" like 1425this: 1426<pre> 1427 re> /cat/aftertext 1428 data> cataract 1429 0: cat 1430 0+ aract 1431</pre> 1432If global matching is requested, the results of successive matching attempts 1433are output in sequence, like this: 1434<pre> 1435 re> /\Bi(\w\w)/g 1436 data> Mississippi 1437 0: iss 1438 1: ss 1439 0: iss 1440 1: ss 1441 0: ipp 1442 1: pp 1443</pre> 1444"No match" is output only if the first match attempt fails. Here is an example 1445of a failure message (the offset 4 that is specified by the <b>offset</b> 1446modifier is past the end of the subject string): 1447<pre> 1448 re> /xyz/ 1449 data> xyz\=offset=4 1450 Error -24 (bad offset value) 1451</PRE> 1452</P> 1453<P> 1454Note that whereas patterns can be continued over several lines (a plain ">" 1455prompt is used for continuations), subject lines may not. However newlines can 1456be included in a subject by means of the \n escape (or \r, \r\n, etc., 1457depending on the newline sequence setting). 1458</P> 1459<br><a name="SEC14" href="#TOC1">OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION</a><br> 1460<P> 1461When the alternative matching function, <b>pcre2_dfa_match()</b>, is used, the 1462output consists of a list of all the matches that start at the first point in 1463the subject where there is at least one match. For example: 1464<pre> 1465 re> /(tang|tangerine|tan)/ 1466 data> yellow tangerine\=dfa 1467 0: tangerine 1468 1: tang 1469 2: tan 1470</pre> 1471Using the normal matching function on this data finds only "tang". The 1472longest matching string is always given first (and numbered zero). After a 1473PCRE2_ERROR_PARTIAL return, the output is "Partial match:", followed by the 1474partially matching substring. Note that this is the entire substring that was 1475inspected during the partial match; it may include characters before the actual 1476match start if a lookbehind assertion, \b, or \B was involved. (\K is not 1477supported for DFA matching.) 1478</P> 1479<P> 1480If global matching is requested, the search for further matches resumes 1481at the end of the longest match. For example: 1482<pre> 1483 re> /(tang|tangerine|tan)/g 1484 data> yellow tangerine and tangy sultana\=dfa 1485 0: tangerine 1486 1: tang 1487 2: tan 1488 0: tang 1489 1: tan 1490 0: tan 1491</pre> 1492The alternative matching function does not support substring capture, so the 1493modifiers that are concerned with captured substrings are not relevant. 1494</P> 1495<br><a name="SEC15" href="#TOC1">RESTARTING AFTER A PARTIAL MATCH</a><br> 1496<P> 1497When the alternative matching function has given the PCRE2_ERROR_PARTIAL 1498return, indicating that the subject partially matched the pattern, you can 1499restart the match with additional subject data by means of the 1500<b>dfa_restart</b> modifier. For example: 1501<pre> 1502 re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ 1503 data> 23ja\=P,dfa 1504 Partial match: 23ja 1505 data> n05\=dfa,dfa_restart 1506 0: n05 1507</pre> 1508For further information about partial matching, see the 1509<a href="pcre2partial.html"><b>pcre2partial</b></a> 1510documentation. 1511</P> 1512<br><a name="SEC16" href="#TOC1">CALLOUTS</a><br> 1513<P> 1514If the pattern contains any callout requests, <b>pcre2test</b>'s callout 1515function is called during matching unless <b>callout_none</b> is specified. 1516This works with both matching functions. 1517</P> 1518<P> 1519The callout function in <b>pcre2test</b> returns zero (carry on matching) by 1520default, but you can use a <b>callout_fail</b> modifier in a subject line (as 1521described above) to change this and other parameters of the callout. 1522</P> 1523<P> 1524Inserting callouts can be helpful when using <b>pcre2test</b> to check 1525complicated regular expressions. For further information about callouts, see 1526the 1527<a href="pcre2callout.html"><b>pcre2callout</b></a> 1528documentation. 1529</P> 1530<P> 1531The output for callouts with numerical arguments and those with string 1532arguments is slightly different. 1533</P> 1534<br><b> 1535Callouts with numerical arguments 1536</b><br> 1537<P> 1538By default, the callout function displays the callout number, the start and 1539current positions in the subject text at the callout time, and the next pattern 1540item to be tested. For example: 1541<pre> 1542 --->pqrabcdef 1543 0 ^ ^ \d 1544</pre> 1545This output indicates that callout number 0 occurred for a match attempt 1546starting at the fourth character of the subject string, when the pointer was at 1547the seventh character, and when the next pattern item was \d. Just 1548one circumflex is output if the start and current positions are the same, or if 1549the current position precedes the start position, which can happen if the 1550callout is in a lookbehind assertion. 1551</P> 1552<P> 1553Callouts numbered 255 are assumed to be automatic callouts, inserted as a 1554result of the <b>/auto_callout</b> pattern modifier. In this case, instead of 1555showing the callout number, the offset in the pattern, preceded by a plus, is 1556output. For example: 1557<pre> 1558 re> /\d?[A-E]\*/auto_callout 1559 data> E* 1560 --->E* 1561 +0 ^ \d? 1562 +3 ^ [A-E] 1563 +8 ^^ \* 1564 +10 ^ ^ 1565 0: E* 1566</pre> 1567If a pattern contains (*MARK) items, an additional line is output whenever 1568a change of latest mark is passed to the callout function. For example: 1569<pre> 1570 re> /a(*MARK:X)bc/auto_callout 1571 data> abc 1572 --->abc 1573 +0 ^ a 1574 +1 ^^ (*MARK:X) 1575 +10 ^^ b 1576 Latest Mark: X 1577 +11 ^ ^ c 1578 +12 ^ ^ 1579 0: abc 1580</pre> 1581The mark changes between matching "a" and "b", but stays the same for the rest 1582of the match, so nothing more is output. If, as a result of backtracking, the 1583mark reverts to being unset, the text "<unset>" is output. 1584</P> 1585<br><b> 1586Callouts with string arguments 1587</b><br> 1588<P> 1589The output for a callout with a string argument is similar, except that instead 1590of outputting a callout number before the position indicators, the callout 1591string and its offset in the pattern string are output before the reflection of 1592the subject string, and the subject string is reflected for each callout. For 1593example: 1594<pre> 1595 re> /^ab(?C'first')cd(?C"second")ef/ 1596 data> abcdefg 1597 Callout (7): 'first' 1598 --->abcdefg 1599 ^ ^ c 1600 Callout (20): "second" 1601 --->abcdefg 1602 ^ ^ e 1603 0: abcdef 1604 1605</PRE> 1606</P> 1607<br><a name="SEC17" href="#TOC1">NON-PRINTING CHARACTERS</a><br> 1608<P> 1609When <b>pcre2test</b> is outputting text in the compiled version of a pattern, 1610bytes other than 32-126 are always treated as non-printing characters and are 1611therefore shown as hex escapes. 1612</P> 1613<P> 1614When <b>pcre2test</b> is outputting text that is a matched part of a subject 1615string, it behaves in the same way, unless a different locale has been set for 1616the pattern (using the <b>/locale</b> modifier). In this case, the 1617<b>isprint()</b> function is used to distinguish printing and non-printing 1618characters. 1619<a name="saverestore"></a></P> 1620<br><a name="SEC18" href="#TOC1">SAVING AND RESTORING COMPILED PATTERNS</a><br> 1621<P> 1622It is possible to save compiled patterns on disc or elsewhere, and reload them 1623later, subject to a number of restrictions. JIT data cannot be saved. The host 1624on which the patterns are reloaded must be running the same version of PCRE2, 1625with the same code unit width, and must also have the same endianness, pointer 1626width and PCRE2_SIZE type. Before compiled patterns can be saved they must be 1627serialized, that is, converted to a stream of bytes. A single byte stream may 1628contain any number of compiled patterns, but they must all use the same 1629character tables. A single copy of the tables is included in the byte stream 1630(its size is 1088 bytes). 1631</P> 1632<P> 1633The functions whose names begin with <b>pcre2_serialize_</b> are used 1634for serializing and de-serializing. They are described in the 1635<a href="pcre2serialize.html"><b>pcre2serialize</b></a> 1636documentation. In this section we describe the features of <b>pcre2test</b> that 1637can be used to test these functions. 1638</P> 1639<P> 1640When a pattern with <b>push</b> modifier is successfully compiled, it is pushed 1641onto a stack of compiled patterns, and <b>pcre2test</b> expects the next line to 1642contain a new pattern (or command) instead of a subject line. By contrast, 1643the <b>pushcopy</b> modifier causes a copy of the compiled pattern to be 1644stacked, leaving the original available for immediate matching. By using 1645<b>push</b> and/or <b>pushcopy</b>, a number of patterns can be compiled and 1646retained. These modifiers are incompatible with <b>posix</b>, and control 1647modifiers that act at match time are ignored (with a message) for the stacked 1648patterns. The <b>jitverify</b> modifier applies only at compile time. 1649</P> 1650<P> 1651The command 1652<pre> 1653 #save <filename> 1654</pre> 1655causes all the stacked patterns to be serialized and the result written to the 1656named file. Afterwards, all the stacked patterns are freed. The command 1657<pre> 1658 #load <filename> 1659</pre> 1660reads the data in the file, and then arranges for it to be de-serialized, with 1661the resulting compiled patterns added to the pattern stack. The pattern on the 1662top of the stack can be retrieved by the #pop command, which must be followed 1663by lines of subjects that are to be matched with the pattern, terminated as 1664usual by an empty line or end of file. This command may be followed by a 1665modifier list containing only 1666<a href="#controlmodifiers">control modifiers</a> 1667that act after a pattern has been compiled. In particular, <b>hex</b>, 1668<b>posix</b>, <b>posix_nosub</b>, <b>push</b>, and <b>pushcopy</b> are not allowed, 1669nor are any 1670<a href="#optionmodifiers">option-setting modifiers.</a> 1671The JIT modifiers are, however permitted. Here is an example that saves and 1672reloads two patterns. 1673<pre> 1674 /abc/push 1675 /xyz/push 1676 #save tempfile 1677 #load tempfile 1678 #pop info 1679 xyz 1680 1681 #pop jit,bincode 1682 abc 1683</pre> 1684If <b>jitverify</b> is used with #pop, it does not automatically imply 1685<b>jit</b>, which is different behaviour from when it is used on a pattern. 1686</P> 1687<P> 1688The #popcopy command is analagous to the <b>pushcopy</b> modifier in that it 1689makes current a copy of the topmost stack pattern, leaving the original still 1690on the stack. 1691</P> 1692<br><a name="SEC19" href="#TOC1">SEE ALSO</a><br> 1693<P> 1694<b>pcre2</b>(3), <b>pcre2api</b>(3), <b>pcre2callout</b>(3), 1695<b>pcre2jit</b>, <b>pcre2matching</b>(3), <b>pcre2partial</b>(d), 1696<b>pcre2pattern</b>(3), <b>pcre2serialize</b>(3). 1697</P> 1698<br><a name="SEC20" href="#TOC1">AUTHOR</a><br> 1699<P> 1700Philip Hazel 1701<br> 1702University Computing Service 1703<br> 1704Cambridge, England. 1705<br> 1706</P> 1707<br><a name="SEC21" href="#TOC1">REVISION</a><br> 1708<P> 1709Last updated: 06 July 2016 1710<br> 1711Copyright © 1997-2016 University of Cambridge. 1712<br> 1713<p> 1714Return to the <a href="index.html">PCRE2 index page</a>. 1715</p> 1716