• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1<html>
2<head>
3<title>pcre2test specification</title>
4</head>
5<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
6<h1>pcre2test man page</h1>
7<p>
8Return to the <a href="index.html">PCRE2 index page</a>.
9</p>
10<p>
11This page is part of the PCRE2 HTML documentation. It was generated
12automatically from the original man page. If there is any nonsense in it,
13please consult the man page, in case the conversion went wrong.
14<br>
15<ul>
16<li><a name="TOC1" href="#SEC1">SYNOPSIS</a>
17<li><a name="TOC2" href="#SEC2">PCRE2's 8-BIT, 16-BIT AND 32-BIT LIBRARIES</a>
18<li><a name="TOC3" href="#SEC3">INPUT ENCODING</a>
19<li><a name="TOC4" href="#SEC4">COMMAND LINE OPTIONS</a>
20<li><a name="TOC5" href="#SEC5">DESCRIPTION</a>
21<li><a name="TOC6" href="#SEC6">COMMAND LINES</a>
22<li><a name="TOC7" href="#SEC7">MODIFIER SYNTAX</a>
23<li><a name="TOC8" href="#SEC8">PATTERN SYNTAX</a>
24<li><a name="TOC9" href="#SEC9">SUBJECT LINE SYNTAX</a>
25<li><a name="TOC10" href="#SEC10">PATTERN MODIFIERS</a>
26<li><a name="TOC11" href="#SEC11">SUBJECT MODIFIERS</a>
27<li><a name="TOC12" href="#SEC12">THE ALTERNATIVE MATCHING FUNCTION</a>
28<li><a name="TOC13" href="#SEC13">DEFAULT OUTPUT FROM pcre2test</a>
29<li><a name="TOC14" href="#SEC14">OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION</a>
30<li><a name="TOC15" href="#SEC15">RESTARTING AFTER A PARTIAL MATCH</a>
31<li><a name="TOC16" href="#SEC16">CALLOUTS</a>
32<li><a name="TOC17" href="#SEC17">NON-PRINTING CHARACTERS</a>
33<li><a name="TOC18" href="#SEC18">SAVING AND RESTORING COMPILED PATTERNS</a>
34<li><a name="TOC19" href="#SEC19">SEE ALSO</a>
35<li><a name="TOC20" href="#SEC20">AUTHOR</a>
36<li><a name="TOC21" href="#SEC21">REVISION</a>
37</ul>
38<br><a name="SEC1" href="#TOC1">SYNOPSIS</a><br>
39<P>
40<b>pcre2test [options] [input file [output file]]</b>
41<br>
42<br>
43<b>pcre2test</b> is a test program for the PCRE2 regular expression libraries,
44but it can also be used for experimenting with regular expressions. This
45document describes the features of the test program; for details of the regular
46expressions themselves, see the
47<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
48documentation. For details of the PCRE2 library function calls and their
49options, see the
50<a href="pcre2api.html"><b>pcre2api</b></a>
51documentation.
52</P>
53<P>
54The input for <b>pcre2test</b> is a sequence of regular expression patterns and
55subject strings to be matched. There are also command lines for setting
56defaults and controlling some special actions. The output shows the result of
57each match attempt. Modifiers on external or internal command lines, the
58patterns, and the subject lines specify PCRE2 function options, control how the
59subject is processed, and what output is produced.
60</P>
61<P>
62As the original fairly simple PCRE library evolved, it acquired many different
63features, and as a result, the original <b>pcretest</b> program ended up with a
64lot of options in a messy, arcane syntax, for testing all the features. The
65move to the new PCRE2 API provided an opportunity to re-implement the test
66program as <b>pcre2test</b>, with a cleaner modifier syntax. Nevertheless, there
67are still many obscure modifiers, some of which are specifically designed for
68use in conjunction with the test script and data files that are distributed as
69part of PCRE2. All the modifiers are documented here, some without much
70justification, but many of them are unlikely to be of use except when testing
71the libraries.
72</P>
73<br><a name="SEC2" href="#TOC1">PCRE2's 8-BIT, 16-BIT AND 32-BIT LIBRARIES</a><br>
74<P>
75Different versions of the PCRE2 library can be built to support character
76strings that are encoded in 8-bit, 16-bit, or 32-bit code units. One, two, or
77all three of these libraries may be simultaneously installed. The
78<b>pcre2test</b> program can be used to test all the libraries. However, its own
79input and output are always in 8-bit format. When testing the 16-bit or 32-bit
80libraries, patterns and subject strings are converted to 16- or 32-bit format
81before being passed to the library functions. Results are converted back to
828-bit code units for output.
83</P>
84<P>
85In the rest of this document, the names of library functions and structures
86are given in generic form, for example, <b>pcre_compile()</b>. The actual
87names used in the libraries have a suffix _8, _16, or _32, as appropriate.
88</P>
89<br><a name="SEC3" href="#TOC1">INPUT ENCODING</a><br>
90<P>
91Input to <b>pcre2test</b> is processed line by line, either by calling the C
92library's <b>fgets()</b> function, or via the <b>libreadline</b> library (see
93below). The input is processed using using C's string functions, so must not
94contain binary zeroes, even though in Unix-like environments, <b>fgets()</b>
95treats any bytes other than newline as data characters. In some Windows
96environments character 26 (hex 1A) causes an immediate end of file, and no
97further data is read.
98</P>
99<P>
100For maximum portability, therefore, it is safest to avoid non-printing
101characters in <b>pcre2test</b> input files. There is a facility for specifying
102some or all of a pattern's characters as hexadecimal pairs, thus making it
103possible to include binary zeroes in a pattern for testing purposes. Subject
104lines are processed for backslash escapes, which makes it possible to include
105any data value.
106</P>
107<br><a name="SEC4" href="#TOC1">COMMAND LINE OPTIONS</a><br>
108<P>
109<b>-8</b>
110If the 8-bit library has been built, this option causes it to be used (this is
111the default). If the 8-bit library has not been built, this option causes an
112error.
113</P>
114<P>
115<b>-16</b>
116If the 16-bit library has been built, this option causes it to be used. If only
117the 16-bit library has been built, this is the default. If the 16-bit library
118has not been built, this option causes an error.
119</P>
120<P>
121<b>-32</b>
122If the 32-bit library has been built, this option causes it to be used. If only
123the 32-bit library has been built, this is the default. If the 32-bit library
124has not been built, this option causes an error.
125</P>
126<P>
127<b>-b</b>
128Behave as if each pattern has the <b>/fullbincode</b> modifier; the full
129internal binary form of the pattern is output after compilation.
130</P>
131<P>
132<b>-C</b>
133Output the version number of the PCRE2 library, and all available information
134about the optional features that are included, and then exit with zero exit
135code. All other options are ignored.
136</P>
137<P>
138<b>-C</b> <i>option</i>
139Output information about a specific build-time option, then exit. This
140functionality is intended for use in scripts such as <b>RunTest</b>. The
141following options output the value and set the exit code as indicated:
142<pre>
143  ebcdic-nl  the code for LF (= NL) in an EBCDIC environment:
144               0x15 or 0x25
145               0 if used in an ASCII environment
146               exit code is always 0
147  linksize   the configured internal link size (2, 3, or 4)
148               exit code is set to the link size
149  newline    the default newline setting:
150               CR, LF, CRLF, ANYCRLF, or ANY
151               exit code is always 0
152  bsr        the default setting for what \R matches:
153               ANYCRLF or ANY
154               exit code is always 0
155</pre>
156The following options output 1 for true or 0 for false, and set the exit code
157to the same value:
158<pre>
159  backslash-C  \C is supported (not locked out)
160  ebcdic       compiled for an EBCDIC environment
161  jit          just-in-time support is available
162  pcre2-16     the 16-bit library was built
163  pcre2-32     the 32-bit library was built
164  pcre2-8      the 8-bit library was built
165  unicode      Unicode support is available
166</pre>
167If an unknown option is given, an error message is output; the exit code is 0.
168</P>
169<P>
170<b>-d</b>
171Behave as if each pattern has the <b>debug</b> modifier; the internal
172form and information about the compiled pattern is output after compilation;
173<b>-d</b> is equivalent to <b>-b -i</b>.
174</P>
175<P>
176<b>-dfa</b>
177Behave as if each subject line has the <b>dfa</b> modifier; matching is done
178using the <b>pcre2_dfa_match()</b> function instead of the default
179<b>pcre2_match()</b>.
180</P>
181<P>
182<b>-error</b> <i>number[,number,...]</i>
183Call <b>pcre2_get_error_message()</b> for each of the error numbers in the
184comma-separated list, display the resulting messages on the standard output,
185then exit with zero exit code. The numbers may be positive or negative. This is
186a convenience facility for PCRE2 maintainers.
187</P>
188<P>
189<b>-help</b>
190Output a brief summary these options and then exit.
191</P>
192<P>
193<b>-i</b>
194Behave as if each pattern has the <b>/info</b> modifier; information about the
195compiled pattern is given after compilation.
196</P>
197<P>
198<b>-jit</b>
199Behave as if each pattern line has the <b>jit</b> modifier; after successful
200compilation, each pattern is passed to the just-in-time compiler, if available.
201</P>
202<P>
203\fB-pattern\fB <i>modifier-list</i>
204Behave as if each pattern line contains the given modifiers.
205</P>
206<P>
207<b>-q</b>
208Do not output the version number of <b>pcre2test</b> at the start of execution.
209</P>
210<P>
211<b>-S</b> <i>size</i>
212On Unix-like systems, set the size of the run-time stack to <i>size</i>
213megabytes.
214</P>
215<P>
216<b>-subject</b> <i>modifier-list</i>
217Behave as if each subject line contains the given modifiers.
218</P>
219<P>
220<b>-t</b>
221Run each compile and match many times with a timer, and output the resulting
222times per compile or match. When JIT is used, separate times are given for the
223initial compile and the JIT compile. You can control the number of iterations
224that are used for timing by following <b>-t</b> with a number (as a separate
225item on the command line). For example, "-t 1000" iterates 1000 times. The
226default is to iterate 500,000 times.
227</P>
228<P>
229<b>-tm</b>
230This is like <b>-t</b> except that it times only the matching phase, not the
231compile phase.
232</P>
233<P>
234<b>-T</b> <b>-TM</b>
235These behave like <b>-t</b> and <b>-tm</b>, but in addition, at the end of a run,
236the total times for all compiles and matches are output.
237</P>
238<P>
239<b>-version</b>
240Output the PCRE2 version number and then exit.
241</P>
242<br><a name="SEC5" href="#TOC1">DESCRIPTION</a><br>
243<P>
244If <b>pcre2test</b> is given two filename arguments, it reads from the first and
245writes to the second. If the first name is "-", input is taken from the
246standard input. If <b>pcre2test</b> is given only one argument, it reads from
247that file and writes to stdout. Otherwise, it reads from stdin and writes to
248stdout.
249</P>
250<P>
251When <b>pcre2test</b> is built, a configuration option can specify that it
252should be linked with the <b>libreadline</b> or <b>libedit</b> library. When this
253is done, if the input is from a terminal, it is read using the <b>readline()</b>
254function. This provides line-editing and history facilities. The output from
255the <b>-help</b> option states whether or not <b>readline()</b> will be used.
256</P>
257<P>
258The program handles any number of tests, each of which consists of a set of
259input lines. Each set starts with a regular expression pattern, followed by any
260number of subject lines to be matched against that pattern. In between sets of
261test data, command lines that begin with # may appear. This file format, with
262some restrictions, can also be processed by the <b>perltest.sh</b> script that
263is distributed with PCRE2 as a means of checking that the behaviour of PCRE2
264and Perl is the same.
265</P>
266<P>
267When the input is a terminal, <b>pcre2test</b> prompts for each line of input,
268using "re&#62;" to prompt for regular expression patterns, and "data&#62;" to prompt
269for subject lines. Command lines starting with # can be entered only in
270response to the "re&#62;" prompt.
271</P>
272<P>
273Each subject line is matched separately and independently. If you want to do
274multi-line matches, you have to use the \n escape sequence (or \r or \r\n,
275etc., depending on the newline setting) in a single line of input to encode the
276newline sequences. There is no limit on the length of subject lines; the input
277buffer is automatically extended if it is too small. There are replication
278features that makes it possible to generate long repetitive pattern or subject
279lines without having to supply them explicitly.
280</P>
281<P>
282An empty line or the end of the file signals the end of the subject lines for a
283test, at which point a new pattern or command line is expected if there is
284still input to be read.
285</P>
286<br><a name="SEC6" href="#TOC1">COMMAND LINES</a><br>
287<P>
288In between sets of test data, a line that begins with # is interpreted as a
289command line. If the first character is followed by white space or an
290exclamation mark, the line is treated as a comment, and ignored. Otherwise, the
291following commands are recognized:
292<pre>
293  #forbid_utf
294</pre>
295Subsequent patterns automatically have the PCRE2_NEVER_UTF and PCRE2_NEVER_UCP
296options set, which locks out the use of the PCRE2_UTF and PCRE2_UCP options and
297the use of (*UTF) and (*UCP) at the start of patterns. This command also forces
298an error if a subsequent pattern contains any occurrences of \P, \p, or \X,
299which are still supported when PCRE2_UTF is not set, but which require Unicode
300property support to be included in the library.
301</P>
302<P>
303This is a trigger guard that is used in test files to ensure that UTF or
304Unicode property tests are not accidentally added to files that are used when
305Unicode support is not included in the library. Setting PCRE2_NEVER_UTF and
306PCRE2_NEVER_UCP as a default can also be obtained by the use of <b>#pattern</b>;
307the difference is that <b>#forbid_utf</b> cannot be unset, and the automatic
308options are not displayed in pattern information, to avoid cluttering up test
309output.
310<pre>
311  #load &#60;filename&#62;
312</pre>
313This command is used to load a set of precompiled patterns from a file, as
314described in the section entitled "Saving and restoring compiled patterns"
315<a href="#saverestore">below.</a>
316<pre>
317  #newline_default [&#60;newline-list&#62;]
318</pre>
319When PCRE2 is built, a default newline convention can be specified. This
320determines which characters and/or character pairs are recognized as indicating
321a newline in a pattern or subject string. The default can be overridden when a
322pattern is compiled. The standard test files contain tests of various newline
323conventions, but the majority of the tests expect a single linefeed to be
324recognized as a newline by default. Without special action the tests would fail
325when PCRE2 is compiled with either CR or CRLF as the default newline.
326</P>
327<P>
328The #newline_default command specifies a list of newline types that are
329acceptable as the default. The types must be one of CR, LF, CRLF, ANYCRLF, or
330ANY (in upper or lower case), for example:
331<pre>
332  #newline_default LF Any anyCRLF
333</pre>
334If the default newline is in the list, this command has no effect. Otherwise,
335except when testing the POSIX API, a <b>newline</b> modifier that specifies the
336first newline convention in the list (LF in the above example) is added to any
337pattern that does not already have a <b>newline</b> modifier. If the newline
338list is empty, the feature is turned off. This command is present in a number
339of the standard test input files.
340</P>
341<P>
342When the POSIX API is being tested there is no way to override the default
343newline convention, though it is possible to set the newline convention from
344within the pattern. A warning is given if the <b>posix</b> modifier is used when
345<b>#newline_default</b> would set a default for the non-POSIX API.
346<pre>
347  #pattern &#60;modifier-list&#62;
348</pre>
349This command sets a default modifier list that applies to all subsequent
350patterns. Modifiers on a pattern can change these settings.
351<pre>
352  #perltest
353</pre>
354The appearance of this line causes all subsequent modifier settings to be
355checked for compatibility with the <b>perltest.sh</b> script, which is used to
356confirm that Perl gives the same results as PCRE2. Also, apart from comment
357lines, none of the other command lines are permitted, because they and many
358of the modifiers are specific to <b>pcre2test</b>, and should not be used in
359test files that are also processed by <b>perltest.sh</b>. The <b>#perltest</b>
360command helps detect tests that are accidentally put in the wrong file.
361<pre>
362  #pop [&#60;modifiers&#62;]
363  #popcopy [&#60;modifiers&#62;]
364</pre>
365These commands are used to manipulate the stack of compiled patterns, as
366described in the section entitled "Saving and restoring compiled patterns"
367<a href="#saverestore">below.</a>
368<pre>
369  #save &#60;filename&#62;
370</pre>
371This command is used to save a set of compiled patterns to a file, as described
372in the section entitled "Saving and restoring compiled patterns"
373<a href="#saverestore">below.</a>
374<pre>
375  #subject &#60;modifier-list&#62;
376</pre>
377This command sets a default modifier list that applies to all subsequent
378subject lines. Modifiers on a subject line can change these settings.
379</P>
380<br><a name="SEC7" href="#TOC1">MODIFIER SYNTAX</a><br>
381<P>
382Modifier lists are used with both pattern and subject lines. Items in a list
383are separated by commas followed by optional white space. Trailing whitespace
384in a modifier list is ignored. Some modifiers may be given for both patterns
385and subject lines, whereas others are valid only for one or the other. Each
386modifier has a long name, for example "anchored", and some of them must be
387followed by an equals sign and a value, for example, "offset=12". Values cannot
388contain comma characters, but may contain spaces. Modifiers that do not take
389values may be preceded by a minus sign to turn off a previous setting.
390</P>
391<P>
392A few of the more common modifiers can also be specified as single letters, for
393example "i" for "caseless". In documentation, following the Perl convention,
394these are written with a slash ("the /i modifier") for clarity. Abbreviated
395modifiers must all be concatenated in the first item of a modifier list. If the
396first item is not recognized as a long modifier name, it is interpreted as a
397sequence of these abbreviations. For example:
398<pre>
399  /abc/ig,newline=cr,jit=3
400</pre>
401This is a pattern line whose modifier list starts with two one-letter modifiers
402(/i and /g). The lower-case abbreviated modifiers are the same as used in Perl.
403</P>
404<br><a name="SEC8" href="#TOC1">PATTERN SYNTAX</a><br>
405<P>
406A pattern line must start with one of the following characters (common symbols,
407excluding pattern meta-characters):
408<pre>
409  / ! " ' ` - = _ : ; , % & @ ~
410</pre>
411This is interpreted as the pattern's delimiter. A regular expression may be
412continued over several input lines, in which case the newline characters are
413included within it. It is possible to include the delimiter within the pattern
414by escaping it with a backslash, for example
415<pre>
416  /abc\/def/
417</pre>
418If you do this, the escape and the delimiter form part of the pattern, but
419since the delimiters are all non-alphanumeric, this does not affect its
420interpretation. If the terminating delimiter is immediately followed by a
421backslash, for example,
422<pre>
423  /abc/\
424</pre>
425then a backslash is added to the end of the pattern. This is done to provide a
426way of testing the error condition that arises if a pattern finishes with a
427backslash, because
428<pre>
429  /abc\/
430</pre>
431is interpreted as the first line of a pattern that starts with "abc/", causing
432pcre2test to read the next line as a continuation of the regular expression.
433</P>
434<P>
435A pattern can be followed by a modifier list (details below).
436</P>
437<br><a name="SEC9" href="#TOC1">SUBJECT LINE SYNTAX</a><br>
438<P>
439Before each subject line is passed to <b>pcre2_match()</b> or
440<b>pcre2_dfa_match()</b>, leading and trailing white space is removed, and the
441line is scanned for backslash escapes. The following provide a means of
442encoding non-printing characters in a visible way:
443<pre>
444  \a         alarm (BEL, \x07)
445  \b         backspace (\x08)
446  \e         escape (\x27)
447  \f         form feed (\x0c)
448  \n         newline (\x0a)
449  \r         carriage return (\x0d)
450  \t         tab (\x09)
451  \v         vertical tab (\x0b)
452  \nnn       octal character (up to 3 octal digits); always
453               a byte unless &#62; 255 in UTF-8 or 16-bit or 32-bit mode
454  \o{dd...}  octal character (any number of octal digits}
455  \xhh       hexadecimal byte (up to 2 hex digits)
456  \x{hh...}  hexadecimal character (any number of hex digits)
457</pre>
458The use of \x{hh...} is not dependent on the use of the <b>utf</b> modifier on
459the pattern. It is recognized always. There may be any number of hexadecimal
460digits inside the braces; invalid values provoke error messages.
461</P>
462<P>
463Note that \xhh specifies one byte rather than one character in UTF-8 mode;
464this makes it possible to construct invalid UTF-8 sequences for testing
465purposes. On the other hand, \x{hh} is interpreted as a UTF-8 character in
466UTF-8 mode, generating more than one byte if the value is greater than 127.
467When testing the 8-bit library not in UTF-8 mode, \x{hh} generates one byte
468for values less than 256, and causes an error for greater values.
469</P>
470<P>
471In UTF-16 mode, all 4-digit \x{hhhh} values are accepted. This makes it
472possible to construct invalid UTF-16 sequences for testing purposes.
473</P>
474<P>
475In UTF-32 mode, all 4- to 8-digit \x{...} values are accepted. This makes it
476possible to construct invalid UTF-32 sequences for testing purposes.
477</P>
478<P>
479There is a special backslash sequence that specifies replication of one or more
480characters:
481<pre>
482  \[&#60;characters&#62;]{&#60;count&#62;}
483</pre>
484This makes it possible to test long strings without having to provide them as
485part of the file. For example:
486<pre>
487  \[abc]{4}
488</pre>
489is converted to "abcabcabcabc". This feature does not support nesting. To
490include a closing square bracket in the characters, code it as \x5D.
491</P>
492<P>
493A backslash followed by an equals sign marks the end of the subject string and
494the start of a modifier list. For example:
495<pre>
496  abc\=notbol,notempty
497</pre>
498If the subject string is empty and \= is followed by whitespace, the line is
499treated as a comment line, and is not used for matching. For example:
500<pre>
501  \= This is a comment.
502  abc\= This is an invalid modifier list.
503</pre>
504A backslash followed by any other non-alphanumeric character just escapes that
505character. A backslash followed by anything else causes an error. However, if
506the very last character in the line is a backslash (and there is no modifier
507list), it is ignored. This gives a way of passing an empty line as data, since
508a real empty line terminates the data input.
509</P>
510<br><a name="SEC10" href="#TOC1">PATTERN MODIFIERS</a><br>
511<P>
512There are several types of modifier that can appear in pattern lines. Except
513where noted below, they may also be used in <b>#pattern</b> commands. A
514pattern's modifier list can add to or override default modifiers that were set
515by a previous <b>#pattern</b> command.
516<a name="optionmodifiers"></a></P>
517<br><b>
518Setting compilation options
519</b><br>
520<P>
521The following modifiers set options for <b>pcre2_compile()</b>. The most common
522ones have single-letter abbreviations. See
523<a href="pcre2api.html"><b>pcre2api</b></a>
524for a description of their effects.
525<pre>
526      allow_empty_class         set PCRE2_ALLOW_EMPTY_CLASS
527      alt_bsux                  set PCRE2_ALT_BSUX
528      alt_circumflex            set PCRE2_ALT_CIRCUMFLEX
529      alt_verbnames             set PCRE2_ALT_VERBNAMES
530      anchored                  set PCRE2_ANCHORED
531      auto_callout              set PCRE2_AUTO_CALLOUT
532  /i  caseless                  set PCRE2_CASELESS
533      dollar_endonly            set PCRE2_DOLLAR_ENDONLY
534  /s  dotall                    set PCRE2_DOTALL
535      dupnames                  set PCRE2_DUPNAMES
536  /x  extended                  set PCRE2_EXTENDED
537      firstline                 set PCRE2_FIRSTLINE
538      match_unset_backref       set PCRE2_MATCH_UNSET_BACKREF
539  /m  multiline                 set PCRE2_MULTILINE
540      never_backslash_c         set PCRE2_NEVER_BACKSLASH_C
541      never_ucp                 set PCRE2_NEVER_UCP
542      never_utf                 set PCRE2_NEVER_UTF
543      no_auto_capture           set PCRE2_NO_AUTO_CAPTURE
544      no_auto_possess           set PCRE2_NO_AUTO_POSSESS
545      no_dotstar_anchor         set PCRE2_NO_DOTSTAR_ANCHOR
546      no_start_optimize         set PCRE2_NO_START_OPTIMIZE
547      no_utf_check              set PCRE2_NO_UTF_CHECK
548      ucp                       set PCRE2_UCP
549      ungreedy                  set PCRE2_UNGREEDY
550      use_offset_limit          set PCRE2_USE_OFFSET_LIMIT
551      utf                       set PCRE2_UTF
552</pre>
553As well as turning on the PCRE2_UTF option, the <b>utf</b> modifier causes all
554non-printing characters in output strings to be printed using the \x{hh...}
555notation. Otherwise, those less than 0x100 are output in hex without the curly
556brackets.
557<a name="controlmodifiers"></a></P>
558<br><b>
559Setting compilation controls
560</b><br>
561<P>
562The following modifiers affect the compilation process or request information
563about the pattern:
564<pre>
565      bsr=[anycrlf|unicode]     specify \R handling
566  /B  bincode                   show binary code without lengths
567      callout_info              show callout information
568      debug                     same as info,fullbincode
569      fullbincode               show binary code with lengths
570  /I  info                      show info about compiled pattern
571      hex                       unquoted characters are hexadecimal
572      jit[=&#60;number&#62;]            use JIT
573      jitfast                   use JIT fast path
574      jitverify                 verify JIT use
575      locale=&#60;name&#62;             use this locale
576      max_pattern_length=&#60;n&#62;    set the maximum pattern length
577      memory                    show memory used
578      newline=&#60;type&#62;            set newline type
579      null_context              compile with a NULL context
580      parens_nest_limit=&#60;n&#62;     set maximum parentheses depth
581      posix                     use the POSIX API
582      posix_nosub               use the POSIX API with REG_NOSUB
583      push                      push compiled pattern onto the stack
584      pushcopy                  push a copy onto the stack
585      stackguard=&#60;number&#62;       test the stackguard feature
586      tables=[0|1|2]            select internal tables
587</pre>
588The effects of these modifiers are described in the following sections.
589</P>
590<br><b>
591Newline and \R handling
592</b><br>
593<P>
594The <b>bsr</b> modifier specifies what \R in a pattern should match. If it is
595set to "anycrlf", \R matches CR, LF, or CRLF only. If it is set to "unicode",
596\R matches any Unicode newline sequence. The default is specified when PCRE2
597is built, with the default default being Unicode.
598</P>
599<P>
600The <b>newline</b> modifier specifies which characters are to be interpreted as
601newlines, both in the pattern and in subject lines. The type must be one of CR,
602LF, CRLF, ANYCRLF, or ANY (in upper or lower case).
603</P>
604<br><b>
605Information about a pattern
606</b><br>
607<P>
608The <b>debug</b> modifier is a shorthand for <b>info,fullbincode</b>, requesting
609all available information.
610</P>
611<P>
612The <b>bincode</b> modifier causes a representation of the compiled code to be
613output after compilation. This information does not contain length and offset
614values, which ensures that the same output is generated for different internal
615link sizes and different code unit widths. By using <b>bincode</b>, the same
616regression tests can be used in different environments.
617</P>
618<P>
619The <b>fullbincode</b> modifier, by contrast, <i>does</i> include length and
620offset values. This is used in a few special tests that run only for specific
621code unit widths and link sizes, and is also useful for one-off tests.
622</P>
623<P>
624The <b>info</b> modifier requests information about the compiled pattern
625(whether it is anchored, has a fixed first character, and so on). The
626information is obtained from the <b>pcre2_pattern_info()</b> function. Here are
627some typical examples:
628<pre>
629    re&#62; /(?i)(^a|^b)/m,info
630  Capturing subpattern count = 1
631  Compile options: multiline
632  Overall options: caseless multiline
633  First code unit at start or follows newline
634  Subject length lower bound = 1
635
636    re&#62; /(?i)abc/info
637  Capturing subpattern count = 0
638  Compile options: &#60;none&#62;
639  Overall options: caseless
640  First code unit = 'a' (caseless)
641  Last code unit = 'c' (caseless)
642  Subject length lower bound = 3
643</pre>
644"Compile options" are those specified by modifiers; "overall options" have
645added options that are taken or deduced from the pattern. If both sets of
646options are the same, just a single "options" line is output; if there are no
647options, the line is omitted. "First code unit" is where any match must start;
648if there is more than one they are listed as "starting code units". "Last code
649unit" is the last literal code unit that must be present in any match. This is
650not necessarily the last character. These lines are omitted if no starting or
651ending code units are recorded.
652</P>
653<P>
654The <b>callout_info</b> modifier requests information about all the callouts in
655the pattern. A list of them is output at the end of any other information that
656is requested. For each callout, either its number or string is given, followed
657by the item that follows it in the pattern.
658</P>
659<br><b>
660Passing a NULL context
661</b><br>
662<P>
663Normally, <b>pcre2test</b> passes a context block to <b>pcre2_compile()</b>. If
664the <b>null_context</b> modifier is set, however, NULL is passed. This is for
665testing that <b>pcre2_compile()</b> behaves correctly in this case (it uses
666default values).
667</P>
668<br><b>
669Specifying pattern characters in hexadecimal
670</b><br>
671<P>
672The <b>hex</b> modifier specifies that the characters of the pattern, except for
673substrings enclosed in single or double quotes, are to be interpreted as pairs
674of hexadecimal digits. This feature is provided as a way of creating patterns
675that contain binary zeros and other non-printing characters. White space is
676permitted between pairs of digits. For example, this pattern contains three
677characters:
678<pre>
679  /ab 32 59/hex
680</pre>
681Parts of such a pattern are taken literally if quoted. This pattern contains
682nine characters, only two of which are specified in hexadecimal:
683<pre>
684  /ab "literal" 32/hex
685</pre>
686Either single or double quotes may be used. There is no way of including
687the delimiter within a substring.
688</P>
689<P>
690By default, <b>pcre2test</b> passes patterns as zero-terminated strings to
691<b>pcre2_compile()</b>, giving the length as PCRE2_ZERO_TERMINATED. However, for
692patterns specified with the <b>hex</b> modifier, the actual length of the
693pattern is passed.
694</P>
695<br><b>
696Generating long repetitive patterns
697</b><br>
698<P>
699Some tests use long patterns that are very repetitive. Instead of creating a
700very long input line for such a pattern, you can use a special repetition
701feature, similar to the one described for subject lines above. If the
702<b>expand</b> modifier is present on a pattern, parts of the pattern that have
703the form
704<pre>
705  \[&#60;characters&#62;]{&#60;count&#62;}
706</pre>
707are expanded before the pattern is passed to <b>pcre2_compile()</b>. For
708example, \[AB]{6000} is expanded to "ABAB..." 6000 times. This construction
709cannot be nested. An initial "\[" sequence is recognized only if "]{" followed
710by decimal digits and "}" is found later in the pattern. If not, the characters
711remain in the pattern unaltered.
712</P>
713<P>
714If part of an expanded pattern looks like an expansion, but is really part of
715the actual pattern, unwanted expansion can be avoided by giving two values in
716the quantifier. For example, \[AB]{6000,6000} is not recognized as an
717expansion item.
718</P>
719<P>
720If the <b>info</b> modifier is set on an expanded pattern, the result of the
721expansion is included in the information that is output.
722</P>
723<br><b>
724JIT compilation
725</b><br>
726<P>
727Just-in-time (JIT) compiling is a heavyweight optimization that can greatly
728speed up pattern matching. See the
729<a href="pcre2jit.html"><b>pcre2jit</b></a>
730documentation for details. JIT compiling happens, optionally, after a pattern
731has been successfully compiled into an internal form. The JIT compiler converts
732this to optimized machine code. It needs to know whether the match-time options
733PCRE2_PARTIAL_HARD and PCRE2_PARTIAL_SOFT are going to be used, because
734different code is generated for the different cases. See the <b>partial</b>
735modifier in "Subject Modifiers"
736<a href="#subjectmodifiers">below</a>
737for details of how these options are specified for each match attempt.
738</P>
739<P>
740JIT compilation is requested by the <b>/jit</b> pattern modifier, which may
741optionally be followed by an equals sign and a number in the range 0 to 7.
742The three bits that make up the number specify which of the three JIT operating
743modes are to be compiled:
744<pre>
745  1  compile JIT code for non-partial matching
746  2  compile JIT code for soft partial matching
747  4  compile JIT code for hard partial matching
748</pre>
749The possible values for the <b>/jit</b> modifier are therefore:
750<pre>
751  0  disable JIT
752  1  normal matching only
753  2  soft partial matching only
754  3  normal and soft partial matching
755  4  hard partial matching only
756  6  soft and hard partial matching only
757  7  all three modes
758</pre>
759If no number is given, 7 is assumed. The phrase "partial matching" means a call
760to <b>pcre2_match()</b> with either the PCRE2_PARTIAL_SOFT or the
761PCRE2_PARTIAL_HARD option set. Note that such a call may return a complete
762match; the options enable the possibility of a partial match, but do not
763require it. Note also that if you request JIT compilation only for partial
764matching (for example, /jit=2) but do not set the <b>partial</b> modifier on a
765subject line, that match will not use JIT code because none was compiled for
766non-partial matching.
767</P>
768<P>
769If JIT compilation is successful, the compiled JIT code will automatically be
770used when an appropriate type of match is run, except when incompatible
771run-time options are specified. For more details, see the
772<a href="pcre2jit.html"><b>pcre2jit</b></a>
773documentation. See also the <b>jitstack</b> modifier below for a way of
774setting the size of the JIT stack.
775</P>
776<P>
777If the <b>jitfast</b> modifier is specified, matching is done using the JIT
778"fast path" interface, <b>pcre2_jit_match()</b>, which skips some of the sanity
779checks that are done by <b>pcre2_match()</b>, and of course does not work when
780JIT is not supported. If <b>jitfast</b> is specified without <b>jit</b>, jit=7 is
781assumed.
782</P>
783<P>
784If the <b>jitverify</b> modifier is specified, information about the compiled
785pattern shows whether JIT compilation was or was not successful. If
786<b>jitverify</b> is specified without <b>jit</b>, jit=7 is assumed. If JIT
787compilation is successful when <b>jitverify</b> is set, the text "(JIT)" is
788added to the first output line after a match or non match when JIT-compiled
789code was actually used in the match.
790</P>
791<br><b>
792Setting a locale
793</b><br>
794<P>
795The <b>/locale</b> modifier must specify the name of a locale, for example:
796<pre>
797  /pattern/locale=fr_FR
798</pre>
799The given locale is set, <b>pcre2_maketables()</b> is called to build a set of
800character tables for the locale, and this is then passed to
801<b>pcre2_compile()</b> when compiling the regular expression. The same tables
802are used when matching the following subject lines. The <b>/locale</b> modifier
803applies only to the pattern on which it appears, but can be given in a
804<b>#pattern</b> command if a default is needed. Setting a locale and alternate
805character tables are mutually exclusive.
806</P>
807<br><b>
808Showing pattern memory
809</b><br>
810<P>
811The <b>/memory</b> modifier causes the size in bytes of the memory used to hold
812the compiled pattern to be output. This does not include the size of the
813<b>pcre2_code</b> block; it is just the actual compiled data. If the pattern is
814subsequently passed to the JIT compiler, the size of the JIT compiled code is
815also output. Here is an example:
816<pre>
817    re&#62; /a(b)c/jit,memory
818  Memory allocation (code space): 21
819  Memory allocation (JIT code): 1910
820
821</PRE>
822</P>
823<br><b>
824Limiting nested parentheses
825</b><br>
826<P>
827The <b>parens_nest_limit</b> modifier sets a limit on the depth of nested
828parentheses in a pattern. Breaching the limit causes a compilation error.
829The default for the library is set when PCRE2 is built, but <b>pcre2test</b>
830sets its own default of 220, which is required for running the standard test
831suite.
832</P>
833<br><b>
834Limiting the pattern length
835</b><br>
836<P>
837The <b>max_pattern_length</b> modifier sets a limit, in code units, to the
838length of pattern that <b>pcre2_compile()</b> will accept. Breaching the limit
839causes a compilation error. The default is the largest number a PCRE2_SIZE
840variable can hold (essentially unlimited).
841</P>
842<br><b>
843Using the POSIX wrapper API
844</b><br>
845<P>
846The <b>/posix</b> and <b>posix_nosub</b> modifiers cause <b>pcre2test</b> to call
847PCRE2 via the POSIX wrapper API rather than its native API. When
848<b>posix_nosub</b> is used, the POSIX option REG_NOSUB is passed to
849<b>regcomp()</b>. The POSIX wrapper supports only the 8-bit library. Note that
850it does not imply POSIX matching semantics; for more detail see the
851<a href="pcre2posix.html"><b>pcre2posix</b></a>
852documentation. The following pattern modifiers set options for the
853<b>regcomp()</b> function:
854<pre>
855  caseless           REG_ICASE
856  multiline          REG_NEWLINE
857  dotall             REG_DOTALL     )
858  ungreedy           REG_UNGREEDY   ) These options are not part of
859  ucp                REG_UCP        )   the POSIX standard
860  utf                REG_UTF8       )
861</pre>
862The <b>regerror_buffsize</b> modifier specifies a size for the error buffer that
863is passed to <b>regerror()</b> in the event of a compilation error. For example:
864<pre>
865  /abc/posix,regerror_buffsize=20
866</pre>
867This provides a means of testing the behaviour of <b>regerror()</b> when the
868buffer is too small for the error message. If this modifier has not been set, a
869large buffer is used.
870</P>
871<P>
872The <b>aftertext</b> and <b>allaftertext</b> subject modifiers work as described
873below. All other modifiers are either ignored, with a warning message, or cause
874an error.
875</P>
876<br><b>
877Testing the stack guard feature
878</b><br>
879<P>
880The <b>/stackguard</b> modifier is used to test the use of
881<b>pcre2_set_compile_recursion_guard()</b>, a function that is provided to
882enable stack availability to be checked during compilation (see the
883<a href="pcre2api.html"><b>pcre2api</b></a>
884documentation for details). If the number specified by the modifier is greater
885than zero, <b>pcre2_set_compile_recursion_guard()</b> is called to set up
886callback from <b>pcre2_compile()</b> to a local function. The argument it
887receives is the current nesting parenthesis depth; if this is greater than the
888value given by the modifier, non-zero is returned, causing the compilation to
889be aborted.
890</P>
891<br><b>
892Using alternative character tables
893</b><br>
894<P>
895The value specified for the <b>/tables</b> modifier must be one of the digits 0,
8961, or 2. It causes a specific set of built-in character tables to be passed to
897<b>pcre2_compile()</b>. This is used in the PCRE2 tests to check behaviour with
898different character tables. The digit specifies the tables as follows:
899<pre>
900  0   do not pass any special character tables
901  1   the default ASCII tables, as distributed in
902        pcre2_chartables.c.dist
903  2   a set of tables defining ISO 8859 characters
904</pre>
905In table 2, some characters whose codes are greater than 128 are identified as
906letters, digits, spaces, etc. Setting alternate character tables and a locale
907are mutually exclusive.
908</P>
909<br><b>
910Setting certain match controls
911</b><br>
912<P>
913The following modifiers are really subject modifiers, and are described below.
914However, they may be included in a pattern's modifier list, in which case they
915are applied to every subject line that is processed with that pattern. They may
916not appear in <b>#pattern</b> commands. These modifiers do not affect the
917compilation process.
918<pre>
919      aftertext                  show text after match
920      allaftertext               show text after captures
921      allcaptures                show all captures
922      allusedtext                show all consulted text
923  /g  global                     global matching
924      mark                       show mark values
925      replace=&#60;string&#62;           specify a replacement string
926      startchar                  show starting character when relevant
927      substitute_extended        use PCRE2_SUBSTITUTE_EXTENDED
928      substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH
929      substitute_unknown_unset   use PCRE2_SUBSTITUTE_UNKNOWN_UNSET
930      substitute_unset_empty     use PCRE2_SUBSTITUTE_UNSET_EMPTY
931</pre>
932These modifiers may not appear in a <b>#pattern</b> command. If you want them as
933defaults, set them in a <b>#subject</b> command.
934</P>
935<br><b>
936Saving a compiled pattern
937</b><br>
938<P>
939When a pattern with the <b>push</b> modifier is successfully compiled, it is
940pushed onto a stack of compiled patterns, and <b>pcre2test</b> expects the next
941line to contain a new pattern (or a command) instead of a subject line. This
942facility is used when saving compiled patterns to a file, as described in the
943section entitled "Saving and restoring compiled patterns"
944<a href="#saverestore">below. If <b>pushcopy</b> is used instead of <b>push</b>, a copy of the compiled</a>
945pattern is stacked, leaving the original as current, ready to match the
946following input lines. This provides a way of testing the
947<b>pcre2_code_copy()</b> function.
948The <b>push</b> and <b>pushcopy </b> modifiers are incompatible with compilation
949modifiers such as <b>global</b> that act at match time. Any that are specified
950are ignored (for the stacked copy), with a warning message, except for
951<b>replace</b>, which causes an error. Note that <b>jitverify</b>, which is
952allowed, does not carry through to any subsequent matching that uses a stacked
953pattern.
954<a name="subjectmodifiers"></a></P>
955<br><a name="SEC11" href="#TOC1">SUBJECT MODIFIERS</a><br>
956<P>
957The modifiers that can appear in subject lines and the <b>#subject</b>
958command are of two types.
959</P>
960<br><b>
961Setting match options
962</b><br>
963<P>
964The following modifiers set options for <b>pcre2_match()</b> or
965<b>pcre2_dfa_match()</b>. See
966<a href="pcreapi.html"><b>pcreapi</b></a>
967for a description of their effects.
968<pre>
969      anchored                  set PCRE2_ANCHORED
970      dfa_restart               set PCRE2_DFA_RESTART
971      dfa_shortest              set PCRE2_DFA_SHORTEST
972      no_jit                    set PCRE2_NO_JIT
973      no_utf_check              set PCRE2_NO_UTF_CHECK
974      notbol                    set PCRE2_NOTBOL
975      notempty                  set PCRE2_NOTEMPTY
976      notempty_atstart          set PCRE2_NOTEMPTY_ATSTART
977      noteol                    set PCRE2_NOTEOL
978      partial_hard (or ph)      set PCRE2_PARTIAL_HARD
979      partial_soft (or ps)      set PCRE2_PARTIAL_SOFT
980</pre>
981The partial matching modifiers are provided with abbreviations because they
982appear frequently in tests.
983</P>
984<P>
985If the <b>/posix</b> modifier was present on the pattern, causing the POSIX
986wrapper API to be used, the only option-setting modifiers that have any effect
987are <b>notbol</b>, <b>notempty</b>, and <b>noteol</b>, causing REG_NOTBOL,
988REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to <b>regexec()</b>.
989The other modifiers are ignored, with a warning message.
990</P>
991<br><b>
992Setting match controls
993</b><br>
994<P>
995The following modifiers affect the matching process or request additional
996information. Some of them may also be specified on a pattern line (see above),
997in which case they apply to every subject line that is matched against that
998pattern.
999<pre>
1000      aftertext                  show text after match
1001      allaftertext               show text after captures
1002      allcaptures                show all captures
1003      allusedtext                show all consulted text (non-JIT only)
1004      altglobal                  alternative global matching
1005      callout_capture            show captures at callout time
1006      callout_data=&#60;n&#62;           set a value to pass via callouts
1007      callout_fail=&#60;n&#62;[:&#60;m&#62;]     control callout failure
1008      callout_none               do not supply a callout function
1009      copy=&#60;number or name&#62;      copy captured substring
1010      dfa                        use <b>pcre2_dfa_match()</b>
1011      find_limits                find match and recursion limits
1012      get=&#60;number or name&#62;       extract captured substring
1013      getall                     extract all captured substrings
1014  /g  global                     global matching
1015      jitstack=&#60;n&#62;               set size of JIT stack
1016      mark                       show mark values
1017      match_limit=&#60;n&#62;            set a match limit
1018      memory                     show memory usage
1019      null_context               match with a NULL context
1020      offset=&#60;n&#62;                 set starting offset
1021      offset_limit=&#60;n&#62;           set offset limit
1022      ovector=&#60;n&#62;                set size of output vector
1023      recursion_limit=&#60;n&#62;        set a recursion limit
1024      replace=&#60;string&#62;           specify a replacement string
1025      startchar                  show startchar when relevant
1026      startoffset=&#60;n&#62;            same as offset=&#60;n&#62;
1027      substitute_extedded        use PCRE2_SUBSTITUTE_EXTENDED
1028      substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH
1029      substitute_unknown_unset   use PCRE2_SUBSTITUTE_UNKNOWN_UNSET
1030      substitute_unset_empty     use PCRE2_SUBSTITUTE_UNSET_EMPTY
1031      zero_terminate             pass the subject as zero-terminated
1032</pre>
1033The effects of these modifiers are described in the following sections. When
1034matching via the POSIX wrapper API, the <b>aftertext</b>, <b>allaftertext</b>,
1035and <b>ovector</b> subject modifiers work as described below. All other
1036modifiers are either ignored, with a warning message, or cause an error.
1037</P>
1038<br><b>
1039Showing more text
1040</b><br>
1041<P>
1042The <b>aftertext</b> modifier requests that as well as outputting the part of
1043the subject string that matched the entire pattern, <b>pcre2test</b> should in
1044addition output the remainder of the subject string. This is useful for tests
1045where the subject contains multiple copies of the same substring. The
1046<b>allaftertext</b> modifier requests the same action for captured substrings as
1047well as the main matched substring. In each case the remainder is output on the
1048following line with a plus character following the capture number.
1049</P>
1050<P>
1051The <b>allusedtext</b> modifier requests that all the text that was consulted
1052during a successful pattern match by the interpreter should be shown. This
1053feature is not supported for JIT matching, and if requested with JIT it is
1054ignored (with a warning message). Setting this modifier affects the output if
1055there is a lookbehind at the start of a match, or a lookahead at the end, or if
1056\K is used in the pattern. Characters that precede or follow the start and end
1057of the actual match are indicated in the output by '&#60;' or '&#62;' characters
1058underneath them. Here is an example:
1059<pre>
1060    re&#62; /(?&#60;=pqr)abc(?=xyz)/
1061  data&#62; 123pqrabcxyz456\=allusedtext
1062   0: pqrabcxyz
1063      &#60;&#60;&#60;   &#62;&#62;&#62;
1064</pre>
1065This shows that the matched string is "abc", with the preceding and following
1066strings "pqr" and "xyz" having been consulted during the match (when processing
1067the assertions).
1068</P>
1069<P>
1070The <b>startchar</b> modifier requests that the starting character for the match
1071be indicated, if it is different to the start of the matched string. The only
1072time when this occurs is when \K has been processed as part of the match. In
1073this situation, the output for the matched string is displayed from the
1074starting character instead of from the match point, with circumflex characters
1075under the earlier characters. For example:
1076<pre>
1077    re&#62; /abc\Kxyz/
1078  data&#62; abcxyz\=startchar
1079   0: abcxyz
1080      ^^^
1081</pre>
1082Unlike <b>allusedtext</b>, the <b>startchar</b> modifier can be used with JIT.
1083However, these two modifiers are mutually exclusive.
1084</P>
1085<br><b>
1086Showing the value of all capture groups
1087</b><br>
1088<P>
1089The <b>allcaptures</b> modifier requests that the values of all potential
1090captured parentheses be output after a match. By default, only those up to the
1091highest one actually used in the match are output (corresponding to the return
1092code from <b>pcre2_match()</b>). Groups that did not take part in the match
1093are output as "&#60;unset&#62;". This modifier is not relevant for DFA matching (which
1094does no capturing); it is ignored, with a warning message, if present.
1095</P>
1096<br><b>
1097Testing callouts
1098</b><br>
1099<P>
1100A callout function is supplied when <b>pcre2test</b> calls the library matching
1101functions, unless <b>callout_none</b> is specified. If <b>callout_capture</b> is
1102set, the current captured groups are output when a callout occurs.
1103</P>
1104<P>
1105The <b>callout_fail</b> modifier can be given one or two numbers. If there is
1106only one number, 1 is returned instead of 0 when a callout of that number is
1107reached. If two numbers are given, 1 is returned when callout &#60;n&#62; is reached
1108for the &#60;m&#62;th time. Note that callouts with string arguments are always given
1109the number zero. See "Callouts" below for a description of the output when a
1110callout it taken.
1111</P>
1112<P>
1113The <b>callout_data</b> modifier can be given an unsigned or a negative number.
1114This is set as the "user data" that is passed to the matching function, and
1115passed back when the callout function is invoked. Any value other than zero is
1116used as a return from <b>pcre2test</b>'s callout function.
1117</P>
1118<br><b>
1119Finding all matches in a string
1120</b><br>
1121<P>
1122Searching for all possible matches within a subject can be requested by the
1123<b>global</b> or <b>/altglobal</b> modifier. After finding a match, the matching
1124function is called again to search the remainder of the subject. The difference
1125between <b>global</b> and <b>altglobal</b> is that the former uses the
1126<i>start_offset</i> argument to <b>pcre2_match()</b> or <b>pcre2_dfa_match()</b>
1127to start searching at a new point within the entire string (which is what Perl
1128does), whereas the latter passes over a shortened subject. This makes a
1129difference to the matching process if the pattern begins with a lookbehind
1130assertion (including \b or \B).
1131</P>
1132<P>
1133If an empty string is matched, the next match is done with the
1134PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED flags set, in order to search for
1135another, non-empty, match at the same point in the subject. If this match
1136fails, the start offset is advanced, and the normal match is retried. This
1137imitates the way Perl handles such cases when using the <b>/g</b> modifier or
1138the <b>split()</b> function. Normally, the start offset is advanced by one
1139character, but if the newline convention recognizes CRLF as a newline, and the
1140current character is CR followed by LF, an advance of two characters occurs.
1141</P>
1142<br><b>
1143Testing substring extraction functions
1144</b><br>
1145<P>
1146The <b>copy</b> and <b>get</b> modifiers can be used to test the
1147<b>pcre2_substring_copy_xxx()</b> and <b>pcre2_substring_get_xxx()</b> functions.
1148They can be given more than once, and each can specify a group name or number,
1149for example:
1150<pre>
1151   abcd\=copy=1,copy=3,get=G1
1152</pre>
1153If the <b>#subject</b> command is used to set default copy and/or get lists,
1154these can be unset by specifying a negative number to cancel all numbered
1155groups and an empty name to cancel all named groups.
1156</P>
1157<P>
1158The <b>getall</b> modifier tests <b>pcre2_substring_list_get()</b>, which
1159extracts all captured substrings.
1160</P>
1161<P>
1162If the subject line is successfully matched, the substrings extracted by the
1163convenience functions are output with C, G, or L after the string number
1164instead of a colon. This is in addition to the normal full list. The string
1165length (that is, the return from the extraction function) is given in
1166parentheses after each substring, followed by the name when the extraction was
1167by name.
1168</P>
1169<br><b>
1170Testing the substitution function
1171</b><br>
1172<P>
1173If the <b>replace</b> modifier is set, the <b>pcre2_substitute()</b> function is
1174called instead of one of the matching functions. Note that replacement strings
1175cannot contain commas, because a comma signifies the end of a modifier. This is
1176not thought to be an issue in a test program.
1177</P>
1178<P>
1179Unlike subject strings, <b>pcre2test</b> does not process replacement strings
1180for escape sequences. In UTF mode, a replacement string is checked to see if it
1181is a valid UTF-8 string. If so, it is correctly converted to a UTF string of
1182the appropriate code unit width. If it is not a valid UTF-8 string, the
1183individual code units are copied directly. This provides a means of passing an
1184invalid UTF-8 string for testing purposes.
1185</P>
1186<P>
1187The following modifiers set options (in additional to the normal match options)
1188for <b>pcre2_substitute()</b>:
1189<pre>
1190  global                      PCRE2_SUBSTITUTE_GLOBAL
1191  substitute_extended         PCRE2_SUBSTITUTE_EXTENDED
1192  substitute_overflow_length  PCRE2_SUBSTITUTE_OVERFLOW_LENGTH
1193  substitute_unknown_unset    PCRE2_SUBSTITUTE_UNKNOWN_UNSET
1194  substitute_unset_empty      PCRE2_SUBSTITUTE_UNSET_EMPTY
1195
1196</PRE>
1197</P>
1198<P>
1199After a successful substitution, the modified string is output, preceded by the
1200number of replacements. This may be zero if there were no matches. Here is a
1201simple example of a substitution test:
1202<pre>
1203  /abc/replace=xxx
1204      =abc=abc=
1205   1: =xxx=abc=
1206      =abc=abc=\=global
1207   2: =xxx=xxx=
1208</pre>
1209Subject and replacement strings should be kept relatively short (fewer than 256
1210characters) for substitution tests, as fixed-size buffers are used. To make it
1211easy to test for buffer overflow, if the replacement string starts with a
1212number in square brackets, that number is passed to <b>pcre2_substitute()</b> as
1213the size of the output buffer, with the replacement string starting at the next
1214character. Here is an example that tests the edge case:
1215<pre>
1216  /abc/
1217      123abc123\=replace=[10]XYZ
1218   1: 123XYZ123
1219      123abc123\=replace=[9]XYZ
1220  Failed: error -47: no more memory
1221</pre>
1222The default action of <b>pcre2_substitute()</b> is to return
1223PCRE2_ERROR_NOMEMORY when the output buffer is too small. However, if the
1224PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set (by using the
1225<b>substitute_overflow_length</b> modifier), <b>pcre2_substitute()</b> continues
1226to go through the motions of matching and substituting, in order to compute the
1227size of buffer that is required. When this happens, <b>pcre2test</b> shows the
1228required buffer length (which includes space for the trailing zero) as part of
1229the error message. For example:
1230<pre>
1231  /abc/substitute_overflow_length
1232      123abc123\=replace=[9]XYZ
1233  Failed: error -47: no more memory: 10 code units are needed
1234</pre>
1235A replacement string is ignored with POSIX and DFA matching. Specifying partial
1236matching provokes an error return ("bad option value") from
1237<b>pcre2_substitute()</b>.
1238</P>
1239<br><b>
1240Setting the JIT stack size
1241</b><br>
1242<P>
1243The <b>jitstack</b> modifier provides a way of setting the maximum stack size
1244that is used by the just-in-time optimization code. It is ignored if JIT
1245optimization is not being used. The value is a number of kilobytes. Providing a
1246stack that is larger than the default 32K is necessary only for very
1247complicated patterns.
1248</P>
1249<br><b>
1250Setting match and recursion limits
1251</b><br>
1252<P>
1253The <b>match_limit</b> and <b>recursion_limit</b> modifiers set the appropriate
1254limits in the match context. These values are ignored when the
1255<b>find_limits</b> modifier is specified.
1256</P>
1257<br><b>
1258Finding minimum limits
1259</b><br>
1260<P>
1261If the <b>find_limits</b> modifier is present, <b>pcre2test</b> calls
1262<b>pcre2_match()</b> several times, setting different values in the match
1263context via <b>pcre2_set_match_limit()</b> and <b>pcre2_set_recursion_limit()</b>
1264until it finds the minimum values for each parameter that allow
1265<b>pcre2_match()</b> to complete without error.
1266</P>
1267<P>
1268If JIT is being used, only the match limit is relevant. If DFA matching is
1269being used, neither limit is relevant, and this modifier is ignored (with a
1270warning message).
1271</P>
1272<P>
1273The <i>match_limit</i> number is a measure of the amount of backtracking
1274that takes place, and learning the minimum value can be instructive. For most
1275simple matches, the number is quite small, but for patterns with very large
1276numbers of matching possibilities, it can become large very quickly with
1277increasing length of subject string. The <i>match_limit_recursion</i> number is
1278a measure of how much stack (or, if PCRE2 is compiled with NO_RECURSE, how much
1279heap) memory is needed to complete the match attempt.
1280</P>
1281<br><b>
1282Showing MARK names
1283</b><br>
1284<P>
1285The <b>mark</b> modifier causes the names from backtracking control verbs that
1286are returned from calls to <b>pcre2_match()</b> to be displayed. If a mark is
1287returned for a match, non-match, or partial match, <b>pcre2test</b> shows it.
1288For a match, it is on a line by itself, tagged with "MK:". Otherwise, it
1289is added to the non-match message.
1290</P>
1291<br><b>
1292Showing memory usage
1293</b><br>
1294<P>
1295The <b>memory</b> modifier causes <b>pcre2test</b> to log all memory allocation
1296and freeing calls that occur during a match operation.
1297</P>
1298<br><b>
1299Setting a starting offset
1300</b><br>
1301<P>
1302The <b>offset</b> modifier sets an offset in the subject string at which
1303matching starts. Its value is a number of code units, not characters.
1304</P>
1305<br><b>
1306Setting an offset limit
1307</b><br>
1308<P>
1309The <b>offset_limit</b> modifier sets a limit for unanchored matches. If a match
1310cannot be found starting at or before this offset in the subject, a "no match"
1311return is given. The data value is a number of code units, not characters. When
1312this modifier is used, the <b>use_offset_limit</b> modifier must have been set
1313for the pattern; if not, an error is generated.
1314</P>
1315<br><b>
1316Setting the size of the output vector
1317</b><br>
1318<P>
1319The <b>ovector</b> modifier applies only to the subject line in which it
1320appears, though of course it can also be used to set a default in a
1321<b>#subject</b> command. It specifies the number of pairs of offsets that are
1322available for storing matching information. The default is 15.
1323</P>
1324<P>
1325A value of zero is useful when testing the POSIX API because it causes
1326<b>regexec()</b> to be called with a NULL capture vector. When not testing the
1327POSIX API, a value of zero is used to cause
1328<b>pcre2_match_data_create_from_pattern()</b> to be called, in order to create a
1329match block of exactly the right size for the pattern. (It is not possible to
1330create a match block with a zero-length ovector; there is always at least one
1331pair of offsets.)
1332</P>
1333<br><b>
1334Passing the subject as zero-terminated
1335</b><br>
1336<P>
1337By default, the subject string is passed to a native API matching function with
1338its correct length. In order to test the facility for passing a zero-terminated
1339string, the <b>zero_terminate</b> modifier is provided. It causes the length to
1340be passed as PCRE2_ZERO_TERMINATED. (When matching via the POSIX interface,
1341this modifier has no effect, as there is no facility for passing a length.)
1342</P>
1343<P>
1344When testing <b>pcre2_substitute()</b>, this modifier also has the effect of
1345passing the replacement string as zero-terminated.
1346</P>
1347<br><b>
1348Passing a NULL context
1349</b><br>
1350<P>
1351Normally, <b>pcre2test</b> passes a context block to <b>pcre2_match()</b>,
1352<b>pcre2_dfa_match()</b> or <b>pcre2_jit_match()</b>. If the <b>null_context</b>
1353modifier is set, however, NULL is passed. This is for testing that the matching
1354functions behave correctly in this case (they use default values). This
1355modifier cannot be used with the <b>find_limits</b> modifier or when testing the
1356substitution function.
1357</P>
1358<br><a name="SEC12" href="#TOC1">THE ALTERNATIVE MATCHING FUNCTION</a><br>
1359<P>
1360By default, <b>pcre2test</b> uses the standard PCRE2 matching function,
1361<b>pcre2_match()</b> to match each subject line. PCRE2 also supports an
1362alternative matching function, <b>pcre2_dfa_match()</b>, which operates in a
1363different way, and has some restrictions. The differences between the two
1364functions are described in the
1365<a href="pcre2matching.html"><b>pcre2matching</b></a>
1366documentation.
1367</P>
1368<P>
1369If the <b>dfa</b> modifier is set, the alternative matching function is used.
1370This function finds all possible matches at a given point in the subject. If,
1371however, the <b>dfa_shortest</b> modifier is set, processing stops after the
1372first match is found. This is always the shortest possible match.
1373</P>
1374<br><a name="SEC13" href="#TOC1">DEFAULT OUTPUT FROM pcre2test</a><br>
1375<P>
1376This section describes the output when the normal matching function,
1377<b>pcre2_match()</b>, is being used.
1378</P>
1379<P>
1380When a match succeeds, <b>pcre2test</b> outputs the list of captured substrings,
1381starting with number 0 for the string that matched the whole pattern.
1382Otherwise, it outputs "No match" when the return is PCRE2_ERROR_NOMATCH, or
1383"Partial match:" followed by the partially matching substring when the
1384return is PCRE2_ERROR_PARTIAL. (Note that this is the
1385entire substring that was inspected during the partial match; it may include
1386characters before the actual match start if a lookbehind assertion, \K, \b,
1387or \B was involved.)
1388</P>
1389<P>
1390For any other return, <b>pcre2test</b> outputs the PCRE2 negative error number
1391and a short descriptive phrase. If the error is a failed UTF string check, the
1392code unit offset of the start of the failing character is also output. Here is
1393an example of an interactive <b>pcre2test</b> run.
1394<pre>
1395  $ pcre2test
1396  PCRE2 version 9.00 2014-05-10
1397
1398    re&#62; /^abc(\d+)/
1399  data&#62; abc123
1400   0: abc123
1401   1: 123
1402  data&#62; xyz
1403  No match
1404</pre>
1405Unset capturing substrings that are not followed by one that is set are not
1406shown by <b>pcre2test</b> unless the <b>allcaptures</b> modifier is specified. In
1407the following example, there are two capturing substrings, but when the first
1408data line is matched, the second, unset substring is not shown. An "internal"
1409unset substring is shown as "&#60;unset&#62;", as for the second data line.
1410<pre>
1411    re&#62; /(a)|(b)/
1412  data&#62; a
1413   0: a
1414   1: a
1415  data&#62; b
1416   0: b
1417   1: &#60;unset&#62;
1418   2: b
1419</pre>
1420If the strings contain any non-printing characters, they are output as \xhh
1421escapes if the value is less than 256 and UTF mode is not set. Otherwise they
1422are output as \x{hh...} escapes. See below for the definition of non-printing
1423characters. If the <b>/aftertext</b> modifier is set, the output for substring
14240 is followed by the the rest of the subject string, identified by "0+" like
1425this:
1426<pre>
1427    re&#62; /cat/aftertext
1428  data&#62; cataract
1429   0: cat
1430   0+ aract
1431</pre>
1432If global matching is requested, the results of successive matching attempts
1433are output in sequence, like this:
1434<pre>
1435    re&#62; /\Bi(\w\w)/g
1436  data&#62; Mississippi
1437   0: iss
1438   1: ss
1439   0: iss
1440   1: ss
1441   0: ipp
1442   1: pp
1443</pre>
1444"No match" is output only if the first match attempt fails. Here is an example
1445of a failure message (the offset 4 that is specified by the <b>offset</b>
1446modifier is past the end of the subject string):
1447<pre>
1448    re&#62; /xyz/
1449  data&#62; xyz\=offset=4
1450  Error -24 (bad offset value)
1451</PRE>
1452</P>
1453<P>
1454Note that whereas patterns can be continued over several lines (a plain "&#62;"
1455prompt is used for continuations), subject lines may not. However newlines can
1456be included in a subject by means of the \n escape (or \r, \r\n, etc.,
1457depending on the newline sequence setting).
1458</P>
1459<br><a name="SEC14" href="#TOC1">OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION</a><br>
1460<P>
1461When the alternative matching function, <b>pcre2_dfa_match()</b>, is used, the
1462output consists of a list of all the matches that start at the first point in
1463the subject where there is at least one match. For example:
1464<pre>
1465    re&#62; /(tang|tangerine|tan)/
1466  data&#62; yellow tangerine\=dfa
1467   0: tangerine
1468   1: tang
1469   2: tan
1470</pre>
1471Using the normal matching function on this data finds only "tang". The
1472longest matching string is always given first (and numbered zero). After a
1473PCRE2_ERROR_PARTIAL return, the output is "Partial match:", followed by the
1474partially matching substring. Note that this is the entire substring that was
1475inspected during the partial match; it may include characters before the actual
1476match start if a lookbehind assertion, \b, or \B was involved. (\K is not
1477supported for DFA matching.)
1478</P>
1479<P>
1480If global matching is requested, the search for further matches resumes
1481at the end of the longest match. For example:
1482<pre>
1483    re&#62; /(tang|tangerine|tan)/g
1484  data&#62; yellow tangerine and tangy sultana\=dfa
1485   0: tangerine
1486   1: tang
1487   2: tan
1488   0: tang
1489   1: tan
1490   0: tan
1491</pre>
1492The alternative matching function does not support substring capture, so the
1493modifiers that are concerned with captured substrings are not relevant.
1494</P>
1495<br><a name="SEC15" href="#TOC1">RESTARTING AFTER A PARTIAL MATCH</a><br>
1496<P>
1497When the alternative matching function has given the PCRE2_ERROR_PARTIAL
1498return, indicating that the subject partially matched the pattern, you can
1499restart the match with additional subject data by means of the
1500<b>dfa_restart</b> modifier. For example:
1501<pre>
1502    re&#62; /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
1503  data&#62; 23ja\=P,dfa
1504  Partial match: 23ja
1505  data&#62; n05\=dfa,dfa_restart
1506   0: n05
1507</pre>
1508For further information about partial matching, see the
1509<a href="pcre2partial.html"><b>pcre2partial</b></a>
1510documentation.
1511</P>
1512<br><a name="SEC16" href="#TOC1">CALLOUTS</a><br>
1513<P>
1514If the pattern contains any callout requests, <b>pcre2test</b>'s callout
1515function is called during matching unless <b>callout_none</b> is specified.
1516This works with both matching functions.
1517</P>
1518<P>
1519The callout function in <b>pcre2test</b> returns zero (carry on matching) by
1520default, but you can use a <b>callout_fail</b> modifier in a subject line (as
1521described above) to change this and other parameters of the callout.
1522</P>
1523<P>
1524Inserting callouts can be helpful when using <b>pcre2test</b> to check
1525complicated regular expressions. For further information about callouts, see
1526the
1527<a href="pcre2callout.html"><b>pcre2callout</b></a>
1528documentation.
1529</P>
1530<P>
1531The output for callouts with numerical arguments and those with string
1532arguments is slightly different.
1533</P>
1534<br><b>
1535Callouts with numerical arguments
1536</b><br>
1537<P>
1538By default, the callout function displays the callout number, the start and
1539current positions in the subject text at the callout time, and the next pattern
1540item to be tested. For example:
1541<pre>
1542  ---&#62;pqrabcdef
1543    0    ^  ^     \d
1544</pre>
1545This output indicates that callout number 0 occurred for a match attempt
1546starting at the fourth character of the subject string, when the pointer was at
1547the seventh character, and when the next pattern item was \d. Just
1548one circumflex is output if the start and current positions are the same, or if
1549the current position precedes the start position, which can happen if the
1550callout is in a lookbehind assertion.
1551</P>
1552<P>
1553Callouts numbered 255 are assumed to be automatic callouts, inserted as a
1554result of the <b>/auto_callout</b> pattern modifier. In this case, instead of
1555showing the callout number, the offset in the pattern, preceded by a plus, is
1556output. For example:
1557<pre>
1558    re&#62; /\d?[A-E]\*/auto_callout
1559  data&#62; E*
1560  ---&#62;E*
1561   +0 ^      \d?
1562   +3 ^      [A-E]
1563   +8 ^^     \*
1564  +10 ^ ^
1565   0: E*
1566</pre>
1567If a pattern contains (*MARK) items, an additional line is output whenever
1568a change of latest mark is passed to the callout function. For example:
1569<pre>
1570    re&#62; /a(*MARK:X)bc/auto_callout
1571  data&#62; abc
1572  ---&#62;abc
1573   +0 ^       a
1574   +1 ^^      (*MARK:X)
1575  +10 ^^      b
1576  Latest Mark: X
1577  +11 ^ ^     c
1578  +12 ^  ^
1579   0: abc
1580</pre>
1581The mark changes between matching "a" and "b", but stays the same for the rest
1582of the match, so nothing more is output. If, as a result of backtracking, the
1583mark reverts to being unset, the text "&#60;unset&#62;" is output.
1584</P>
1585<br><b>
1586Callouts with string arguments
1587</b><br>
1588<P>
1589The output for a callout with a string argument is similar, except that instead
1590of outputting a callout number before the position indicators, the callout
1591string and its offset in the pattern string are output before the reflection of
1592the subject string, and the subject string is reflected for each callout. For
1593example:
1594<pre>
1595    re&#62; /^ab(?C'first')cd(?C"second")ef/
1596  data&#62; abcdefg
1597  Callout (7): 'first'
1598  ---&#62;abcdefg
1599      ^ ^         c
1600  Callout (20): "second"
1601  ---&#62;abcdefg
1602      ^   ^       e
1603   0: abcdef
1604
1605</PRE>
1606</P>
1607<br><a name="SEC17" href="#TOC1">NON-PRINTING CHARACTERS</a><br>
1608<P>
1609When <b>pcre2test</b> is outputting text in the compiled version of a pattern,
1610bytes other than 32-126 are always treated as non-printing characters and are
1611therefore shown as hex escapes.
1612</P>
1613<P>
1614When <b>pcre2test</b> is outputting text that is a matched part of a subject
1615string, it behaves in the same way, unless a different locale has been set for
1616the pattern (using the <b>/locale</b> modifier). In this case, the
1617<b>isprint()</b> function is used to distinguish printing and non-printing
1618characters.
1619<a name="saverestore"></a></P>
1620<br><a name="SEC18" href="#TOC1">SAVING AND RESTORING COMPILED PATTERNS</a><br>
1621<P>
1622It is possible to save compiled patterns on disc or elsewhere, and reload them
1623later, subject to a number of restrictions. JIT data cannot be saved. The host
1624on which the patterns are reloaded must be running the same version of PCRE2,
1625with the same code unit width, and must also have the same endianness, pointer
1626width and PCRE2_SIZE type. Before compiled patterns can be saved they must be
1627serialized, that is, converted to a stream of bytes. A single byte stream may
1628contain any number of compiled patterns, but they must all use the same
1629character tables. A single copy of the tables is included in the byte stream
1630(its size is 1088 bytes).
1631</P>
1632<P>
1633The functions whose names begin with <b>pcre2_serialize_</b> are used
1634for serializing and de-serializing. They are described in the
1635<a href="pcre2serialize.html"><b>pcre2serialize</b></a>
1636documentation. In this section we describe the features of <b>pcre2test</b> that
1637can be used to test these functions.
1638</P>
1639<P>
1640When a pattern with <b>push</b> modifier is successfully compiled, it is pushed
1641onto a stack of compiled patterns, and <b>pcre2test</b> expects the next line to
1642contain a new pattern (or command) instead of a subject line. By contrast,
1643the <b>pushcopy</b> modifier causes a copy of the compiled pattern to be
1644stacked, leaving the original available for immediate matching. By using
1645<b>push</b> and/or <b>pushcopy</b>, a number of patterns can be compiled and
1646retained. These modifiers are incompatible with <b>posix</b>, and control
1647modifiers that act at match time are ignored (with a message) for the stacked
1648patterns. The <b>jitverify</b> modifier applies only at compile time.
1649</P>
1650<P>
1651The command
1652<pre>
1653  #save &#60;filename&#62;
1654</pre>
1655causes all the stacked patterns to be serialized and the result written to the
1656named file. Afterwards, all the stacked patterns are freed. The command
1657<pre>
1658  #load &#60;filename&#62;
1659</pre>
1660reads the data in the file, and then arranges for it to be de-serialized, with
1661the resulting compiled patterns added to the pattern stack. The pattern on the
1662top of the stack can be retrieved by the #pop command, which must be followed
1663by lines of subjects that are to be matched with the pattern, terminated as
1664usual by an empty line or end of file. This command may be followed by a
1665modifier list containing only
1666<a href="#controlmodifiers">control modifiers</a>
1667that act after a pattern has been compiled. In particular, <b>hex</b>,
1668<b>posix</b>, <b>posix_nosub</b>, <b>push</b>, and <b>pushcopy</b> are not allowed,
1669nor are any
1670<a href="#optionmodifiers">option-setting modifiers.</a>
1671The JIT modifiers are, however permitted. Here is an example that saves and
1672reloads two patterns.
1673<pre>
1674  /abc/push
1675  /xyz/push
1676  #save tempfile
1677  #load tempfile
1678  #pop info
1679  xyz
1680
1681  #pop jit,bincode
1682  abc
1683</pre>
1684If <b>jitverify</b> is used with #pop, it does not automatically imply
1685<b>jit</b>, which is different behaviour from when it is used on a pattern.
1686</P>
1687<P>
1688The #popcopy command is analagous to the <b>pushcopy</b> modifier in that it
1689makes current a copy of the topmost stack pattern, leaving the original still
1690on the stack.
1691</P>
1692<br><a name="SEC19" href="#TOC1">SEE ALSO</a><br>
1693<P>
1694<b>pcre2</b>(3), <b>pcre2api</b>(3), <b>pcre2callout</b>(3),
1695<b>pcre2jit</b>, <b>pcre2matching</b>(3), <b>pcre2partial</b>(d),
1696<b>pcre2pattern</b>(3), <b>pcre2serialize</b>(3).
1697</P>
1698<br><a name="SEC20" href="#TOC1">AUTHOR</a><br>
1699<P>
1700Philip Hazel
1701<br>
1702University Computing Service
1703<br>
1704Cambridge, England.
1705<br>
1706</P>
1707<br><a name="SEC21" href="#TOC1">REVISION</a><br>
1708<P>
1709Last updated: 06 July 2016
1710<br>
1711Copyright &copy; 1997-2016 University of Cambridge.
1712<br>
1713<p>
1714Return to the <a href="index.html">PCRE2 index page</a>.
1715</p>
1716