• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1<html>
2<head>
3<title>pcre2perform specification</title>
4</head>
5<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
6<h1>pcre2perform man page</h1>
7<p>
8Return to the <a href="index.html">PCRE2 index page</a>.
9</p>
10<p>
11This page is part of the PCRE2 HTML documentation. It was generated
12automatically from the original man page. If there is any nonsense in it,
13please consult the man page, in case the conversion went wrong.
14<br>
15<ul>
16<li><a name="TOC1" href="#SEC1">PCRE2 PERFORMANCE</a>
17<li><a name="TOC2" href="#SEC2">COMPILED PATTERN MEMORY USAGE</a>
18<li><a name="TOC3" href="#SEC3">STACK USAGE AT RUN TIME</a>
19<li><a name="TOC4" href="#SEC4">PROCESSING TIME</a>
20<li><a name="TOC5" href="#SEC5">AUTHOR</a>
21<li><a name="TOC6" href="#SEC6">REVISION</a>
22</ul>
23<br><a name="SEC1" href="#TOC1">PCRE2 PERFORMANCE</a><br>
24<P>
25Two aspects of performance are discussed below: memory usage and processing
26time. The way you express your pattern as a regular expression can affect both
27of them.
28</P>
29<br><a name="SEC2" href="#TOC1">COMPILED PATTERN MEMORY USAGE</a><br>
30<P>
31Patterns are compiled by PCRE2 into a reasonably efficient interpretive code,
32so that most simple patterns do not use much memory. However, there is one case
33where the memory usage of a compiled pattern can be unexpectedly large. If a
34parenthesized subpattern has a quantifier with a minimum greater than 1 and/or
35a limited maximum, the whole subpattern is repeated in the compiled code. For
36example, the pattern
37<pre>
38  (abc|def){2,4}
39</pre>
40is compiled as if it were
41<pre>
42  (abc|def)(abc|def)((abc|def)(abc|def)?)?
43</pre>
44(Technical aside: It is done this way so that backtrack points within each of
45the repetitions can be independently maintained.)
46</P>
47<P>
48For regular expressions whose quantifiers use only small numbers, this is not
49usually a problem. However, if the numbers are large, and particularly if such
50repetitions are nested, the memory usage can become an embarrassment. For
51example, the very simple pattern
52<pre>
53  ((ab){1,1000}c){1,3}
54</pre>
55uses 51K bytes when compiled using the 8-bit library. When PCRE2 is compiled
56with its default internal pointer size of two bytes, the size limit on a
57compiled pattern is 64K code units in the 8-bit and 16-bit libraries, and this
58is reached with the above pattern if the outer repetition is increased from 3
59to 4. PCRE2 can be compiled to use larger internal pointers and thus handle
60larger compiled patterns, but it is better to try to rewrite your pattern to
61use less memory if you can.
62</P>
63<P>
64One way of reducing the memory usage for such patterns is to make use of
65PCRE2's
66<a href="pcre2pattern.html#subpatternsassubroutines">"subroutine"</a>
67facility. Re-writing the above pattern as
68<pre>
69  ((ab)(?2){0,999}c)(?1){0,2}
70</pre>
71reduces the memory requirements to 18K, and indeed it remains under 20K even
72with the outer repetition increased to 100. However, this pattern is not
73exactly equivalent, because the "subroutine" calls are treated as
74<a href="pcre2pattern.html#atomicgroup">atomic groups</a>
75into which there can be no backtracking if there is a subsequent matching
76failure. Therefore, PCRE2 cannot do this kind of rewriting automatically.
77Furthermore, there is a noticeable loss of speed when executing the modified
78pattern. Nevertheless, if the atomic grouping is not a problem and the loss of
79speed is acceptable, this kind of rewriting will allow you to process patterns
80that PCRE2 cannot otherwise handle.
81</P>
82<br><a name="SEC3" href="#TOC1">STACK USAGE AT RUN TIME</a><br>
83<P>
84When <b>pcre2_match()</b> is used for matching, certain kinds of pattern can
85cause it to use large amounts of the process stack. In some environments the
86default process stack is quite small, and if it runs out the result is often
87SIGSEGV. Rewriting your pattern can often help. The
88<a href="pcre2stack.html"><b>pcre2stack</b></a>
89documentation discusses this issue in detail.
90</P>
91<br><a name="SEC4" href="#TOC1">PROCESSING TIME</a><br>
92<P>
93Certain items in regular expression patterns are processed more efficiently
94than others. It is more efficient to use a character class like [aeiou] than a
95set of single-character alternatives such as (a|e|i|o|u). In general, the
96simplest construction that provides the required behaviour is usually the most
97efficient. Jeffrey Friedl's book contains a lot of useful general discussion
98about optimizing regular expressions for efficient performance. This document
99contains a few observations about PCRE2.
100</P>
101<P>
102Using Unicode character properties (the \p, \P, and \X escapes) is slow,
103because PCRE2 has to use a multi-stage table lookup whenever it needs a
104character's property. If you can find an alternative pattern that does not use
105character properties, it will probably be faster.
106</P>
107<P>
108By default, the escape sequences \b, \d, \s, and \w, and the POSIX
109character classes such as [:alpha:] do not use Unicode properties, partly for
110backwards compatibility, and partly for performance reasons. However, you can
111set the PCRE2_UCP option or start the pattern with (*UCP) if you want Unicode
112character properties to be used. This can double the matching time for items
113such as \d, when matched with <b>pcre2_match()</b>; the performance loss is
114less with a DFA matching function, and in both cases there is not much
115difference for \b.
116</P>
117<P>
118When a pattern begins with .* not in atomic parentheses, nor in parentheses
119that are the subject of a backreference, and the PCRE2_DOTALL option is set,
120the pattern is implicitly anchored by PCRE2, since it can match only at the
121start of a subject string. If the pattern has multiple top-level branches, they
122must all be anchorable. The optimization can be disabled by the
123PCRE2_NO_DOTSTAR_ANCHOR option, and is automatically disabled if the pattern
124contains (*PRUNE) or (*SKIP).
125</P>
126<P>
127If PCRE2_DOTALL is not set, PCRE2 cannot make this optimization, because the
128dot metacharacter does not then match a newline, and if the subject string
129contains newlines, the pattern may match from the character immediately
130following one of them instead of from the very start. For example, the pattern
131<pre>
132  .*second
133</pre>
134matches the subject "first\nand second" (where \n stands for a newline
135character), with the match starting at the seventh character. In order to do
136this, PCRE2 has to retry the match starting after every newline in the subject.
137</P>
138<P>
139If you are using such a pattern with subject strings that do not contain
140newlines, the best performance is obtained by setting PCRE2_DOTALL, or starting
141the pattern with ^.* or ^.*? to indicate explicit anchoring. That saves PCRE2
142from having to scan along the subject looking for a newline to restart at.
143</P>
144<P>
145Beware of patterns that contain nested indefinite repeats. These can take a
146long time to run when applied to a string that does not match. Consider the
147pattern fragment
148<pre>
149  ^(a+)*
150</pre>
151This can match "aaaa" in 16 different ways, and this number increases very
152rapidly as the string gets longer. (The * repeat can match 0, 1, 2, 3, or 4
153times, and for each of those cases other than 0 or 4, the + repeats can match
154different numbers of times.) When the remainder of the pattern is such that the
155entire match is going to fail, PCRE2 has in principle to try every possible
156variation, and this can take an extremely long time, even for relatively short
157strings.
158</P>
159<P>
160An optimization catches some of the more simple cases such as
161<pre>
162  (a+)*b
163</pre>
164where a literal character follows. Before embarking on the standard matching
165procedure, PCRE2 checks that there is a "b" later in the subject string, and if
166there is not, it fails the match immediately. However, when there is no
167following literal this optimization cannot be used. You can see the difference
168by comparing the behaviour of
169<pre>
170  (a+)*\d
171</pre>
172with the pattern above. The former gives a failure almost instantly when
173applied to a whole line of "a" characters, whereas the latter takes an
174appreciable time with strings longer than about 20 characters.
175</P>
176<P>
177In many cases, the solution to this kind of performance issue is to use an
178atomic group or a possessive quantifier.
179</P>
180<br><a name="SEC5" href="#TOC1">AUTHOR</a><br>
181<P>
182Philip Hazel
183<br>
184University Computing Service
185<br>
186Cambridge, England.
187<br>
188</P>
189<br><a name="SEC6" href="#TOC1">REVISION</a><br>
190<P>
191Last updated: 02 January 2015
192<br>
193Copyright &copy; 1997-2015 University of Cambridge.
194<br>
195<p>
196Return to the <a href="index.html">PCRE2 index page</a>.
197</p>
198