• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1<html>
2<head>
3<title>pcresyntax specification</title>
4</head>
5<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
6<h1>pcresyntax man page</h1>
7<p>
8Return to the <a href="index.html">PCRE index page</a>.
9</p>
10<p>
11This page is part of the PCRE HTML documentation. It was generated automatically
12from the original man page. If there is any nonsense in it, please consult the
13man page, in case the conversion went wrong.
14<br>
15<ul>
16<li><a name="TOC1" href="#SEC1">PCRE REGULAR EXPRESSION SYNTAX SUMMARY</a>
17<li><a name="TOC2" href="#SEC2">QUOTING</a>
18<li><a name="TOC3" href="#SEC3">CHARACTERS</a>
19<li><a name="TOC4" href="#SEC4">CHARACTER TYPES</a>
20<li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTIES FOR \p and \P</a>
21<li><a name="TOC6" href="#SEC6">PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P</a>
22<li><a name="TOC7" href="#SEC7">SCRIPT NAMES FOR \p AND \P</a>
23<li><a name="TOC8" href="#SEC8">CHARACTER CLASSES</a>
24<li><a name="TOC9" href="#SEC9">QUANTIFIERS</a>
25<li><a name="TOC10" href="#SEC10">ANCHORS AND SIMPLE ASSERTIONS</a>
26<li><a name="TOC11" href="#SEC11">MATCH POINT RESET</a>
27<li><a name="TOC12" href="#SEC12">ALTERNATION</a>
28<li><a name="TOC13" href="#SEC13">CAPTURING</a>
29<li><a name="TOC14" href="#SEC14">ATOMIC GROUPS</a>
30<li><a name="TOC15" href="#SEC15">COMMENT</a>
31<li><a name="TOC16" href="#SEC16">OPTION SETTING</a>
32<li><a name="TOC17" href="#SEC17">NEWLINE CONVENTION</a>
33<li><a name="TOC18" href="#SEC18">WHAT \R MATCHES</a>
34<li><a name="TOC19" href="#SEC19">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>
35<li><a name="TOC20" href="#SEC20">BACKREFERENCES</a>
36<li><a name="TOC21" href="#SEC21">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
37<li><a name="TOC22" href="#SEC22">CONDITIONAL PATTERNS</a>
38<li><a name="TOC23" href="#SEC23">BACKTRACKING CONTROL</a>
39<li><a name="TOC24" href="#SEC24">CALLOUTS</a>
40<li><a name="TOC25" href="#SEC25">SEE ALSO</a>
41<li><a name="TOC26" href="#SEC26">AUTHOR</a>
42<li><a name="TOC27" href="#SEC27">REVISION</a>
43</ul>
44<br><a name="SEC1" href="#TOC1">PCRE REGULAR EXPRESSION SYNTAX SUMMARY</a><br>
45<P>
46The full syntax and semantics of the regular expressions that are supported by
47PCRE are described in the
48<a href="pcrepattern.html"><b>pcrepattern</b></a>
49documentation. This document contains a quick-reference summary of the syntax.
50</P>
51<br><a name="SEC2" href="#TOC1">QUOTING</a><br>
52<P>
53<pre>
54  \x         where x is non-alphanumeric is a literal x
55  \Q...\E    treat enclosed characters as literal
56</PRE>
57</P>
58<br><a name="SEC3" href="#TOC1">CHARACTERS</a><br>
59<P>
60<pre>
61  \a         alarm, that is, the BEL character (hex 07)
62  \cx        "control-x", where x is any ASCII character
63  \e         escape (hex 1B)
64  \f         form feed (hex 0C)
65  \n         newline (hex 0A)
66  \r         carriage return (hex 0D)
67  \t         tab (hex 09)
68  \0dd       character with octal code 0dd
69  \ddd       character with octal code ddd, or backreference
70  \o{ddd..}  character with octal code ddd..
71  \xhh       character with hex code hh
72  \x{hhh..}  character with hex code hhh..
73</pre>
74Note that \0dd is always an octal code, and that \8 and \9 are the literal
75characters "8" and "9".
76</P>
77<br><a name="SEC4" href="#TOC1">CHARACTER TYPES</a><br>
78<P>
79<pre>
80  .          any character except newline;
81               in dotall mode, any character whatsoever
82  \C         one data unit, even in UTF mode (best avoided)
83  \d         a decimal digit
84  \D         a character that is not a decimal digit
85  \h         a horizontal white space character
86  \H         a character that is not a horizontal white space character
87  \N         a character that is not a newline
88  \p{<i>xx</i>}     a character with the <i>xx</i> property
89  \P{<i>xx</i>}     a character without the <i>xx</i> property
90  \R         a newline sequence
91  \s         a white space character
92  \S         a character that is not a white space character
93  \v         a vertical white space character
94  \V         a character that is not a vertical white space character
95  \w         a "word" character
96  \W         a "non-word" character
97  \X         a Unicode extended grapheme cluster
98</pre>
99By default, \d, \s, and \w match only ASCII characters, even in UTF-8 mode
100or in the 16- bit and 32-bit libraries. However, if locale-specific matching is
101happening, \s and \w may also match characters with code points in the range
102128-255. If the PCRE_UCP option is set, the behaviour of these escape sequences
103is changed to use Unicode properties and they match many more characters.
104</P>
105<br><a name="SEC5" href="#TOC1">GENERAL CATEGORY PROPERTIES FOR \p and \P</a><br>
106<P>
107<pre>
108  C          Other
109  Cc         Control
110  Cf         Format
111  Cn         Unassigned
112  Co         Private use
113  Cs         Surrogate
114
115  L          Letter
116  Ll         Lower case letter
117  Lm         Modifier letter
118  Lo         Other letter
119  Lt         Title case letter
120  Lu         Upper case letter
121  L&         Ll, Lu, or Lt
122
123  M          Mark
124  Mc         Spacing mark
125  Me         Enclosing mark
126  Mn         Non-spacing mark
127
128  N          Number
129  Nd         Decimal number
130  Nl         Letter number
131  No         Other number
132
133  P          Punctuation
134  Pc         Connector punctuation
135  Pd         Dash punctuation
136  Pe         Close punctuation
137  Pf         Final punctuation
138  Pi         Initial punctuation
139  Po         Other punctuation
140  Ps         Open punctuation
141
142  S          Symbol
143  Sc         Currency symbol
144  Sk         Modifier symbol
145  Sm         Mathematical symbol
146  So         Other symbol
147
148  Z          Separator
149  Zl         Line separator
150  Zp         Paragraph separator
151  Zs         Space separator
152</PRE>
153</P>
154<br><a name="SEC6" href="#TOC1">PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P</a><br>
155<P>
156<pre>
157  Xan        Alphanumeric: union of properties L and N
158  Xps        POSIX space: property Z or tab, NL, VT, FF, CR
159  Xsp        Perl space: property Z or tab, NL, VT, FF, CR
160  Xuc        Univerally-named character: one that can be
161               represented by a Universal Character Name
162  Xwd        Perl word: property Xan or underscore
163</pre>
164Perl and POSIX space are now the same. Perl added VT to its space character set
165at release 5.18 and PCRE changed at release 8.34.
166</P>
167<br><a name="SEC7" href="#TOC1">SCRIPT NAMES FOR \p AND \P</a><br>
168<P>
169Arabic,
170Armenian,
171Avestan,
172Balinese,
173Bamum,
174Bassa_Vah,
175Batak,
176Bengali,
177Bopomofo,
178Brahmi,
179Braille,
180Buginese,
181Buhid,
182Canadian_Aboriginal,
183Carian,
184Caucasian_Albanian,
185Chakma,
186Cham,
187Cherokee,
188Common,
189Coptic,
190Cuneiform,
191Cypriot,
192Cyrillic,
193Deseret,
194Devanagari,
195Duployan,
196Egyptian_Hieroglyphs,
197Elbasan,
198Ethiopic,
199Georgian,
200Glagolitic,
201Gothic,
202Grantha,
203Greek,
204Gujarati,
205Gurmukhi,
206Han,
207Hangul,
208Hanunoo,
209Hebrew,
210Hiragana,
211Imperial_Aramaic,
212Inherited,
213Inscriptional_Pahlavi,
214Inscriptional_Parthian,
215Javanese,
216Kaithi,
217Kannada,
218Katakana,
219Kayah_Li,
220Kharoshthi,
221Khmer,
222Khojki,
223Khudawadi,
224Lao,
225Latin,
226Lepcha,
227Limbu,
228Linear_A,
229Linear_B,
230Lisu,
231Lycian,
232Lydian,
233Mahajani,
234Malayalam,
235Mandaic,
236Manichaean,
237Meetei_Mayek,
238Mende_Kikakui,
239Meroitic_Cursive,
240Meroitic_Hieroglyphs,
241Miao,
242Modi,
243Mongolian,
244Mro,
245Myanmar,
246Nabataean,
247New_Tai_Lue,
248Nko,
249Ogham,
250Ol_Chiki,
251Old_Italic,
252Old_North_Arabian,
253Old_Permic,
254Old_Persian,
255Old_South_Arabian,
256Old_Turkic,
257Oriya,
258Osmanya,
259Pahawh_Hmong,
260Palmyrene,
261Pau_Cin_Hau,
262Phags_Pa,
263Phoenician,
264Psalter_Pahlavi,
265Rejang,
266Runic,
267Samaritan,
268Saurashtra,
269Sharada,
270Shavian,
271Siddham,
272Sinhala,
273Sora_Sompeng,
274Sundanese,
275Syloti_Nagri,
276Syriac,
277Tagalog,
278Tagbanwa,
279Tai_Le,
280Tai_Tham,
281Tai_Viet,
282Takri,
283Tamil,
284Telugu,
285Thaana,
286Thai,
287Tibetan,
288Tifinagh,
289Tirhuta,
290Ugaritic,
291Vai,
292Warang_Citi,
293Yi.
294</P>
295<br><a name="SEC8" href="#TOC1">CHARACTER CLASSES</a><br>
296<P>
297<pre>
298  [...]       positive character class
299  [^...]      negative character class
300  [x-y]       range (can be used for hex characters)
301  [[:xxx:]]   positive POSIX named set
302  [[:^xxx:]]  negative POSIX named set
303
304  alnum       alphanumeric
305  alpha       alphabetic
306  ascii       0-127
307  blank       space or tab
308  cntrl       control character
309  digit       decimal digit
310  graph       printing, excluding space
311  lower       lower case letter
312  print       printing, including space
313  punct       printing, excluding alphanumeric
314  space       white space
315  upper       upper case letter
316  word        same as \w
317  xdigit      hexadecimal digit
318</pre>
319In PCRE, POSIX character set names recognize only ASCII characters by default,
320but some of them use Unicode properties if PCRE_UCP is set. You can use
321\Q...\E inside a character class.
322</P>
323<br><a name="SEC9" href="#TOC1">QUANTIFIERS</a><br>
324<P>
325<pre>
326  ?           0 or 1, greedy
327  ?+          0 or 1, possessive
328  ??          0 or 1, lazy
329  *           0 or more, greedy
330  *+          0 or more, possessive
331  *?          0 or more, lazy
332  +           1 or more, greedy
333  ++          1 or more, possessive
334  +?          1 or more, lazy
335  {n}         exactly n
336  {n,m}       at least n, no more than m, greedy
337  {n,m}+      at least n, no more than m, possessive
338  {n,m}?      at least n, no more than m, lazy
339  {n,}        n or more, greedy
340  {n,}+       n or more, possessive
341  {n,}?       n or more, lazy
342</PRE>
343</P>
344<br><a name="SEC10" href="#TOC1">ANCHORS AND SIMPLE ASSERTIONS</a><br>
345<P>
346<pre>
347  \b          word boundary
348  \B          not a word boundary
349  ^           start of subject
350               also after internal newline in multiline mode
351  \A          start of subject
352  $           end of subject
353               also before newline at end of subject
354               also before internal newline in multiline mode
355  \Z          end of subject
356               also before newline at end of subject
357  \z          end of subject
358  \G          first matching position in subject
359</PRE>
360</P>
361<br><a name="SEC11" href="#TOC1">MATCH POINT RESET</a><br>
362<P>
363<pre>
364  \K          reset start of match
365</pre>
366\K is honoured in positive assertions, but ignored in negative ones.
367</P>
368<br><a name="SEC12" href="#TOC1">ALTERNATION</a><br>
369<P>
370<pre>
371  expr|expr|expr...
372</PRE>
373</P>
374<br><a name="SEC13" href="#TOC1">CAPTURING</a><br>
375<P>
376<pre>
377  (...)           capturing group
378  (?&#60;name&#62;...)    named capturing group (Perl)
379  (?'name'...)    named capturing group (Perl)
380  (?P&#60;name&#62;...)   named capturing group (Python)
381  (?:...)         non-capturing group
382  (?|...)         non-capturing group; reset group numbers for
383                   capturing groups in each alternative
384</PRE>
385</P>
386<br><a name="SEC14" href="#TOC1">ATOMIC GROUPS</a><br>
387<P>
388<pre>
389  (?&#62;...)         atomic, non-capturing group
390</PRE>
391</P>
392<br><a name="SEC15" href="#TOC1">COMMENT</a><br>
393<P>
394<pre>
395  (?#....)        comment (not nestable)
396</PRE>
397</P>
398<br><a name="SEC16" href="#TOC1">OPTION SETTING</a><br>
399<P>
400<pre>
401  (?i)            caseless
402  (?J)            allow duplicate names
403  (?m)            multiline
404  (?s)            single line (dotall)
405  (?U)            default ungreedy (lazy)
406  (?x)            extended (ignore white space)
407  (?-...)         unset option(s)
408</pre>
409The following are recognized only at the very start of a pattern or after one
410of the newline or \R options with similar syntax. More than one of them may
411appear.
412<pre>
413  (*LIMIT_MATCH=d) set the match limit to d (decimal number)
414  (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number)
415  (*NO_AUTO_POSSESS) no auto-possessification (PCRE_NO_AUTO_POSSESS)
416  (*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE)
417  (*UTF8)         set UTF-8 mode: 8-bit library (PCRE_UTF8)
418  (*UTF16)        set UTF-16 mode: 16-bit library (PCRE_UTF16)
419  (*UTF32)        set UTF-32 mode: 32-bit library (PCRE_UTF32)
420  (*UTF)          set appropriate UTF mode for the library in use
421  (*UCP)          set PCRE_UCP (use Unicode properties for \d etc)
422</pre>
423Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
424limits set by the caller of pcre_exec(), not increase them.
425</P>
426<br><a name="SEC17" href="#TOC1">NEWLINE CONVENTION</a><br>
427<P>
428These are recognized only at the very start of the pattern or after option
429settings with a similar syntax.
430<pre>
431  (*CR)           carriage return only
432  (*LF)           linefeed only
433  (*CRLF)         carriage return followed by linefeed
434  (*ANYCRLF)      all three of the above
435  (*ANY)          any Unicode newline sequence
436</PRE>
437</P>
438<br><a name="SEC18" href="#TOC1">WHAT \R MATCHES</a><br>
439<P>
440These are recognized only at the very start of the pattern or after option
441setting with a similar syntax.
442<pre>
443  (*BSR_ANYCRLF)  CR, LF, or CRLF
444  (*BSR_UNICODE)  any Unicode newline sequence
445</PRE>
446</P>
447<br><a name="SEC19" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br>
448<P>
449<pre>
450  (?=...)         positive look ahead
451  (?!...)         negative look ahead
452  (?&#60;=...)        positive look behind
453  (?&#60;!...)        negative look behind
454</pre>
455Each top-level branch of a look behind must be of a fixed length.
456</P>
457<br><a name="SEC20" href="#TOC1">BACKREFERENCES</a><br>
458<P>
459<pre>
460  \n              reference by number (can be ambiguous)
461  \gn             reference by number
462  \g{n}           reference by number
463  \g{-n}          relative reference by number
464  \k&#60;name&#62;        reference by name (Perl)
465  \k'name'        reference by name (Perl)
466  \g{name}        reference by name (Perl)
467  \k{name}        reference by name (.NET)
468  (?P=name)       reference by name (Python)
469</PRE>
470</P>
471<br><a name="SEC21" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
472<P>
473<pre>
474  (?R)            recurse whole pattern
475  (?n)            call subpattern by absolute number
476  (?+n)           call subpattern by relative number
477  (?-n)           call subpattern by relative number
478  (?&name)        call subpattern by name (Perl)
479  (?P&#62;name)       call subpattern by name (Python)
480  \g&#60;name&#62;        call subpattern by name (Oniguruma)
481  \g'name'        call subpattern by name (Oniguruma)
482  \g&#60;n&#62;           call subpattern by absolute number (Oniguruma)
483  \g'n'           call subpattern by absolute number (Oniguruma)
484  \g&#60;+n&#62;          call subpattern by relative number (PCRE extension)
485  \g'+n'          call subpattern by relative number (PCRE extension)
486  \g&#60;-n&#62;          call subpattern by relative number (PCRE extension)
487  \g'-n'          call subpattern by relative number (PCRE extension)
488</PRE>
489</P>
490<br><a name="SEC22" href="#TOC1">CONDITIONAL PATTERNS</a><br>
491<P>
492<pre>
493  (?(condition)yes-pattern)
494  (?(condition)yes-pattern|no-pattern)
495
496  (?(n)...        absolute reference condition
497  (?(+n)...       relative reference condition
498  (?(-n)...       relative reference condition
499  (?(&#60;name&#62;)...   named reference condition (Perl)
500  (?('name')...   named reference condition (Perl)
501  (?(name)...     named reference condition (PCRE)
502  (?(R)...        overall recursion condition
503  (?(Rn)...       specific group recursion condition
504  (?(R&name)...   specific recursion condition
505  (?(DEFINE)...   define subpattern for reference
506  (?(assert)...   assertion condition
507</PRE>
508</P>
509<br><a name="SEC23" href="#TOC1">BACKTRACKING CONTROL</a><br>
510<P>
511The following act immediately they are reached:
512<pre>
513  (*ACCEPT)       force successful match
514  (*FAIL)         force backtrack; synonym (*F)
515  (*MARK:NAME)    set name to be passed back; synonym (*:NAME)
516</pre>
517The following act only when a subsequent match failure causes a backtrack to
518reach them. They all force a match failure, but they differ in what happens
519afterwards. Those that advance the start-of-match point do so only if the
520pattern is not anchored.
521<pre>
522  (*COMMIT)       overall failure, no advance of starting point
523  (*PRUNE)        advance to next starting character
524  (*PRUNE:NAME)   equivalent to (*MARK:NAME)(*PRUNE)
525  (*SKIP)         advance to current matching position
526  (*SKIP:NAME)    advance to position corresponding to an earlier
527                  (*MARK:NAME); if not found, the (*SKIP) is ignored
528  (*THEN)         local failure, backtrack to next alternation
529  (*THEN:NAME)    equivalent to (*MARK:NAME)(*THEN)
530</PRE>
531</P>
532<br><a name="SEC24" href="#TOC1">CALLOUTS</a><br>
533<P>
534<pre>
535  (?C)      callout
536  (?Cn)     callout with data n
537</PRE>
538</P>
539<br><a name="SEC25" href="#TOC1">SEE ALSO</a><br>
540<P>
541<b>pcrepattern</b>(3), <b>pcreapi</b>(3), <b>pcrecallout</b>(3),
542<b>pcrematching</b>(3), <b>pcre</b>(3).
543</P>
544<br><a name="SEC26" href="#TOC1">AUTHOR</a><br>
545<P>
546Philip Hazel
547<br>
548University Computing Service
549<br>
550Cambridge CB2 3QH, England.
551<br>
552</P>
553<br><a name="SEC27" href="#TOC1">REVISION</a><br>
554<P>
555Last updated: 08 January 2014
556<br>
557Copyright &copy; 1997-2014 University of Cambridge.
558<br>
559<p>
560Return to the <a href="index.html">PCRE index page</a>.
561</p>
562