• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1<html>
2<head>
3<title>pcre2syntax specification</title>
4</head>
5<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
6<h1>pcre2syntax man page</h1>
7<p>
8Return to the <a href="index.html">PCRE2 index page</a>.
9</p>
10<p>
11This page is part of the PCRE2 HTML documentation. It was generated
12automatically from the original man page. If there is any nonsense in it,
13please consult the man page, in case the conversion went wrong.
14<br>
15<ul>
16<li><a name="TOC1" href="#SEC1">PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY</a>
17<li><a name="TOC2" href="#SEC2">QUOTING</a>
18<li><a name="TOC3" href="#SEC3">ESCAPED CHARACTERS</a>
19<li><a name="TOC4" href="#SEC4">CHARACTER TYPES</a>
20<li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTIES FOR \p and \P</a>
21<li><a name="TOC6" href="#SEC6">PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P</a>
22<li><a name="TOC7" href="#SEC7">SCRIPT NAMES FOR \p AND \P</a>
23<li><a name="TOC8" href="#SEC8">CHARACTER CLASSES</a>
24<li><a name="TOC9" href="#SEC9">QUANTIFIERS</a>
25<li><a name="TOC10" href="#SEC10">ANCHORS AND SIMPLE ASSERTIONS</a>
26<li><a name="TOC11" href="#SEC11">MATCH POINT RESET</a>
27<li><a name="TOC12" href="#SEC12">ALTERNATION</a>
28<li><a name="TOC13" href="#SEC13">CAPTURING</a>
29<li><a name="TOC14" href="#SEC14">ATOMIC GROUPS</a>
30<li><a name="TOC15" href="#SEC15">COMMENT</a>
31<li><a name="TOC16" href="#SEC16">OPTION SETTING</a>
32<li><a name="TOC17" href="#SEC17">NEWLINE CONVENTION</a>
33<li><a name="TOC18" href="#SEC18">WHAT \R MATCHES</a>
34<li><a name="TOC19" href="#SEC19">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>
35<li><a name="TOC20" href="#SEC20">BACKREFERENCES</a>
36<li><a name="TOC21" href="#SEC21">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
37<li><a name="TOC22" href="#SEC22">CONDITIONAL PATTERNS</a>
38<li><a name="TOC23" href="#SEC23">BACKTRACKING CONTROL</a>
39<li><a name="TOC24" href="#SEC24">CALLOUTS</a>
40<li><a name="TOC25" href="#SEC25">SEE ALSO</a>
41<li><a name="TOC26" href="#SEC26">AUTHOR</a>
42<li><a name="TOC27" href="#SEC27">REVISION</a>
43</ul>
44<br><a name="SEC1" href="#TOC1">PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY</a><br>
45<P>
46The full syntax and semantics of the regular expressions that are supported by
47PCRE2 are described in the
48<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
49documentation. This document contains a quick-reference summary of the syntax.
50</P>
51<br><a name="SEC2" href="#TOC1">QUOTING</a><br>
52<P>
53<pre>
54  \x         where x is non-alphanumeric is a literal x
55  \Q...\E    treat enclosed characters as literal
56</PRE>
57</P>
58<br><a name="SEC3" href="#TOC1">ESCAPED CHARACTERS</a><br>
59<P>
60This table applies to ASCII and Unicode environments.
61<pre>
62  \a         alarm, that is, the BEL character (hex 07)
63  \cx        "control-x", where x is any ASCII printing character
64  \e         escape (hex 1B)
65  \f         form feed (hex 0C)
66  \n         newline (hex 0A)
67  \r         carriage return (hex 0D)
68  \t         tab (hex 09)
69  \0dd       character with octal code 0dd
70  \ddd       character with octal code ddd, or backreference
71  \o{ddd..}  character with octal code ddd..
72  \U         "U" if PCRE2_ALT_BSUX is set (otherwise is an error)
73  \uhhhh     character with hex code hhhh (if PCRE2_ALT_BSUX is set)
74  \xhh       character with hex code hh
75  \x{hhh..}  character with hex code hhh..
76</pre>
77Note that \0dd is always an octal code. The treatment of backslash followed by
78a non-zero digit is complicated; for details see the section
79<a href="pcre2pattern.html#digitsafterbackslash">"Non-printing characters"</a>
80in the
81<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
82documentation, where details of escape processing in EBCDIC environments are
83also given.
84</P>
85<P>
86When \x is not followed by {, from zero to two hexadecimal digits are read,
87but if PCRE2_ALT_BSUX is set, \x must be followed by two hexadecimal digits to
88be recognized as a hexadecimal escape; otherwise it matches a literal "x".
89Likewise, if \u (in ALT_BSUX mode) is not followed by four hexadecimal digits,
90it matches a literal "u".
91</P>
92<br><a name="SEC4" href="#TOC1">CHARACTER TYPES</a><br>
93<P>
94<pre>
95  .          any character except newline;
96               in dotall mode, any character whatsoever
97  \C         one code unit, even in UTF mode (best avoided)
98  \d         a decimal digit
99  \D         a character that is not a decimal digit
100  \h         a horizontal white space character
101  \H         a character that is not a horizontal white space character
102  \N         a character that is not a newline
103  \p{<i>xx</i>}     a character with the <i>xx</i> property
104  \P{<i>xx</i>}     a character without the <i>xx</i> property
105  \R         a newline sequence
106  \s         a white space character
107  \S         a character that is not a white space character
108  \v         a vertical white space character
109  \V         a character that is not a vertical white space character
110  \w         a "word" character
111  \W         a "non-word" character
112  \X         a Unicode extended grapheme cluster
113</pre>
114\C is dangerous because it may leave the current matching point in the middle
115of a UTF-8 or UTF-16 character. The application can lock out the use of \C by
116setting the PCRE2_NEVER_BACKSLASH_C option. It is also possible to build PCRE2
117with the use of \C permanently disabled.
118</P>
119<P>
120By default, \d, \s, and \w match only ASCII characters, even in UTF-8 mode
121or in the 16-bit and 32-bit libraries. However, if locale-specific matching is
122happening, \s and \w may also match characters with code points in the range
123128-255. If the PCRE2_UCP option is set, the behaviour of these escape
124sequences is changed to use Unicode properties and they match many more
125characters.
126</P>
127<br><a name="SEC5" href="#TOC1">GENERAL CATEGORY PROPERTIES FOR \p and \P</a><br>
128<P>
129<pre>
130  C          Other
131  Cc         Control
132  Cf         Format
133  Cn         Unassigned
134  Co         Private use
135  Cs         Surrogate
136
137  L          Letter
138  Ll         Lower case letter
139  Lm         Modifier letter
140  Lo         Other letter
141  Lt         Title case letter
142  Lu         Upper case letter
143  L&         Ll, Lu, or Lt
144
145  M          Mark
146  Mc         Spacing mark
147  Me         Enclosing mark
148  Mn         Non-spacing mark
149
150  N          Number
151  Nd         Decimal number
152  Nl         Letter number
153  No         Other number
154
155  P          Punctuation
156  Pc         Connector punctuation
157  Pd         Dash punctuation
158  Pe         Close punctuation
159  Pf         Final punctuation
160  Pi         Initial punctuation
161  Po         Other punctuation
162  Ps         Open punctuation
163
164  S          Symbol
165  Sc         Currency symbol
166  Sk         Modifier symbol
167  Sm         Mathematical symbol
168  So         Other symbol
169
170  Z          Separator
171  Zl         Line separator
172  Zp         Paragraph separator
173  Zs         Space separator
174</PRE>
175</P>
176<br><a name="SEC6" href="#TOC1">PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P</a><br>
177<P>
178<pre>
179  Xan        Alphanumeric: union of properties L and N
180  Xps        POSIX space: property Z or tab, NL, VT, FF, CR
181  Xsp        Perl space: property Z or tab, NL, VT, FF, CR
182  Xuc        Univerally-named character: one that can be
183               represented by a Universal Character Name
184  Xwd        Perl word: property Xan or underscore
185</pre>
186Perl and POSIX space are now the same. Perl added VT to its space character set
187at release 5.18.
188</P>
189<br><a name="SEC7" href="#TOC1">SCRIPT NAMES FOR \p AND \P</a><br>
190<P>
191Ahom,
192Anatolian_Hieroglyphs,
193Arabic,
194Armenian,
195Avestan,
196Balinese,
197Bamum,
198Bassa_Vah,
199Batak,
200Bengali,
201Bopomofo,
202Brahmi,
203Braille,
204Buginese,
205Buhid,
206Canadian_Aboriginal,
207Carian,
208Caucasian_Albanian,
209Chakma,
210Cham,
211Cherokee,
212Common,
213Coptic,
214Cuneiform,
215Cypriot,
216Cyrillic,
217Deseret,
218Devanagari,
219Duployan,
220Egyptian_Hieroglyphs,
221Elbasan,
222Ethiopic,
223Georgian,
224Glagolitic,
225Gothic,
226Grantha,
227Greek,
228Gujarati,
229Gurmukhi,
230Han,
231Hangul,
232Hanunoo,
233Hatran,
234Hebrew,
235Hiragana,
236Imperial_Aramaic,
237Inherited,
238Inscriptional_Pahlavi,
239Inscriptional_Parthian,
240Javanese,
241Kaithi,
242Kannada,
243Katakana,
244Kayah_Li,
245Kharoshthi,
246Khmer,
247Khojki,
248Khudawadi,
249Lao,
250Latin,
251Lepcha,
252Limbu,
253Linear_A,
254Linear_B,
255Lisu,
256Lycian,
257Lydian,
258Mahajani,
259Malayalam,
260Mandaic,
261Manichaean,
262Meetei_Mayek,
263Mende_Kikakui,
264Meroitic_Cursive,
265Meroitic_Hieroglyphs,
266Miao,
267Modi,
268Mongolian,
269Mro,
270Multani,
271Myanmar,
272Nabataean,
273New_Tai_Lue,
274Nko,
275Ogham,
276Ol_Chiki,
277Old_Hungarian,
278Old_Italic,
279Old_North_Arabian,
280Old_Permic,
281Old_Persian,
282Old_South_Arabian,
283Old_Turkic,
284Oriya,
285Osmanya,
286Pahawh_Hmong,
287Palmyrene,
288Pau_Cin_Hau,
289Phags_Pa,
290Phoenician,
291Psalter_Pahlavi,
292Rejang,
293Runic,
294Samaritan,
295Saurashtra,
296Sharada,
297Shavian,
298Siddham,
299SignWriting,
300Sinhala,
301Sora_Sompeng,
302Sundanese,
303Syloti_Nagri,
304Syriac,
305Tagalog,
306Tagbanwa,
307Tai_Le,
308Tai_Tham,
309Tai_Viet,
310Takri,
311Tamil,
312Telugu,
313Thaana,
314Thai,
315Tibetan,
316Tifinagh,
317Tirhuta,
318Ugaritic,
319Vai,
320Warang_Citi,
321Yi.
322</P>
323<br><a name="SEC8" href="#TOC1">CHARACTER CLASSES</a><br>
324<P>
325<pre>
326  [...]       positive character class
327  [^...]      negative character class
328  [x-y]       range (can be used for hex characters)
329  [[:xxx:]]   positive POSIX named set
330  [[:^xxx:]]  negative POSIX named set
331
332  alnum       alphanumeric
333  alpha       alphabetic
334  ascii       0-127
335  blank       space or tab
336  cntrl       control character
337  digit       decimal digit
338  graph       printing, excluding space
339  lower       lower case letter
340  print       printing, including space
341  punct       printing, excluding alphanumeric
342  space       white space
343  upper       upper case letter
344  word        same as \w
345  xdigit      hexadecimal digit
346</pre>
347In PCRE2, POSIX character set names recognize only ASCII characters by default,
348but some of them use Unicode properties if PCRE2_UCP is set. You can use
349\Q...\E inside a character class.
350</P>
351<br><a name="SEC9" href="#TOC1">QUANTIFIERS</a><br>
352<P>
353<pre>
354  ?           0 or 1, greedy
355  ?+          0 or 1, possessive
356  ??          0 or 1, lazy
357  *           0 or more, greedy
358  *+          0 or more, possessive
359  *?          0 or more, lazy
360  +           1 or more, greedy
361  ++          1 or more, possessive
362  +?          1 or more, lazy
363  {n}         exactly n
364  {n,m}       at least n, no more than m, greedy
365  {n,m}+      at least n, no more than m, possessive
366  {n,m}?      at least n, no more than m, lazy
367  {n,}        n or more, greedy
368  {n,}+       n or more, possessive
369  {n,}?       n or more, lazy
370</PRE>
371</P>
372<br><a name="SEC10" href="#TOC1">ANCHORS AND SIMPLE ASSERTIONS</a><br>
373<P>
374<pre>
375  \b          word boundary
376  \B          not a word boundary
377  ^           start of subject
378                also after an internal newline in multiline mode
379                (after any newline if PCRE2_ALT_CIRCUMFLEX is set)
380  \A          start of subject
381  $           end of subject
382                also before newline at end of subject
383                also before internal newline in multiline mode
384  \Z          end of subject
385                also before newline at end of subject
386  \z          end of subject
387  \G          first matching position in subject
388</PRE>
389</P>
390<br><a name="SEC11" href="#TOC1">MATCH POINT RESET</a><br>
391<P>
392<pre>
393  \K          reset start of match
394</pre>
395\K is honoured in positive assertions, but ignored in negative ones.
396</P>
397<br><a name="SEC12" href="#TOC1">ALTERNATION</a><br>
398<P>
399<pre>
400  expr|expr|expr...
401</PRE>
402</P>
403<br><a name="SEC13" href="#TOC1">CAPTURING</a><br>
404<P>
405<pre>
406  (...)           capturing group
407  (?&#60;name&#62;...)    named capturing group (Perl)
408  (?'name'...)    named capturing group (Perl)
409  (?P&#60;name&#62;...)   named capturing group (Python)
410  (?:...)         non-capturing group
411  (?|...)         non-capturing group; reset group numbers for
412                   capturing groups in each alternative
413</PRE>
414</P>
415<br><a name="SEC14" href="#TOC1">ATOMIC GROUPS</a><br>
416<P>
417<pre>
418  (?&#62;...)         atomic, non-capturing group
419</PRE>
420</P>
421<br><a name="SEC15" href="#TOC1">COMMENT</a><br>
422<P>
423<pre>
424  (?#....)        comment (not nestable)
425</PRE>
426</P>
427<br><a name="SEC16" href="#TOC1">OPTION SETTING</a><br>
428<P>
429<pre>
430  (?i)            caseless
431  (?J)            allow duplicate names
432  (?m)            multiline
433  (?s)            single line (dotall)
434  (?U)            default ungreedy (lazy)
435  (?x)            extended (ignore white space)
436  (?-...)         unset option(s)
437</pre>
438The following are recognized only at the very start of a pattern or after one
439of the newline or \R options with similar syntax. More than one of them may
440appear.
441<pre>
442  (*LIMIT_MATCH=d) set the match limit to d (decimal number)
443  (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number)
444  (*NOTEMPTY)     set PCRE2_NOTEMPTY when matching
445  (*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
446  (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
447  (*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR)
448  (*NO_JIT)       disable JIT optimization
449  (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
450  (*UTF)          set appropriate UTF mode for the library in use
451  (*UCP)          set PCRE2_UCP (use Unicode properties for \d etc)
452</pre>
453Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
454limits set by the caller of pcre2_match(), not increase them. The application
455can lock out the use of (*UTF) and (*UCP) by setting the PCRE2_NEVER_UTF or
456PCRE2_NEVER_UCP options, respectively, at compile time.
457</P>
458<br><a name="SEC17" href="#TOC1">NEWLINE CONVENTION</a><br>
459<P>
460These are recognized only at the very start of the pattern or after option
461settings with a similar syntax.
462<pre>
463  (*CR)           carriage return only
464  (*LF)           linefeed only
465  (*CRLF)         carriage return followed by linefeed
466  (*ANYCRLF)      all three of the above
467  (*ANY)          any Unicode newline sequence
468</PRE>
469</P>
470<br><a name="SEC18" href="#TOC1">WHAT \R MATCHES</a><br>
471<P>
472These are recognized only at the very start of the pattern or after option
473setting with a similar syntax.
474<pre>
475  (*BSR_ANYCRLF)  CR, LF, or CRLF
476  (*BSR_UNICODE)  any Unicode newline sequence
477</PRE>
478</P>
479<br><a name="SEC19" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br>
480<P>
481<pre>
482  (?=...)         positive look ahead
483  (?!...)         negative look ahead
484  (?&#60;=...)        positive look behind
485  (?&#60;!...)        negative look behind
486</pre>
487Each top-level branch of a look behind must be of a fixed length.
488</P>
489<br><a name="SEC20" href="#TOC1">BACKREFERENCES</a><br>
490<P>
491<pre>
492  \n              reference by number (can be ambiguous)
493  \gn             reference by number
494  \g{n}           reference by number
495  \g{-n}          relative reference by number
496  \k&#60;name&#62;        reference by name (Perl)
497  \k'name'        reference by name (Perl)
498  \g{name}        reference by name (Perl)
499  \k{name}        reference by name (.NET)
500  (?P=name)       reference by name (Python)
501</PRE>
502</P>
503<br><a name="SEC21" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
504<P>
505<pre>
506  (?R)            recurse whole pattern
507  (?n)            call subpattern by absolute number
508  (?+n)           call subpattern by relative number
509  (?-n)           call subpattern by relative number
510  (?&name)        call subpattern by name (Perl)
511  (?P&#62;name)       call subpattern by name (Python)
512  \g&#60;name&#62;        call subpattern by name (Oniguruma)
513  \g'name'        call subpattern by name (Oniguruma)
514  \g&#60;n&#62;           call subpattern by absolute number (Oniguruma)
515  \g'n'           call subpattern by absolute number (Oniguruma)
516  \g&#60;+n&#62;          call subpattern by relative number (PCRE2 extension)
517  \g'+n'          call subpattern by relative number (PCRE2 extension)
518  \g&#60;-n&#62;          call subpattern by relative number (PCRE2 extension)
519  \g'-n'          call subpattern by relative number (PCRE2 extension)
520</PRE>
521</P>
522<br><a name="SEC22" href="#TOC1">CONDITIONAL PATTERNS</a><br>
523<P>
524<pre>
525  (?(condition)yes-pattern)
526  (?(condition)yes-pattern|no-pattern)
527
528  (?(n)               absolute reference condition
529  (?(+n)              relative reference condition
530  (?(-n)              relative reference condition
531  (?(&#60;name&#62;)          named reference condition (Perl)
532  (?('name')          named reference condition (Perl)
533  (?(name)            named reference condition (PCRE2)
534  (?(R)               overall recursion condition
535  (?(Rn)              specific group recursion condition
536  (?(R&name)          specific recursion condition
537  (?(DEFINE)          define subpattern for reference
538  (?(VERSION[&#62;]=n.m)  test PCRE2 version
539  (?(assert)          assertion condition
540</PRE>
541</P>
542<br><a name="SEC23" href="#TOC1">BACKTRACKING CONTROL</a><br>
543<P>
544The following act immediately they are reached:
545<pre>
546  (*ACCEPT)       force successful match
547  (*FAIL)         force backtrack; synonym (*F)
548  (*MARK:NAME)    set name to be passed back; synonym (*:NAME)
549</pre>
550The following act only when a subsequent match failure causes a backtrack to
551reach them. They all force a match failure, but they differ in what happens
552afterwards. Those that advance the start-of-match point do so only if the
553pattern is not anchored.
554<pre>
555  (*COMMIT)       overall failure, no advance of starting point
556  (*PRUNE)        advance to next starting character
557  (*PRUNE:NAME)   equivalent to (*MARK:NAME)(*PRUNE)
558  (*SKIP)         advance to current matching position
559  (*SKIP:NAME)    advance to position corresponding to an earlier
560                  (*MARK:NAME); if not found, the (*SKIP) is ignored
561  (*THEN)         local failure, backtrack to next alternation
562  (*THEN:NAME)    equivalent to (*MARK:NAME)(*THEN)
563</PRE>
564</P>
565<br><a name="SEC24" href="#TOC1">CALLOUTS</a><br>
566<P>
567<pre>
568  (?C)            callout (assumed number 0)
569  (?Cn)           callout with numerical data n
570  (?C"text")      callout with string data
571</pre>
572The allowed string delimiters are ` ' " ^ % # $ (which are the same for the
573start and the end), and the starting delimiter { matched with the ending
574delimiter }. To encode the ending delimiter within the string, double it.
575</P>
576<br><a name="SEC25" href="#TOC1">SEE ALSO</a><br>
577<P>
578<b>pcre2pattern</b>(3), <b>pcre2api</b>(3), <b>pcre2callout</b>(3),
579<b>pcre2matching</b>(3), <b>pcre2</b>(3).
580</P>
581<br><a name="SEC26" href="#TOC1">AUTHOR</a><br>
582<P>
583Philip Hazel
584<br>
585University Computing Service
586<br>
587Cambridge, England.
588<br>
589</P>
590<br><a name="SEC27" href="#TOC1">REVISION</a><br>
591<P>
592Last updated: 16 October 2015
593<br>
594Copyright &copy; 1997-2015 University of Cambridge.
595<br>
596<p>
597Return to the <a href="index.html">PCRE2 index page</a>.
598</p>
599