• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1---
2layout: default
3title: Regular Expressions
4nav_order: 6
5parent: Chars and Strings
6---
7<!--
8© 2020 and later: Unicode, Inc. and others.
9License & terms of use: http://www.unicode.org/copyright.html
10-->
11
12# Regular Expressions
13
14## Overview
15
16ICU's Regular Expressions package provides applications with the ability to
17apply regular expression matching to Unicode string data. The regular expression
18patterns and behavior are based on Perl's regular expressions. The C++
19programming API for using ICU regular expressions is loosely based on the JDK
201.4 package java.util.regex, with some extensions to adapt it for use in a C++
21environment. A plain C API is also provided.
22
23The ICU Regular expression API supports operations including testing for a
24pattern match, searching for a pattern match, and replacing matched text.
25Capture groups allow subranges within an overall match to be identified, and to
26appear within replacement text.
27
28A Perl-inspired split() function that breaks a string into fields based on a
29delimiter pattern is also included.
30
31ICU Regular Expressions conform to version 19 of the
32[Unicode Technical Standard \#18](http://www.unicode.org/reports/tr18/),
33Unicode Regular Expressions, level 1, and in addition include Default Word
34boundaries and Name Properties from level 2.
35
36A detailed description of regular expression patterns and pattern matching
37behavior is not included in this user guide. The best reference for this topic
38is the book "Mastering Regular Expressions, 3rd Edition" by Jeffrey E. F.
39Friedl, O'Reilly Media; 3rd edition (August 2006). Matching behavior can
40sometimes be surprising, and this book is highly recommended for anyone doing
41significant work with regular expressions.
42
43## Using ICU Regular Expressions
44
45The ICU C++ Regular Expression API includes two classes, `RegexPattern` and
46`RegexMatcher`, that parallel the classes from the Java JDK package
47java.util.regex. A `RegexPattern` represents a compiled regular expression while
48`RegexMatcher` associates a `RegexPattern` and an input string to be matched, and
49provides API for the various find, match and replace operations. In most cases,
50however, only the class `RegexMatcher` is needed, and the existence of class
51RegexPattern can safely be ignored.
52
53The first step in using a regular expression is typically the creation of a
54`RegexMatcher` object from the source (string) form of the regular expression.
55
56`RegexMatcher` holds a pre-processed (compiled) pattern and a reference to an
57input string to be matched, and provides API for the various find, match and
58replace operations. `RegexMatchers` can be reset and reused with new input, thus
59avoiding object creation overhead when performing the same matching operation
60repeatedly on different strings.
61
62The following code will create a `RegexMatcher` from a string containing a regular
63expression, and then perform a simple `find()` operation.
64
65        #include <unicode/regex.h>
66        UErrorCode status = U_ZERO_ERROR;
67        ...
68        RegexMatcher *matcher = new RegexMatcher("abc+", 0, status);
69        if (U_FAILURE(status)) {
70            // Handle any syntax errors in the regular expression here
71            ...
72        }
73        UnicodeString stringToTest = "Find the abc in this string";
74        matcher->reset(stringToTest);
75        if (matcher->find()) {
76            // We found a match.
77            int startOfMatch = matcher->start(status); // string index of start of match.
78            ...
79        }
80
81Several types of matching tests are available
82
83| Function      | Description                                                    |
84|:--------------|:---------------------------------------------------------------|
85| `matches()`   | True if the pattern matches the entire string, from the start through to the last character.
86| `lookingAt()` | True if the pattern matches at the start of the string. The match need not include the entire string.
87| `find()`      | True if the pattern matches somewhere within the string.  Successive calls to find() will find additional matches, until the string is exhausted.
88
89If additional text is to be checked for a match with the same pattern, there is
90no need to create a new matcher object; just reuse the existing one.
91
92        myMatcher->reset(anotherString);
93        if (myMatcher->matches(status)) {
94            // We have a match with the new string.
95        }
96
97Note that matching happens directly in the string supplied by the application.
98This reduces the overhead when resetting a matcher to an absolute minimum – the
99matcher need only store a reference to the new string – but it does mean that
100the application must be careful not to modify or delete the string while the
101matcher is holding a reference to the string.
102
103After finding a match, additional information is available about the range of
104the input matched, and the contents of any capture groups. Note that, for
105simplicity, any error parameters have been omitted. See the [API
106reference](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classRegexMatcher.html) for
107complete a complete description of the API.
108
109| Function        | Description                                                    |
110|:----------------|:---------------------------------------------------------------|
111| `start()`       | Return the index of the start of the matched region in the input string.
112| `end()`         | Return the index of the first character following the match.
113| `group()`       | Return a UnicodeString containing the text that was matched.
114| `start(n)`      | Return the index of the start of the text matched by the nth capture group.
115| `end(n)`        | Return the index of the first character following the text matched by the nth capture group.
116| `group(n)`      | Return a UnicodeString containing the text that was matched by the nth capture group.
117
118## Regular Expression Metacharacters
119
120| Character | outside of sets | \[inside sets\] |  Description |
121|:----------|:----------------|:----------------|:-------------|
122| \\a       | ✓               | ✓               | Match a BELL, \\u0007.
123| \\A       | ✓               |                 | Match at the beginning of the input. Differs from ^ in that \\A will not match after a new line within the input.
124| \\b       | ✓               |                 | Match if the current position is a word boundary. Boundaries occur at the transitions between word (\\w) and non-word (\\W) characters, with combining marks ignored. For better word boundaries, see [ICU Boundary Analysis](../boundaryanalysis/index.md).
125| \\B       | ✓               |                 | Match if the current position is not a word boundary.
126| \\cX      | ✓               | ✓               | Match a control-X character.
127| \\d       | ✓               | ✓               | Match any character with the Unicode General Category of Nd (Number, Decimal Digit.)
128| \\D       | ✓               | ✓               | Match any character that is not a decimal digit.
129| \\e       | ✓               | ✓               | Match an ESCAPE, \\u001B.
130| \\E       | ✓               | ✓               | Terminates a \\Q ...  \\E quoted sequence.
131| \\f       | ✓               | ✓               | Match a FORM FEED, \\u000C.
132| \\G       | ✓               | ✓               | Match if the current position is at the end of the previous match.
133| \\h       | ✓               | ✓               | Match a Horizontal White Space character.  They are characters with Unicode General Category of Space_Separator plus the ASCII tab (\\u0009).
134| \\H       | ✓               | ✓               | Match a non-Horizontal White Space character.
135| \\k<name> | ✓               |                 | Named Capture Back Reference.
136| \\n       | ✓               | ✓               | Match a LINE FEED, \\u000A.
137| \\N{UNICODE CHARACTER NAME} | ✓  | ✓          | Match the named character.
138| \\p{UNICODE PROPERTY NAME} | ✓   | ✓          | Match any character with the specified Unicode Property.
139| \\P{UNICODE PROPERTY NAME} | ✓   | ✓          | Match any character not having the specified Unicode Property.
140| \\Q       | ✓               | ✓               | Quotes all following characters until \\E.
141| \\r       | ✓               | ✓               | Match a CARRIAGE RETURN, \\u000D.
142| \\R       | ✓               |                 | Match a new line character, or the sequence CR LF. The new line characters are \\u000a, \\u000b, \\u000c, \\u000d, \\u0085, \\u2028, \\u2029.
143| \\s       | ✓               | ✓               | Match a white space character. White space is defined as \[\\t\\n\\f\\r\\p{Z}\].
144| \\S       | ✓               | ✓               | Match a non-white space character.
145| \\t       | ✓               | ✓               | Match a HORIZONTAL TABULATION, \\u0009.
146| \\uhhhh   | ✓               | ✓               | Match the character with the hex value hhhh.
147| \\Uhhhhhhhh | ✓             | ✓               | Match the character with the hex value hhhhhhhh. Exactly eight hex digits must be provided, even though the largest Unicode code point is \\U0010ffff.
148| \\v       | ✓               | ✓               | Match a new line character. The new line characters are \\u000a, \\u000b, \\u000c, \\u000d, \\u0085, \\u2028, \\u2029. Does not match the new line sequence CR LF.
149| \\V       | ✓               | ✓               | Match a non-new line character.
150| \\w       | ✓               | ✓               | Match a word character. Word characters are \[\\p{Alphabetic}\\p{Mark}\\p{Decimal_Number}\\p{Connector_Punctuation}\\u200c\\u200d\].
151| \\W       | ✓               | ✓               | Match a non-word character.
152| \\x{hhhh} | ✓               | ✓               | Match the character with hex value hhhh. From one to six hex digits may be supplied.
153| \\xhh     | ✓               | ✓               | Match the character with two digit hex value hh.
154| \\X       | ✓               |                 | Match a [Grapheme Cluster](http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries).
155| \\Z       | ✓               |                 | Match if the current position is at the end of input, but before the final line terminator, if one exists.
156| \\z       | ✓               |                 | Match if the current position is at the end of input.
157| \\*n*     | ✓               |                 | Back Reference. Match whatever the nth capturing group matched. n must be a number > 1 and < total number of capture groups in the pattern.
158| \\0ooo    | ✓               | ✓               | Match an Octal character. 'ooo' is from one to three octal digits.  0377 is the largest allowed Octal character. The leading zero is required; it distinguishes Octal constants from back references.
159| \[pattern\] | ✓             | ✓               | Match any one character from the set.
160| .         | ✓               |                 | Match any character.
161| ^         | ✓               |                 | Match at the beginning of a line.
162| $         | ✓               |                 | Match at the end of a line. Line terminating characters are \\u000a, \\u000b, \\u000c, \\u000d, \\u0085, \\u2028, \\u2029 and the sequence \\u000d \\u000a.
163| \\        | ✓               |                 | Quotes the following character. Characters that must be quoted to be treated as literals are \* ? + \[ ( ) { } ^ $ \| \\ .
164| \\        |                 | ✓               | Quotes the following character. Characters that must be quoted to be treated as literals are \[ \] \\ Characters that may need to be quoted, depending on the context are - &
165
166## Regular Expression Operators
167
168| Operator      | Description
169|:--------------|:---------------------------------------------------------------|
170| `|`           | Alternation. A\|B matches either A or B.
171| `*`           | Match 0 or more times. Match as many times as possible.
172| `+`           | Match 1 or more times. Match as many times as possible.
173| `?`           | Match zero or one times. Prefer one.
174| `{n}`         | Match exactly n times
175| `{n,}`        | Match at least n times. Match as many times as possible.
176| `{n,m}`       | Match between n and m times. Match as many times as possible, but not more than m.
177| `*?`          | Match 0 or more times. Match as few times as possible.
178| `+?`          | Match 1 or more times. Match as few times as possible.
179| `??`          | Match zero or one times. Prefer zero.
180| `{n}?`        | Match exactly n times.
181| `{n,}?`       | Match at least n times, but no more than required for an overall pattern match.
182| `{n,m}?`      | Match between n and m times. Match as few times as possible, but not less than n.
183| `*+`          | Match 0 or more times. Match as many times as possible when first encountered, do not retry with fewer even if overall match fails (Possessive Match).
184| `++`          | Match 1 or more times. Possessive match.
185| `?+`          | Match zero or one times. Possessive match.
186| `{n}+`        | Match exactly n times.
187| `{n,}+`       | Match at least n times. Possessive Match.
188| `{n,m}+`      | Match between n and m times. Possessive Match.
189| `( ...)`      | Capturing parentheses. Range of input that matched the parenthesized subexpression is available after the match.
190| `(?: ...)`    | Non-capturing parentheses. Groups the included pattern, but does not provide capturing of matching text. Somewhat more efficient than capturing parentheses.
191| `(?> ...)`    | Atomic-match parentheses. First match of the parenthesized subexpression is the only one tried; if it does not lead to an overall pattern match, back up the search for a match to a position before the "(?>".
192| `(?# ...)`    | Free-format comment (?# comment ).
193| `(?= ...)`    | Look-ahead assertion. True if the parenthesized pattern matches at the current input position, but does not advance the input position.
194| `(?! ...)`    | Negative look-ahead assertion. True if the parenthesized pattern does not match at the current input position. Does not advance the input position.
195| `(?<= ...)`   | Look-behind assertion. True if the parenthesized pattern matches text preceding the current input position, with the last character of the match being the input character just before the current position. Does not alter the input position. The length of possible strings matched by the look-behind pattern must not be unbounded (no \* or + operators.)
196| `(?<! ...)`   | Negative Look-behind assertion. True if the parenthesized pattern does not match text preceding the current input position, with the last character of the match being the input character just before the current position. Does not alter the input position. The length of possible strings matched by the look-behind pattern must not be unbounded (no \* or + operators.)
197| `(?<name>...)` | Named capture group. The <angle brackets> are literal - they appear in the pattern.
198| `(?ismwx-ismwx:...)`  | Flag settings. Evaluate the parenthesized expression with the specified flags enabled or -disabled.
199| `(?ismwx-ismwx)`      | Flag settings. Change the flag settings. Changes apply to the portion of the pattern following the setting. For example, (?i) changes to a case insensitive match.
200
201## Set Expressions (Character Classes)
202
203| Example       | Description
204|:--------------|:---------------------------------------------------------------|
205| `[abc]`                              | Match any of the characters a, b or c.
206| `[^abc]`                             | Negation - match any character except a, b or c.
207| `[A-M]`                              | Range - match any character from A to M. The characters to include are determined by Unicode code point ordering.
208| `[\u0000-\U0010ffff]`                | Range - match all characters.
209| `[\p{L}] [\p{Letter}] [\p{General_Category=Letter}]` | Characters with Unicode Category = Letter. All forms shown are equivalent.
210| `[\P{Letter}]`                       | Negated property. (Upper case \P) Match everything except Letters.
211| `[\p{numeric_value=9}]`              | Match all numbers with a numeric value of 9. Any Unicode Property may be used in set expressions.
212| `[\p{Letter}&&\p{script=cyrillic}]`  | Logical AND or intersection. Match the set of all Cyrillic letters.
213| `[\p{Letter}--\p{script=latin}]`     | Subtraction. Match all non-Latin letters.
214| `[[a-z][A-Z][0-9]]` `[a-zA-Z0-9]`    | Implicit Logical OR or Union of Sets. The examples match ASCII letters and digits. The two forms are equivalent.
215| `[:script=Greek:]`                   | Alternate POSIX-like syntax for properties. Equivalent to \\p{script=Greek}.
216
217## Case Insensitive Matching
218
219Case insensitive matching is specified by the UREGEX_CASE_INSENSITIVE flag
220during pattern compilation, or by the (?i) flag within a pattern itself. Unicode
221case insensitive matching is complicated by the fact that changing the case of a
222string may change its length. See <http://www.unicode.org/faq/casemap_charprop.html>
223for more information on Unicode casing operations.
224
225Full case-insensitive matching handles situations where the number of characters
226in equal string may differ. "fußball" compares equal "FUSSBALL", for example.
227
228Simple case insensitive matching operates one character at a time on the strings
229being compared. "fußball" does not compare equal to "FUSSBALL"
230
231For ICU regular expression matching,
232
233*   Anything from a regular expression pattern that looks like a literal string
234    (even of one character) will be matched against the text using full case
235    folding. The pattern string and the matched text may be of different
236    lengths.
237*   Any sequence that is composed by the matching engine from originally
238    separate parts of the pattern will not match with the composition boundary
239    within a case folding expansion of the text being matched.
240*   Matching of \[set expressions\] uses simple matching. A \[set\] will match
241    exactly one code point from the text.
242
243Examples:
244
245*   pattern "fussball" will match "fußball or "fussball".
246*   pattern "fu(s)(s)ball" or "fus{2}ball" will match "fussball" or "FUSSBALL"
247    but not "fußball.
248*   pattern "ß" will find occurrences of "ss" or "ß".
249*   pattern "s+" will not find "ß".
250
251With these rules, a match or capturing sub-match can never begin or end in the
252interior of an input text character that expanded when case folded.
253
254## Flag Options
255
256The following flags control various aspects of regular expression matching. The
257flag values may be specified at the time that an expression is compiled into a
258RegexPattern object, or they may be specified within the pattern itself using
259the `(?ismx-ismx)` pattern options.
260
261> :point_right: **Note**: The UREGEX_CANON_EQ option is not yet available.
262
263| Flag (pattern) | Flag (API Constant) | Description
264|:---------------|:--------------------|:-----------------|
265|   | UREGEX_CANON_EQ         | If set, matching will take the canonical equivalence of characters into account. NOTE: this flag is not yet implemented.
266| i | UREGEX_CASE_INSENSITIVE |  If set, matching will take place in a case-insensitive manner.
267| x | UREGEX_COMMENTS         | If set, allow use of white space and #comments within patterns.
268| s | UREGEX_DOTALL           | If set, a "." in a pattern will match a line terminator in the input text. By default, it will not. Note that a carriage-return / line-feed pair in text behave as a single line terminator, and will match a single "." in a RE pattern.  Line terminators are \\u000a, \\u000b, \\u000c, \\u000d, \\u0085, \\u2028, \\u2029 and the sequence \\u000d \\u000a.
269| m | UREGEX_MULTILINE        | Control the behavior of "^" and "$" in a pattern. By default these will only match at the start and end, respectively, of the input text. If this flag is set, "^" and "$" will also match at the start and end of each line within the input text.
270| w | UREGEX_UWORD            | Controls the behavior of \\b in a pattern. If set, word boundaries are found according to the definitions of word found in Unicode UAX 29, Text Boundaries. By default, word boundaries are identified by means of a simple classification of characters as either “word” or “non-word”, which approximates traditional regular expression behavior. The results obtained with the two options can be quite different in runs of spaces and other non-word characters.
271
272## Using split()
273
274ICU's split() function is similar in concept to Perl's – it will split a string
275into fields, with a regular expression match defining the field delimiters and
276the text between the delimiters being the field content itself.
277
278Suppose you have a string of words separated by spaces:
279
280        UnicodeString s = “dog cat giraffe”;
281
282This code will extract the individual words from the string:
283
284        UErrorCode status = U_ZERO_ERROR;
285        RegexMatcher m(“\\s+”, 0, status);
286        const int maxWords = 10;
287        UnicodeString words[maxWords];
288        int numWords = m.split(s, words, maxWords, status);
289
290After the split():
291
292| Variable        |  value       |
293|:----------------|:-------------|
294| `numWords`      | `3`
295| `words[0]`      | `“dog”`
296| `words[1]`      | `“cat”`
297| `words[2]`      | `“giraffe”`
298| `words[3 to 9]` | `“”`
299
300The field delimiters, the spaces from the original string, do not appear in the
301output strings.
302
303Note that, in this example, `words` is a local, or stack array of actual
304UnicodeString objects. No heap allocation is involved in initializing this array
305of empty strings (C++ is not Java!). Local UnicodeString arrays like this are a
306very good fit for use with split(); after extracting the fields, any values that
307need to be kept in some more permanent way can be copied to their ultimate
308destination.
309
310If the number of fields in a string being split exceeds the capacity of the
311destination array, the last destination string will contain all of the input
312string data that could not be split, including any embedded field delimiters.
313This is similar to split() in Perl.
314
315If the pattern expression contains capturing parentheses, the captured data ($1,
316$2, etc.) will also be saved in the destination array, interspersed with the
317fields themselves.
318
319If, in the “dog cat giraffe” example, the pattern had been `“(\s+)”` instead of
320`“\s+”`, `split()` would have produced five output strings instead of three.
321`Words[1]` and `words[3]` would have been the spaces.
322
323## Find and Replace
324
325Find and Replace operations are provided with the following functions.
326
327| Function    | Description   |
328|:------------|:--------------|
329| `replaceFirst()` | Replace the first matching substring with the replacement text.  Performs the complete operation, including the `find()`.
330| `replaceAll()` | Replace all matching substrings with the replacement text. Performs the complete operation, including all `find()`s.
331| `appendReplacement()` | Incremental replace operation, intended to be used in a loop with `find()`.
332| `appendTail()` | Final step in an incremental find & replace; appends any remaining text following the last replacement.
333
334The replacement text for find-and-replace operations may contain references to
335capture-group text from the find.
336
337| Character | Descriptions  |
338|:----------|:--------------|
339| `$n`      | The text of capture group 'n' will be substituted for `$n`. n must be >= 0 and not greater than the number of capture groups. An unescaped $ in replacement text that is not followed by a capture group specification, either a number or name, is an error.
340| `${name}` | The text of named capture group will be substituted. The name must appear in the pattern.
341| `\`       | Treat the following character as a literal, suppressing any special meaning. Backslash escaping in substitution text is only required for '$' and '\\', but may be used on any other character without bad effects.
342
343**Sample code showing the use of appendReplacement()**
344
345        #include <stdio.h>
346        #include "unicode/regex.h"
347
348        int main() {
349            UErrorCode status = U_ZERO_ERROR;
350            RegexMatcher m(UnicodeString(" +"), 0, status);
351            UnicodeString text("Here is some    text.");
352            m.reset(text);
353
354            UnicodeString result;
355            UnicodeString replacement("_");
356            int replacement_count = 0;
357
358            while (m.find(status) && U_SUCCESS(status)) {
359               m.appendReplacement(result, replacement, status);
360               replacement_count++;
361            }
362            m.appendTail(result);
363
364            char result_buf[100];
365            result.extract(0, result.length(), result_buf, sizeof(result_buf));
366            printf("The result of find & replace is \"%s\"\n", result_buf);
367            printf("The number of replacements is %d\n", replacement_count);
368        }
369
370Running this sample produces the following:
371
372        The result of find & replace is "Here_is_some_text."
373        The number of replacements is 3
374
375## Performance Tips
376
377Some regular expression patterns can result in very slow match operations,
378sometimes so slow that it will appear as though the match has gone into an
379infinite loop. The problem is not unique to ICU - it affects any regular
380expression implementation using a conventional nondeterministic finite automaton
381(NFA) style matching engine. This section gives some suggestion on how to avoid
382problems.
383
384The performance problems tend to show up most commonly on failing matches - when
385an input string does not match the regexp pattern. With a complex pattern
386containing multiple \* or + (or similar) operators, the match engine will
387tediously redistribute the input text between the different pattern terms, in a
388doomed effort to find some combination that leads to a match (that doesn't
389exist).
390
391The running time for troublesome patterns is exponential with the length of the
392input string. Every added character in the input doubles the (non)matching time.
393It doesn't take a particularly long string for the projected running time to
394exceed the age of the universe.
395
396A simple pattern showing the problem is
397
398  `(A+)+B`
399
400matching against the string
401
402  `AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC`
403
404The expression can't match - there is no 'B' in the input - but the engine is
405too dumb to realize that, and will try all possible permutations of rearranging
406the input between the terms of the expression before failing.
407Some suggestions:
408
409*   Avoid, or examine carefully, any expressions with nested repeating
410    quantifiers, like in the example above. They can often be recast in some
411    other way. Any ambiguity in how input text could be distributed between the
412    terms of the expression will cause problems.
413*   Narrow every term in a pattern to match as small a set of characters as
414    possible at each point. Fail as early as possible with bad input, rather
415    than letting broad `.*` style terms eat intermediate input and relying on
416    later terms in the expression to produce a failure.
417*   Use possessive quantifiers when possible - `*+` instead of `*`, `++`
418    instead of `+`
419    These operators prevent backtracking; the initial match of a `*+` qualified
420    pattern is either used in its entirety as part of the complete match, or it
421    is not used at all.
422
423*   Follow or surround `*` or `+` expressions with terms that the repeated
424    expression can not match. The idea is to have only one possible way to match
425    the input, with no possibility of redistributing the input between adjacent
426    terms of the pattern.
427
428*   Avoid overly long and complex regular expressions. Just because it's
429    possible to do something completely in one large expression doesn't mean
430    that you should. Long expressions are difficult to understand and can be
431    almost impossible to debug when they go wrong. It is no sin to break a
432    parsing problem into pieces and to have some code involved involved in the
433    process.
434
435*   Set a time limit. ICU includes the ability to limit the time spent on a
436    regular expression match. This is a good idea when running untested
437    expressions from users of your application, or as a fail safe for servers or
438    other processes that cannot afford to be hung.
439
440Examples from actual bug reports,
441
442The pattern
443
444        (?:[A-Za-z0-9]+[._]?){1,}[A-Za-z0-9]+\@(?:(?:[A-Za-z0-9]+[-]?){1,}[A-Za-z0-9]+\.){1,}
445                      ^^^^^^^^^^^
446
447and the text
448
449        abcdefghijklmnopq
450
451cause an infinite loop.
452
453The problem is in the region marked with `^^^^^^^^^^`. The `"[._]?"` term can be ignored, because
454it need not match anything. `{1,}` is the same as `+`. So we effectively have
455`(?:[A-Za-z0-9]+)+`, which is trouble.
456
457The initial part of the expression can be recast as
458
459`[A-Za-z0-9]+([._][A-Za-z0-9]+)*`
460
461which matches the same thing. The nested `+` and `*` qualifiers do not cause a
462problem because the `[._]` term is not optional and contains no characters that
463overlap with `[A-Za-z0-9]`, leaving no ambiguity in how input characters can be
464distributed among terms in the match.
465
466A further note: this expression was intended to parse email addresses, and has a
467number of other flaws. For common tasks like this there are libraries of freely
468available regular expressions that have been well debugged. It's worth making a
469quick search before writing a new expression.
470
471> :construction: **TODO**: add more examples.*
472
473### Heap and Stack Usage
474
475ICU keeps its match backtracking state on the heap. Because badly designed or
476malicious patterns can result in matches that require large amounts of storage,
477ICU sets a limit on heap usage by matches. The default is 8 MB; it can be
478changed or removed via an API.
479
480Because ICU does not use program recursion to maintain its backtracking state,
481stack usage during matching operations is minimal, and does not increase with
482complex patterns or large amounts of backtracking state. This is worth
483mentioning only because excessive stack usage, resulting in blown off threads or
484processes, can be a problem with some regular expression packages.
485
486## Differences with Java Regular Expressions
487
488*   ICU does not support UREGEX_CANON_EQ. See
489    <https://unicode-org.atlassian.net/browse/ICU-9111>.
490*   The behavior of \\cx (Control-X) differs from Java when x is outside the
491    range A-Z. See <https://unicode-org.atlassian.net/browse/ICU-6068>.
492*   Java allows quantifiers (\*, +, etc) on zero length tests. ICU does not.
493    Occurrences of these in patterns are most likely unintended user errors, but
494    it is an incompatibility with Java.
495    <https://unicode-org.atlassian.net/browse/ICU-6080>
496*   ICU recognizes all Unicode properties known to ICU, which is all of them.
497    Java is restricted to just a few.
498*   ICU case insensitive matching works with all Unicode characters, and, within
499    string literals, does full Unicode matching (where matching strings may be
500    different lengths.) Java does ASCII only by default, with Unicode aware case
501    folding available as an option.
502*   ICU has an extended syntax for set \[bracket\] expressions, including
503    additional operators. Added for improved compatibility with the original ICU
504    implementation, which was based on ICU UnicodeSet pattern syntax.
505*   The property expression `\p{punct}` differs in what it matches. Java matches
506    matches any of ```!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~```. From that list,
507    ICU omits ```$+<=>^`|~``` &nbsp; &nbsp;
508    ICU follows the recommendations from Unicode UTS-18,
509    <http://www.unicode.org/reports/tr18/#Compatibility_Properties>. See also
510    <https://unicode-org.atlassian.net/browse/ICU-20095>.
511