• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1---
2layout: default
3title: UnicodeSet
4nav_order: 5
5parent: Chars and Strings
6---
7<!--
8© 2020 and later: Unicode, Inc. and others.
9License & terms of use: http://www.unicode.org/copyright.html
10-->
11
12# UnicodeSet
13
14## Overview
15
16A UnicodeSet is an object that represents a set of Unicode characters or
17character strings. The contents of that object can be specified either by
18patterns or by building them programmatically.
19
20Here are a few examples of sets:
21
22| Pattern | Description |
23|--------------|-------------------------------------------------------------|
24| `[a-z]` | The lower case letters a through z |
25| `[abc123]` | The six characters a,b,c,1,2 and 3 |
26| `[\p{Letter}]` | All characters with the Unicode General Category of Letter. |
27
28### String Values
29
30In addition to being a set of characters (of Unicode code points),
31a UnicodeSet may also contain string values. Conceptually, the UnicodeSet is
32always a set of strings, not a set of characters, although in many common use
33cases the strings are all of length one, which reduces to being a set of
34characters.
35
36This concept can be confusing when first encountered, probably because similar
37set constructs from other environments
38(e.g., character classes in most regular expression implementations)
39can only contain characters.
40
41Until ICU 68, it was not possible for a UnicodeSet to contain the empty string.
42In Java, an exception was thrown. In C++, the empty string was silently ignored.
43
44Starting with ICU 69 [ICU-13702](https://unicode-org.atlassian.net/browse/ICU-13702)
45the empty string is supported as a set element;
46however, it is ignored in matching functions such as `span(string)`.
47
48## UnicodeSet Patterns
49
50Patterns are a series of characters bounded by square brackets that contain
51lists of characters and Unicode property sets. Lists are a sequence of
52characters that may have ranges indicated by a '-' between two characters, as in
53"a-z". The sequence specifies the range of all characters from the left to the
54right, in Unicode order. For example, `[a c d-f m]` is equivalent to `[a c d e f m]`.
55Whitespace can be freely used for clarity as `[a c d-f m]` means the same
56as `[acd-fm]`.
57
58Unicode property sets are specified by a Unicode property, such as `[:Letter:]`.
59For a list of supported properties, see the [Properties](properties.md) chapter.
60For details on the use of short vs. long property and property value names, see
61the end of this section. The syntax for specifying the property names is an
62extension of either POSIX or Perl syntax with the addition of "=value". For
63example, you can match letters by using the POSIX syntax `[:Letter:]`, or by
64using the Perl-style syntax \\p{Letter}. The type can be omitted for the
65Category and Script properties, but is required for other properties.
66
67The table below shows the two kinds of syntax: POSIX and Perl style. Also, the
68table shows the "Negative", which is a property that excludes all characters of
69a given kind. For example, `[:^Letter:]` matches all characters that are not
70`[:Letter:]`.
71
72|  | Positive | Negative |
73|--------------------|------------------|-------------------|
74| POSIX-style Syntax | `[:type=value:]` | `[:^type=value:]` |
75| Perl-style Syntax  | `\p{type=value}` | `\P{type=value}`  |
76
77These following low-level lists or properties then can be freely combined with
78the normal set operations (union, inverse, difference, and intersection):
79
80|  | Example | Corresponding Method | Meaning |
81|-------|-------------------------|----------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
82| A B | `[[:letter:] [:number:]]` | `A.addAll(B)` | To union two sets A and B, simply concatenate them |
83| A & B | `[[:letter:] & [a-z]]` | `A.retainAll(B)` | To intersect two sets A and B, use the '&' operator. |
84| A - B | `[[:letter:] - [a-z]]` | `A.removeAll(B)` | To take the set-difference of two sets  A and B, use the '-' operator. |
85| [^A] | `[^a-z]` | `A.complement(B)` | To invert a set A, place a '^' immediately after the opening '['.  Note that the complement only affects code points, not string values. In any other location, the '^' does not have a special meaning. |
86
87### Precedence
88
89The binary operators of union, intersection, and set-difference have equal
90precedence and bind left-to-right. Thus the following are equivalent:
91
92*   `[[:letter:] - [a-z] [:number:] & [\u0100-\u01FF]]`
93*   `[[[[[:letter:] - [a-z]] [:number:]] & [\u0100-\u01FF]]`
94
95Another example is that the set `[[ace][bdf\] - [abc][def]]` is **not**
96the empty set, but instead the set `[def]`. That is because the syntax
97corresponds to the following UnicodeSet operations:
98
991.  start with `[ace]`
1002.  addAll `[bdf]` *-- we now have `[abcdef]`*
1013.  removeAll `[abc]` *-- we now have `[def]`*
1024.  addAll `[def]` *-- no effect, we still have `[def]`*
103
104This only really matters where there are the difference and intersection
105operations, as the union operation is commutative. To make sure that the - is
106the main operator, add brackets to group the operations as desired, such as
107`[[ace][bdf] - [[abc][def]]]`.
108
109Another caveat with the '&' and '-' operators is that they operate between
110**sets**. That is, they must be immediately preceded and immediately followed by
111a set. For example, the pattern `[[:Lu:]-A]` is illegal, since it is
112interpreted as the set `[:Lu:]` followed by the incomplete range `-A`. To specify
113the set of uppercase letters except for 'A', enclose the 'A' in a set:
114`[[:Lu:]-[A]]`.
115
116### Examples
117
118| `[a]` | The set containing 'a' |
119|------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
120| `[a-z]` | The set containing 'a' through 'z' and all letters in between, in Unicode order |
121| `[^a-z]` | The set containing all characters but 'a' through 'z', that is, U+0000 through 'a'-1 and 'z'+1 through U+FFFF |
122| `[[pat1][pat2]]` | The union of sets specified by pat1 and pat2 |
123| `[[pat1]& [pat2]]` | The intersection of sets specified by pat1 and pat2 |
124| `[[pat1]- [pat2]]` | The asymmetric difference of sets specified by pat1 and pat2 |
125| `[:Lu:]` | The set of characters belonging to the given Unicode category, as defined by  `Character.getType()`; in this case, Unicode uppercase letters. The long form for this is  `[:UppercaseLetter:]`. |
126| `[:L:]` | The set of characters belonging to all Unicode categories starting with 'L', that is,  `[[:Lu:][:Ll:][:Lt:][:Lm:][:Lo:]]`. The long form for this is  `[:Letter:]`. |
127
128### String Values in Sets
129
130String values are enclosed in {curly brackets}.
131
132| Set expression | Description |
133|------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
134| `[abc{def}]` | A set containing four members, the single characters a, b and c, and the string “def” |
135| `[{abc}{def}]` | A set containing two members, the string “abc” and the string “def”. |
136| `[{a}{b}{c}]` `[abc]` | These two sets are equivalent. Each contains three items, the three individual characters a, b and c. A {string} containing a single character is equivalent to that same character specified in any other way. |
137
138### Character Quoting and Escaping in Unicode Set Patterns
139
140#### Single Quote
141
142Two single quotes represents a single quote, either inside or outside single
143quotes.
144
145Text within single quotes is not interpreted in any way (except for two adjacent
146single quotes). It is taken as literal text (special characters become
147non-special).
148
149These quoting conventions for ICU UnicodeSets differ from those of regular
150expression character set expressions. In regular expressions, single quotes have
151no special meaning and are treated like any other literal character.
152
153#### Backslash Escapes
154
155Outside of single quotes, certain backslashed characters have special meaning:
156
157| `\uhhhh` | Exactly 4 hex digits; h in [0-9A-Fa-f] |
158|------------|----------------------------------------|
159| `\Uhhhhhhhh` | Exactly 8 hex digits |
160| `\xhh` | 1-2 hex digits |
161| `\ooo` | 1-3 octal digits; o in [0-7] |
162| `\a` | U+0007 (BELL) |
163| `\b` | U+0008 (BACKSPACE) |
164| `\t` | U+0009 (HORIZONTAL TAB) |
165| `\n` | U+000A (LINE FEED) |
166| `\v` | U+000B (VERTICAL TAB) |
167| `\f` | U+000C (FORM FEED) |
168| `\r` | U+000D (CARRIAGE RETURN) |
169| `\\` | U+005C (BACKSLASH) |
170
171Anything else following a backslash is mapped to itself, except in an
172environment where it is defined to have some special meaning. For example,
173`\\p{Lu}` is the set of uppercase letters in UnicodeSet.
174
175Any character formed as the result of a backslash escape loses any special
176meaning and is treated as a literal. In particular, note that \\u and \\U
177escapes create literal characters. (In contrast, the Java compiler treats
178Unicode escapes as just a way to represent arbitrary characters in an ASCII
179source file, and any resulting characters are **not** tagged as literals.)
180
181#### Whitespace
182
183Whitespace (as defined by our API) is ignored unless it is quoted or
184backslashed.
185
186> :point_right: **Note**: *The rules for quoting and white space handling are common to most ICU APIs that
187process rule or expression strings, including UnicodeSet, Transliteration and
188Break Iterators.*
189
190> :point_right: **Note**:*ICU Regular Expression set expressions have a different (but similar) syntax,
191and a different set of recognized backslash escapes. \[Sets\] in ICU Regular
192Expressions follow the conventions from Perl and Java regular expressions rather
193than the pattern syntax from ICU UnicodeSet.*
194
195## Using a UnicodeSet
196
197For best performance, once the set contents is complete, freeze() the set to
198make it immutable and to speed up contains() and span() operations (for which it
199builds a small additional data structure).
200
201The most basic operation is contains(code point) or, if relevant,
202contains(string).
203
204For splitting and partitioning strings, it is simpler and faster to use span()
205and spanBack() rather than iterate over code points and calling contains(). In
206Java, there is also a class UnicodeSetSpanner for somewhat higher-level
207operations. See also the “Lookup” section of the [Properties](properties.md)
208chapter.
209
210## Programmatically Building UnicodeSets
211
212ICU users can programmatically build a UnicodeSet by adding or removing ranges
213of characters or by using the retain (intersection), remove (difference), and
214add (union) operations.
215
216## Property Values
217
218The following property value variants are recognized:
219
220| Format | Description | Example |
221|--------|-----------------------------------------------------------------------------------------------------|-----------------------------------|
222| short | omits the type (used to prevent ambiguity and only allowed with the Category and Script properties) | Lu |
223| medium | uses an abbreviated type and value | gc=Lu |
224| long | uses a full type and value | General_Category=Uppercase_Letter |
225
226If the type or value is omitted, then the equals sign is also omitted. The short
227style is only
228used for Category and Script properties because these properties are very common
229and their omission is unambiguous.
230
231In actual practice, you can mix type names and values that are omitted,
232abbreviated, or full. For example, if Category=Unassigned you could use what is
233in the table explicitly, `\p{gc=Unassigned}`, `\p{Category=Cn}`, or
234`\p{Unassigned}`.
235
236When these are processed, case and whitespace are ignored so you may use them
237for clarity, if desired. For example, `\p{Category = Uppercase Letter}` or
238`\p{Category = uppercase letter}`.
239
240For a list of supported properties, see the [Properties](properties.md) chapter.
241
242## Getting UnicodeSet from Script
243
244ICU provides the functionality of getting UnicodeSet from the script. Here is an
245example of generating a pattern from all the scripts that are associated to a
246Locale and then getting the UnicodeSet based on the generated pattern.
247
248**In C:**
249
250    UErrorCode err = U_ZERO_ERROR;
251    const int32_t capacity = 10;
252    const char * shortname = NULL;
253    int32_t num, j;
254    int32_t strLength =4;
255    UChar32 c = 0x00003096 ;
256    UScriptCode script[10] = {USCRIPT_INVALID_CODE};
257    UScriptCode scriptcode = USCRIPT_INVALID_CODE;
258    num = uscript_getCode("ja",script,capacity, &err);
259    printf("%s %d \n" ,"Number of script code associated are :", num);
260    UnicodeString temp = UnicodeString("[", 1, US_INV);
261    UnicodeString pattern;
262    for(j=0;j<num;j++){
263        shortname = uscript_getShortName(script[j]);
264        UnicodeString str(shortname,strLength,US_INV);
265        temp.append("[:");
266        temp.append(str);
267        temp.append(":]+");
268    }
269    pattern = temp.remove(temp.length()-1,1);
270    pattern.append("]");
271    UnicodeSet cnvSet(pattern, err);
272    printf("%d\n", cnvSet.size());
273    printf("%d\n", cnvSet.contains(c));
274
275**In Java:**
276
277    ULocale ul = new ULocale("ja");
278    int script[] = UScript.getCode(ul);
279    String str ="[";
280    for(int i=0;i<script.length;i++){
281        str = str + "[:"+UScript.getShortName(script[i])+":]+";
282    }
283    String pattern =str.substring(0, (str.length()-1));
284    pattern = pattern + "]";
285    System.out.println(pattern);
286    UnicodeSet ucs = new UnicodeSet(pattern);
287    System.out.println(ucs.size());
288    System.out.println(ucs.contains(0x00003096));
289