• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1---
2layout: default
3title: String Search
4nav_order: 4
5parent: Collation
6---
7<!--
8© 2020 and later: Unicode, Inc. and others.
9License & terms of use: http://www.unicode.org/copyright.html
10-->
11
12# String Search Service
13{: .no_toc }
14
15## Contents
16{: .no_toc .text-delta }
17
181. TOC
19{:toc}
20
21---
22
23## Overview
24
25String searching, also known as string matching, is a very important subject in
26the wider domain of text processing and analysis. Many software applications use
27the basic string search algorithm in the implementations on most operating
28systems. With the popularity of Internet, the quantity of available data from
29different parts of the world has increased dramatically within a short time.
30Therefore, a string search algorithm that is language-aware has become more
31important. A bitwise match that uses the `u_strstr` (C), `UnicodeString::indexOf`
32(C++) or `String.indexOf` (Java) APIs will not yield the correct result specific
33to a particular language's requirements. The APIs will not yield the correct
34result because all the issues that are important to language-sensitive collation
35are also applicable to text searching. The following lists those issues which
36are applicable to text searching:
37
381.  Accented letters\
39    In English, accents are treated as minor variations of a letter. In French,
40    accented letters have much more significance as they can actually change the
41    meaning of a word. Very often, an accented letter is actually a distinct
42    letter. For example, letter 'å' (\\u00e5) may be just a letter 'a' with an
43    accent symbol to English speakers. However, it is actually a distinct letter
44    in Danish; in Danish searching for 'a' should generally not match 'å' and
45    vice versa. In some cases, such as in traditional German, an accented letter
46    is short-hand for something longer. In sorting, an 'ä' (\\u00e4) is treated
47    as 'ae'. Note that primary- and secondary-level distinctions for *searching*
48    may not be the same as those for sorting; in ICU, many languages provide a
49    special "search" collator with the appropriate level settings for search.
50
512.  Conjoined letters\
52    Special handling is required when a single letter is treated equivalent to
53    two distinct letters and vice versa. For example, in German, the letter 'ß'
54    (\\u00df) is treated as 'ss' in sorting. Also, in most languages, 'æ'
55    (\\u00e6) is considered equivalent to the letter 'a' followed by the letter
56    'e'. Also, the ligatures are often treated as distinct letters by
57    themselves. For example, 'ch' is treated as a distinct letter between the
58    letter 'c' and the letter 'd' in Spanish.
59
603.  Ignorable punctuation\
61    As in collation, it is important that the user is able to choose to ignore
62    punctuation symbols while the user searches for a pattern in the string. For
63    example, a user may search for "blackbird" and want to include entries such
64    as "black-bird".
65
66## ICU String Search Model
67
68The ICU string search service provides similar APIs to the other text iterating
69services. Allowing users to specify the starting position and direction within
70the text string to be searched. For more information, please see the [Boundary
71Analysis](../boundaryanalysis/index.md) chapter. The user can locate one or all
72occurrences of a pattern in a string. For a given collator, a pattern match is
73located at the offsets <start, end> in a string if the collator finds that the
74sub-string between the start and end is equal.
75
76The string search service supports two different types of canonical match
77behavior.
78
79Let S' be the sub-string of a text string S between the offsets start and end
80<start, end>.
81A pattern string P matches a text string S at the offsets <start, end> if
82
831.  option 1. P matches some canonical equivalent string of S'. Suppose the
84    collator used for searching has a tertiary collation strength, all accents
85    are non-ignorable. If the pattern "a\\u0300" is searched in the target text
86    "a\\u0325\\u0300", a match will be found, since the target text is
87    canonically equivalent to "a\\u0300\\u0325"
88
892.  option 2. P matches S' and if P starts or ends with a combining mark, there
90    exists no non-ignorable combining mark before or after S' in S respectively.
91    Following the example above, the pattern "a\\u0300" will not find a match in
92    "a\\u0325\\u0300", since there exists a non-ignorable accent '\\u0325' in
93    the middle of 'a' and '\\u0300'. Even with a target text of
94    "a\\u0300\\u0325" a match will not be found because of the non-ignorable
95    trailing accent \\u0325.
96
97One restriction is to be noted for option 1. Currently there are no composite
98characters that consists of a character with combining class greater than 0
99before a character with combining class equals to 0. However, if such a
100character exists in the future, the string search service may not work correctly
101with option 1 when such characters are encountered.
102
103Furthermore, option 1 could generate more than one "encompassing" matches. For
104example, in Danish, 'å' (\\u00e5) and 'aa' are considered equivalent. So the
105pattern "baad" will match "a--båd--man" (a--b\\u00e5d--man) at the start offset
106at 3 and the end offset 5. However, the start offset can be 1 or 2 and the end
107offset can be 6 or 7, because "-" (hyphen) is ignorable for a certain collation.
108The ICU implementation always returns the offsets of the shortest match
109sub-string. To be more exact, the string search added a "tightest" match
110condition. In other words, if the pattern matches at offsets <start, end> as
111well as offsets <start + 1, end>, the offsets <start, end> are not considered a
112match. Likewise, if the pattern matches at offsets <start, end> as well as
113offsets <start, end + 1>, the offsets <start, end + 1> are not considered a
114match. Therefore, when the option 1 is chosen in Danish collator, 'baad' will
115match in the string "a--båd--man" (a--b\\u00e5d--man) ONLY at offsets <3,5>.
116
117The default behavior is that described in option 2 above. To obtain the behavior
118described in option 1, you must set the normalization mode to ON in the collator
119used for search.
120
121> :point_right: **Note**: The "tightest match" behavior described above
122> is defined as "Minimal Match" in
123> [Section 8 Searching and Matching in UTS #10 Unicode Collation Collation Algorithm](http://www.unicode.org/reports/tr10/#Searching).
124> "Medial Match" and "Maximal Match" are not yet implemented by the ICU String Search service.
125
126The string search service also supports two varieties of “asymmetric search” as
127described in *[Section 8.2 Asymmetric Search in UTS #10 Unicode Collation
128Collation Algorithm](http://www.unicode.org/reports/tr10/#Asymmetric_Search)*.
129With asymmetric search, for example, unaccented characters are treated as
130“wildcards” that may match any character with the same primary weight, this
131behavior can be applied just to characters in the search pattern, or to
132characters in both the search pattern and the searched text. With the former
133behavior, searching with French behavior for 'e' might match 'e', 'è', 'é', 'ê',
134and so one, while search for 'é' would only match 'é'.
135
136Both a locale or collator can be used to specify the language-sensitive rules
137for searches. When a locale is specified, a collator will be created internally
138and the StringSearch instance that is created is responsible for the ownership
139of the collator. All the collation attributes will be considered during the
140string search operation. However, the users only can set the collator attributes
141using the collator APIs. Normalization is usually done within collation and the
142process is outside the scope of the string search service.
143
144As in other iterator interfaces, the string search service provides APIs to
145perform string matching for the first pattern occurrence, immediate next,
146previous match, and the last pattern occurrence. There are also options to allow
147for overlapping matching. For example, in English, if the string is "ababab" and
148the pattern is "abab", overlapping matching produces results of offsets <0, 3>
149and <2, 5>. Otherwise, the mutually exclusive matching produces the result
150offset <0, 3> only. To find a whole word match, the user can provide a
151locale-specific `BreakIterator` object to a `StringSearch` instance to correctly
152locate the word boundaries. For example, if "c" exists in the string "abc", a
153match is returned. However, the behavior can be overwritten by supplying a word
154`BreakIterator`.
155
156The minimum unit of match is aligned to an extended grapheme cluster in the ICU
157string search service implementation defined by [UAX #29 Unicode Text
158Segmentation](http://www.unicode.org/reports/tr29/). Therefore, all matches will
159begin and end on extended grapheme cluster boundaries. If the given input search
160pattern starts with non-base character, no matches will be returned.
161When there are contractions in the collation sequence and the contraction
162happens to span across the boundary of a match, it is not considered a match.
163For example, in traditional Spanish where 'ch' is a contraction, the "har"
164pattern will not match in the string "uno charo". Boundaries that are
165discontiguous contractions will yield a match result similar to those described
166above, where the end of the match returned will be one character before the
167immediate following base letter. In addition, only the first match will be
168located if a pattern contains only combining marks and the search string
169contains more than one occurrences of the pattern consecutively. For example, if
170the user searches for the pattern "´" (\\u00b4) in the string "A´´B",
171(A\\u00b4\\u00b4B) the result will be offsets <1, 2>.
172
173### Example
174
175**In C:**
176
177```c
178    char *tgtstr = "The quick brown fox jumps over the lazy dog.";
179    char *patstr = "fox";
180    UChar target[64];
181
182    UChar pattern[16];
183    int pos = 0;
184    UErrorCode status = U_ZERO_ERROR;
185    UStringSearch *search = NULL;
186
187    u_uastrcpy(target, tgtstr);
188    u_uastrcpy(pattern, patstr);
189
190
191    search = usearch_open(pattern, -1, target, -1, "en_US",
192                          NULL, &status);
193
194
195    if (U_FAILURE(status)) {
196        fprintf(stderr, "Could not create a UStringSearch.\n");
197        return;
198    }
199
200    for(pos = usearch_first(search, &status);
201        U_SUCCESS(status) && pos != USEARCH_DONE;
202        pos = usearch_next(search, &status))
203    {
204        fprintf(stdout, "Match found at position %d.\n", pos);
205    }
206
207    if (U_FAILURE(status)) {
208        fprintf(stderr, "Error searching for pattern.\n");
209    }
210```
211
212**In C++:**
213
214```c++
215    UErrorCode status = U_ZERO_ERROR;
216    UnicodeString target("Jackdaws love my big sphinx of quartz.");
217    UnicodeString pattern("sphinx");
218    StringSearch search(pattern, target, Locale::getUS(), NULL, status);
219
220
221    if (U_FAILURE(status)) {
222        fprintf(stderr, "Could not create a StringSearch object.\n");
223        return;
224    }
225
226    for(int pos = search.first(status);
227        U_SUCCESS(status) && pos != USEARCH_DONE;
228        pos = search.next(status))
229    {
230        fprintf(stdout, "Match found at position %d.\n", pos);
231    }
232
233    if (U_FAILURE(status)) {
234        fprintf(stderr, "Error searching for pattern.\n");
235    }
236```
237
238**In Java:**
239
240```java
241    StringCharacterIterator target = new StringCharacterIterator(
242                                         "Pack my box with five dozen liquor jugs.");
243    String pattern = "box";
244
245    try {
246        StringSearch search = new StringSearch(pattern, target, Locale.US);
247
248
249        for(int pos = search.first();
250            pos != StringSearch.DONE;
251            pos = search.next())
252        {
253            System.out.println("Match found for pattern at position " + pos);
254        }
255    } catch (Exception e) {
256        System.err.println("StringSearch failure: " + e.toString());
257    }
258```
259
260## Performance and Other Implications
261
262The ICU string search service is designed to be on top of the ICU collation
263service. Therefore, all the performance implications that apply to a collator
264are also applicable to the string search service. To obtain the best
265performance, use the default collator attributes described in the Performance
266and Storage Implications on Attributes section in the [Collation Service
267Architecture](architecture#performance-and-storage-implications-of-attributes)
268chapter. In addition, users need to be aware of
269the following `StringSearch` specific considerations:
270
271### Search Algorithm
272
273ICU4C releases up to 3.8 used the Boyer-Moore search algorithm in the string
274search service. There were some known issues in these previous releases.
275(See ICU tickets [ICU-5024](https://unicode-org.atlassian.net/browse/ICU-5024),
276[ICU-5382](https://unicode-org.atlassian.net/browse/ICU-5382),
277[ICU-5420](https://unicode-org.atlassian.net/browse/ICU-5420))
278
279In ICU4C 4.0, the string
280search service was updated with the simple linear search algorithm, which
281locates a match by shifting a cursor in the target text one by one, and these
282issues were fixed. In ICU4C 4.0.1, the Boyer-Moore search code was reintroduced
283as a separated API set as a technology preview. In a later release, this code was deleted.
284
285The Boyer-Moore searching
286algorithm is based on automata or combinatorial properties of strings and
287pre-processes the pattern and known to be much faster than the linear search
288when search pattern length is longer. According to performance evaluation
289between these two implementations, the Boyer-Moore search is faster than the
290linear search when the pattern text is longer than 3 or 4 characters.
291However, it is very tricky to get correct results with a collation-based Boyer-Moore search.
292
293### Change Iterating Direction
294
295The ICU string search service provides a set of very dynamic APIs that allow
296users to change the iterating direction randomly. For example, users can search
297for a particular word going forward by calling the `usearch_next` (C),
298`StringSearch::next` (C++) or `StringSearch.next` (Java) APIs and then search
299backwards at any point of the search operation by calling the `usearch_previous`
300(C), `StringSearch::previous` (C++) or `StringSearch.previous` (Java) APIs. Another
301way to change the iterating direction is by calling the `usearch_reset` (C),
302`StringSearch::previous` (C++) or `StringSearch.previous` (Java) APIs. Though the
303direction change can occur without calling the reset APIs first, this operation
304comes with a reduction in speed.
305
306> :point_right: **Note**: The backward search is not available with the
307> ICU4C Boyer-Moore search technology preview introduced in ICU4C 4.0.1
308> and only available with the linear search implementation.
309
310### Thai and Lao Character Boundaries
311
312In collation, certain Thai and Lao vowels are swapped with the next character.
313For example, the text string "A ขเ" (A \\u0e02\\u0e40) is processed internally
314in collation as
315"A เข" (A \\u0e40\\u0e02). Therefore, if the user searches for the pattern "Aเ"
316(A\\u0e40) in "A ขเ" (A \\u0e02\\u0e40) the string search service will match
317starting at offset 0. Since this normalization process is internal to collation,
318there is no notification that the swapping has happened. The return result
319offsets in this example will be <0, 2> even though the range would encompass one
320extra character.
321
322### Case Level Search
323
324Case level string search is currently done with the strength set to tertiary.
325When searching with the strength set to primary and the case level attribute
326turned on, results given may not be correct. The case level attribute is
327different from tertiary strength in that accents are ignored but case
328differences are not. Suppose you wanted to search for “A” in the text
329“ABC\\u00C5a”. The match found should be at 0 and 3 if using the case level
330attribute. However, searching with the case level attribute turned on finds
331matches at 0, 3, and 4, which includes the lower case 'a'. To ensure that case
332level differences are not ignored, string search must be done with at least
333tertiary strength.
334