string-search.md - OpenGrok cross reference for /third_party/icu/docs/userguide/collation/string-search.md

Lines Matching +full:all +full:- +full:apis
1 ---
6 ---
7 <!--
10 -->
16 {: .no_toc .text-delta }
21 ---
30 Therefore, a string search algorithm that is language-aware has become more
32 (C++) or `String.indexOf` (Java) APIs will not yield the correct result specific
33 to a particular language's requirements. The APIs will not yield the correct
34 result because all the issues that are important to language-sensitive collation
46     is short-hand for something longer. In sorting, an 'ä' (\\u00e4) is treated
47     as 'ae'. Note that primary- and secondary-level distinctions for *searching*
64     as "black-bird".
68 The ICU string search service provides similar APIs to the other text iterating
71 Analysis](../boundaryanalysis/index.md) chapter. The user can locate one or all
74 sub-string between the start and end is equal.
79 Let S' be the sub-string of a text string S between the offsets start and end
84     collator used for searching has a tertiary collation strength, all accents
85     are non-ignorable. If the pattern "a\\u0300" is searched in the target text
90     exists no non-ignorable combining mark before or after S' in S respectively.
92     "a\\u0325\\u0300", since there exists a non-ignorable accent '\\u0325' in
94     "a\\u0300\\u0325" a match will not be found because of the non-ignorable
105 pattern "baad" will match "a--båd--man" (a--b\\u00e5d--man) at the start offset
107 offset can be 6 or 7, because "-" (hyphen) is ignorable for a certain collation.
109 sub-string. To be more exact, the string search added a "tightest" match
115 match in the string "a--båd--man" (a--b\\u00e5d--man) ONLY at offsets <3,5>.
136 Both a locale or collator can be used to specify the language-sensitive rules
139 of the collator. All the collation attributes will be considered during the
141 using the collator APIs. Normalization is usually done within collation and the
144 As in other iterator interfaces, the string search service provides APIs to
151 locale-specific `BreakIterator` object to a `StringSearch` instance to correctly
158 Segmentation](http://www.unicode.org/reports/tr29/). Therefore, all matches will
160 pattern starts with non-base character, no matches will be returned.
191     search = usearch_open(pattern, -1, target, -1, "en_US", 
263 service. Therefore, all the performance implications that apply to a collator
267 Architecture](architecture#performance-and-storage-implications-of-attributes)
273 ICU4C releases up to 3.8 used the Boyer-Moore search algorithm in the string
275 (See ICU tickets [ICU-5024](https://unicode-org.atlassian.net/browse/ICU-5024),
276 [ICU-5382](https://unicode-org.atlassian.net/browse/ICU-5382),
277 [ICU-5420](https://unicode-org.atlassian.net/browse/ICU-5420))
282 issues were fixed. In ICU4C 4.0.1, the Boyer-Moore search code was reintroduced
285 The Boyer-Moore searching
287 pre-processes the pattern and known to be much faster than the linear search
289 between these two implementations, the Boyer-Moore search is faster than the
291 However, it is very tricky to get correct results with a collation-based Boyer-Moore search.
295 The ICU string search service provides a set of very dynamic APIs that allow
298 `StringSearch::next` (C++) or `StringSearch.next` (Java) APIs and then search
300 (C), `StringSearch::previous` (C++) or `StringSearch.previous` (Java) APIs. Another
302 `StringSearch::previous` (C++) or `StringSearch.previous` (Java) APIs. Though the
303 direction change can occur without calling the reset APIs first, this operation
307 > ICU4C Boyer-Moore search technology preview introduced in ICU4C 4.0.1