• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1 // © 2016 and later: Unicode, Inc. and others.
2 // License & terms of use: http://www.unicode.org/copyright.html
3 /*
4  *******************************************************************************
5  * Copyright (C) 1996-2016, International Business Machines Corporation and
6  * others. All Rights Reserved.
7  *******************************************************************************
8  */
9 package com.ibm.icu.text;
10 
11 import java.text.CharacterIterator;
12 import java.text.StringCharacterIterator;
13 import java.util.Locale;
14 
15 import com.ibm.icu.util.ICUException;
16 import com.ibm.icu.util.ULocale;
17 
18 // Java porting note:
19 //
20 //        The ICU4C implementation contains dead code in many places.
21 //      While porting the ICU4C linear search implementation, this dead code
22 //      was not fully ported. The code blocks tagged by "// *** Boyer-Moore ***"
23 //      are those dead code blocks, still available in ICU4C.
24 
25 //        The ICU4C implementation does not seem to handle UCharacterIterator pointing
26 //      to a fragment of text properly. ICU4J uses CharacterIterator to navigate through
27 //      the input text. We need to carefully review the code ported from ICU4C
28 //      assuming the start index is 0.
29 
30 //        ICU4C implementation initializes pattern.CE and pattern.PCE. It looks like
31 //      CE is no longer used, except in a few places checking CELength. It looks like this
32 //      is a leftover from already-disabled Boyer-Moore search code. This Java implementation
33 //      preserves the code, but we should clean this up later.
34 
35 /**
36  *
37  * <tt>StringSearch</tt> is a {@link SearchIterator} that provides
38  * language-sensitive text searching based on the comparison rules defined
39  * in a {@link RuleBasedCollator} object.
40  * StringSearch ensures that language eccentricity can be
41  * handled, e.g. for the German collator, characters &szlig; and SS will be matched
42  * if case is chosen to be ignored.
43  * See the <a href="https://htmlpreview.github.io/?https://github.com/unicode-org/icu-docs/blob/main/design/collation/ICU_collation_design.htm">
44  * "ICU Collation Design Document"</a> for more information.
45  * <p>
46  * There are 2 match options for selection:<br>
47  * Let S' be the sub-string of a text string S between the offsets start and
48  * end [start, end].
49  * <br>
50  * A pattern string P matches a text string S at the offsets [start, end]
51  * if
52  * <pre>
53  * option 1. Some canonical equivalent of P matches some canonical equivalent
54  *           of S'
55  * option 2. P matches S' and if P starts or ends with a combining mark,
56  *           there exists no non-ignorable combining mark before or after S?
57  *           in S respectively.
58  * </pre>
59  * Option 2. is the default.
60  * <p>
61  * This search has APIs similar to that of other text iteration mechanisms
62  * such as the break iterators in {@link BreakIterator}. Using these
63  * APIs, it is easy to scan through text looking for all occurrences of
64  * a given pattern. This search iterator allows changing of direction by
65  * calling a {@link #reset} followed by a {@link #next} or {@link #previous}.
66  * Though a direction change can occur without calling {@link #reset} first,
67  * this operation comes with some speed penalty.
68  * Match results in the forward direction will match the result matches in
69  * the backwards direction in the reverse order
70  * <p>
71  * {@link SearchIterator} provides APIs to specify the starting position
72  * within the text string to be searched, e.g. {@link SearchIterator#setIndex setIndex},
73  * {@link SearchIterator#preceding preceding} and {@link SearchIterator#following following}.
74  * Since the starting position will be set as it is specified, please take note that
75  * there are some danger points at which the search may render incorrect
76  * results:
77  * <ul>
78  * <li> In the midst of a substring that requires normalization.
79  * <li> If the following match is to be found, the position should not be the
80  *      second character which requires swapping with the preceding
81  *      character. Vice versa, if the preceding match is to be found, the
82  *      position to search from should not be the first character which
83  *      requires swapping with the next character. E.g certain Thai and
84  *      Lao characters require swapping.
85  * <li> If a following pattern match is to be found, any position within a
86  *      contracting sequence except the first will fail. Vice versa if a
87  *      preceding pattern match is to be found, an invalid starting point
88  *      would be any character within a contracting sequence except the last.
89  * </ul>
90  * <p>
91  * A {@link BreakIterator} can be used if only matches at logical breaks are desired.
92  * Using a {@link BreakIterator} will only give you results that exactly matches the
93  * boundaries given by the {@link BreakIterator}. For instance the pattern "e" will
94  * not be found in the string "\u00e9" if a character break iterator is used.
95  * <p>
96  * Options are provided to handle overlapping matches.
97  * E.g. In English, overlapping matches produces the result 0 and 2
98  * for the pattern "abab" in the text "ababab", where mutually
99  * exclusive matches only produces the result of 0.
100  * <p>
101  * Options are also provided to implement "asymmetric search" as described in
102  * <a href="http://www.unicode.org/reports/tr10/#Asymmetric_Search">
103  * UTS #10 Unicode Collation Algorithm</a>, specifically the ElementComparisonType
104  * values.
105  * <p>
106  * Though collator attributes will be taken into consideration while
107  * performing matches, there are no APIs here for setting and getting the
108  * attributes. These attributes can be set by getting the collator
109  * from {@link #getCollator} and using the APIs in {@link RuleBasedCollator}.
110  * Lastly to update <tt>StringSearch</tt> to the new collator attributes,
111  * {@link #reset} has to be called.
112  * <p>
113  * Restriction: <br>
114  * Currently there are no composite characters that consists of a
115  * character with combining class &gt; 0 before a character with combining
116  * class == 0. However, if such a character exists in the future,
117  * <tt>StringSearch</tt> does not guarantee the results for option 1.
118  * <p>
119  * Consult the {@link SearchIterator} documentation for information on
120  * and examples of how to use instances of this class to implement text
121  * searching.
122  * <p>
123  * Note, <tt>StringSearch</tt> is not to be subclassed.
124  * </p>
125  * @see SearchIterator
126  * @see RuleBasedCollator
127  * @author Laura Werner, synwee
128  * @stable ICU 2.0
129  */
130 // internal notes: all methods do not guarantee the correct status of the
131 // characteriterator. the caller has to maintain the original index position
132 // if necessary. methods could change the index position as it deems fit
133 public final class StringSearch extends SearchIterator {
134 
135     private Pattern pattern_;
136     private RuleBasedCollator collator_;
137 
138     // positions within the collation element iterator is used to determine
139     // if we are at the start of the text.
140     private CollationElementIterator textIter_;
141     private CollationPCE textProcessedIter_;
142 
143     // utility collation element, used throughout program for temporary
144     // iteration.
145     private CollationElementIterator utilIter_;
146 
147     private Normalizer2 nfd_;
148 
149     private int strength_;
150     int ceMask_;
151     int variableTop_;
152 
153     private boolean toShift_;
154 
155     // *** Boyer-Moore ***
156     // private char[] canonicalPrefixAccents_;
157     // private char[] canonicalSuffixAccents_;
158 
159     /**
160      * Initializes the iterator to use the language-specific rules defined in
161      * the argument collator to search for argument pattern in the argument
162      * target text. The argument <code>breakiter</code> is used to define logical matches.
163      * See super class documentation for more details on the use of the target
164      * text and {@link BreakIterator}.
165      * @param pattern text to look for.
166      * @param target target text to search for pattern.
167      * @param collator {@link RuleBasedCollator} that defines the language rules
168      * @param breakiter A {@link BreakIterator} that is used to determine the
169      *                boundaries of a logical match. This argument can be null.
170      * @throws IllegalArgumentException thrown when argument target is null,
171      *            or of length 0
172      * @see BreakIterator
173      * @see RuleBasedCollator
174      * @stable ICU 2.0
175      */
StringSearch(String pattern, CharacterIterator target, RuleBasedCollator collator, BreakIterator breakiter)176     public StringSearch(String pattern, CharacterIterator target, RuleBasedCollator collator,
177             BreakIterator breakiter) {
178 
179         // This implementation is ported from ICU4C usearch_open()
180 
181         super(target, breakiter);
182 
183         // string search does not really work when numeric collation is turned on
184         if (collator.getNumericCollation()) {
185             throw new UnsupportedOperationException("Numeric collation is not supported by StringSearch");
186         }
187 
188         collator_ = collator;
189         strength_ = collator.getStrength();
190         ceMask_ = getMask(strength_);
191         toShift_ = collator.isAlternateHandlingShifted();
192         variableTop_ = collator.getVariableTop();
193 
194         nfd_ = Normalizer2.getNFDInstance();
195 
196         pattern_ = new Pattern(pattern);
197 
198         search_.setMatchedLength(0);
199         search_.matchedIndex_ = DONE;
200 
201         utilIter_ = null;
202         textIter_ = new CollationElementIterator(target, collator);
203 
204         textProcessedIter_ = null;
205 
206         // This is done by super class constructor
207         /*
208         search_.isOverlap_ = false;
209         search_.isCanonicalMatch_ = false;
210         search_.elementComparisonType_ = ElementComparisonType.STANDARD_ELEMENT_COMPARISON;
211         search_.isForwardSearching_ = true;
212         search_.reset_ = true;
213          */
214         ULocale collLocale = collator.getLocale(ULocale.VALID_LOCALE);
215         search_.internalBreakIter_ = BreakIterator.getCharacterInstance(collLocale == null ? ULocale.ROOT : collLocale);
216         search_.internalBreakIter_.setText((CharacterIterator)target.clone());  // We need to create a clone
217 
218         initialize();
219     }
220 
221     /**
222      * Initializes the iterator to use the language-specific rules defined in
223      * the argument collator to search for argument pattern in the argument
224      * target text. No {@link BreakIterator}s are set to test for logical matches.
225      * @param pattern text to look for.
226      * @param target target text to search for pattern.
227      * @param collator {@link RuleBasedCollator} that defines the language rules
228      * @throws IllegalArgumentException thrown when argument target is null,
229      *            or of length 0
230      * @see RuleBasedCollator
231      * @stable ICU 2.0
232      */
StringSearch(String pattern, CharacterIterator target, RuleBasedCollator collator)233     public StringSearch(String pattern, CharacterIterator target, RuleBasedCollator collator) {
234         this(pattern, target, collator, null);
235     }
236 
237     /**
238      * Initializes the iterator to use the language-specific rules and
239      * break iterator rules defined in the argument locale to search for
240      * argument pattern in the argument target text.
241      * @param pattern text to look for.
242      * @param target target text to search for pattern.
243      * @param locale locale to use for language and break iterator rules
244      * @throws IllegalArgumentException thrown when argument target is null,
245      *            or of length 0. ClassCastException thrown if the collator for
246      *            the specified locale is not a RuleBasedCollator.
247      * @stable ICU 2.0
248      */
StringSearch(String pattern, CharacterIterator target, Locale locale)249     public StringSearch(String pattern, CharacterIterator target, Locale locale) {
250         this(pattern, target, ULocale.forLocale(locale));
251     }
252 
253     /**
254      * Initializes the iterator to use the language-specific rules and
255      * break iterator rules defined in the argument locale to search for
256      * argument pattern in the argument target text.
257      * See super class documentation for more details on the use of the target
258      * text and {@link BreakIterator}.
259      * @param pattern text to look for.
260      * @param target target text to search for pattern.
261      * @param locale locale to use for language and break iterator rules
262      * @throws IllegalArgumentException thrown when argument target is null,
263      *            or of length 0. ClassCastException thrown if the collator for
264      *            the specified locale is not a RuleBasedCollator.
265      * @see BreakIterator
266      * @see RuleBasedCollator
267      * @see SearchIterator
268      * @stable ICU 3.2
269      */
StringSearch(String pattern, CharacterIterator target, ULocale locale)270     public StringSearch(String pattern, CharacterIterator target, ULocale locale) {
271         this(pattern, target, (RuleBasedCollator) Collator.getInstance(locale), null);
272     }
273 
274     /**
275      * Initializes the iterator to use the language-specific rules and
276      * break iterator rules defined in the default locale to search for
277      * argument pattern in the argument target text.
278      * @param pattern text to look for.
279      * @param target target text to search for pattern.
280      * @throws IllegalArgumentException thrown when argument target is null,
281      *            or of length 0. ClassCastException thrown if the collator for
282      *            the default locale is not a RuleBasedCollator.
283      * @stable ICU 2.0
284      */
StringSearch(String pattern, String target)285     public StringSearch(String pattern, String target) {
286         this(pattern, new StringCharacterIterator(target),
287                 (RuleBasedCollator) Collator.getInstance(), null);
288     }
289 
290     /**
291      * Gets the {@link RuleBasedCollator} used for the language rules.
292      * <p>
293      * Since <tt>StringSearch</tt> depends on the returned {@link RuleBasedCollator}, any
294      * changes to the {@link RuleBasedCollator} result should follow with a call to
295      * either {@link #reset()} or {@link #setCollator(RuleBasedCollator)} to ensure the correct
296      * search behavior.
297      * </p>
298      * @return {@link RuleBasedCollator} used by this <tt>StringSearch</tt>
299      * @see RuleBasedCollator
300      * @see #setCollator
301      * @stable ICU 2.0
302      */
getCollator()303     public RuleBasedCollator getCollator() {
304         return collator_;
305     }
306 
307     /**
308      * Sets the {@link RuleBasedCollator} to be used for language-specific searching.
309      * <p>
310      * The iterator's position will not be changed by this method.
311      * @param collator to use for this <tt>StringSearch</tt>
312      * @throws IllegalArgumentException thrown when collator is null
313      * @see #getCollator
314      * @stable ICU 2.0
315      */
setCollator(RuleBasedCollator collator)316     public void setCollator(RuleBasedCollator collator) {
317         if (collator == null) {
318             throw new IllegalArgumentException("Collator can not be null");
319         }
320         collator_ = collator;
321         ceMask_ = getMask(collator_.getStrength());
322 
323         ULocale collLocale = collator.getLocale(ULocale.VALID_LOCALE);
324         search_.internalBreakIter_ = BreakIterator.getCharacterInstance(collLocale == null ? ULocale.ROOT : collLocale);
325         search_.internalBreakIter_.setText((CharacterIterator)search_.text().clone());  // We need to create a clone
326 
327         toShift_ = collator.isAlternateHandlingShifted();
328         variableTop_ = collator.getVariableTop();
329         textIter_ = new CollationElementIterator(pattern_.text_, collator);
330         utilIter_ = new CollationElementIterator(pattern_.text_, collator);
331 
332         // initialize() _after_ setting the iterators for the new collator.
333         initialize();
334     }
335 
336     /**
337      * Returns the pattern for which <tt>StringSearch</tt> is searching for.
338      * @return the pattern searched for
339      * @stable ICU 2.0
340      */
getPattern()341     public String getPattern() {
342         return pattern_.text_;
343     }
344 
345     /**
346      * Set the pattern to search for.
347      * The iterator's position will not be changed by this method.
348      * @param pattern for searching
349      * @see #getPattern
350      * @exception IllegalArgumentException thrown if pattern is null or of
351      *               length 0
352      * @stable ICU 2.0
353      */
setPattern(String pattern)354     public void setPattern(String pattern) {
355         if (pattern == null || pattern.length() <= 0) {
356             throw new IllegalArgumentException(
357                     "Pattern to search for can not be null or of length 0");
358         }
359         pattern_.text_ = pattern;
360         initialize();
361     }
362 
363     /**
364      * Determines whether canonical matches (option 1, as described in the
365      * class documentation) is set.
366      * See setCanonical(boolean) for more information.
367      * @see #setCanonical
368      * @return true if canonical matches is set, false otherwise
369      * @stable ICU 2.8
370      */
371     //TODO: hoist this to SearchIterator
isCanonical()372     public boolean isCanonical() {
373         return search_.isCanonicalMatch_;
374     }
375 
376     /**
377      * Set the canonical match mode. See class documentation for details.
378      * The default setting for this property is false.
379      * @param allowCanonical flag indicator if canonical matches are allowed
380      * @see #isCanonical
381      * @stable ICU 2.8
382      */
383     //TODO: hoist this to SearchIterator
setCanonical(boolean allowCanonical)384     public void setCanonical(boolean allowCanonical) {
385         search_.isCanonicalMatch_ = allowCanonical;
386     }
387 
388     /**
389      * {@inheritDoc}
390      * @stable ICU 2.8
391      */
392     @Override
setTarget(CharacterIterator text)393     public void setTarget(CharacterIterator text) {
394         super.setTarget(text);
395         textIter_.setText(text);
396     }
397 
398     /**
399      * {@inheritDoc}
400      * @stable ICU 2.8
401      */
402     @Override
getIndex()403     public int getIndex() {
404         int result = textIter_.getOffset();
405         if (isOutOfBounds(search_.beginIndex(), search_.endIndex(), result)) {
406             return DONE;
407         }
408         return result;
409     }
410 
411     /**
412      * {@inheritDoc}
413      * @stable ICU 2.8
414      */
415     @Override
setIndex(int position)416     public void setIndex(int position) {
417         // Java porting note: This method is equivalent to setOffset() in ICU4C.
418         // ICU4C SearchIterator::setOffset() is a pure virtual method, while
419         // ICU4J SearchIterator.setIndex() is not abstract method.
420 
421         super.setIndex(position);
422         textIter_.setOffset(position);
423     }
424 
425     /**
426      * {@inheritDoc}
427      * @stable ICU 2.8
428      */
429     @Override
reset()430     public void reset() {
431         // reset is setting the attributes that are already in
432         // string search, hence all attributes in the collator should
433         // be retrieved without any problems
434 
435         boolean sameCollAttribute = true;
436         int ceMask;
437         boolean shift;
438         int varTop;
439 
440         // **** hack to deal w/ how processed CEs encode quaternary ****
441         int newStrength = collator_.getStrength();
442         if ((strength_ < Collator.QUATERNARY && newStrength >= Collator.QUATERNARY)
443                 || (strength_ >= Collator.QUATERNARY && newStrength < Collator.QUATERNARY)) {
444             sameCollAttribute = false;
445         }
446 
447         strength_ = collator_.getStrength();
448         ceMask = getMask(strength_);
449         if (ceMask_ != ceMask) {
450             ceMask_ = ceMask;
451             sameCollAttribute = false;
452         }
453 
454         shift = collator_.isAlternateHandlingShifted();
455         if (toShift_ != shift) {
456             toShift_ = shift;
457             sameCollAttribute = false;
458         }
459 
460         varTop = collator_.getVariableTop();
461         if (variableTop_ != varTop) {
462             variableTop_ = varTop;
463             sameCollAttribute = false;
464         }
465 
466         if (!sameCollAttribute) {
467             initialize();
468         }
469 
470         textIter_.setText(search_.text());
471 
472         search_.setMatchedLength(0);
473         search_.matchedIndex_ = DONE;
474         search_.isOverlap_ = false;
475         search_.isCanonicalMatch_ = false;
476         search_.elementComparisonType_ = ElementComparisonType.STANDARD_ELEMENT_COMPARISON;
477         search_.isForwardSearching_ = true;
478         search_.reset_ = true;
479     }
480 
481     /**
482      * {@inheritDoc}
483      * @stable ICU 2.8
484      */
485     @Override
handleNext(int position)486     protected int handleNext(int position) {
487         if (pattern_.CELength_ == 0) {
488             search_.matchedIndex_ = search_.matchedIndex_ == DONE ?
489                                     getIndex() : search_.matchedIndex_ + 1;
490             search_.setMatchedLength(0);
491             textIter_.setOffset(search_.matchedIndex_);
492             if (search_.matchedIndex_ == search_.endIndex()) {
493                 search_.matchedIndex_ = DONE;
494             }
495         } else {
496             if (search_.matchedLength() <= 0) {
497                 // the flipping direction issue has already been handled
498                 // in next()
499                 // for boundary check purposes. this will ensure that the
500                 // next match will not preceed the current offset
501                 // note search_.matchedIndex_ will always be set to something
502                 // in the code
503                 search_.matchedIndex_ = position - 1;
504             }
505 
506             textIter_.setOffset(position);
507 
508             // ICU4C comment:
509             // if strsrch_->breakIter is always the same as m_breakiterator_
510             // then we don't need to check the match boundaries here because
511             // usearch_handleNextXXX will already have done it.
512             if (search_.isCanonicalMatch_) {
513                 // *could* actually use exact here 'cause no extra accents allowed...
514                 handleNextCanonical();
515             } else {
516                 handleNextExact();
517             }
518 
519             if (search_.matchedIndex_ == DONE) {
520                 textIter_.setOffset(search_.endIndex());
521             } else {
522                 textIter_.setOffset(search_.matchedIndex_);
523             }
524 
525             return search_.matchedIndex_;
526         }
527 
528         return DONE;
529     }
530 
531     /**
532      * {@inheritDoc}
533      * @stable ICU 2.8
534      */
535     @Override
handlePrevious(int position)536     protected int handlePrevious(int position) {
537         if (pattern_.CELength_ == 0) {
538             search_.matchedIndex_ =
539                     search_.matchedIndex_ == DONE ? getIndex() : search_.matchedIndex_;
540             if (search_.matchedIndex_ == search_.beginIndex()) {
541                 setMatchNotFound();
542             } else {
543                 search_.matchedIndex_--;
544                 textIter_.setOffset(search_.matchedIndex_);
545                 search_.setMatchedLength(0);
546             }
547         } else {
548             textIter_.setOffset(position);
549 
550             if (search_.isCanonicalMatch_) {
551                 // *could* use exact match here since extra accents *not* allowed!
552                 handlePreviousCanonical();
553             } else {
554                 handlePreviousExact();
555             }
556         }
557 
558         return search_.matchedIndex_;
559     }
560 
561     // ------------------ Internal implementation code ---------------------------
562 
563     private static final int INITIAL_ARRAY_SIZE_ = 256;
564 
565     // *** Boyer-Moore ***
566     // private static final Normalizer2Impl nfcImpl_ = Norm2AllModes.getNFCInstance().impl;
567     // private static final int LAST_BYTE_MASK_ = 0xff;
568     // private static final int SECOND_LAST_BYTE_SHIFT_ = 8;
569 
570     private static final int PRIMARYORDERMASK = 0xffff0000;
571     private static final int SECONDARYORDERMASK = 0x0000ff00;
572     private static final int TERTIARYORDERMASK = 0x000000ff;
573 
574     /**
575      * Getting the mask for collation strength
576      * @param strength collation strength
577      * @return collation element mask
578      */
getMask(int strength)579     private static int getMask(int strength) {
580         switch (strength) {
581         case Collator.PRIMARY:
582             return PRIMARYORDERMASK;
583         case Collator.SECONDARY:
584             return SECONDARYORDERMASK | PRIMARYORDERMASK;
585         default:
586             return TERTIARYORDERMASK | SECONDARYORDERMASK | PRIMARYORDERMASK;
587         }
588     }
589 
590 
591     // *** Boyer-Moore ***
592     /*
593     private final char getFCD(String str, int offset) {
594         char ch = str.charAt(offset);
595         if (ch < 0x180) {
596             return (char) nfcImpl_.getFCD16FromBelow180(ch);
597         } else if (nfcImpl_.singleLeadMightHaveNonZeroFCD16(ch)) {
598             if (!Character.isHighSurrogate(ch)) {
599                 return (char) nfcImpl_.getFCD16FromNormData(ch);
600             } else {
601                 char c2;
602                 if (++offset < str.length() && Character.isLowSurrogate(c2 = str.charAt(offset))) {
603                     return (char) nfcImpl_.getFCD16FromNormData(Character.toCodePoint(ch, c2));
604                 }
605             }
606         }
607         return 0;
608     }
609 
610     private final char getFCD(int c) {
611         return (char)nfcImpl_.getFCD16(c);
612     }
613     */
614 
615     /**
616      * Getting the modified collation elements taking into account the collation
617      * attributes.
618      *
619      * @param sourcece
620      * @return the modified collation element
621      */
getCE(int sourcece)622     private int getCE(int sourcece) {
623         // note for tertiary we can't use the collator->tertiaryMask, that
624         // is a preprocessed mask that takes into account case options. since
625         // we are only concerned with exact matches, we don't need that.
626         sourcece &= ceMask_;
627 
628         if (toShift_) {
629             // alternate handling here, since only the 16 most significant digits
630             // is only used, we can safely do a compare without masking
631             // if the ce is a variable, we mask and get only the primary values
632             // no shifting to quartenary is required since all primary values
633             // less than variabletop will need to be masked off anyway.
634             if (variableTop_ > sourcece) {
635                 if (strength_ >= Collator.QUATERNARY) {
636                     sourcece &= PRIMARYORDERMASK;
637                 } else {
638                     sourcece = CollationElementIterator.IGNORABLE;
639                 }
640             }
641         } else if (strength_ >= Collator.QUATERNARY && sourcece == CollationElementIterator.IGNORABLE) {
642             sourcece = 0xFFFF;
643         }
644 
645         return sourcece;
646     }
647 
648     /**
649      * Direct port of ICU4C static int32_t * addTouint32_tArray(...) in usearch.cpp
650      * (except not taking destination buffer size and status param).
651      * This is used for appending a PCE to Pattern.PCE_ buffer. We probably should
652      * implement this in Pattern class.
653      *
654      * @param destination target array
655      * @param offset destination offset to add value
656      * @param value to be added
657      * @param increments incremental size expected
658      * @return new destination array, destination if there was no new allocation
659      */
addToIntArray(int[] destination, int offset, int value, int increments)660     private static int[] addToIntArray(int[] destination, int offset, int value, int increments) {
661         int newlength = destination.length;
662         if (offset + 1 == newlength) {
663             newlength += increments;
664             int temp[] = new int[newlength];
665             System.arraycopy(destination, 0, temp, 0, offset);
666             destination = temp;
667         }
668         destination[offset] = value;
669         return destination;
670     }
671 
672     /**
673      * Direct port of ICU4C static int64_t * addTouint64_tArray(...) in usearch.cpp.
674      * This is used for appending a PCE to Pattern.PCE_ buffer. We probably should
675      * implement this in Pattern class.
676      *
677      * @param destination target array
678      * @param offset destination offset to add value
679      * @param destinationlength target array size
680      * @param value to be added
681      * @param increments incremental size expected
682      * @return new destination array, destination if there was no new allocation
683      */
addToLongArray(long[] destination, int offset, int destinationlength, long value, int increments)684     private static long[] addToLongArray(long[] destination, int offset, int destinationlength,
685             long value, int increments) {
686         int newlength = destinationlength;
687         if (offset + 1 == newlength) {
688             newlength += increments;
689             long temp[] = new long[newlength];
690             System.arraycopy(destination, 0, temp, 0, offset);
691             destination = temp;
692         }
693         destination[offset] = value;
694         return destination;
695     }
696 
697     /**
698      * Initializing the ce table for a pattern.
699      * Stores non-ignorable collation keys.
700      * Table size will be estimated by the size of the pattern text. Table
701      * expansion will be perform as we go along. Adding 1 to ensure that the table
702      * size definitely increases.
703      * @return total number of expansions
704      */
705     // TODO: We probably do not need Pattern CE table.
initializePatternCETable()706     private int initializePatternCETable() {
707         int[] cetable = new int[INITIAL_ARRAY_SIZE_];
708         int patternlength = pattern_.text_.length();
709         CollationElementIterator coleiter = utilIter_;
710 
711         if (coleiter == null) {
712             coleiter = new CollationElementIterator(pattern_.text_, collator_);
713             utilIter_ = coleiter;
714         } else {
715             coleiter.setText(pattern_.text_);
716         }
717 
718         int offset = 0;
719         int result = 0;
720         int ce;
721 
722         while ((ce = coleiter.next()) != CollationElementIterator.NULLORDER) {
723             int newce = getCE(ce);
724             if (newce != CollationElementIterator.IGNORABLE /* 0 */) {
725                 int[] temp = addToIntArray(cetable, offset, newce,
726                         patternlength - coleiter.getOffset() + 1);
727                 offset++;
728                 cetable = temp;
729             }
730             result += (coleiter.getMaxExpansion(ce) - 1);
731         }
732 
733         cetable[offset] = 0;
734         pattern_.CE_ = cetable;
735         pattern_.CELength_ = offset;
736 
737         return result;
738     }
739 
740     /**
741      * Initializing the pce table for a pattern.
742      * Stores non-ignorable collation keys.
743      * Table size will be estimated by the size of the pattern text. Table
744      * expansion will be perform as we go along. Adding 1 to ensure that the table
745      * size definitely increases.
746      * @return total number of expansions
747      */
initializePatternPCETable()748     private int initializePatternPCETable() {
749         long[] pcetable = new long[INITIAL_ARRAY_SIZE_];
750         int pcetablesize = pcetable.length;
751         int patternlength = pattern_.text_.length();
752         CollationElementIterator coleiter = utilIter_;
753 
754         if (coleiter == null) {
755             coleiter = new CollationElementIterator(pattern_.text_, collator_);
756             utilIter_ = coleiter;
757         } else {
758             coleiter.setText(pattern_.text_);
759         }
760 
761         int offset = 0;
762         int result = 0;
763         long pce;
764 
765         CollationPCE iter = new CollationPCE(coleiter);
766 
767         // ** Should processed CEs be signed or unsigned?
768         // ** (the rest of the code in this file seems to play fast-and-loose with
769         // ** whether a CE is signed or unsigned. For example, look at routine above this one.)
770         while ((pce = iter.nextProcessed(null)) != CollationPCE.PROCESSED_NULLORDER) {
771             long[] temp = addToLongArray(pcetable, offset, pcetablesize, pce, patternlength - coleiter.getOffset() + 1);
772             offset++;
773             pcetable = temp;
774         }
775 
776         pcetable[offset] = 0;
777         pattern_.PCE_ = pcetable;
778         pattern_.PCELength_ = offset;
779 
780         return result;
781     }
782 
783     // TODO: This method only triggers initializePatternCETable(), which is probably no
784     //      longer needed.
initializePattern()785     private int initializePattern() {
786         // Since the strength is primary, accents are ignored in the pattern.
787 
788         // *** Boyer-Moore ***
789         /*
790         if (strength_ == Collator.PRIMARY) {
791             pattern_.hasPrefixAccents_ = false;
792             pattern_.hasSuffixAccents_ = false;
793         } else {
794             pattern_.hasPrefixAccents_ = (getFCD(pattern_.text_, 0) >>> SECOND_LAST_BYTE_SHIFT_) != 0;
795             pattern_.hasSuffixAccents_ = (getFCD(pattern_.text_.codePointBefore(pattern_.text_.length())) & LAST_BYTE_MASK_) != 0;
796         }
797         */
798 
799         pattern_.PCE_ = null;
800 
801         // since intializePattern is an internal method status is a success.
802         return initializePatternCETable();
803     }
804 
805     // *** Boyer-Moore ***
806     /*
807      private final void setShiftTable(char shift[],
808                                          char backshift[],
809                                          int cetable[], int cesize,
810                                          int expansionsize,
811                                          int defaultforward,
812                                          int defaultbackward) {
813          // No implementation
814      }
815      */
816 
817     // TODO: This method only triggers initializePattern(), which is probably no
818     //      longer needed.
initialize()819     private void initialize() {
820         /* int expandlength = */ initializePattern();
821 
822         // *** Boyer-Moore ***
823         /*
824         if (pattern_.CELength_ > 0) {
825             int cesize = pattern_.CELength_;
826             int minlength = cesize > expandlength ? cesize - expandlength : 1;
827             pattern_.defaultShiftSize_ = minlength;
828             setShiftTable(pattern_.shift_, pattern_.backShift_, pattern_.CE_, cesize,
829                     expandlength, minlength, minlength);
830             return;
831         }
832         return pattern_.defaultShiftSize_;
833         */
834     }
835 
836     /**
837      * @internal
838      * @deprecated This API is ICU internal only.
839      */
840     @Override
841     @Deprecated
setMatchNotFound()842     protected void setMatchNotFound() {
843         super.setMatchNotFound();
844         // SearchIterator#setMatchNotFound() does following:
845         //      search_.matchedIndex_ = DONE;
846         //      search_.setMatchedLength(0);
847         if (search_.isForwardSearching_) {
848             textIter_.setOffset(search_.text().getEndIndex());
849         } else {
850             textIter_.setOffset(0);
851         }
852     }
853 
854     /**
855      * Checks if the offset runs out of the text string range
856      * @param textstart offset of the first character in the range
857      * @param textlimit limit offset of the text string range
858      * @param offset to test
859      * @return true if offset is out of bounds, false otherwise
860      */
isOutOfBounds(int textstart, int textlimit, int offset)861     private static final boolean isOutOfBounds(int textstart, int textlimit, int offset) {
862         return offset < textstart || offset > textlimit;
863     }
864 
865     /**
866      * Checks for identical match
867      * @param start offset of possible match
868      * @param end offset of possible match
869      * @return true if identical match is found
870      */
checkIdentical(int start, int end)871     private boolean checkIdentical(int start, int end) {
872         if (strength_ != Collator.IDENTICAL) {
873             return true;
874         }
875         // Note: We could use Normalizer::compare() or similar, but for short strings
876         // which may not be in FCD it might be faster to just NFD them.
877         String textstr = getString(targetText, start, end - start);
878         if (Normalizer.quickCheck(textstr, Normalizer.NFD, 0) == Normalizer.NO) {
879             textstr = Normalizer.decompose(textstr, false);
880         }
881         String patternstr = pattern_.text_;
882         if (Normalizer.quickCheck(patternstr, Normalizer.NFD, 0) == Normalizer.NO) {
883             patternstr = Normalizer.decompose(patternstr, false);
884         }
885         return textstr.equals(patternstr);
886     }
887 
initTextProcessedIter()888     private boolean initTextProcessedIter() {
889         if (textProcessedIter_ == null) {
890             textProcessedIter_ = new CollationPCE(textIter_);
891         } else {
892             textProcessedIter_.init(textIter_);
893         }
894         return true;
895     }
896 
897     /*
898      * Find the next break boundary after startIndex. If the UStringSearch object
899      * has an external break iterator, use that. Otherwise use the internal character
900      * break iterator.
901      */
nextBoundaryAfter(int startIndex)902     private int nextBoundaryAfter(int startIndex) {
903         BreakIterator breakiterator = search_.breakIter();
904 
905         if (breakiterator == null) {
906             breakiterator = search_.internalBreakIter_;
907         }
908 
909         if (breakiterator != null) {
910             return breakiterator.following(startIndex);
911         }
912 
913         return startIndex;
914     }
915 
916     /*
917      * Returns true if index is on a break boundary. If the UStringSearch
918      * has an external break iterator, test using that, otherwise test
919      * using the internal character break iterator.
920      */
isBreakBoundary(int index)921     private boolean isBreakBoundary(int index) {
922         BreakIterator breakiterator = search_.breakIter();
923 
924         if (breakiterator == null) {
925             breakiterator = search_.internalBreakIter_;
926         }
927 
928         return (breakiterator != null && breakiterator.isBoundary(index));
929     }
930 
931 
932     // Java porting note: Followings are corresponding to UCompareCEsResult enum
933     private static final int CE_MATCH = -1;
934     private static final int CE_NO_MATCH = 0;
935     private static final int CE_SKIP_TARG = 1;
936     private static final int CE_SKIP_PATN = 2;
937 
938     private static int CE_LEVEL2_BASE = 0x00000005;
939     private static int CE_LEVEL3_BASE = 0x00050000;
940 
compareCE64s(long targCE, long patCE, ElementComparisonType compareType)941     private static int compareCE64s(long targCE, long patCE, ElementComparisonType compareType) {
942         if (targCE == patCE) {
943             return CE_MATCH;
944         }
945         if (compareType == ElementComparisonType.STANDARD_ELEMENT_COMPARISON) {
946             return CE_NO_MATCH;
947         }
948 
949         long targCEshifted = targCE >>> 32;
950         long patCEshifted = patCE >>> 32;
951         long mask;
952 
953         mask = 0xFFFF0000L;
954         int targLev1 = (int)(targCEshifted & mask);
955         int patLev1 = (int)(patCEshifted & mask);
956         if (targLev1 != patLev1) {
957             if (targLev1 == 0) {
958                 return CE_SKIP_TARG;
959             }
960             if (patLev1 == 0
961                     && compareType == ElementComparisonType.ANY_BASE_WEIGHT_IS_WILDCARD) {
962                 return CE_SKIP_PATN;
963             }
964             return CE_NO_MATCH;
965         }
966 
967         mask = 0x0000FFFFL;
968         int targLev2 = (int)(targCEshifted & mask);
969         int patLev2 = (int)(patCEshifted & mask);
970         if (targLev2 != patLev2) {
971             if (targLev2 == 0) {
972                 return CE_SKIP_TARG;
973             }
974             if (patLev2 == 0
975                     && compareType == ElementComparisonType.ANY_BASE_WEIGHT_IS_WILDCARD) {
976                 return CE_SKIP_PATN;
977             }
978             return (patLev2 == CE_LEVEL2_BASE ||
979                     (compareType == ElementComparisonType.ANY_BASE_WEIGHT_IS_WILDCARD &&
980                         targLev2 == CE_LEVEL2_BASE)) ? CE_MATCH : CE_NO_MATCH;
981         }
982 
983         mask = 0xFFFF0000L;
984         int targLev3 = (int)(targCE & mask);
985         int patLev3 = (int)(patCE & mask);
986         if (targLev3 != patLev3) {
987             return (patLev3 == CE_LEVEL3_BASE ||
988                     (compareType == ElementComparisonType.ANY_BASE_WEIGHT_IS_WILDCARD &&
989                         targLev3 == CE_LEVEL3_BASE) )? CE_MATCH: CE_NO_MATCH;
990         }
991 
992         return CE_MATCH;
993     }
994 
995     /**
996      * An object used for receiving matched index in search() and
997      * searchBackwards().
998      */
999     private static class Match {
1000         int start_ = -1;
1001         int limit_ = -1;
1002     }
1003 
search(int startIdx, Match m)1004     private boolean search(int startIdx, Match m) {
1005         // Input parameter sanity check.
1006         if (pattern_.CELength_ == 0
1007                 || startIdx < search_.beginIndex()
1008                 || startIdx > search_.endIndex()) {
1009             throw new IllegalArgumentException("search(" + startIdx + ", m) - expected position to be between " +
1010                     search_.beginIndex() + " and " + search_.endIndex());
1011         }
1012 
1013         if (pattern_.PCE_ == null) {
1014             initializePatternPCETable();
1015         }
1016 
1017         textIter_.setOffset(startIdx);
1018         CEBuffer ceb = new CEBuffer(this);
1019 
1020         int targetIx = 0;
1021         CEI targetCEI = null;
1022         int patIx;
1023         boolean found;
1024 
1025         int mStart = -1;
1026         int mLimit = -1;
1027         int minLimit;
1028         int maxLimit;
1029 
1030         // Outer loop moves over match starting positions in the
1031         //      target CE space.
1032         // Here we see the target as a sequence of collation elements, resulting from the following:
1033         // 1. Target characters were decomposed, and (if appropriate) other compressions and expansions are applied
1034         //    (for example, digraphs such as IJ may be broken into two characters).
1035         // 2. An int64_t CE weight is determined for each resulting unit (high 16 bits are primary strength, next
1036         //    16 bits are secondary, next 16 (the high 16 bits of the low 32-bit half) are tertiary. Any of these
1037         //    fields that are for strengths below that of the collator are set to 0. If this makes the int64_t
1038         //    CE weight 0 (as for a combining diacritic with secondary weight when the collator strength is primary),
1039         //    then the CE is deleted, so the following code sees only CEs that are relevant.
1040         // For each CE, the lowIndex and highIndex correspond to where this CE begins and ends in the original text.
1041         // If lowIndex==highIndex, either the CE resulted from an expansion/decomposition of one of the original text
1042         // characters, or the CE marks the limit of the target text (in which case the CE weight is UCOL_PROCESSED_NULLORDER).
1043         for (targetIx = 0; ; targetIx++) {
1044             found = true;
1045             // Inner loop checks for a match beginning at each
1046             // position from the outer loop.
1047             int targetIxOffset = 0;
1048             long patCE = 0;
1049             // For targetIx > 0, this ceb.get gets a CE that is as far back in the ring buffer
1050             // (compared to the last CE fetched for the previous targetIx value) as we need to go
1051             // for this targetIx value, so if it is non-NULL then other ceb.get calls should be OK.
1052             CEI firstCEI = ceb.get(targetIx);
1053             if (firstCEI == null) {
1054                 throw new ICUException("CEBuffer.get(" + targetIx + ") returned null.");
1055             }
1056 
1057             for (patIx = 0; patIx < pattern_.PCELength_; patIx++) {
1058                 patCE = pattern_.PCE_[patIx];
1059                 targetCEI = ceb.get(targetIx + patIx + targetIxOffset);
1060                 // Compare CE from target string with CE from the pattern.
1061                 // Note that the target CE will be UCOL_PROCESSED_NULLORDER if we reach the end of input,
1062                 // which will fail the compare, below.
1063                 int ceMatch = compareCE64s(targetCEI.ce_, patCE, search_.elementComparisonType_);
1064                 if (ceMatch == CE_NO_MATCH) {
1065                     found = false;
1066                     break;
1067                 } else if (ceMatch > CE_NO_MATCH) {
1068                     if (ceMatch == CE_SKIP_TARG) {
1069                         // redo with same patCE, next targCE
1070                         patIx--;
1071                         targetIxOffset++;
1072                     } else { // ceMatch == CE_SKIP_PATN
1073                         // redo with same targCE, next patCE
1074                         targetIxOffset--;
1075                     }
1076                 }
1077             }
1078             targetIxOffset += pattern_.PCELength_; // this is now the offset in target CE space to end of the match so far
1079 
1080             if (!found && ((targetCEI == null) || (targetCEI.ce_ != CollationPCE.PROCESSED_NULLORDER))) {
1081                 // No match at this targetIx.  Try again at the next.
1082                 continue;
1083             }
1084 
1085             if (!found) {
1086                 // No match at all, we have run off the end of the target text.
1087                 break;
1088             }
1089 
1090             // We have found a match in CE space.
1091             // Now determine the bounds in string index space.
1092             // There still is a chance of match failure if the CE range not correspond to
1093             // an acceptable character range.
1094             //
1095             CEI lastCEI = ceb.get(targetIx + targetIxOffset -1);
1096 
1097             mStart = firstCEI.lowIndex_;
1098             minLimit = lastCEI.lowIndex_;
1099 
1100             // Look at the CE following the match.  If it is UCOL_NULLORDER the match
1101             // extended to the end of input, and the match is good.
1102 
1103             // Look at the high and low indices of the CE following the match. If
1104             // they are the same it means one of two things:
1105             //    1. The match extended to the last CE from the target text, which is OK, or
1106             //    2. The last CE that was part of the match is in an expansion that extends
1107             //       to the first CE after the match. In this case, we reject the match.
1108             CEI nextCEI = null;
1109             if (search_.elementComparisonType_ == ElementComparisonType.STANDARD_ELEMENT_COMPARISON) {
1110                 nextCEI = ceb.get(targetIx + targetIxOffset);
1111                 maxLimit = nextCEI.lowIndex_;
1112                 if (nextCEI.lowIndex_ == nextCEI.highIndex_ && nextCEI.ce_ != CollationPCE.PROCESSED_NULLORDER) {
1113                     found = false;
1114                 }
1115             } else {
1116                 for (;; ++targetIxOffset) {
1117                     nextCEI = ceb.get(targetIx + targetIxOffset);
1118                     maxLimit = nextCEI.lowIndex_;
1119                     // If we are at the end of the target too, match succeeds
1120                     if (nextCEI.ce_ == CollationPCE.PROCESSED_NULLORDER) {
1121                         break;
1122                     }
1123                     // As long as the next CE has primary weight of 0,
1124                     // it is part of the last target element matched by the pattern;
1125                     // make sure it can be part of a match with the last patCE
1126                     if ((((nextCEI.ce_) >>> 32) & 0xFFFF0000L) == 0) {
1127                         int ceMatch = compareCE64s(nextCEI.ce_, patCE, search_.elementComparisonType_);
1128                         if (ceMatch == CE_NO_MATCH || ceMatch == CE_SKIP_PATN ) {
1129                             found = false;
1130                             break;
1131                         }
1132                     // If lowIndex == highIndex, this target CE is part of an expansion of the last matched
1133                     // target element, but it has non-zero primary weight => match fails
1134                     } else if ( nextCEI.lowIndex_ == nextCEI.highIndex_ ) {
1135                         found = false;
1136                         break;
1137                     // Else the target CE is not part of an expansion of the last matched element, match succeeds
1138                     } else {
1139                         break;
1140                     }
1141                 }
1142             }
1143 
1144             // Check for the start of the match being within a combining sequence.
1145             // This can happen if the pattern itself begins with a combining char, and
1146             // the match found combining marks in the target text that were attached
1147             // to something else.
1148             // This type of match should be rejected for not completely consuming a
1149             // combining sequence.
1150             if (!isBreakBoundary(mStart)) {
1151                 found = false;
1152             }
1153 
1154             // Check for the start of the match being within an Collation Element Expansion,
1155             // meaning that the first char of the match is only partially matched.
1156             // With expansions, the first CE will report the index of the source
1157             // character, and all subsequent (expansions) CEs will report the source index of the
1158             // _following_ character.
1159             int secondIx = firstCEI.highIndex_;
1160             if (mStart == secondIx) {
1161                 found = false;
1162             }
1163 
1164             // Allow matches to end in the middle of a grapheme cluster if the following
1165             // conditions are met; this is needed to make prefix search work properly in
1166             // Indic, see #11750
1167             // * the default breakIter is being used
1168             // * the next collation element after this combining sequence
1169             //   - has non-zero primary weight
1170             //   - corresponds to a separate character following the one at end of the current match
1171             //   (the second of these conditions, and perhaps both, may be redundant given the
1172             //   subsequent check for normalization boundary; however they are likely much faster
1173             //   tests in any case)
1174             // * the match limit is a normalization boundary
1175             boolean allowMidclusterMatch =
1176                             breakIterator == null &&
1177                             (((nextCEI.ce_) >>> 32) & 0xFFFF0000L) != 0 &&
1178                             maxLimit >= lastCEI.highIndex_ && nextCEI.highIndex_ > maxLimit &&
1179                             (nfd_.hasBoundaryBefore(codePointAt(targetText, maxLimit)) ||
1180                                     nfd_.hasBoundaryAfter(codePointBefore(targetText, maxLimit)));
1181 
1182             // If those conditions are met, then:
1183             // * do NOT advance the candidate match limit (mLimit) to a break boundary; however
1184             //   the match limit may be backed off to a previous break boundary. This handles
1185             //   cases in which mLimit includes target characters that are ignorable with current
1186             //   settings (such as space) and which extend beyond the pattern match.
1187             // * do NOT require that end of the combining sequence not extend beyond the match in CE space
1188             // * do NOT require that match limit be on a breakIter boundary
1189 
1190             // Advance the match end position to the first acceptable match boundary.
1191             // This advances the index over any combining characters.
1192             mLimit = maxLimit;
1193             if (minLimit < maxLimit) {
1194                 // When the last CE's low index is same with its high index, the CE is likely
1195                 // a part of expansion. In this case, the index is located just after the
1196                 // character corresponding to the CEs compared above. If the index is right
1197                 // at the break boundary, move the position to the next boundary will result
1198                 // incorrect match length when there are ignorable characters exist between
1199                 // the position and the next character produces CE(s). See ticket#8482.
1200                 if (minLimit == lastCEI.highIndex_ && isBreakBoundary(minLimit)) {
1201                     mLimit = minLimit;
1202                 } else {
1203                     int nba = nextBoundaryAfter(minLimit);
1204                     // Note that we can have nba < maxLimit && nba >= minLImit, in which
1205                     // case we want to set mLimit to nba regardless of allowMidclusterMatch
1206                     // (i.e. we back off mLimit to the previous breakIterator boundary).
1207                     if (nba >= lastCEI.highIndex_ && (!allowMidclusterMatch || nba < maxLimit)) {
1208                         mLimit = nba;
1209                     }
1210                 }
1211             }
1212 
1213             if (!allowMidclusterMatch) {
1214                 // If advancing to the end of a combining sequence in character indexing space
1215                 // advanced us beyond the end of the match in CE space, reject this match.
1216                 if (mLimit > maxLimit) {
1217                     found = false;
1218                 }
1219 
1220                 if (!isBreakBoundary(mLimit)) {
1221                     found = false;
1222                 }
1223             }
1224 
1225             if (!checkIdentical(mStart, mLimit)) {
1226                 found = false;
1227             }
1228 
1229             if (found) {
1230                 break;
1231             }
1232         }
1233 
1234         // All Done.  Store back the match bounds to the caller.
1235         //
1236         if (found == false) {
1237             mLimit = -1;
1238             mStart = -1;
1239         }
1240 
1241         if (m != null) {
1242             m.start_ = mStart;
1243             m.limit_ = mLimit;
1244         }
1245 
1246         return found;
1247     }
1248 
codePointAt(CharacterIterator iter, int index)1249     private static int codePointAt(CharacterIterator iter, int index) {
1250         int currentIterIndex = iter.getIndex();
1251         char codeUnit = iter.setIndex(index);
1252         int cp = codeUnit;
1253         if (Character.isHighSurrogate(codeUnit)) {
1254             char nextUnit = iter.next();
1255             if (Character.isLowSurrogate(nextUnit)) {
1256                 cp = Character.toCodePoint(codeUnit, nextUnit);
1257             }
1258         }
1259         iter.setIndex(currentIterIndex);  // restore iter position
1260         return cp;
1261     }
1262 
codePointBefore(CharacterIterator iter, int index)1263     private static int codePointBefore(CharacterIterator iter, int index) {
1264         int currentIterIndex = iter.getIndex();
1265         iter.setIndex(index);
1266         char codeUnit = iter.previous();
1267         int cp = codeUnit;
1268         if (Character.isLowSurrogate(codeUnit)) {
1269             char prevUnit = iter.previous();
1270             if (Character.isHighSurrogate(prevUnit)) {
1271                 cp = Character.toCodePoint(prevUnit, codeUnit);
1272             }
1273         }
1274         iter.setIndex(currentIterIndex);  // restore iter position
1275         return cp;
1276     }
1277 
searchBackwards(int startIdx, Match m)1278     private boolean searchBackwards(int startIdx, Match m) {
1279         //ICU4C_TODO comment:  reject search patterns beginning with a combining char.
1280 
1281         // Input parameter sanity check.
1282         if (pattern_.CELength_ == 0
1283                 || startIdx < search_.beginIndex()
1284                 || startIdx > search_.endIndex()) {
1285             throw new IllegalArgumentException("searchBackwards(" + startIdx + ", m) - expected position to be between " +
1286                     search_.beginIndex() + " and " + search_.endIndex());
1287         }
1288 
1289         if (pattern_.PCE_ == null) {
1290             initializePatternPCETable();
1291         }
1292 
1293         CEBuffer ceb = new CEBuffer(this);
1294         int targetIx = 0;
1295 
1296         /*
1297          * Pre-load the buffer with the CE's for the grapheme
1298          * after our starting position so that we're sure that
1299          * we can look at the CE following the match when we
1300          * check the match boundaries.
1301          *
1302          * This will also pre-fetch the first CE that we'll
1303          * consider for the match.
1304          */
1305         if (startIdx < search_.endIndex()) {
1306             BreakIterator bi = search_.internalBreakIter_;
1307             int next = bi.following(startIdx);
1308 
1309             textIter_.setOffset(next);
1310 
1311             for (targetIx = 0; ; targetIx++) {
1312                 if (ceb.getPrevious(targetIx).lowIndex_ < startIdx) {
1313                     break;
1314                 }
1315             }
1316         } else {
1317             textIter_.setOffset(startIdx);
1318         }
1319 
1320         CEI targetCEI = null;
1321         int patIx;
1322         boolean found;
1323 
1324         int limitIx = targetIx;
1325         int mStart = -1;
1326         int mLimit = -1;
1327         int minLimit;
1328         int maxLimit;
1329 
1330         // Outer loop moves over match starting positions in the
1331         //      target CE space.
1332         // Here, targetIx values increase toward the beginning of the base text (i.e. we get the text CEs in reverse order).
1333         // But  patIx is 0 at the beginning of the pattern and increases toward the end.
1334         // So this loop performs a comparison starting with the end of pattern, and prcessd toward the beginning of the pattern
1335         // and the beginning of the base text.
1336         for (targetIx = limitIx; ; targetIx++) {
1337             found = true;
1338             // For targetIx > limitIx, this ceb.getPrevious gets a CE that is as far back in the ring buffer
1339             // (compared to the last CE fetched for the previous targetIx value) as we need to go
1340             // for this targetIx value, so if it is non-NULL then other ceb.getPrevious calls should be OK.
1341             CEI lastCEI = ceb.getPrevious(targetIx);
1342             if (lastCEI == null) {
1343                 throw new ICUException("CEBuffer.getPrevious(" + targetIx + ") returned null.");
1344             }
1345             // Inner loop checks for a match beginning at each
1346             // position from the outer loop.
1347             int targetIxOffset = 0;
1348             for (patIx = pattern_.PCELength_ - 1; patIx >= 0; patIx--) {
1349                 long patCE = pattern_.PCE_[patIx];
1350 
1351                 targetCEI = ceb.getPrevious(targetIx + pattern_.PCELength_ - 1 - patIx + targetIxOffset);
1352                 // Compare CE from target string with CE from the pattern.
1353                 // Note that the target CE will be UCOL_NULLORDER if we reach the end of input,
1354                 // which will fail the compare, below.
1355                 int ceMatch = compareCE64s(targetCEI.ce_, patCE, search_.elementComparisonType_);
1356                 if (ceMatch == CE_NO_MATCH) {
1357                     found = false;
1358                     break;
1359                 } else if (ceMatch > CE_NO_MATCH) {
1360                     if (ceMatch == CE_SKIP_TARG) {
1361                         // redo with same patCE, next targCE
1362                         patIx++;
1363                         targetIxOffset++;
1364                     } else { // ceMatch == CE_SKIP_PATN
1365                         // redo with same targCE, next patCE
1366                         targetIxOffset--;
1367                     }
1368                 }
1369             }
1370 
1371             if (!found && ((targetCEI == null) || (targetCEI.ce_ != CollationPCE.PROCESSED_NULLORDER))) {
1372                 // No match at this targetIx.  Try again at the next.
1373                 continue;
1374             }
1375 
1376             if (!found) {
1377                 // No match at all, we have run off the end of the target text.
1378                 break;
1379             }
1380 
1381             // We have found a match in CE space.
1382             // Now determine the bounds in string index space.
1383             // There still is a chance of match failure if the CE range not correspond to
1384             // an acceptable character range.
1385             //
1386             CEI firstCEI = ceb.getPrevious(targetIx + pattern_.PCELength_ - 1 + targetIxOffset);
1387             mStart = firstCEI.lowIndex_;
1388 
1389             // Check for the start of the match being within a combining sequence.
1390             // This can happen if the pattern itself begins with a combining char, and
1391             // the match found combining marks in the target text that were attached
1392             // to something else.
1393             // This type of match should be rejected for not completely consuming a
1394             // combining sequence.
1395             if (!isBreakBoundary(mStart)) {
1396                 found = false;
1397             }
1398 
1399             // Look at the high index of the first CE in the match. If it's the same as the
1400             // low index, the first CE in the match is in the middle of an expansion.
1401             if (mStart == firstCEI.highIndex_) {
1402                 found = false;
1403             }
1404 
1405             minLimit = lastCEI.lowIndex_;
1406 
1407             if (targetIx > 0) {
1408                 // Look at the CE following the match.  If it is UCOL_NULLORDER the match
1409                 // extended to the end of input, and the match is good.
1410 
1411                 // Look at the high and low indices of the CE following the match. If
1412                 // they are the same it means one of two things:
1413                 //    1. The match extended to the last CE from the target text, which is OK, or
1414                 //    2. The last CE that was part of the match is in an expansion that extends
1415                 //       to the first CE after the match. In this case, we reject the match.
1416                 CEI nextCEI  = ceb.getPrevious(targetIx - 1);
1417 
1418                 if (nextCEI.lowIndex_ == nextCEI.highIndex_ && nextCEI.ce_ != CollationPCE.PROCESSED_NULLORDER) {
1419                     found = false;
1420                 }
1421 
1422                 mLimit = maxLimit = nextCEI.lowIndex_;
1423 
1424                 // Allow matches to end in the middle of a grapheme cluster if the following
1425                 // conditions are met; this is needed to make prefix search work properly in
1426                 // Indic, see #11750
1427                 // * the default breakIter is being used
1428                 // * the next collation element after this combining sequence
1429                 //   - has non-zero primary weight
1430                 //   - corresponds to a separate character following the one at end of the current match
1431                 //   (the second of these conditions, and perhaps both, may be redundant given the
1432                 //   subsequent check for normalization boundary; however they are likely much faster
1433                 //   tests in any case)
1434                 // * the match limit is a normalization boundary
1435                 boolean allowMidclusterMatch =
1436                                 breakIterator == null &&
1437                                 (((nextCEI.ce_) >>> 32) & 0xFFFF0000L) != 0 &&
1438                                 maxLimit >= lastCEI.highIndex_ && nextCEI.highIndex_ > maxLimit &&
1439                                 (nfd_.hasBoundaryBefore(codePointAt(targetText, maxLimit)) ||
1440                                         nfd_.hasBoundaryAfter(codePointBefore(targetText, maxLimit)));
1441 
1442                 // If those conditions are met, then:
1443                 // * do NOT advance the candidate match limit (mLimit) to a break boundary; however
1444                 //   the match limit may be backed off to a previous break boundary. This handles
1445                 //   cases in which mLimit includes target characters that are ignorable with current
1446                 //   settings (such as space) and which extend beyond the pattern match.
1447                 // * do NOT require that end of the combining sequence not extend beyond the match in CE space
1448                 // * do NOT require that match limit be on a breakIter boundary
1449 
1450                 // Advance the match end position to the first acceptable match boundary.
1451                 // This advances the index over any combining characters.
1452                 if (minLimit < maxLimit) {
1453                     int nba = nextBoundaryAfter(minLimit);
1454                     // Note that we can have nba < maxLimit && nba >= minLImit, in which
1455                     // case we want to set mLimit to nba regardless of allowMidclusterMatch
1456                     // (i.e. we back off mLimit to the previous breakIterator boundary).
1457                     if (nba >= lastCEI.highIndex_ && (!allowMidclusterMatch || nba < maxLimit)) {
1458                         mLimit = nba;
1459                     }
1460                 }
1461 
1462                 if (!allowMidclusterMatch) {
1463                     // If advancing to the end of a combining sequence in character indexing space
1464                     // advanced us beyond the end of the match in CE space, reject this match.
1465                     if (mLimit > maxLimit) {
1466                         found = false;
1467                     }
1468 
1469                     // Make sure the end of the match is on a break boundary
1470                     if (!isBreakBoundary(mLimit)) {
1471                         found = false;
1472                     }
1473                 }
1474 
1475             } else {
1476                 // No non-ignorable CEs after this point.
1477                 // The maximum position is detected by boundary after
1478                 // the last non-ignorable CE. Combining sequence
1479                 // across the start index will be truncated.
1480                 int nba = nextBoundaryAfter(minLimit);
1481                 mLimit = maxLimit = (nba > 0) && (startIdx > nba) ? nba : startIdx;
1482             }
1483 
1484             if (!checkIdentical(mStart, mLimit)) {
1485                 found = false;
1486             }
1487 
1488             if (found) {
1489                 break;
1490             }
1491         }
1492 
1493         // All Done.  Store back the match bounds to the caller.
1494         //
1495         if (found == false) {
1496             mLimit = -1;
1497             mStart = -1;
1498         }
1499 
1500         if (m != null) {
1501             m.start_ = mStart;
1502             m.limit_ = mLimit;
1503         }
1504 
1505         return found;
1506     }
1507 
1508     // Java porting note:
1509     //
1510     // ICU4C usearch_handleNextExact() is identical to usearch_handleNextCanonical()
1511     // for the linear search implementation. The differences are addressed in search().
1512     //
handleNextExact()1513     private boolean handleNextExact() {
1514         return handleNextCommonImpl();
1515     }
1516 
handleNextCanonical()1517     private boolean handleNextCanonical() {
1518         return handleNextCommonImpl();
1519     }
1520 
handleNextCommonImpl()1521     private boolean handleNextCommonImpl() {
1522         int textOffset = textIter_.getOffset();
1523         Match match = new Match();
1524 
1525         if (search(textOffset, match)) {
1526             search_.matchedIndex_ = match.start_;
1527             search_.setMatchedLength(match.limit_ - match.start_);
1528             return true;
1529         } else {
1530             setMatchNotFound();
1531             return false;
1532         }
1533     }
1534 
1535     // Java porting note:
1536     //
1537     // ICU4C usearch_handlePreviousExact() is identical to usearch_handlePreviousCanonical()
1538     // for the linear search implementation. The differences are addressed in searchBackwards().
1539     //
handlePreviousExact()1540     private boolean handlePreviousExact() {
1541         return handlePreviousCommonImpl();
1542     }
1543 
handlePreviousCanonical()1544     private boolean handlePreviousCanonical() {
1545         return handlePreviousCommonImpl();
1546     }
1547 
handlePreviousCommonImpl()1548     private boolean handlePreviousCommonImpl() {
1549         int textOffset;
1550 
1551         if (search_.isOverlap_) {
1552             if (search_.matchedIndex_ != DONE) {
1553                 textOffset = search_.matchedIndex_ + search_.matchedLength() - 1;
1554             } else {
1555                 // move the start position at the end of possible match
1556                 initializePatternPCETable();
1557                 if (!initTextProcessedIter()) {
1558                     setMatchNotFound();
1559                     return false;
1560                 }
1561                 for (int nPCEs = 0; nPCEs < pattern_.PCELength_ - 1; nPCEs++) {
1562                     long pce = textProcessedIter_.nextProcessed(null);
1563                     if (pce == CollationPCE.PROCESSED_NULLORDER) {
1564                         // at the end of the text
1565                         break;
1566                     }
1567                 }
1568                 textOffset = textIter_.getOffset();
1569             }
1570         } else {
1571             textOffset = textIter_.getOffset();
1572         }
1573 
1574         Match match = new Match();
1575         if (searchBackwards(textOffset, match)) {
1576             search_.matchedIndex_ = match.start_;
1577             search_.setMatchedLength(match.limit_ - match.start_);
1578             return true;
1579         } else {
1580             setMatchNotFound();
1581             return false;
1582         }
1583     }
1584 
1585     /**
1586      * Gets a substring out of a CharacterIterator
1587      *
1588      * Java porting note: Not available in ICU4C
1589      *
1590      * @param text CharacterIterator
1591      * @param start start offset
1592      * @param length of substring
1593      * @return substring from text starting at start and length length
1594      */
getString(CharacterIterator text, int start, int length)1595     private static final String getString(CharacterIterator text, int start, int length) {
1596         StringBuilder result = new StringBuilder(length);
1597         int offset = text.getIndex();
1598         text.setIndex(start);
1599         for (int i = 0; i < length; i++) {
1600             result.append(text.current());
1601             text.next();
1602         }
1603         text.setIndex(offset);
1604         return result.toString();
1605     }
1606 
1607     /**
1608      * Java port of ICU4C struct UPattern (usrchimp.h)
1609      */
1610     private static final class Pattern {
1611         /** Pattern string */
1612         String text_;
1613 
1614         long[] PCE_;
1615         int PCELength_ = 0;
1616 
1617         // TODO: We probably do not need CE_ / CELength_
1618         @SuppressWarnings("unused")
1619         int[] CE_;
1620         int CELength_ = 0;
1621 
1622         // *** Boyer-Moore ***
1623         // boolean hasPrefixAccents_ = false;
1624         // boolean hasSuffixAccents_ = false;
1625         // int defaultShiftSize_;
1626         // char[] shift_;
1627         // char[] backShift_;
1628 
Pattern(String pattern)1629         protected Pattern(String pattern) {
1630             text_ = pattern;
1631         }
1632     }
1633 
1634     /**
1635      * Java port of ICU4C UCollationPCE (usrchimp.h)
1636      */
1637     private static class CollationPCE {
1638         public static final long PROCESSED_NULLORDER = -1;
1639 
1640         private static final int DEFAULT_BUFFER_SIZE = 16;
1641         private static final int BUFFER_GROW = 8;
1642 
1643         // Note: PRIMARYORDERMASK is also duplicated in StringSearch class
1644         private static final int PRIMARYORDERMASK = 0xffff0000;
1645         private static final int CONTINUATION_MARKER = 0xc0;
1646 
1647         private PCEBuffer pceBuffer_ = new PCEBuffer();
1648         private CollationElementIterator cei_;
1649         private int strength_;
1650         private boolean toShift_;
1651         private boolean isShifted_;
1652         private int variableTop_;
1653 
CollationPCE(CollationElementIterator iter)1654         public CollationPCE(CollationElementIterator iter) {
1655             init(iter);
1656         }
1657 
init(CollationElementIterator iter)1658         public void init(CollationElementIterator iter) {
1659             cei_ = iter;
1660             init(iter.getRuleBasedCollator());
1661         }
1662 
init(RuleBasedCollator coll)1663         private void init(RuleBasedCollator coll) {
1664             strength_ = coll.getStrength();
1665             toShift_ = coll.isAlternateHandlingShifted();
1666             isShifted_ = false;
1667             variableTop_ = coll.getVariableTop();
1668         }
1669 
1670         @SuppressWarnings("fallthrough")
processCE(int ce)1671         private long processCE(int ce) {
1672             long primary = 0, secondary = 0, tertiary = 0, quaternary = 0;
1673 
1674             // This is clean, but somewhat slow...
1675             // We could apply the mask to ce and then
1676             // just get all three orders...
1677             switch (strength_) {
1678             default:
1679                 tertiary = CollationElementIterator.tertiaryOrder(ce);
1680                 /* note fall-through */
1681 
1682             case Collator.SECONDARY:
1683                 secondary = CollationElementIterator.secondaryOrder(ce);
1684                 /* note fall-through */
1685 
1686             case Collator.PRIMARY:
1687                 primary = CollationElementIterator.primaryOrder(ce);
1688             }
1689 
1690             // **** This should probably handle continuations too. ****
1691             // **** That means that we need 24 bits for the primary ****
1692             // **** instead of the 16 that we're currently using. ****
1693             // **** So we can lay out the 64 bits as: 24.12.12.16. ****
1694             // **** Another complication with continuations is that ****
1695             // **** the *second* CE is marked as a continuation, so ****
1696             // **** we always have to peek ahead to know how long ****
1697             // **** the primary is... ****
1698             if ((toShift_ && variableTop_ > ce && primary != 0) || (isShifted_ && primary == 0)) {
1699 
1700                 if (primary == 0) {
1701                     return CollationElementIterator.IGNORABLE;
1702                 }
1703 
1704                 if (strength_ >= Collator.QUATERNARY) {
1705                     quaternary = primary;
1706                 }
1707 
1708                 primary = secondary = tertiary = 0;
1709                 isShifted_ = true;
1710             } else {
1711                 if (strength_ >= Collator.QUATERNARY) {
1712                     quaternary = 0xFFFF;
1713                 }
1714 
1715                 isShifted_ = false;
1716             }
1717 
1718             return primary << 48 | secondary << 32 | tertiary << 16 | quaternary;
1719         }
1720 
1721         /**
1722          * Get the processed ordering priority of the next collation element in the text.
1723          * A single character may contain more than one collation element.
1724          *
1725          * Note: This is equivalent to
1726          * UCollationPCE::nextProcessed(int32_t *ixLow, int32_t *ixHigh, UErrorCode *status);
1727          *
1728          * @param range receiving the iterator index before/after fetching the CE.
1729          * @return The next collation elements ordering, otherwise returns PROCESSED_NULLORDER
1730          *         if an error has occurred or if the end of string has been reached
1731          */
nextProcessed(Range range)1732         public long nextProcessed(Range range) {
1733             long result = CollationElementIterator.IGNORABLE;
1734             int low = 0, high = 0;
1735 
1736             pceBuffer_.reset();
1737 
1738             do {
1739                 low = cei_.getOffset();
1740                 int ce = cei_.next();
1741                 high = cei_.getOffset();
1742 
1743                 if (ce == CollationElementIterator.NULLORDER) {
1744                      result = PROCESSED_NULLORDER;
1745                      break;
1746                 }
1747 
1748                 result = processCE(ce);
1749             } while (result == CollationElementIterator.IGNORABLE);
1750 
1751             if (range != null) {
1752                 range.ixLow_ = low;
1753                 range.ixHigh_ = high;
1754             }
1755 
1756             return result;
1757         }
1758 
1759         /**
1760          * Get the processed ordering priority of the previous collation element in the text.
1761          * A single character may contain more than one collation element.
1762          *
1763          * Note: This is equivalent to
1764          * UCollationPCE::previousProcessed(int32_t *ixLow, int32_t *ixHigh, UErrorCode *status);
1765          *
1766          * @param range receiving the iterator index before/after fetching the CE.
1767          * @return The previous collation elements ordering, otherwise returns
1768          *         PROCESSED_NULLORDER if an error has occurred or if the start of
1769          *         string has been reached.
1770          */
previousProcessed(Range range)1771         public long previousProcessed(Range range) {
1772             long result = CollationElementIterator.IGNORABLE;
1773             int low = 0, high = 0;
1774 
1775             // pceBuffer_.reset();
1776 
1777             while (pceBuffer_.empty()) {
1778                 // buffer raw CEs up to non-ignorable primary
1779                 RCEBuffer rceb = new RCEBuffer();
1780                 int ce;
1781 
1782                 boolean finish = false;
1783 
1784                 // **** do we need to reset rceb, or will it always be empty at this point ****
1785                 do {
1786                     high = cei_.getOffset();
1787                     ce = cei_.previous();
1788                     low = cei_.getOffset();
1789 
1790                     if (ce == CollationElementIterator.NULLORDER) {
1791                         if (!rceb.empty()) {
1792                             break;
1793                         }
1794 
1795                         finish = true;
1796                         break;
1797                     }
1798 
1799                     rceb.put(ce, low, high);
1800                 } while ((ce & PRIMARYORDERMASK) == 0 || isContinuation(ce));
1801 
1802                 if (finish) {
1803                     break;
1804                 }
1805 
1806                 // process the raw CEs
1807                 while (!rceb.empty()) {
1808                     RCEI rcei = rceb.get();
1809 
1810                     result = processCE(rcei.ce_);
1811 
1812                     if (result != CollationElementIterator.IGNORABLE) {
1813                         pceBuffer_.put(result, rcei.low_, rcei.high_);
1814                     }
1815                 }
1816             }
1817 
1818             if (pceBuffer_.empty()) {
1819                 // **** Is -1 the right value for ixLow, ixHigh? ****
1820                 if (range != null) {
1821                     range.ixLow_ = -1;
1822                     range.ixHigh_ = -1;
1823                 }
1824                 return CollationElementIterator.NULLORDER;
1825             }
1826 
1827             PCEI pcei = pceBuffer_.get();
1828 
1829             if (range != null) {
1830                 range.ixLow_ = pcei.low_;
1831                 range.ixHigh_ = pcei.high_;
1832             }
1833 
1834             return pcei.ce_;
1835         }
1836 
isContinuation(int ce)1837         private static boolean isContinuation(int ce) {
1838             return ((ce & CONTINUATION_MARKER) == CONTINUATION_MARKER);
1839         }
1840 
1841         public static final class Range {
1842             int ixLow_;
1843             int ixHigh_;
1844         }
1845 
1846         /** Processed collation element buffer stuff ported from ICU4C ucoleitr.cpp */
1847         private static final class PCEI {
1848             long ce_;
1849             int low_;
1850             int high_;
1851         }
1852 
1853         private static final class PCEBuffer {
1854             private PCEI[] buffer_ = new PCEI[DEFAULT_BUFFER_SIZE];
1855             private int bufferIndex_ = 0;
1856 
reset()1857             void reset() {
1858                 bufferIndex_ = 0;
1859             }
1860 
empty()1861             boolean empty() {
1862                 return bufferIndex_ <= 0;
1863             }
1864 
put(long ce, int ixLow, int ixHigh)1865             void put(long ce, int ixLow, int ixHigh)
1866             {
1867                 if (bufferIndex_ >= buffer_.length) {
1868                     PCEI[] newBuffer = new PCEI[buffer_.length + BUFFER_GROW];
1869                     System.arraycopy(buffer_, 0, newBuffer, 0, buffer_.length);
1870                     buffer_ = newBuffer;
1871                 }
1872                 buffer_[bufferIndex_] = new PCEI();
1873                 buffer_[bufferIndex_].ce_ = ce;
1874                 buffer_[bufferIndex_].low_ = ixLow;
1875                 buffer_[bufferIndex_].high_ = ixHigh;
1876 
1877                 bufferIndex_ += 1;
1878             }
1879 
get()1880             PCEI get() {
1881                 if (bufferIndex_ > 0) {
1882                     return buffer_[--bufferIndex_];
1883                 }
1884                 return null;
1885             }
1886         }
1887 
1888         /** Raw collation element buffer stuff ported from ICU4C ucoleitr.cpp */
1889         private static final class RCEI {
1890             int ce_;
1891             int low_;
1892             int high_;
1893         }
1894 
1895         private static final class RCEBuffer {
1896             private RCEI[] buffer_ = new RCEI[DEFAULT_BUFFER_SIZE];
1897             private int bufferIndex_ = 0;
1898 
empty()1899             boolean empty() {
1900                 return bufferIndex_ <= 0;
1901             }
1902 
put(int ce, int ixLow, int ixHigh)1903             void put(int ce, int ixLow, int ixHigh) {
1904                 if (bufferIndex_ >= buffer_.length) {
1905                     RCEI[] newBuffer = new RCEI[buffer_.length + BUFFER_GROW];
1906                     System.arraycopy(buffer_, 0, newBuffer, 0, buffer_.length);
1907                     buffer_ = newBuffer;
1908                 }
1909                 buffer_[bufferIndex_] = new RCEI();
1910                 buffer_[bufferIndex_].ce_ = ce;
1911                 buffer_[bufferIndex_].low_ = ixLow;
1912                 buffer_[bufferIndex_].high_ = ixHigh;
1913 
1914                 bufferIndex_ += 1;
1915             }
1916 
get()1917             RCEI get() {
1918                 if (bufferIndex_ > 0) {
1919                     return buffer_[--bufferIndex_];
1920                 }
1921                 return null;
1922             }
1923         }
1924     }
1925 
1926     /**
1927      * Java port of ICU4C CEI (usearch.cpp)
1928      *
1929      * CEI  Collation Element + source text index.
1930      *      These structs are kept in the circular buffer.
1931      */
1932     private static class CEI {
1933         long ce_;
1934         int lowIndex_;
1935         int highIndex_;
1936     }
1937 
1938     /**
1939      * CEBuffer A circular buffer of CEs from the text being searched
1940      */
1941     private static class CEBuffer {
1942         // Java porting note: ICU4C uses the size for stack buffer
1943         // static final int DEFAULT_CEBUFFER_SIZE = 96;
1944 
1945         static final int CEBUFFER_EXTRA = 32;
1946         static final int MAX_TARGET_IGNORABLES_PER_PAT_JAMO_L = 8;
1947         static final int MAX_TARGET_IGNORABLES_PER_PAT_OTHER = 3;
1948 
1949         CEI[] buf_;
1950         int bufSize_;
1951         int firstIx_;
1952         int limitIx_;
1953 
1954         // Java porting note: No references in ICU4C implementation
1955         // CollationElementIterator ceIter_;
1956 
1957         StringSearch strSearch_;
1958 
CEBuffer(StringSearch ss)1959         CEBuffer(StringSearch ss) {
1960             strSearch_ = ss;
1961             bufSize_ = ss.pattern_.PCELength_ + CEBUFFER_EXTRA;
1962             if (ss.search_.elementComparisonType_ != ElementComparisonType.STANDARD_ELEMENT_COMPARISON) {
1963                 String patText = ss.pattern_.text_;
1964                 if (patText != null) {
1965                     for (int i = 0; i < patText.length(); i++) {
1966                         char c = patText.charAt(i);
1967                         if (MIGHT_BE_JAMO_L(c)) {
1968                             bufSize_ += MAX_TARGET_IGNORABLES_PER_PAT_JAMO_L;
1969                         } else {
1970                             // No check for surrogates, we might allocate slightly more buffer than necessary.
1971                             bufSize_ += MAX_TARGET_IGNORABLES_PER_PAT_OTHER;
1972                         }
1973                     }
1974                 }
1975             }
1976 
1977             // Not used - see above
1978             // ceIter_ = ss.textIter_;
1979 
1980             firstIx_ = 0;
1981             limitIx_ = 0;
1982 
1983             if (!ss.initTextProcessedIter()) {
1984                 return;
1985             }
1986 
1987             buf_ = new CEI[bufSize_];
1988         }
1989 
1990         // Get the CE with the specified index.
1991         //   Index must be in the range
1992         //             n-history_size < index < n+1
1993         //   where n is the largest index to have been fetched by some previous call to this function.
1994         //   The CE value will be UCOL__PROCESSED_NULLORDER at end of input.
1995         //
get(int index)1996         CEI get(int index) {
1997             int i = index % bufSize_;
1998 
1999             if (index >= firstIx_ && index < limitIx_) {
2000                 // The request was for an entry already in our buffer.
2001                 // Just return it.
2002                 return buf_[i];
2003             }
2004 
2005             // Caller is requesting a new, never accessed before, CE.
2006             // Verify that it is the next one in sequence, which is all
2007             // that is allowed.
2008             if (index != limitIx_) {
2009                 assert(false);
2010                 return null;
2011             }
2012 
2013             // Manage the circular CE buffer indexing
2014             limitIx_++;
2015 
2016             if (limitIx_ - firstIx_ >= bufSize_) {
2017                 // The buffer is full, knock out the lowest-indexed entry.
2018                 firstIx_++;
2019             }
2020 
2021             CollationPCE.Range range = new CollationPCE.Range();
2022             if (buf_[i] == null) {
2023                 buf_[i] = new CEI();
2024             }
2025             buf_[i].ce_ = strSearch_.textProcessedIter_.nextProcessed(range);
2026             buf_[i].lowIndex_ = range.ixLow_;
2027             buf_[i].highIndex_ = range.ixHigh_;
2028 
2029             return buf_[i];
2030         }
2031 
2032         // Get the CE with the specified index.
2033         //   Index must be in the range
2034         //             n-history_size < index < n+1
2035         //   where n is the largest index to have been fetched by some previous call to this function.
2036         //   The CE value will be UCOL__PROCESSED_NULLORDER at end of input.
2037         //
getPrevious(int index)2038         CEI getPrevious(int index) {
2039             int i = index % bufSize_;
2040 
2041             if (index >= firstIx_ && index < limitIx_) {
2042                 // The request was for an entry already in our buffer.
2043                 // Just return it.
2044                 return buf_[i];
2045             }
2046 
2047             // Caller is requesting a new, never accessed before, CE.
2048             // Verify that it is the next one in sequence, which is all
2049             // that is allowed.
2050             if (index != limitIx_) {
2051                 assert(false);
2052                 return null;
2053             }
2054 
2055             // Manage the circular CE buffer indexing
2056             limitIx_++;
2057 
2058             if (limitIx_ - firstIx_ >= bufSize_) {
2059                 // The buffer is full, knock out the lowest-indexed entry.
2060                 firstIx_++;
2061             }
2062 
2063             CollationPCE.Range range = new CollationPCE.Range();
2064             if (buf_[i] == null) {
2065                 buf_[i] = new CEI();
2066             }
2067             buf_[i].ce_ = strSearch_.textProcessedIter_.previousProcessed(range);
2068             buf_[i].lowIndex_ = range.ixLow_;
2069             buf_[i].highIndex_ = range.ixHigh_;
2070 
2071             return buf_[i];
2072         }
2073 
MIGHT_BE_JAMO_L(char c)2074         static boolean MIGHT_BE_JAMO_L(char c) {
2075             return (c >= 0x1100 && c <= 0x115E)
2076                     || (c >= 0x3131 && c <= 0x314E)
2077                     || (c >= 0x3165 && c <= 0x3186);
2078         }
2079     }
2080 }
2081