1 // © 2016 and later: Unicode, Inc. and others. 2 // License & terms of use: http://www.unicode.org/copyright.html 3 /* 4 ******************************************************************************* 5 * Copyright (C) 1996-2016, International Business Machines Corporation and 6 * others. All Rights Reserved. 7 ******************************************************************************* 8 */ 9 package com.ibm.icu.text; 10 11 import java.text.CharacterIterator; 12 import java.text.StringCharacterIterator; 13 import java.util.Locale; 14 15 import com.ibm.icu.util.ICUException; 16 import com.ibm.icu.util.ULocale; 17 18 // Java porting note: 19 // 20 // The ICU4C implementation contains dead code in many places. 21 // While porting the ICU4C linear search implementation, this dead code 22 // was not fully ported. The code blocks tagged by "// *** Boyer-Moore ***" 23 // are those dead code blocks, still available in ICU4C. 24 25 // The ICU4C implementation does not seem to handle UCharacterIterator pointing 26 // to a fragment of text properly. ICU4J uses CharacterIterator to navigate through 27 // the input text. We need to carefully review the code ported from ICU4C 28 // assuming the start index is 0. 29 30 // ICU4C implementation initializes pattern.CE and pattern.PCE. It looks like 31 // CE is no longer used, except in a few places checking CELength. It looks like this 32 // is a leftover from already-disabled Boyer-Moore search code. This Java implementation 33 // preserves the code, but we should clean this up later. 34 35 /** 36 * 37 * <tt>StringSearch</tt> is a {@link SearchIterator} that provides 38 * language-sensitive text searching based on the comparison rules defined 39 * in a {@link RuleBasedCollator} object. 40 * StringSearch ensures that language eccentricity can be 41 * handled, e.g. for the German collator, characters ß and SS will be matched 42 * if case is chosen to be ignored. 43 * See the <a href="https://htmlpreview.github.io/?https://github.com/unicode-org/icu-docs/blob/main/design/collation/ICU_collation_design.htm"> 44 * "ICU Collation Design Document"</a> for more information. 45 * <p> 46 * There are 2 match options for selection:<br> 47 * Let S' be the sub-string of a text string S between the offsets start and 48 * end [start, end]. 49 * <br> 50 * A pattern string P matches a text string S at the offsets [start, end] 51 * if 52 * <pre> 53 * option 1. Some canonical equivalent of P matches some canonical equivalent 54 * of S' 55 * option 2. P matches S' and if P starts or ends with a combining mark, 56 * there exists no non-ignorable combining mark before or after S? 57 * in S respectively. 58 * </pre> 59 * Option 2. is the default. 60 * <p> 61 * This search has APIs similar to that of other text iteration mechanisms 62 * such as the break iterators in {@link BreakIterator}. Using these 63 * APIs, it is easy to scan through text looking for all occurrences of 64 * a given pattern. This search iterator allows changing of direction by 65 * calling a {@link #reset} followed by a {@link #next} or {@link #previous}. 66 * Though a direction change can occur without calling {@link #reset} first, 67 * this operation comes with some speed penalty. 68 * Match results in the forward direction will match the result matches in 69 * the backwards direction in the reverse order 70 * <p> 71 * {@link SearchIterator} provides APIs to specify the starting position 72 * within the text string to be searched, e.g. {@link SearchIterator#setIndex setIndex}, 73 * {@link SearchIterator#preceding preceding} and {@link SearchIterator#following following}. 74 * Since the starting position will be set as it is specified, please take note that 75 * there are some danger points at which the search may render incorrect 76 * results: 77 * <ul> 78 * <li> In the midst of a substring that requires normalization. 79 * <li> If the following match is to be found, the position should not be the 80 * second character which requires swapping with the preceding 81 * character. Vice versa, if the preceding match is to be found, the 82 * position to search from should not be the first character which 83 * requires swapping with the next character. E.g certain Thai and 84 * Lao characters require swapping. 85 * <li> If a following pattern match is to be found, any position within a 86 * contracting sequence except the first will fail. Vice versa if a 87 * preceding pattern match is to be found, an invalid starting point 88 * would be any character within a contracting sequence except the last. 89 * </ul> 90 * <p> 91 * A {@link BreakIterator} can be used if only matches at logical breaks are desired. 92 * Using a {@link BreakIterator} will only give you results that exactly matches the 93 * boundaries given by the {@link BreakIterator}. For instance the pattern "e" will 94 * not be found in the string "\u00e9" if a character break iterator is used. 95 * <p> 96 * Options are provided to handle overlapping matches. 97 * E.g. In English, overlapping matches produces the result 0 and 2 98 * for the pattern "abab" in the text "ababab", where mutually 99 * exclusive matches only produces the result of 0. 100 * <p> 101 * Options are also provided to implement "asymmetric search" as described in 102 * <a href="http://www.unicode.org/reports/tr10/#Asymmetric_Search"> 103 * UTS #10 Unicode Collation Algorithm</a>, specifically the ElementComparisonType 104 * values. 105 * <p> 106 * Though collator attributes will be taken into consideration while 107 * performing matches, there are no APIs here for setting and getting the 108 * attributes. These attributes can be set by getting the collator 109 * from {@link #getCollator} and using the APIs in {@link RuleBasedCollator}. 110 * Lastly to update <tt>StringSearch</tt> to the new collator attributes, 111 * {@link #reset} has to be called. 112 * <p> 113 * Restriction: <br> 114 * Currently there are no composite characters that consists of a 115 * character with combining class > 0 before a character with combining 116 * class == 0. However, if such a character exists in the future, 117 * <tt>StringSearch</tt> does not guarantee the results for option 1. 118 * <p> 119 * Consult the {@link SearchIterator} documentation for information on 120 * and examples of how to use instances of this class to implement text 121 * searching. 122 * <p> 123 * Note, <tt>StringSearch</tt> is not to be subclassed. 124 * </p> 125 * @see SearchIterator 126 * @see RuleBasedCollator 127 * @author Laura Werner, synwee 128 * @stable ICU 2.0 129 */ 130 // internal notes: all methods do not guarantee the correct status of the 131 // characteriterator. the caller has to maintain the original index position 132 // if necessary. methods could change the index position as it deems fit 133 public final class StringSearch extends SearchIterator { 134 135 private Pattern pattern_; 136 private RuleBasedCollator collator_; 137 138 // positions within the collation element iterator is used to determine 139 // if we are at the start of the text. 140 private CollationElementIterator textIter_; 141 private CollationPCE textProcessedIter_; 142 143 // utility collation element, used throughout program for temporary 144 // iteration. 145 private CollationElementIterator utilIter_; 146 147 private Normalizer2 nfd_; 148 149 private int strength_; 150 int ceMask_; 151 int variableTop_; 152 153 private boolean toShift_; 154 155 // *** Boyer-Moore *** 156 // private char[] canonicalPrefixAccents_; 157 // private char[] canonicalSuffixAccents_; 158 159 /** 160 * Initializes the iterator to use the language-specific rules defined in 161 * the argument collator to search for argument pattern in the argument 162 * target text. The argument <code>breakiter</code> is used to define logical matches. 163 * See super class documentation for more details on the use of the target 164 * text and {@link BreakIterator}. 165 * @param pattern text to look for. 166 * @param target target text to search for pattern. 167 * @param collator {@link RuleBasedCollator} that defines the language rules 168 * @param breakiter A {@link BreakIterator} that is used to determine the 169 * boundaries of a logical match. This argument can be null. 170 * @throws IllegalArgumentException thrown when argument target is null, 171 * or of length 0 172 * @see BreakIterator 173 * @see RuleBasedCollator 174 * @stable ICU 2.0 175 */ StringSearch(String pattern, CharacterIterator target, RuleBasedCollator collator, BreakIterator breakiter)176 public StringSearch(String pattern, CharacterIterator target, RuleBasedCollator collator, 177 BreakIterator breakiter) { 178 179 // This implementation is ported from ICU4C usearch_open() 180 181 super(target, breakiter); 182 183 // string search does not really work when numeric collation is turned on 184 if (collator.getNumericCollation()) { 185 throw new UnsupportedOperationException("Numeric collation is not supported by StringSearch"); 186 } 187 188 collator_ = collator; 189 strength_ = collator.getStrength(); 190 ceMask_ = getMask(strength_); 191 toShift_ = collator.isAlternateHandlingShifted(); 192 variableTop_ = collator.getVariableTop(); 193 194 nfd_ = Normalizer2.getNFDInstance(); 195 196 pattern_ = new Pattern(pattern); 197 198 search_.setMatchedLength(0); 199 search_.matchedIndex_ = DONE; 200 201 utilIter_ = null; 202 textIter_ = new CollationElementIterator(target, collator); 203 204 textProcessedIter_ = null; 205 206 // This is done by super class constructor 207 /* 208 search_.isOverlap_ = false; 209 search_.isCanonicalMatch_ = false; 210 search_.elementComparisonType_ = ElementComparisonType.STANDARD_ELEMENT_COMPARISON; 211 search_.isForwardSearching_ = true; 212 search_.reset_ = true; 213 */ 214 ULocale collLocale = collator.getLocale(ULocale.VALID_LOCALE); 215 search_.internalBreakIter_ = BreakIterator.getCharacterInstance(collLocale == null ? ULocale.ROOT : collLocale); 216 search_.internalBreakIter_.setText((CharacterIterator)target.clone()); // We need to create a clone 217 218 initialize(); 219 } 220 221 /** 222 * Initializes the iterator to use the language-specific rules defined in 223 * the argument collator to search for argument pattern in the argument 224 * target text. No {@link BreakIterator}s are set to test for logical matches. 225 * @param pattern text to look for. 226 * @param target target text to search for pattern. 227 * @param collator {@link RuleBasedCollator} that defines the language rules 228 * @throws IllegalArgumentException thrown when argument target is null, 229 * or of length 0 230 * @see RuleBasedCollator 231 * @stable ICU 2.0 232 */ StringSearch(String pattern, CharacterIterator target, RuleBasedCollator collator)233 public StringSearch(String pattern, CharacterIterator target, RuleBasedCollator collator) { 234 this(pattern, target, collator, null); 235 } 236 237 /** 238 * Initializes the iterator to use the language-specific rules and 239 * break iterator rules defined in the argument locale to search for 240 * argument pattern in the argument target text. 241 * @param pattern text to look for. 242 * @param target target text to search for pattern. 243 * @param locale locale to use for language and break iterator rules 244 * @throws IllegalArgumentException thrown when argument target is null, 245 * or of length 0. ClassCastException thrown if the collator for 246 * the specified locale is not a RuleBasedCollator. 247 * @stable ICU 2.0 248 */ StringSearch(String pattern, CharacterIterator target, Locale locale)249 public StringSearch(String pattern, CharacterIterator target, Locale locale) { 250 this(pattern, target, ULocale.forLocale(locale)); 251 } 252 253 /** 254 * Initializes the iterator to use the language-specific rules and 255 * break iterator rules defined in the argument locale to search for 256 * argument pattern in the argument target text. 257 * See super class documentation for more details on the use of the target 258 * text and {@link BreakIterator}. 259 * @param pattern text to look for. 260 * @param target target text to search for pattern. 261 * @param locale locale to use for language and break iterator rules 262 * @throws IllegalArgumentException thrown when argument target is null, 263 * or of length 0. ClassCastException thrown if the collator for 264 * the specified locale is not a RuleBasedCollator. 265 * @see BreakIterator 266 * @see RuleBasedCollator 267 * @see SearchIterator 268 * @stable ICU 3.2 269 */ StringSearch(String pattern, CharacterIterator target, ULocale locale)270 public StringSearch(String pattern, CharacterIterator target, ULocale locale) { 271 this(pattern, target, (RuleBasedCollator) Collator.getInstance(locale), null); 272 } 273 274 /** 275 * Initializes the iterator to use the language-specific rules and 276 * break iterator rules defined in the default locale to search for 277 * argument pattern in the argument target text. 278 * @param pattern text to look for. 279 * @param target target text to search for pattern. 280 * @throws IllegalArgumentException thrown when argument target is null, 281 * or of length 0. ClassCastException thrown if the collator for 282 * the default locale is not a RuleBasedCollator. 283 * @stable ICU 2.0 284 */ StringSearch(String pattern, String target)285 public StringSearch(String pattern, String target) { 286 this(pattern, new StringCharacterIterator(target), 287 (RuleBasedCollator) Collator.getInstance(), null); 288 } 289 290 /** 291 * Gets the {@link RuleBasedCollator} used for the language rules. 292 * <p> 293 * Since <tt>StringSearch</tt> depends on the returned {@link RuleBasedCollator}, any 294 * changes to the {@link RuleBasedCollator} result should follow with a call to 295 * either {@link #reset()} or {@link #setCollator(RuleBasedCollator)} to ensure the correct 296 * search behavior. 297 * </p> 298 * @return {@link RuleBasedCollator} used by this <tt>StringSearch</tt> 299 * @see RuleBasedCollator 300 * @see #setCollator 301 * @stable ICU 2.0 302 */ getCollator()303 public RuleBasedCollator getCollator() { 304 return collator_; 305 } 306 307 /** 308 * Sets the {@link RuleBasedCollator} to be used for language-specific searching. 309 * <p> 310 * The iterator's position will not be changed by this method. 311 * @param collator to use for this <tt>StringSearch</tt> 312 * @throws IllegalArgumentException thrown when collator is null 313 * @see #getCollator 314 * @stable ICU 2.0 315 */ setCollator(RuleBasedCollator collator)316 public void setCollator(RuleBasedCollator collator) { 317 if (collator == null) { 318 throw new IllegalArgumentException("Collator can not be null"); 319 } 320 collator_ = collator; 321 ceMask_ = getMask(collator_.getStrength()); 322 323 ULocale collLocale = collator.getLocale(ULocale.VALID_LOCALE); 324 search_.internalBreakIter_ = BreakIterator.getCharacterInstance(collLocale == null ? ULocale.ROOT : collLocale); 325 search_.internalBreakIter_.setText((CharacterIterator)search_.text().clone()); // We need to create a clone 326 327 toShift_ = collator.isAlternateHandlingShifted(); 328 variableTop_ = collator.getVariableTop(); 329 textIter_ = new CollationElementIterator(pattern_.text_, collator); 330 utilIter_ = new CollationElementIterator(pattern_.text_, collator); 331 332 // initialize() _after_ setting the iterators for the new collator. 333 initialize(); 334 } 335 336 /** 337 * Returns the pattern for which <tt>StringSearch</tt> is searching for. 338 * @return the pattern searched for 339 * @stable ICU 2.0 340 */ getPattern()341 public String getPattern() { 342 return pattern_.text_; 343 } 344 345 /** 346 * Set the pattern to search for. 347 * The iterator's position will not be changed by this method. 348 * @param pattern for searching 349 * @see #getPattern 350 * @exception IllegalArgumentException thrown if pattern is null or of 351 * length 0 352 * @stable ICU 2.0 353 */ setPattern(String pattern)354 public void setPattern(String pattern) { 355 if (pattern == null || pattern.length() <= 0) { 356 throw new IllegalArgumentException( 357 "Pattern to search for can not be null or of length 0"); 358 } 359 pattern_.text_ = pattern; 360 initialize(); 361 } 362 363 /** 364 * Determines whether canonical matches (option 1, as described in the 365 * class documentation) is set. 366 * See setCanonical(boolean) for more information. 367 * @see #setCanonical 368 * @return true if canonical matches is set, false otherwise 369 * @stable ICU 2.8 370 */ 371 //TODO: hoist this to SearchIterator isCanonical()372 public boolean isCanonical() { 373 return search_.isCanonicalMatch_; 374 } 375 376 /** 377 * Set the canonical match mode. See class documentation for details. 378 * The default setting for this property is false. 379 * @param allowCanonical flag indicator if canonical matches are allowed 380 * @see #isCanonical 381 * @stable ICU 2.8 382 */ 383 //TODO: hoist this to SearchIterator setCanonical(boolean allowCanonical)384 public void setCanonical(boolean allowCanonical) { 385 search_.isCanonicalMatch_ = allowCanonical; 386 } 387 388 /** 389 * {@inheritDoc} 390 * @stable ICU 2.8 391 */ 392 @Override setTarget(CharacterIterator text)393 public void setTarget(CharacterIterator text) { 394 super.setTarget(text); 395 textIter_.setText(text); 396 } 397 398 /** 399 * {@inheritDoc} 400 * @stable ICU 2.8 401 */ 402 @Override getIndex()403 public int getIndex() { 404 int result = textIter_.getOffset(); 405 if (isOutOfBounds(search_.beginIndex(), search_.endIndex(), result)) { 406 return DONE; 407 } 408 return result; 409 } 410 411 /** 412 * {@inheritDoc} 413 * @stable ICU 2.8 414 */ 415 @Override setIndex(int position)416 public void setIndex(int position) { 417 // Java porting note: This method is equivalent to setOffset() in ICU4C. 418 // ICU4C SearchIterator::setOffset() is a pure virtual method, while 419 // ICU4J SearchIterator.setIndex() is not abstract method. 420 421 super.setIndex(position); 422 textIter_.setOffset(position); 423 } 424 425 /** 426 * {@inheritDoc} 427 * @stable ICU 2.8 428 */ 429 @Override reset()430 public void reset() { 431 // reset is setting the attributes that are already in 432 // string search, hence all attributes in the collator should 433 // be retrieved without any problems 434 435 boolean sameCollAttribute = true; 436 int ceMask; 437 boolean shift; 438 int varTop; 439 440 // **** hack to deal w/ how processed CEs encode quaternary **** 441 int newStrength = collator_.getStrength(); 442 if ((strength_ < Collator.QUATERNARY && newStrength >= Collator.QUATERNARY) 443 || (strength_ >= Collator.QUATERNARY && newStrength < Collator.QUATERNARY)) { 444 sameCollAttribute = false; 445 } 446 447 strength_ = collator_.getStrength(); 448 ceMask = getMask(strength_); 449 if (ceMask_ != ceMask) { 450 ceMask_ = ceMask; 451 sameCollAttribute = false; 452 } 453 454 shift = collator_.isAlternateHandlingShifted(); 455 if (toShift_ != shift) { 456 toShift_ = shift; 457 sameCollAttribute = false; 458 } 459 460 varTop = collator_.getVariableTop(); 461 if (variableTop_ != varTop) { 462 variableTop_ = varTop; 463 sameCollAttribute = false; 464 } 465 466 if (!sameCollAttribute) { 467 initialize(); 468 } 469 470 textIter_.setText(search_.text()); 471 472 search_.setMatchedLength(0); 473 search_.matchedIndex_ = DONE; 474 search_.isOverlap_ = false; 475 search_.isCanonicalMatch_ = false; 476 search_.elementComparisonType_ = ElementComparisonType.STANDARD_ELEMENT_COMPARISON; 477 search_.isForwardSearching_ = true; 478 search_.reset_ = true; 479 } 480 481 /** 482 * {@inheritDoc} 483 * @stable ICU 2.8 484 */ 485 @Override handleNext(int position)486 protected int handleNext(int position) { 487 if (pattern_.CELength_ == 0) { 488 search_.matchedIndex_ = search_.matchedIndex_ == DONE ? 489 getIndex() : search_.matchedIndex_ + 1; 490 search_.setMatchedLength(0); 491 textIter_.setOffset(search_.matchedIndex_); 492 if (search_.matchedIndex_ == search_.endIndex()) { 493 search_.matchedIndex_ = DONE; 494 } 495 } else { 496 if (search_.matchedLength() <= 0) { 497 // the flipping direction issue has already been handled 498 // in next() 499 // for boundary check purposes. this will ensure that the 500 // next match will not preceed the current offset 501 // note search_.matchedIndex_ will always be set to something 502 // in the code 503 search_.matchedIndex_ = position - 1; 504 } 505 506 textIter_.setOffset(position); 507 508 // ICU4C comment: 509 // if strsrch_->breakIter is always the same as m_breakiterator_ 510 // then we don't need to check the match boundaries here because 511 // usearch_handleNextXXX will already have done it. 512 if (search_.isCanonicalMatch_) { 513 // *could* actually use exact here 'cause no extra accents allowed... 514 handleNextCanonical(); 515 } else { 516 handleNextExact(); 517 } 518 519 if (search_.matchedIndex_ == DONE) { 520 textIter_.setOffset(search_.endIndex()); 521 } else { 522 textIter_.setOffset(search_.matchedIndex_); 523 } 524 525 return search_.matchedIndex_; 526 } 527 528 return DONE; 529 } 530 531 /** 532 * {@inheritDoc} 533 * @stable ICU 2.8 534 */ 535 @Override handlePrevious(int position)536 protected int handlePrevious(int position) { 537 if (pattern_.CELength_ == 0) { 538 search_.matchedIndex_ = 539 search_.matchedIndex_ == DONE ? getIndex() : search_.matchedIndex_; 540 if (search_.matchedIndex_ == search_.beginIndex()) { 541 setMatchNotFound(); 542 } else { 543 search_.matchedIndex_--; 544 textIter_.setOffset(search_.matchedIndex_); 545 search_.setMatchedLength(0); 546 } 547 } else { 548 textIter_.setOffset(position); 549 550 if (search_.isCanonicalMatch_) { 551 // *could* use exact match here since extra accents *not* allowed! 552 handlePreviousCanonical(); 553 } else { 554 handlePreviousExact(); 555 } 556 } 557 558 return search_.matchedIndex_; 559 } 560 561 // ------------------ Internal implementation code --------------------------- 562 563 private static final int INITIAL_ARRAY_SIZE_ = 256; 564 565 // *** Boyer-Moore *** 566 // private static final Normalizer2Impl nfcImpl_ = Norm2AllModes.getNFCInstance().impl; 567 // private static final int LAST_BYTE_MASK_ = 0xff; 568 // private static final int SECOND_LAST_BYTE_SHIFT_ = 8; 569 570 private static final int PRIMARYORDERMASK = 0xffff0000; 571 private static final int SECONDARYORDERMASK = 0x0000ff00; 572 private static final int TERTIARYORDERMASK = 0x000000ff; 573 574 /** 575 * Getting the mask for collation strength 576 * @param strength collation strength 577 * @return collation element mask 578 */ getMask(int strength)579 private static int getMask(int strength) { 580 switch (strength) { 581 case Collator.PRIMARY: 582 return PRIMARYORDERMASK; 583 case Collator.SECONDARY: 584 return SECONDARYORDERMASK | PRIMARYORDERMASK; 585 default: 586 return TERTIARYORDERMASK | SECONDARYORDERMASK | PRIMARYORDERMASK; 587 } 588 } 589 590 591 // *** Boyer-Moore *** 592 /* 593 private final char getFCD(String str, int offset) { 594 char ch = str.charAt(offset); 595 if (ch < 0x180) { 596 return (char) nfcImpl_.getFCD16FromBelow180(ch); 597 } else if (nfcImpl_.singleLeadMightHaveNonZeroFCD16(ch)) { 598 if (!Character.isHighSurrogate(ch)) { 599 return (char) nfcImpl_.getFCD16FromNormData(ch); 600 } else { 601 char c2; 602 if (++offset < str.length() && Character.isLowSurrogate(c2 = str.charAt(offset))) { 603 return (char) nfcImpl_.getFCD16FromNormData(Character.toCodePoint(ch, c2)); 604 } 605 } 606 } 607 return 0; 608 } 609 610 private final char getFCD(int c) { 611 return (char)nfcImpl_.getFCD16(c); 612 } 613 */ 614 615 /** 616 * Getting the modified collation elements taking into account the collation 617 * attributes. 618 * 619 * @param sourcece 620 * @return the modified collation element 621 */ getCE(int sourcece)622 private int getCE(int sourcece) { 623 // note for tertiary we can't use the collator->tertiaryMask, that 624 // is a preprocessed mask that takes into account case options. since 625 // we are only concerned with exact matches, we don't need that. 626 sourcece &= ceMask_; 627 628 if (toShift_) { 629 // alternate handling here, since only the 16 most significant digits 630 // is only used, we can safely do a compare without masking 631 // if the ce is a variable, we mask and get only the primary values 632 // no shifting to quartenary is required since all primary values 633 // less than variabletop will need to be masked off anyway. 634 if (variableTop_ > sourcece) { 635 if (strength_ >= Collator.QUATERNARY) { 636 sourcece &= PRIMARYORDERMASK; 637 } else { 638 sourcece = CollationElementIterator.IGNORABLE; 639 } 640 } 641 } else if (strength_ >= Collator.QUATERNARY && sourcece == CollationElementIterator.IGNORABLE) { 642 sourcece = 0xFFFF; 643 } 644 645 return sourcece; 646 } 647 648 /** 649 * Direct port of ICU4C static int32_t * addTouint32_tArray(...) in usearch.cpp 650 * (except not taking destination buffer size and status param). 651 * This is used for appending a PCE to Pattern.PCE_ buffer. We probably should 652 * implement this in Pattern class. 653 * 654 * @param destination target array 655 * @param offset destination offset to add value 656 * @param value to be added 657 * @param increments incremental size expected 658 * @return new destination array, destination if there was no new allocation 659 */ addToIntArray(int[] destination, int offset, int value, int increments)660 private static int[] addToIntArray(int[] destination, int offset, int value, int increments) { 661 int newlength = destination.length; 662 if (offset + 1 == newlength) { 663 newlength += increments; 664 int temp[] = new int[newlength]; 665 System.arraycopy(destination, 0, temp, 0, offset); 666 destination = temp; 667 } 668 destination[offset] = value; 669 return destination; 670 } 671 672 /** 673 * Direct port of ICU4C static int64_t * addTouint64_tArray(...) in usearch.cpp. 674 * This is used for appending a PCE to Pattern.PCE_ buffer. We probably should 675 * implement this in Pattern class. 676 * 677 * @param destination target array 678 * @param offset destination offset to add value 679 * @param destinationlength target array size 680 * @param value to be added 681 * @param increments incremental size expected 682 * @return new destination array, destination if there was no new allocation 683 */ addToLongArray(long[] destination, int offset, int destinationlength, long value, int increments)684 private static long[] addToLongArray(long[] destination, int offset, int destinationlength, 685 long value, int increments) { 686 int newlength = destinationlength; 687 if (offset + 1 == newlength) { 688 newlength += increments; 689 long temp[] = new long[newlength]; 690 System.arraycopy(destination, 0, temp, 0, offset); 691 destination = temp; 692 } 693 destination[offset] = value; 694 return destination; 695 } 696 697 /** 698 * Initializing the ce table for a pattern. 699 * Stores non-ignorable collation keys. 700 * Table size will be estimated by the size of the pattern text. Table 701 * expansion will be perform as we go along. Adding 1 to ensure that the table 702 * size definitely increases. 703 * @return total number of expansions 704 */ 705 // TODO: We probably do not need Pattern CE table. initializePatternCETable()706 private int initializePatternCETable() { 707 int[] cetable = new int[INITIAL_ARRAY_SIZE_]; 708 int patternlength = pattern_.text_.length(); 709 CollationElementIterator coleiter = utilIter_; 710 711 if (coleiter == null) { 712 coleiter = new CollationElementIterator(pattern_.text_, collator_); 713 utilIter_ = coleiter; 714 } else { 715 coleiter.setText(pattern_.text_); 716 } 717 718 int offset = 0; 719 int result = 0; 720 int ce; 721 722 while ((ce = coleiter.next()) != CollationElementIterator.NULLORDER) { 723 int newce = getCE(ce); 724 if (newce != CollationElementIterator.IGNORABLE /* 0 */) { 725 int[] temp = addToIntArray(cetable, offset, newce, 726 patternlength - coleiter.getOffset() + 1); 727 offset++; 728 cetable = temp; 729 } 730 result += (coleiter.getMaxExpansion(ce) - 1); 731 } 732 733 cetable[offset] = 0; 734 pattern_.CE_ = cetable; 735 pattern_.CELength_ = offset; 736 737 return result; 738 } 739 740 /** 741 * Initializing the pce table for a pattern. 742 * Stores non-ignorable collation keys. 743 * Table size will be estimated by the size of the pattern text. Table 744 * expansion will be perform as we go along. Adding 1 to ensure that the table 745 * size definitely increases. 746 * @return total number of expansions 747 */ initializePatternPCETable()748 private int initializePatternPCETable() { 749 long[] pcetable = new long[INITIAL_ARRAY_SIZE_]; 750 int pcetablesize = pcetable.length; 751 int patternlength = pattern_.text_.length(); 752 CollationElementIterator coleiter = utilIter_; 753 754 if (coleiter == null) { 755 coleiter = new CollationElementIterator(pattern_.text_, collator_); 756 utilIter_ = coleiter; 757 } else { 758 coleiter.setText(pattern_.text_); 759 } 760 761 int offset = 0; 762 int result = 0; 763 long pce; 764 765 CollationPCE iter = new CollationPCE(coleiter); 766 767 // ** Should processed CEs be signed or unsigned? 768 // ** (the rest of the code in this file seems to play fast-and-loose with 769 // ** whether a CE is signed or unsigned. For example, look at routine above this one.) 770 while ((pce = iter.nextProcessed(null)) != CollationPCE.PROCESSED_NULLORDER) { 771 long[] temp = addToLongArray(pcetable, offset, pcetablesize, pce, patternlength - coleiter.getOffset() + 1); 772 offset++; 773 pcetable = temp; 774 } 775 776 pcetable[offset] = 0; 777 pattern_.PCE_ = pcetable; 778 pattern_.PCELength_ = offset; 779 780 return result; 781 } 782 783 // TODO: This method only triggers initializePatternCETable(), which is probably no 784 // longer needed. initializePattern()785 private int initializePattern() { 786 // Since the strength is primary, accents are ignored in the pattern. 787 788 // *** Boyer-Moore *** 789 /* 790 if (strength_ == Collator.PRIMARY) { 791 pattern_.hasPrefixAccents_ = false; 792 pattern_.hasSuffixAccents_ = false; 793 } else { 794 pattern_.hasPrefixAccents_ = (getFCD(pattern_.text_, 0) >>> SECOND_LAST_BYTE_SHIFT_) != 0; 795 pattern_.hasSuffixAccents_ = (getFCD(pattern_.text_.codePointBefore(pattern_.text_.length())) & LAST_BYTE_MASK_) != 0; 796 } 797 */ 798 799 pattern_.PCE_ = null; 800 801 // since intializePattern is an internal method status is a success. 802 return initializePatternCETable(); 803 } 804 805 // *** Boyer-Moore *** 806 /* 807 private final void setShiftTable(char shift[], 808 char backshift[], 809 int cetable[], int cesize, 810 int expansionsize, 811 int defaultforward, 812 int defaultbackward) { 813 // No implementation 814 } 815 */ 816 817 // TODO: This method only triggers initializePattern(), which is probably no 818 // longer needed. initialize()819 private void initialize() { 820 /* int expandlength = */ initializePattern(); 821 822 // *** Boyer-Moore *** 823 /* 824 if (pattern_.CELength_ > 0) { 825 int cesize = pattern_.CELength_; 826 int minlength = cesize > expandlength ? cesize - expandlength : 1; 827 pattern_.defaultShiftSize_ = minlength; 828 setShiftTable(pattern_.shift_, pattern_.backShift_, pattern_.CE_, cesize, 829 expandlength, minlength, minlength); 830 return; 831 } 832 return pattern_.defaultShiftSize_; 833 */ 834 } 835 836 /** 837 * @internal 838 * @deprecated This API is ICU internal only. 839 */ 840 @Override 841 @Deprecated setMatchNotFound()842 protected void setMatchNotFound() { 843 super.setMatchNotFound(); 844 // SearchIterator#setMatchNotFound() does following: 845 // search_.matchedIndex_ = DONE; 846 // search_.setMatchedLength(0); 847 if (search_.isForwardSearching_) { 848 textIter_.setOffset(search_.text().getEndIndex()); 849 } else { 850 textIter_.setOffset(0); 851 } 852 } 853 854 /** 855 * Checks if the offset runs out of the text string range 856 * @param textstart offset of the first character in the range 857 * @param textlimit limit offset of the text string range 858 * @param offset to test 859 * @return true if offset is out of bounds, false otherwise 860 */ isOutOfBounds(int textstart, int textlimit, int offset)861 private static final boolean isOutOfBounds(int textstart, int textlimit, int offset) { 862 return offset < textstart || offset > textlimit; 863 } 864 865 /** 866 * Checks for identical match 867 * @param start offset of possible match 868 * @param end offset of possible match 869 * @return true if identical match is found 870 */ checkIdentical(int start, int end)871 private boolean checkIdentical(int start, int end) { 872 if (strength_ != Collator.IDENTICAL) { 873 return true; 874 } 875 // Note: We could use Normalizer::compare() or similar, but for short strings 876 // which may not be in FCD it might be faster to just NFD them. 877 String textstr = getString(targetText, start, end - start); 878 if (Normalizer.quickCheck(textstr, Normalizer.NFD, 0) == Normalizer.NO) { 879 textstr = Normalizer.decompose(textstr, false); 880 } 881 String patternstr = pattern_.text_; 882 if (Normalizer.quickCheck(patternstr, Normalizer.NFD, 0) == Normalizer.NO) { 883 patternstr = Normalizer.decompose(patternstr, false); 884 } 885 return textstr.equals(patternstr); 886 } 887 initTextProcessedIter()888 private boolean initTextProcessedIter() { 889 if (textProcessedIter_ == null) { 890 textProcessedIter_ = new CollationPCE(textIter_); 891 } else { 892 textProcessedIter_.init(textIter_); 893 } 894 return true; 895 } 896 897 /* 898 * Find the next break boundary after startIndex. If the UStringSearch object 899 * has an external break iterator, use that. Otherwise use the internal character 900 * break iterator. 901 */ nextBoundaryAfter(int startIndex)902 private int nextBoundaryAfter(int startIndex) { 903 BreakIterator breakiterator = search_.breakIter(); 904 905 if (breakiterator == null) { 906 breakiterator = search_.internalBreakIter_; 907 } 908 909 if (breakiterator != null) { 910 return breakiterator.following(startIndex); 911 } 912 913 return startIndex; 914 } 915 916 /* 917 * Returns true if index is on a break boundary. If the UStringSearch 918 * has an external break iterator, test using that, otherwise test 919 * using the internal character break iterator. 920 */ isBreakBoundary(int index)921 private boolean isBreakBoundary(int index) { 922 BreakIterator breakiterator = search_.breakIter(); 923 924 if (breakiterator == null) { 925 breakiterator = search_.internalBreakIter_; 926 } 927 928 return (breakiterator != null && breakiterator.isBoundary(index)); 929 } 930 931 932 // Java porting note: Followings are corresponding to UCompareCEsResult enum 933 private static final int CE_MATCH = -1; 934 private static final int CE_NO_MATCH = 0; 935 private static final int CE_SKIP_TARG = 1; 936 private static final int CE_SKIP_PATN = 2; 937 938 private static int CE_LEVEL2_BASE = 0x00000005; 939 private static int CE_LEVEL3_BASE = 0x00050000; 940 compareCE64s(long targCE, long patCE, ElementComparisonType compareType)941 private static int compareCE64s(long targCE, long patCE, ElementComparisonType compareType) { 942 if (targCE == patCE) { 943 return CE_MATCH; 944 } 945 if (compareType == ElementComparisonType.STANDARD_ELEMENT_COMPARISON) { 946 return CE_NO_MATCH; 947 } 948 949 long targCEshifted = targCE >>> 32; 950 long patCEshifted = patCE >>> 32; 951 long mask; 952 953 mask = 0xFFFF0000L; 954 int targLev1 = (int)(targCEshifted & mask); 955 int patLev1 = (int)(patCEshifted & mask); 956 if (targLev1 != patLev1) { 957 if (targLev1 == 0) { 958 return CE_SKIP_TARG; 959 } 960 if (patLev1 == 0 961 && compareType == ElementComparisonType.ANY_BASE_WEIGHT_IS_WILDCARD) { 962 return CE_SKIP_PATN; 963 } 964 return CE_NO_MATCH; 965 } 966 967 mask = 0x0000FFFFL; 968 int targLev2 = (int)(targCEshifted & mask); 969 int patLev2 = (int)(patCEshifted & mask); 970 if (targLev2 != patLev2) { 971 if (targLev2 == 0) { 972 return CE_SKIP_TARG; 973 } 974 if (patLev2 == 0 975 && compareType == ElementComparisonType.ANY_BASE_WEIGHT_IS_WILDCARD) { 976 return CE_SKIP_PATN; 977 } 978 return (patLev2 == CE_LEVEL2_BASE || 979 (compareType == ElementComparisonType.ANY_BASE_WEIGHT_IS_WILDCARD && 980 targLev2 == CE_LEVEL2_BASE)) ? CE_MATCH : CE_NO_MATCH; 981 } 982 983 mask = 0xFFFF0000L; 984 int targLev3 = (int)(targCE & mask); 985 int patLev3 = (int)(patCE & mask); 986 if (targLev3 != patLev3) { 987 return (patLev3 == CE_LEVEL3_BASE || 988 (compareType == ElementComparisonType.ANY_BASE_WEIGHT_IS_WILDCARD && 989 targLev3 == CE_LEVEL3_BASE) )? CE_MATCH: CE_NO_MATCH; 990 } 991 992 return CE_MATCH; 993 } 994 995 /** 996 * An object used for receiving matched index in search() and 997 * searchBackwards(). 998 */ 999 private static class Match { 1000 int start_ = -1; 1001 int limit_ = -1; 1002 } 1003 search(int startIdx, Match m)1004 private boolean search(int startIdx, Match m) { 1005 // Input parameter sanity check. 1006 if (pattern_.CELength_ == 0 1007 || startIdx < search_.beginIndex() 1008 || startIdx > search_.endIndex()) { 1009 throw new IllegalArgumentException("search(" + startIdx + ", m) - expected position to be between " + 1010 search_.beginIndex() + " and " + search_.endIndex()); 1011 } 1012 1013 if (pattern_.PCE_ == null) { 1014 initializePatternPCETable(); 1015 } 1016 1017 textIter_.setOffset(startIdx); 1018 CEBuffer ceb = new CEBuffer(this); 1019 1020 int targetIx = 0; 1021 CEI targetCEI = null; 1022 int patIx; 1023 boolean found; 1024 1025 int mStart = -1; 1026 int mLimit = -1; 1027 int minLimit; 1028 int maxLimit; 1029 1030 // Outer loop moves over match starting positions in the 1031 // target CE space. 1032 // Here we see the target as a sequence of collation elements, resulting from the following: 1033 // 1. Target characters were decomposed, and (if appropriate) other compressions and expansions are applied 1034 // (for example, digraphs such as IJ may be broken into two characters). 1035 // 2. An int64_t CE weight is determined for each resulting unit (high 16 bits are primary strength, next 1036 // 16 bits are secondary, next 16 (the high 16 bits of the low 32-bit half) are tertiary. Any of these 1037 // fields that are for strengths below that of the collator are set to 0. If this makes the int64_t 1038 // CE weight 0 (as for a combining diacritic with secondary weight when the collator strength is primary), 1039 // then the CE is deleted, so the following code sees only CEs that are relevant. 1040 // For each CE, the lowIndex and highIndex correspond to where this CE begins and ends in the original text. 1041 // If lowIndex==highIndex, either the CE resulted from an expansion/decomposition of one of the original text 1042 // characters, or the CE marks the limit of the target text (in which case the CE weight is UCOL_PROCESSED_NULLORDER). 1043 for (targetIx = 0; ; targetIx++) { 1044 found = true; 1045 // Inner loop checks for a match beginning at each 1046 // position from the outer loop. 1047 int targetIxOffset = 0; 1048 long patCE = 0; 1049 // For targetIx > 0, this ceb.get gets a CE that is as far back in the ring buffer 1050 // (compared to the last CE fetched for the previous targetIx value) as we need to go 1051 // for this targetIx value, so if it is non-NULL then other ceb.get calls should be OK. 1052 CEI firstCEI = ceb.get(targetIx); 1053 if (firstCEI == null) { 1054 throw new ICUException("CEBuffer.get(" + targetIx + ") returned null."); 1055 } 1056 1057 for (patIx = 0; patIx < pattern_.PCELength_; patIx++) { 1058 patCE = pattern_.PCE_[patIx]; 1059 targetCEI = ceb.get(targetIx + patIx + targetIxOffset); 1060 // Compare CE from target string with CE from the pattern. 1061 // Note that the target CE will be UCOL_PROCESSED_NULLORDER if we reach the end of input, 1062 // which will fail the compare, below. 1063 int ceMatch = compareCE64s(targetCEI.ce_, patCE, search_.elementComparisonType_); 1064 if (ceMatch == CE_NO_MATCH) { 1065 found = false; 1066 break; 1067 } else if (ceMatch > CE_NO_MATCH) { 1068 if (ceMatch == CE_SKIP_TARG) { 1069 // redo with same patCE, next targCE 1070 patIx--; 1071 targetIxOffset++; 1072 } else { // ceMatch == CE_SKIP_PATN 1073 // redo with same targCE, next patCE 1074 targetIxOffset--; 1075 } 1076 } 1077 } 1078 targetIxOffset += pattern_.PCELength_; // this is now the offset in target CE space to end of the match so far 1079 1080 if (!found && ((targetCEI == null) || (targetCEI.ce_ != CollationPCE.PROCESSED_NULLORDER))) { 1081 // No match at this targetIx. Try again at the next. 1082 continue; 1083 } 1084 1085 if (!found) { 1086 // No match at all, we have run off the end of the target text. 1087 break; 1088 } 1089 1090 // We have found a match in CE space. 1091 // Now determine the bounds in string index space. 1092 // There still is a chance of match failure if the CE range not correspond to 1093 // an acceptable character range. 1094 // 1095 CEI lastCEI = ceb.get(targetIx + targetIxOffset -1); 1096 1097 mStart = firstCEI.lowIndex_; 1098 minLimit = lastCEI.lowIndex_; 1099 1100 // Look at the CE following the match. If it is UCOL_NULLORDER the match 1101 // extended to the end of input, and the match is good. 1102 1103 // Look at the high and low indices of the CE following the match. If 1104 // they are the same it means one of two things: 1105 // 1. The match extended to the last CE from the target text, which is OK, or 1106 // 2. The last CE that was part of the match is in an expansion that extends 1107 // to the first CE after the match. In this case, we reject the match. 1108 CEI nextCEI = null; 1109 if (search_.elementComparisonType_ == ElementComparisonType.STANDARD_ELEMENT_COMPARISON) { 1110 nextCEI = ceb.get(targetIx + targetIxOffset); 1111 maxLimit = nextCEI.lowIndex_; 1112 if (nextCEI.lowIndex_ == nextCEI.highIndex_ && nextCEI.ce_ != CollationPCE.PROCESSED_NULLORDER) { 1113 found = false; 1114 } 1115 } else { 1116 for (;; ++targetIxOffset) { 1117 nextCEI = ceb.get(targetIx + targetIxOffset); 1118 maxLimit = nextCEI.lowIndex_; 1119 // If we are at the end of the target too, match succeeds 1120 if (nextCEI.ce_ == CollationPCE.PROCESSED_NULLORDER) { 1121 break; 1122 } 1123 // As long as the next CE has primary weight of 0, 1124 // it is part of the last target element matched by the pattern; 1125 // make sure it can be part of a match with the last patCE 1126 if ((((nextCEI.ce_) >>> 32) & 0xFFFF0000L) == 0) { 1127 int ceMatch = compareCE64s(nextCEI.ce_, patCE, search_.elementComparisonType_); 1128 if (ceMatch == CE_NO_MATCH || ceMatch == CE_SKIP_PATN ) { 1129 found = false; 1130 break; 1131 } 1132 // If lowIndex == highIndex, this target CE is part of an expansion of the last matched 1133 // target element, but it has non-zero primary weight => match fails 1134 } else if ( nextCEI.lowIndex_ == nextCEI.highIndex_ ) { 1135 found = false; 1136 break; 1137 // Else the target CE is not part of an expansion of the last matched element, match succeeds 1138 } else { 1139 break; 1140 } 1141 } 1142 } 1143 1144 // Check for the start of the match being within a combining sequence. 1145 // This can happen if the pattern itself begins with a combining char, and 1146 // the match found combining marks in the target text that were attached 1147 // to something else. 1148 // This type of match should be rejected for not completely consuming a 1149 // combining sequence. 1150 if (!isBreakBoundary(mStart)) { 1151 found = false; 1152 } 1153 1154 // Check for the start of the match being within an Collation Element Expansion, 1155 // meaning that the first char of the match is only partially matched. 1156 // With expansions, the first CE will report the index of the source 1157 // character, and all subsequent (expansions) CEs will report the source index of the 1158 // _following_ character. 1159 int secondIx = firstCEI.highIndex_; 1160 if (mStart == secondIx) { 1161 found = false; 1162 } 1163 1164 // Allow matches to end in the middle of a grapheme cluster if the following 1165 // conditions are met; this is needed to make prefix search work properly in 1166 // Indic, see #11750 1167 // * the default breakIter is being used 1168 // * the next collation element after this combining sequence 1169 // - has non-zero primary weight 1170 // - corresponds to a separate character following the one at end of the current match 1171 // (the second of these conditions, and perhaps both, may be redundant given the 1172 // subsequent check for normalization boundary; however they are likely much faster 1173 // tests in any case) 1174 // * the match limit is a normalization boundary 1175 boolean allowMidclusterMatch = 1176 breakIterator == null && 1177 (((nextCEI.ce_) >>> 32) & 0xFFFF0000L) != 0 && 1178 maxLimit >= lastCEI.highIndex_ && nextCEI.highIndex_ > maxLimit && 1179 (nfd_.hasBoundaryBefore(codePointAt(targetText, maxLimit)) || 1180 nfd_.hasBoundaryAfter(codePointBefore(targetText, maxLimit))); 1181 1182 // If those conditions are met, then: 1183 // * do NOT advance the candidate match limit (mLimit) to a break boundary; however 1184 // the match limit may be backed off to a previous break boundary. This handles 1185 // cases in which mLimit includes target characters that are ignorable with current 1186 // settings (such as space) and which extend beyond the pattern match. 1187 // * do NOT require that end of the combining sequence not extend beyond the match in CE space 1188 // * do NOT require that match limit be on a breakIter boundary 1189 1190 // Advance the match end position to the first acceptable match boundary. 1191 // This advances the index over any combining characters. 1192 mLimit = maxLimit; 1193 if (minLimit < maxLimit) { 1194 // When the last CE's low index is same with its high index, the CE is likely 1195 // a part of expansion. In this case, the index is located just after the 1196 // character corresponding to the CEs compared above. If the index is right 1197 // at the break boundary, move the position to the next boundary will result 1198 // incorrect match length when there are ignorable characters exist between 1199 // the position and the next character produces CE(s). See ticket#8482. 1200 if (minLimit == lastCEI.highIndex_ && isBreakBoundary(minLimit)) { 1201 mLimit = minLimit; 1202 } else { 1203 int nba = nextBoundaryAfter(minLimit); 1204 // Note that we can have nba < maxLimit && nba >= minLImit, in which 1205 // case we want to set mLimit to nba regardless of allowMidclusterMatch 1206 // (i.e. we back off mLimit to the previous breakIterator boundary). 1207 if (nba >= lastCEI.highIndex_ && (!allowMidclusterMatch || nba < maxLimit)) { 1208 mLimit = nba; 1209 } 1210 } 1211 } 1212 1213 if (!allowMidclusterMatch) { 1214 // If advancing to the end of a combining sequence in character indexing space 1215 // advanced us beyond the end of the match in CE space, reject this match. 1216 if (mLimit > maxLimit) { 1217 found = false; 1218 } 1219 1220 if (!isBreakBoundary(mLimit)) { 1221 found = false; 1222 } 1223 } 1224 1225 if (!checkIdentical(mStart, mLimit)) { 1226 found = false; 1227 } 1228 1229 if (found) { 1230 break; 1231 } 1232 } 1233 1234 // All Done. Store back the match bounds to the caller. 1235 // 1236 if (found == false) { 1237 mLimit = -1; 1238 mStart = -1; 1239 } 1240 1241 if (m != null) { 1242 m.start_ = mStart; 1243 m.limit_ = mLimit; 1244 } 1245 1246 return found; 1247 } 1248 codePointAt(CharacterIterator iter, int index)1249 private static int codePointAt(CharacterIterator iter, int index) { 1250 int currentIterIndex = iter.getIndex(); 1251 char codeUnit = iter.setIndex(index); 1252 int cp = codeUnit; 1253 if (Character.isHighSurrogate(codeUnit)) { 1254 char nextUnit = iter.next(); 1255 if (Character.isLowSurrogate(nextUnit)) { 1256 cp = Character.toCodePoint(codeUnit, nextUnit); 1257 } 1258 } 1259 iter.setIndex(currentIterIndex); // restore iter position 1260 return cp; 1261 } 1262 codePointBefore(CharacterIterator iter, int index)1263 private static int codePointBefore(CharacterIterator iter, int index) { 1264 int currentIterIndex = iter.getIndex(); 1265 iter.setIndex(index); 1266 char codeUnit = iter.previous(); 1267 int cp = codeUnit; 1268 if (Character.isLowSurrogate(codeUnit)) { 1269 char prevUnit = iter.previous(); 1270 if (Character.isHighSurrogate(prevUnit)) { 1271 cp = Character.toCodePoint(prevUnit, codeUnit); 1272 } 1273 } 1274 iter.setIndex(currentIterIndex); // restore iter position 1275 return cp; 1276 } 1277 searchBackwards(int startIdx, Match m)1278 private boolean searchBackwards(int startIdx, Match m) { 1279 //ICU4C_TODO comment: reject search patterns beginning with a combining char. 1280 1281 // Input parameter sanity check. 1282 if (pattern_.CELength_ == 0 1283 || startIdx < search_.beginIndex() 1284 || startIdx > search_.endIndex()) { 1285 throw new IllegalArgumentException("searchBackwards(" + startIdx + ", m) - expected position to be between " + 1286 search_.beginIndex() + " and " + search_.endIndex()); 1287 } 1288 1289 if (pattern_.PCE_ == null) { 1290 initializePatternPCETable(); 1291 } 1292 1293 CEBuffer ceb = new CEBuffer(this); 1294 int targetIx = 0; 1295 1296 /* 1297 * Pre-load the buffer with the CE's for the grapheme 1298 * after our starting position so that we're sure that 1299 * we can look at the CE following the match when we 1300 * check the match boundaries. 1301 * 1302 * This will also pre-fetch the first CE that we'll 1303 * consider for the match. 1304 */ 1305 if (startIdx < search_.endIndex()) { 1306 BreakIterator bi = search_.internalBreakIter_; 1307 int next = bi.following(startIdx); 1308 1309 textIter_.setOffset(next); 1310 1311 for (targetIx = 0; ; targetIx++) { 1312 if (ceb.getPrevious(targetIx).lowIndex_ < startIdx) { 1313 break; 1314 } 1315 } 1316 } else { 1317 textIter_.setOffset(startIdx); 1318 } 1319 1320 CEI targetCEI = null; 1321 int patIx; 1322 boolean found; 1323 1324 int limitIx = targetIx; 1325 int mStart = -1; 1326 int mLimit = -1; 1327 int minLimit; 1328 int maxLimit; 1329 1330 // Outer loop moves over match starting positions in the 1331 // target CE space. 1332 // Here, targetIx values increase toward the beginning of the base text (i.e. we get the text CEs in reverse order). 1333 // But patIx is 0 at the beginning of the pattern and increases toward the end. 1334 // So this loop performs a comparison starting with the end of pattern, and prcessd toward the beginning of the pattern 1335 // and the beginning of the base text. 1336 for (targetIx = limitIx; ; targetIx++) { 1337 found = true; 1338 // For targetIx > limitIx, this ceb.getPrevious gets a CE that is as far back in the ring buffer 1339 // (compared to the last CE fetched for the previous targetIx value) as we need to go 1340 // for this targetIx value, so if it is non-NULL then other ceb.getPrevious calls should be OK. 1341 CEI lastCEI = ceb.getPrevious(targetIx); 1342 if (lastCEI == null) { 1343 throw new ICUException("CEBuffer.getPrevious(" + targetIx + ") returned null."); 1344 } 1345 // Inner loop checks for a match beginning at each 1346 // position from the outer loop. 1347 int targetIxOffset = 0; 1348 for (patIx = pattern_.PCELength_ - 1; patIx >= 0; patIx--) { 1349 long patCE = pattern_.PCE_[patIx]; 1350 1351 targetCEI = ceb.getPrevious(targetIx + pattern_.PCELength_ - 1 - patIx + targetIxOffset); 1352 // Compare CE from target string with CE from the pattern. 1353 // Note that the target CE will be UCOL_NULLORDER if we reach the end of input, 1354 // which will fail the compare, below. 1355 int ceMatch = compareCE64s(targetCEI.ce_, patCE, search_.elementComparisonType_); 1356 if (ceMatch == CE_NO_MATCH) { 1357 found = false; 1358 break; 1359 } else if (ceMatch > CE_NO_MATCH) { 1360 if (ceMatch == CE_SKIP_TARG) { 1361 // redo with same patCE, next targCE 1362 patIx++; 1363 targetIxOffset++; 1364 } else { // ceMatch == CE_SKIP_PATN 1365 // redo with same targCE, next patCE 1366 targetIxOffset--; 1367 } 1368 } 1369 } 1370 1371 if (!found && ((targetCEI == null) || (targetCEI.ce_ != CollationPCE.PROCESSED_NULLORDER))) { 1372 // No match at this targetIx. Try again at the next. 1373 continue; 1374 } 1375 1376 if (!found) { 1377 // No match at all, we have run off the end of the target text. 1378 break; 1379 } 1380 1381 // We have found a match in CE space. 1382 // Now determine the bounds in string index space. 1383 // There still is a chance of match failure if the CE range not correspond to 1384 // an acceptable character range. 1385 // 1386 CEI firstCEI = ceb.getPrevious(targetIx + pattern_.PCELength_ - 1 + targetIxOffset); 1387 mStart = firstCEI.lowIndex_; 1388 1389 // Check for the start of the match being within a combining sequence. 1390 // This can happen if the pattern itself begins with a combining char, and 1391 // the match found combining marks in the target text that were attached 1392 // to something else. 1393 // This type of match should be rejected for not completely consuming a 1394 // combining sequence. 1395 if (!isBreakBoundary(mStart)) { 1396 found = false; 1397 } 1398 1399 // Look at the high index of the first CE in the match. If it's the same as the 1400 // low index, the first CE in the match is in the middle of an expansion. 1401 if (mStart == firstCEI.highIndex_) { 1402 found = false; 1403 } 1404 1405 minLimit = lastCEI.lowIndex_; 1406 1407 if (targetIx > 0) { 1408 // Look at the CE following the match. If it is UCOL_NULLORDER the match 1409 // extended to the end of input, and the match is good. 1410 1411 // Look at the high and low indices of the CE following the match. If 1412 // they are the same it means one of two things: 1413 // 1. The match extended to the last CE from the target text, which is OK, or 1414 // 2. The last CE that was part of the match is in an expansion that extends 1415 // to the first CE after the match. In this case, we reject the match. 1416 CEI nextCEI = ceb.getPrevious(targetIx - 1); 1417 1418 if (nextCEI.lowIndex_ == nextCEI.highIndex_ && nextCEI.ce_ != CollationPCE.PROCESSED_NULLORDER) { 1419 found = false; 1420 } 1421 1422 mLimit = maxLimit = nextCEI.lowIndex_; 1423 1424 // Allow matches to end in the middle of a grapheme cluster if the following 1425 // conditions are met; this is needed to make prefix search work properly in 1426 // Indic, see #11750 1427 // * the default breakIter is being used 1428 // * the next collation element after this combining sequence 1429 // - has non-zero primary weight 1430 // - corresponds to a separate character following the one at end of the current match 1431 // (the second of these conditions, and perhaps both, may be redundant given the 1432 // subsequent check for normalization boundary; however they are likely much faster 1433 // tests in any case) 1434 // * the match limit is a normalization boundary 1435 boolean allowMidclusterMatch = 1436 breakIterator == null && 1437 (((nextCEI.ce_) >>> 32) & 0xFFFF0000L) != 0 && 1438 maxLimit >= lastCEI.highIndex_ && nextCEI.highIndex_ > maxLimit && 1439 (nfd_.hasBoundaryBefore(codePointAt(targetText, maxLimit)) || 1440 nfd_.hasBoundaryAfter(codePointBefore(targetText, maxLimit))); 1441 1442 // If those conditions are met, then: 1443 // * do NOT advance the candidate match limit (mLimit) to a break boundary; however 1444 // the match limit may be backed off to a previous break boundary. This handles 1445 // cases in which mLimit includes target characters that are ignorable with current 1446 // settings (such as space) and which extend beyond the pattern match. 1447 // * do NOT require that end of the combining sequence not extend beyond the match in CE space 1448 // * do NOT require that match limit be on a breakIter boundary 1449 1450 // Advance the match end position to the first acceptable match boundary. 1451 // This advances the index over any combining characters. 1452 if (minLimit < maxLimit) { 1453 int nba = nextBoundaryAfter(minLimit); 1454 // Note that we can have nba < maxLimit && nba >= minLImit, in which 1455 // case we want to set mLimit to nba regardless of allowMidclusterMatch 1456 // (i.e. we back off mLimit to the previous breakIterator boundary). 1457 if (nba >= lastCEI.highIndex_ && (!allowMidclusterMatch || nba < maxLimit)) { 1458 mLimit = nba; 1459 } 1460 } 1461 1462 if (!allowMidclusterMatch) { 1463 // If advancing to the end of a combining sequence in character indexing space 1464 // advanced us beyond the end of the match in CE space, reject this match. 1465 if (mLimit > maxLimit) { 1466 found = false; 1467 } 1468 1469 // Make sure the end of the match is on a break boundary 1470 if (!isBreakBoundary(mLimit)) { 1471 found = false; 1472 } 1473 } 1474 1475 } else { 1476 // No non-ignorable CEs after this point. 1477 // The maximum position is detected by boundary after 1478 // the last non-ignorable CE. Combining sequence 1479 // across the start index will be truncated. 1480 int nba = nextBoundaryAfter(minLimit); 1481 mLimit = maxLimit = (nba > 0) && (startIdx > nba) ? nba : startIdx; 1482 } 1483 1484 if (!checkIdentical(mStart, mLimit)) { 1485 found = false; 1486 } 1487 1488 if (found) { 1489 break; 1490 } 1491 } 1492 1493 // All Done. Store back the match bounds to the caller. 1494 // 1495 if (found == false) { 1496 mLimit = -1; 1497 mStart = -1; 1498 } 1499 1500 if (m != null) { 1501 m.start_ = mStart; 1502 m.limit_ = mLimit; 1503 } 1504 1505 return found; 1506 } 1507 1508 // Java porting note: 1509 // 1510 // ICU4C usearch_handleNextExact() is identical to usearch_handleNextCanonical() 1511 // for the linear search implementation. The differences are addressed in search(). 1512 // handleNextExact()1513 private boolean handleNextExact() { 1514 return handleNextCommonImpl(); 1515 } 1516 handleNextCanonical()1517 private boolean handleNextCanonical() { 1518 return handleNextCommonImpl(); 1519 } 1520 handleNextCommonImpl()1521 private boolean handleNextCommonImpl() { 1522 int textOffset = textIter_.getOffset(); 1523 Match match = new Match(); 1524 1525 if (search(textOffset, match)) { 1526 search_.matchedIndex_ = match.start_; 1527 search_.setMatchedLength(match.limit_ - match.start_); 1528 return true; 1529 } else { 1530 setMatchNotFound(); 1531 return false; 1532 } 1533 } 1534 1535 // Java porting note: 1536 // 1537 // ICU4C usearch_handlePreviousExact() is identical to usearch_handlePreviousCanonical() 1538 // for the linear search implementation. The differences are addressed in searchBackwards(). 1539 // handlePreviousExact()1540 private boolean handlePreviousExact() { 1541 return handlePreviousCommonImpl(); 1542 } 1543 handlePreviousCanonical()1544 private boolean handlePreviousCanonical() { 1545 return handlePreviousCommonImpl(); 1546 } 1547 handlePreviousCommonImpl()1548 private boolean handlePreviousCommonImpl() { 1549 int textOffset; 1550 1551 if (search_.isOverlap_) { 1552 if (search_.matchedIndex_ != DONE) { 1553 textOffset = search_.matchedIndex_ + search_.matchedLength() - 1; 1554 } else { 1555 // move the start position at the end of possible match 1556 initializePatternPCETable(); 1557 if (!initTextProcessedIter()) { 1558 setMatchNotFound(); 1559 return false; 1560 } 1561 for (int nPCEs = 0; nPCEs < pattern_.PCELength_ - 1; nPCEs++) { 1562 long pce = textProcessedIter_.nextProcessed(null); 1563 if (pce == CollationPCE.PROCESSED_NULLORDER) { 1564 // at the end of the text 1565 break; 1566 } 1567 } 1568 textOffset = textIter_.getOffset(); 1569 } 1570 } else { 1571 textOffset = textIter_.getOffset(); 1572 } 1573 1574 Match match = new Match(); 1575 if (searchBackwards(textOffset, match)) { 1576 search_.matchedIndex_ = match.start_; 1577 search_.setMatchedLength(match.limit_ - match.start_); 1578 return true; 1579 } else { 1580 setMatchNotFound(); 1581 return false; 1582 } 1583 } 1584 1585 /** 1586 * Gets a substring out of a CharacterIterator 1587 * 1588 * Java porting note: Not available in ICU4C 1589 * 1590 * @param text CharacterIterator 1591 * @param start start offset 1592 * @param length of substring 1593 * @return substring from text starting at start and length length 1594 */ getString(CharacterIterator text, int start, int length)1595 private static final String getString(CharacterIterator text, int start, int length) { 1596 StringBuilder result = new StringBuilder(length); 1597 int offset = text.getIndex(); 1598 text.setIndex(start); 1599 for (int i = 0; i < length; i++) { 1600 result.append(text.current()); 1601 text.next(); 1602 } 1603 text.setIndex(offset); 1604 return result.toString(); 1605 } 1606 1607 /** 1608 * Java port of ICU4C struct UPattern (usrchimp.h) 1609 */ 1610 private static final class Pattern { 1611 /** Pattern string */ 1612 String text_; 1613 1614 long[] PCE_; 1615 int PCELength_ = 0; 1616 1617 // TODO: We probably do not need CE_ / CELength_ 1618 @SuppressWarnings("unused") 1619 int[] CE_; 1620 int CELength_ = 0; 1621 1622 // *** Boyer-Moore *** 1623 // boolean hasPrefixAccents_ = false; 1624 // boolean hasSuffixAccents_ = false; 1625 // int defaultShiftSize_; 1626 // char[] shift_; 1627 // char[] backShift_; 1628 Pattern(String pattern)1629 protected Pattern(String pattern) { 1630 text_ = pattern; 1631 } 1632 } 1633 1634 /** 1635 * Java port of ICU4C UCollationPCE (usrchimp.h) 1636 */ 1637 private static class CollationPCE { 1638 public static final long PROCESSED_NULLORDER = -1; 1639 1640 private static final int DEFAULT_BUFFER_SIZE = 16; 1641 private static final int BUFFER_GROW = 8; 1642 1643 // Note: PRIMARYORDERMASK is also duplicated in StringSearch class 1644 private static final int PRIMARYORDERMASK = 0xffff0000; 1645 private static final int CONTINUATION_MARKER = 0xc0; 1646 1647 private PCEBuffer pceBuffer_ = new PCEBuffer(); 1648 private CollationElementIterator cei_; 1649 private int strength_; 1650 private boolean toShift_; 1651 private boolean isShifted_; 1652 private int variableTop_; 1653 CollationPCE(CollationElementIterator iter)1654 public CollationPCE(CollationElementIterator iter) { 1655 init(iter); 1656 } 1657 init(CollationElementIterator iter)1658 public void init(CollationElementIterator iter) { 1659 cei_ = iter; 1660 init(iter.getRuleBasedCollator()); 1661 } 1662 init(RuleBasedCollator coll)1663 private void init(RuleBasedCollator coll) { 1664 strength_ = coll.getStrength(); 1665 toShift_ = coll.isAlternateHandlingShifted(); 1666 isShifted_ = false; 1667 variableTop_ = coll.getVariableTop(); 1668 } 1669 1670 @SuppressWarnings("fallthrough") processCE(int ce)1671 private long processCE(int ce) { 1672 long primary = 0, secondary = 0, tertiary = 0, quaternary = 0; 1673 1674 // This is clean, but somewhat slow... 1675 // We could apply the mask to ce and then 1676 // just get all three orders... 1677 switch (strength_) { 1678 default: 1679 tertiary = CollationElementIterator.tertiaryOrder(ce); 1680 /* note fall-through */ 1681 1682 case Collator.SECONDARY: 1683 secondary = CollationElementIterator.secondaryOrder(ce); 1684 /* note fall-through */ 1685 1686 case Collator.PRIMARY: 1687 primary = CollationElementIterator.primaryOrder(ce); 1688 } 1689 1690 // **** This should probably handle continuations too. **** 1691 // **** That means that we need 24 bits for the primary **** 1692 // **** instead of the 16 that we're currently using. **** 1693 // **** So we can lay out the 64 bits as: 24.12.12.16. **** 1694 // **** Another complication with continuations is that **** 1695 // **** the *second* CE is marked as a continuation, so **** 1696 // **** we always have to peek ahead to know how long **** 1697 // **** the primary is... **** 1698 if ((toShift_ && variableTop_ > ce && primary != 0) || (isShifted_ && primary == 0)) { 1699 1700 if (primary == 0) { 1701 return CollationElementIterator.IGNORABLE; 1702 } 1703 1704 if (strength_ >= Collator.QUATERNARY) { 1705 quaternary = primary; 1706 } 1707 1708 primary = secondary = tertiary = 0; 1709 isShifted_ = true; 1710 } else { 1711 if (strength_ >= Collator.QUATERNARY) { 1712 quaternary = 0xFFFF; 1713 } 1714 1715 isShifted_ = false; 1716 } 1717 1718 return primary << 48 | secondary << 32 | tertiary << 16 | quaternary; 1719 } 1720 1721 /** 1722 * Get the processed ordering priority of the next collation element in the text. 1723 * A single character may contain more than one collation element. 1724 * 1725 * Note: This is equivalent to 1726 * UCollationPCE::nextProcessed(int32_t *ixLow, int32_t *ixHigh, UErrorCode *status); 1727 * 1728 * @param range receiving the iterator index before/after fetching the CE. 1729 * @return The next collation elements ordering, otherwise returns PROCESSED_NULLORDER 1730 * if an error has occurred or if the end of string has been reached 1731 */ nextProcessed(Range range)1732 public long nextProcessed(Range range) { 1733 long result = CollationElementIterator.IGNORABLE; 1734 int low = 0, high = 0; 1735 1736 pceBuffer_.reset(); 1737 1738 do { 1739 low = cei_.getOffset(); 1740 int ce = cei_.next(); 1741 high = cei_.getOffset(); 1742 1743 if (ce == CollationElementIterator.NULLORDER) { 1744 result = PROCESSED_NULLORDER; 1745 break; 1746 } 1747 1748 result = processCE(ce); 1749 } while (result == CollationElementIterator.IGNORABLE); 1750 1751 if (range != null) { 1752 range.ixLow_ = low; 1753 range.ixHigh_ = high; 1754 } 1755 1756 return result; 1757 } 1758 1759 /** 1760 * Get the processed ordering priority of the previous collation element in the text. 1761 * A single character may contain more than one collation element. 1762 * 1763 * Note: This is equivalent to 1764 * UCollationPCE::previousProcessed(int32_t *ixLow, int32_t *ixHigh, UErrorCode *status); 1765 * 1766 * @param range receiving the iterator index before/after fetching the CE. 1767 * @return The previous collation elements ordering, otherwise returns 1768 * PROCESSED_NULLORDER if an error has occurred or if the start of 1769 * string has been reached. 1770 */ previousProcessed(Range range)1771 public long previousProcessed(Range range) { 1772 long result = CollationElementIterator.IGNORABLE; 1773 int low = 0, high = 0; 1774 1775 // pceBuffer_.reset(); 1776 1777 while (pceBuffer_.empty()) { 1778 // buffer raw CEs up to non-ignorable primary 1779 RCEBuffer rceb = new RCEBuffer(); 1780 int ce; 1781 1782 boolean finish = false; 1783 1784 // **** do we need to reset rceb, or will it always be empty at this point **** 1785 do { 1786 high = cei_.getOffset(); 1787 ce = cei_.previous(); 1788 low = cei_.getOffset(); 1789 1790 if (ce == CollationElementIterator.NULLORDER) { 1791 if (!rceb.empty()) { 1792 break; 1793 } 1794 1795 finish = true; 1796 break; 1797 } 1798 1799 rceb.put(ce, low, high); 1800 } while ((ce & PRIMARYORDERMASK) == 0 || isContinuation(ce)); 1801 1802 if (finish) { 1803 break; 1804 } 1805 1806 // process the raw CEs 1807 while (!rceb.empty()) { 1808 RCEI rcei = rceb.get(); 1809 1810 result = processCE(rcei.ce_); 1811 1812 if (result != CollationElementIterator.IGNORABLE) { 1813 pceBuffer_.put(result, rcei.low_, rcei.high_); 1814 } 1815 } 1816 } 1817 1818 if (pceBuffer_.empty()) { 1819 // **** Is -1 the right value for ixLow, ixHigh? **** 1820 if (range != null) { 1821 range.ixLow_ = -1; 1822 range.ixHigh_ = -1; 1823 } 1824 return CollationElementIterator.NULLORDER; 1825 } 1826 1827 PCEI pcei = pceBuffer_.get(); 1828 1829 if (range != null) { 1830 range.ixLow_ = pcei.low_; 1831 range.ixHigh_ = pcei.high_; 1832 } 1833 1834 return pcei.ce_; 1835 } 1836 isContinuation(int ce)1837 private static boolean isContinuation(int ce) { 1838 return ((ce & CONTINUATION_MARKER) == CONTINUATION_MARKER); 1839 } 1840 1841 public static final class Range { 1842 int ixLow_; 1843 int ixHigh_; 1844 } 1845 1846 /** Processed collation element buffer stuff ported from ICU4C ucoleitr.cpp */ 1847 private static final class PCEI { 1848 long ce_; 1849 int low_; 1850 int high_; 1851 } 1852 1853 private static final class PCEBuffer { 1854 private PCEI[] buffer_ = new PCEI[DEFAULT_BUFFER_SIZE]; 1855 private int bufferIndex_ = 0; 1856 reset()1857 void reset() { 1858 bufferIndex_ = 0; 1859 } 1860 empty()1861 boolean empty() { 1862 return bufferIndex_ <= 0; 1863 } 1864 put(long ce, int ixLow, int ixHigh)1865 void put(long ce, int ixLow, int ixHigh) 1866 { 1867 if (bufferIndex_ >= buffer_.length) { 1868 PCEI[] newBuffer = new PCEI[buffer_.length + BUFFER_GROW]; 1869 System.arraycopy(buffer_, 0, newBuffer, 0, buffer_.length); 1870 buffer_ = newBuffer; 1871 } 1872 buffer_[bufferIndex_] = new PCEI(); 1873 buffer_[bufferIndex_].ce_ = ce; 1874 buffer_[bufferIndex_].low_ = ixLow; 1875 buffer_[bufferIndex_].high_ = ixHigh; 1876 1877 bufferIndex_ += 1; 1878 } 1879 get()1880 PCEI get() { 1881 if (bufferIndex_ > 0) { 1882 return buffer_[--bufferIndex_]; 1883 } 1884 return null; 1885 } 1886 } 1887 1888 /** Raw collation element buffer stuff ported from ICU4C ucoleitr.cpp */ 1889 private static final class RCEI { 1890 int ce_; 1891 int low_; 1892 int high_; 1893 } 1894 1895 private static final class RCEBuffer { 1896 private RCEI[] buffer_ = new RCEI[DEFAULT_BUFFER_SIZE]; 1897 private int bufferIndex_ = 0; 1898 empty()1899 boolean empty() { 1900 return bufferIndex_ <= 0; 1901 } 1902 put(int ce, int ixLow, int ixHigh)1903 void put(int ce, int ixLow, int ixHigh) { 1904 if (bufferIndex_ >= buffer_.length) { 1905 RCEI[] newBuffer = new RCEI[buffer_.length + BUFFER_GROW]; 1906 System.arraycopy(buffer_, 0, newBuffer, 0, buffer_.length); 1907 buffer_ = newBuffer; 1908 } 1909 buffer_[bufferIndex_] = new RCEI(); 1910 buffer_[bufferIndex_].ce_ = ce; 1911 buffer_[bufferIndex_].low_ = ixLow; 1912 buffer_[bufferIndex_].high_ = ixHigh; 1913 1914 bufferIndex_ += 1; 1915 } 1916 get()1917 RCEI get() { 1918 if (bufferIndex_ > 0) { 1919 return buffer_[--bufferIndex_]; 1920 } 1921 return null; 1922 } 1923 } 1924 } 1925 1926 /** 1927 * Java port of ICU4C CEI (usearch.cpp) 1928 * 1929 * CEI Collation Element + source text index. 1930 * These structs are kept in the circular buffer. 1931 */ 1932 private static class CEI { 1933 long ce_; 1934 int lowIndex_; 1935 int highIndex_; 1936 } 1937 1938 /** 1939 * CEBuffer A circular buffer of CEs from the text being searched 1940 */ 1941 private static class CEBuffer { 1942 // Java porting note: ICU4C uses the size for stack buffer 1943 // static final int DEFAULT_CEBUFFER_SIZE = 96; 1944 1945 static final int CEBUFFER_EXTRA = 32; 1946 static final int MAX_TARGET_IGNORABLES_PER_PAT_JAMO_L = 8; 1947 static final int MAX_TARGET_IGNORABLES_PER_PAT_OTHER = 3; 1948 1949 CEI[] buf_; 1950 int bufSize_; 1951 int firstIx_; 1952 int limitIx_; 1953 1954 // Java porting note: No references in ICU4C implementation 1955 // CollationElementIterator ceIter_; 1956 1957 StringSearch strSearch_; 1958 CEBuffer(StringSearch ss)1959 CEBuffer(StringSearch ss) { 1960 strSearch_ = ss; 1961 bufSize_ = ss.pattern_.PCELength_ + CEBUFFER_EXTRA; 1962 if (ss.search_.elementComparisonType_ != ElementComparisonType.STANDARD_ELEMENT_COMPARISON) { 1963 String patText = ss.pattern_.text_; 1964 if (patText != null) { 1965 for (int i = 0; i < patText.length(); i++) { 1966 char c = patText.charAt(i); 1967 if (MIGHT_BE_JAMO_L(c)) { 1968 bufSize_ += MAX_TARGET_IGNORABLES_PER_PAT_JAMO_L; 1969 } else { 1970 // No check for surrogates, we might allocate slightly more buffer than necessary. 1971 bufSize_ += MAX_TARGET_IGNORABLES_PER_PAT_OTHER; 1972 } 1973 } 1974 } 1975 } 1976 1977 // Not used - see above 1978 // ceIter_ = ss.textIter_; 1979 1980 firstIx_ = 0; 1981 limitIx_ = 0; 1982 1983 if (!ss.initTextProcessedIter()) { 1984 return; 1985 } 1986 1987 buf_ = new CEI[bufSize_]; 1988 } 1989 1990 // Get the CE with the specified index. 1991 // Index must be in the range 1992 // n-history_size < index < n+1 1993 // where n is the largest index to have been fetched by some previous call to this function. 1994 // The CE value will be UCOL__PROCESSED_NULLORDER at end of input. 1995 // get(int index)1996 CEI get(int index) { 1997 int i = index % bufSize_; 1998 1999 if (index >= firstIx_ && index < limitIx_) { 2000 // The request was for an entry already in our buffer. 2001 // Just return it. 2002 return buf_[i]; 2003 } 2004 2005 // Caller is requesting a new, never accessed before, CE. 2006 // Verify that it is the next one in sequence, which is all 2007 // that is allowed. 2008 if (index != limitIx_) { 2009 assert(false); 2010 return null; 2011 } 2012 2013 // Manage the circular CE buffer indexing 2014 limitIx_++; 2015 2016 if (limitIx_ - firstIx_ >= bufSize_) { 2017 // The buffer is full, knock out the lowest-indexed entry. 2018 firstIx_++; 2019 } 2020 2021 CollationPCE.Range range = new CollationPCE.Range(); 2022 if (buf_[i] == null) { 2023 buf_[i] = new CEI(); 2024 } 2025 buf_[i].ce_ = strSearch_.textProcessedIter_.nextProcessed(range); 2026 buf_[i].lowIndex_ = range.ixLow_; 2027 buf_[i].highIndex_ = range.ixHigh_; 2028 2029 return buf_[i]; 2030 } 2031 2032 // Get the CE with the specified index. 2033 // Index must be in the range 2034 // n-history_size < index < n+1 2035 // where n is the largest index to have been fetched by some previous call to this function. 2036 // The CE value will be UCOL__PROCESSED_NULLORDER at end of input. 2037 // getPrevious(int index)2038 CEI getPrevious(int index) { 2039 int i = index % bufSize_; 2040 2041 if (index >= firstIx_ && index < limitIx_) { 2042 // The request was for an entry already in our buffer. 2043 // Just return it. 2044 return buf_[i]; 2045 } 2046 2047 // Caller is requesting a new, never accessed before, CE. 2048 // Verify that it is the next one in sequence, which is all 2049 // that is allowed. 2050 if (index != limitIx_) { 2051 assert(false); 2052 return null; 2053 } 2054 2055 // Manage the circular CE buffer indexing 2056 limitIx_++; 2057 2058 if (limitIx_ - firstIx_ >= bufSize_) { 2059 // The buffer is full, knock out the lowest-indexed entry. 2060 firstIx_++; 2061 } 2062 2063 CollationPCE.Range range = new CollationPCE.Range(); 2064 if (buf_[i] == null) { 2065 buf_[i] = new CEI(); 2066 } 2067 buf_[i].ce_ = strSearch_.textProcessedIter_.previousProcessed(range); 2068 buf_[i].lowIndex_ = range.ixLow_; 2069 buf_[i].highIndex_ = range.ixHigh_; 2070 2071 return buf_[i]; 2072 } 2073 MIGHT_BE_JAMO_L(char c)2074 static boolean MIGHT_BE_JAMO_L(char c) { 2075 return (c >= 0x1100 && c <= 0x115E) 2076 || (c >= 0x3131 && c <= 0x314E) 2077 || (c >= 0x3165 && c <= 0x3186); 2078 } 2079 } 2080 } 2081