1 /* 2 * Copyright (c) 1999, 2010, Oracle and/or its affiliates. All rights reserved. 3 * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER. 4 * 5 * This code is free software; you can redistribute it and/or modify it 6 * under the terms of the GNU General Public License version 2 only, as 7 * published by the Free Software Foundation. Oracle designates this 8 * particular file as subject to the "Classpath" exception as provided 9 * by Oracle in the LICENSE file that accompanied this code. 10 * 11 * This code is distributed in the hope that it will be useful, but WITHOUT 12 * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or 13 * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License 14 * version 2 for more details (a copy is included in the LICENSE file that 15 * accompanied this code). 16 * 17 * You should have received a copy of the GNU General Public License version 18 * 2 along with this work; if not, write to the Free Software Foundation, 19 * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA. 20 * 21 * Please contact Oracle, 500 Oracle Parkway, Redwood Shores, CA 94065 USA 22 * or visit www.oracle.com if you need additional information or have any 23 * questions. 24 */ 25 26 /* 27 * 28 * (C) Copyright Taligent, Inc. 1996, 1997 - All Rights Reserved 29 * (C) Copyright IBM Corp. 1996 - 2002 - All Rights Reserved 30 * 31 * The original version of this source code and documentation 32 * is copyrighted and owned by Taligent, Inc., a wholly-owned 33 * subsidiary of IBM. These materials are provided under terms 34 * of a License Agreement between Taligent and Sun. This technology 35 * is protected by multiple US and International patents. 36 * 37 * This notice and attribution to Taligent may not be removed. 38 * Taligent is a registered trademark of Taligent, Inc. 39 */ 40 41 42 package java.text; 43 44 /** 45 * <p>A subclass of BreakIterator whose behavior is specified using a list of rules.</p> 46 * 47 * <p>There are two kinds of rules, which are separated by semicolons: <i>substitutions</i> 48 * and <i>regular expressions.</i></p> 49 * 50 * <p>A substitution rule defines a name that can be used in place of an expression. It 51 * consists of a name, which is a string of characters contained in angle brackets, an equals 52 * sign, and an expression. (There can be no whitespace on either side of the equals sign.) 53 * To keep its syntactic meaning intact, the expression must be enclosed in parentheses or 54 * square brackets. A substitution is visible after its definition, and is filled in using 55 * simple textual substitution. Substitution definitions can contain other substitutions, as 56 * long as those substitutions have been defined first. Substitutions are generally used to 57 * make the regular expressions (which can get quite complex) shorted and easier to read. 58 * They typically define either character categories or commonly-used subexpressions.</p> 59 * 60 * <p>There is one special substitution. If the description defines a substitution 61 * called "<ignore>", the expression must be a [] expression, and the 62 * expression defines a set of characters (the "<em>ignore characters</em>") that 63 * will be transparent to the BreakIterator. A sequence of characters will break the 64 * same way it would if any ignore characters it contains are taken out. Break 65 * positions never occur befoer ignore characters.</p> 66 * 67 * <p>A regular expression uses a subset of the normal Unix regular-expression syntax, and 68 * defines a sequence of characters to be kept together. With one significant exception, the 69 * iterator uses a longest-possible-match algorithm when matching text to regular 70 * expressions. The iterator also treats descriptions containing multiple regular expressions 71 * as if they were ORed together (i.e., as if they were separated by |).</p> 72 * 73 * <p>The special characters recognized by the regular-expression parser are as follows:</p> 74 * 75 * <blockquote> 76 * <table border="1" width="100%"> 77 * <tr> 78 * <td width="6%">*</td> 79 * <td width="94%">Specifies that the expression preceding the asterisk may occur any number 80 * of times (including not at all).</td> 81 * </tr> 82 * <tr> 83 * <td width="6%">{}</td> 84 * <td width="94%">Encloses a sequence of characters that is optional.</td> 85 * </tr> 86 * <tr> 87 * <td width="6%">()</td> 88 * <td width="94%">Encloses a sequence of characters. If followed by *, the sequence 89 * repeats. Otherwise, the parentheses are just a grouping device and a way to delimit 90 * the ends of expressions containing |.</td> 91 * </tr> 92 * <tr> 93 * <td width="6%">|</td> 94 * <td width="94%">Separates two alternative sequences of characters. Either one 95 * sequence or the other, but not both, matches this expression. The | character can 96 * only occur inside ().</td> 97 * </tr> 98 * <tr> 99 * <td width="6%">.</td> 100 * <td width="94%">Matches any character.</td> 101 * </tr> 102 * <tr> 103 * <td width="6%">*?</td> 104 * <td width="94%">Specifies a non-greedy asterisk. *? works the same way as *, except 105 * when there is overlap between the last group of characters in the expression preceding the 106 * * and the first group of characters following the *. When there is this kind of 107 * overlap, * will match the longest sequence of characters that match the expression before 108 * the *, and *? will match the shortest sequence of characters matching the expression 109 * before the *?. For example, if you have "xxyxyyyxyxyxxyxyxyy" in the text, 110 * "x[xy]*x" will match through to the last x (i.e., "<strong>xxyxyyyxyxyxxyxyx</strong>yy", 111 * but "x[xy]*?x" will only match the first two xes ("<strong>xx</strong>yxyyyxyxyxxyxyxyy").</td> 112 * </tr> 113 * <tr> 114 * <td width="6%">[]</td> 115 * <td width="94%">Specifies a group of alternative characters. A [] expression will 116 * match any single character that is specified in the [] expression. For more on the 117 * syntax of [] expressions, see below.</td> 118 * </tr> 119 * <tr> 120 * <td width="6%">/</td> 121 * <td width="94%">Specifies where the break position should go if text matches this 122 * expression. (e.g., "[a-z]*/[:Zs:]*[1-0]" will match if the iterator sees a 123 * run 124 * of letters, followed by a run of whitespace, followed by a digit, but the break position 125 * will actually go before the whitespace). Expressions that don't contain / put the 126 * break position at the end of the matching text.</td> 127 * </tr> 128 * <tr> 129 * <td width="6%">\</td> 130 * <td width="94%">Escape character. The \ itself is ignored, but causes the next 131 * character to be treated as literal character. This has no effect for many 132 * characters, but for the characters listed above, this deprives them of their special 133 * meaning. (There are no special escape sequences for Unicode characters, or tabs and 134 * newlines; these are all handled by a higher-level protocol. In a Java string, 135 * "\n" will be converted to a literal newline character by the time the 136 * regular-expression parser sees it. Of course, this means that \ sequences that are 137 * visible to the regexp parser must be written as \\ when inside a Java string.) All 138 * characters in the ASCII range except for letters, digits, and control characters are 139 * reserved characters to the parser and must be preceded by \ even if they currently don't 140 * mean anything.</td> 141 * </tr> 142 * <tr> 143 * <td width="6%">!</td> 144 * <td width="94%">If ! appears at the beginning of a regular expression, it tells the regexp 145 * parser that this expression specifies the backwards-iteration behavior of the iterator, 146 * and not its normal iteration behavior. This is generally only used in situations 147 * where the automatically-generated backwards-iteration brhavior doesn't produce 148 * satisfactory results and must be supplemented with extra client-specified rules.</td> 149 * </tr> 150 * <tr> 151 * <td width="6%"><em>(all others)</em></td> 152 * <td width="94%">All other characters are treated as literal characters, which must match 153 * the corresponding character(s) in the text exactly.</td> 154 * </tr> 155 * </table> 156 * </blockquote> 157 * 158 * <p>Within a [] expression, a number of other special characters can be used to specify 159 * groups of characters:</p> 160 * 161 * <blockquote> 162 * <table border="1" width="100%"> 163 * <tr> 164 * <td width="6%">-</td> 165 * <td width="94%">Specifies a range of matching characters. For example 166 * "[a-p]" matches all lowercase Latin letters from a to p (inclusive). The - 167 * sign specifies ranges of continuous Unicode numeric values, not ranges of characters in a 168 * language's alphabetical order: "[a-z]" doesn't include capital letters, nor does 169 * it include accented letters such as a-umlaut.</td> 170 * </tr> 171 * <tr> 172 * <td width="6%">::</td> 173 * <td width="94%">A pair of colons containing a one- or two-letter code matches all 174 * characters in the corresponding Unicode category. The two-letter codes are the same 175 * as the two-letter codes in the Unicode database (for example, "[:Sc::Sm:]" 176 * matches all currency symbols and all math symbols). Specifying a one-letter code is 177 * the same as specifying all two-letter codes that begin with that letter (for example, 178 * "[:L:]" matches all letters, and is equivalent to 179 * "[:Lu::Ll::Lo::Lm::Lt:]"). Anything other than a valid two-letter Unicode 180 * category code or a single letter that begins a Unicode category code is illegal within 181 * colons.</td> 182 * </tr> 183 * <tr> 184 * <td width="6%">[]</td> 185 * <td width="94%">[] expressions can nest. This has no effect, except when used in 186 * conjunction with the ^ token.</td> 187 * </tr> 188 * <tr> 189 * <td width="6%">^</td> 190 * <td width="94%">Excludes the character (or the characters in the [] expression) following 191 * it from the group of characters. For example, "[a-z^p]" matches all Latin 192 * lowercase letters except p. "[:L:^[\u4e00-\u9fff]]" matches all letters 193 * except the Han ideographs.</td> 194 * </tr> 195 * <tr> 196 * <td width="6%"><em>(all others)</em></td> 197 * <td width="94%">All other characters are treated as literal characters. (For 198 * example, "[aeiou]" specifies just the letters a, e, i, o, and u.)</td> 199 * </tr> 200 * </table> 201 * </blockquote> 202 * 203 * <p>For a more complete explanation, see <a 204 * href="http://www.ibm.com/java/education/boundaries/boundaries.html">http://www.ibm.com/java/education/boundaries/boundaries.html</a>. 205 * For examples, see the resource data (which is annotated).</p> 206 * 207 * @author Richard Gillam 208 */ 209 class IcuIteratorWrapper extends BreakIterator { 210 211 /* The wrapped ICU implementation. Non-final for #clone() */ 212 private android.icu.text.BreakIterator wrapped; 213 214 /** 215 * Constructs a IcuIteratorWrapper according to the datafile 216 * provided. 217 */ IcuIteratorWrapper(android.icu.text.BreakIterator iterator)218 IcuIteratorWrapper(android.icu.text.BreakIterator iterator) { 219 wrapped = iterator; 220 } 221 222 /** 223 * Clones this iterator. 224 * 225 * @return A newly-constructed IcuIteratorWrapper with the same 226 * behavior as this one. 227 */ clone()228 public Object clone() { 229 IcuIteratorWrapper result = (IcuIteratorWrapper) super.clone(); 230 result.wrapped = (android.icu.text.BreakIterator) wrapped.clone(); 231 return result; 232 } 233 234 /** 235 * Returns true if both BreakIterators are of the same class, have the same 236 * rules, and iterate over the same text. 237 */ equals(Object that)238 public boolean equals(Object that) { 239 if (!(that instanceof IcuIteratorWrapper)) { 240 return false; 241 } 242 return wrapped.equals(((IcuIteratorWrapper) that).wrapped); 243 } 244 245 //======================================================================= 246 // BreakIterator overrides 247 //======================================================================= 248 249 /** 250 * Returns text 251 */ toString()252 public String toString() { 253 return wrapped.toString(); 254 } 255 256 /** 257 * Compute a hashcode for this BreakIterator 258 * 259 * @return A hash code 260 */ hashCode()261 public int hashCode() { 262 return wrapped.hashCode(); 263 } 264 265 /** 266 * Sets the current iteration position to the beginning of the text. 267 * (i.e., the CharacterIterator's starting offset). 268 * 269 * @return The offset of the beginning of the text. 270 */ first()271 public int first() { 272 return wrapped.first(); 273 } 274 275 /** 276 * Sets the current iteration position to the end of the text. 277 * (i.e., the CharacterIterator's ending offset). 278 * 279 * @return The text's past-the-end offset. 280 */ last()281 public int last() { 282 return wrapped.last(); 283 } 284 285 /** 286 * Advances the iterator either forward or backward the specified number of steps. 287 * Negative values move backward, and positive values move forward. This is 288 * equivalent to repeatedly calling next() or previous(). 289 * 290 * @param n The number of steps to move. The sign indicates the direction 291 * (negative is backwards, and positive is forwards). 292 * @return The character offset of the boundary position n boundaries away from 293 * the current one. 294 */ next(int n)295 public int next(int n) { 296 return wrapped.next(n); 297 } 298 299 /** 300 * Advances the iterator to the next boundary position. 301 * 302 * @return The position of the first boundary after this one. 303 */ next()304 public int next() { 305 return wrapped.next(); 306 } 307 308 /** 309 * Advances the iterator backwards, to the last boundary preceding this one. 310 * 311 * @return The position of the last boundary position preceding this one. 312 */ previous()313 public int previous() { 314 return wrapped.previous(); 315 } 316 317 /** 318 * Throw IllegalArgumentException unless begin <= offset < end. 319 */ checkOffset(int offset, CharacterIterator text)320 protected static final void checkOffset(int offset, CharacterIterator text) { 321 if (offset < text.getBeginIndex() || offset > text.getEndIndex()) { 322 throw new IllegalArgumentException("offset out of bounds"); 323 } 324 } 325 326 /** 327 * Sets the iterator to refer to the first boundary position following 328 * the specified position. 329 * 330 * @return The position of the first break after the current position. 331 * @offset The position from which to begin searching for a break position. 332 */ following(int offset)333 public int following(int offset) { 334 CharacterIterator text = getText(); 335 checkOffset(offset, text); 336 return wrapped.following(offset); 337 } 338 339 /** 340 * Sets the iterator to refer to the last boundary position before the 341 * specified position. 342 * 343 * @return The position of the last boundary before the starting position. 344 * @offset The position to begin searching for a break from. 345 */ preceding(int offset)346 public int preceding(int offset) { 347 // if we start by updating the current iteration position to the 348 // position specified by the caller, we can just use previous() 349 // to carry out this operation 350 CharacterIterator text = getText(); 351 checkOffset(offset, text); 352 return wrapped.preceding(offset); 353 } 354 355 /** 356 * Returns true if the specfied position is a boundary position. As a side 357 * effect, leaves the iterator pointing to the first boundary position at 358 * or after "offset". 359 * 360 * @param offset the offset to check. 361 * @return True if "offset" is a boundary position. 362 */ isBoundary(int offset)363 public boolean isBoundary(int offset) { 364 CharacterIterator text = getText(); 365 checkOffset(offset, text); 366 return wrapped.isBoundary(offset); 367 } 368 369 /** 370 * Returns the current iteration position. 371 * 372 * @return The current iteration position. 373 */ current()374 public int current() { 375 return wrapped.current(); 376 } 377 378 /** 379 * Return a CharacterIterator over the text being analyzed. This version 380 * of this method returns the actual CharacterIterator we're using internally. 381 * Changing the state of this iterator can have undefined consequences. If 382 * you need to change it, clone it first. 383 * 384 * @return An iterator over the text being analyzed. 385 */ getText()386 public CharacterIterator getText() { 387 return wrapped.getText(); 388 } 389 setText(String newText)390 public void setText(String newText) { 391 wrapped.setText(newText); 392 } 393 394 /** 395 * Set the iterator to analyze a new piece of text. This function resets 396 * the current iteration position to the beginning of the text. 397 * 398 * @param newText An iterator over the text to analyze. 399 */ setText(CharacterIterator newText)400 public void setText(CharacterIterator newText) { 401 newText.current(); 402 wrapped.setText(newText); 403 } 404 } 405