1--- 2layout: default 3title: Boundary Analysis 4nav_order: 10 5has_children: true 6--- 7<!-- 8© 2020 and later: Unicode, Inc. and others. 9License & terms of use: http://www.unicode.org/copyright.html 10--> 11 12# Boundary Analysis 13{: .no_toc } 14 15## Contents 16{: .no_toc .text-delta } 17 181. TOC 19{:toc} 20 21--- 22 23## Overview of Text Boundary Analysis 24 25Text boundary analysis is the process of locating linguistic boundaries while 26formatting and handling text. Examples of this process include: 27 281. Locating appropriate points to word-wrap text to fit within specific margins 29 while displaying or printing. 30 312. Locating the beginning of a word that the user has selected. 32 333. Counting characters, words, sentences, or paragraphs. 34 354. Determining how far to move the text cursor when the user hits an arrow key 36 (Some characters require more than one position in the text store and some 37 characters in the text store do not display at all). 38 395. Making a list of the unique words in a document. 40 416. Figuring out if a given range of text contains only whole words. 42 437. Capitalizing the first letter of each word. 44 458. Locating a particular unit of the text (For example, finding the third word 46 in the document). 47 48The `BreakIterator` classes were designed to support these kinds of tasks. The 49BreakIterator objects maintain a location between two characters in the text. 50This location will always be a text boundary. Clients can move the location 51forward to the next boundary or backward to the previous boundary. Clients can 52also check if a particular location within a source text is on a boundary or 53find the boundary which is before or after a particular location. 54 55## Four Types of BreakIterator 56 57ICU `BreakIterator`s can be used to locate the following kinds of text boundaries: 58 591. Character Boundary 60 612. Word Boundary 62 633. Line-break Boundary 64 654. Sentence Boundary 66 67Each type of boundary is found in accordance with the rules specified by Unicode 68Standard Annex #29, *Unicode Text Segmentation* 69(<https://www.unicode.org/reports/tr29/> ) or Unicode Standard Annex #14, *Unicode 70Line Breaking Algorithm* (<https://www.unicode.org/reports/tr14/>) 71 72### Character Boundary 73 74The character-boundary iterator locates the boundaries according to the rules 75defined in <http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries>. 76These boundaries try to match what a user would think of as a "character"—a 77basic unit of a writing system for a language—which may be more than just a 78single Unicode code point. 79 80The letter `Ä`, for example, can be represented in Unicode either with a single 81code-point value or with two code-point values (one representing the `A` and 82another representing the umlaut `¨`). The character-boundary iterator will treat 83either representation as a single character. 84 85End-user characters, as described above, are also called grapheme clusters, in 86an attempt to limit the confusion caused by multiple meanings for the word 87"character". 88 89### Word Boundary 90 91The word-boundary iterator locates the boundaries of words, for purposes such as 92double click selection or "Find whole words" operations. 93 94Words boundaries are identified according to the rules in 95<https://www.unicode.org/reports/tr29/#Word_Boundaries>, supplemented by a word 96dictionary for text in Chinese, Japanese, Thai or Khmer. The rules used for 97locating word breaks take into account the alphabets and conventions used by 98different languages. 99 100Here's an example of a sentence, showing the boundary locations that will be 101identified by a word break iterator: 102 103> :point_right: **Note**: TODO: An example needs to be added here. 104 105### Line-break Boundary 106 107The line-break iterator locates positions that would be appropriate points to 108wrap lines when displaying the text. The boundary rules are define here: 109<https://www.unicode.org/reports/tr14/> 110 111This example shows the differences in the break locations produced by word and 112line break iterators: 113 114> :point_right: **Note**: TODO: An example needs to be added here. 115 116### Sentence Boundary 117 118A sentence-break iterator locates sentence boundaries according to the rules 119defined here: <https://www.unicode.org/reports/tr29/#Sentence_Boundaries> 120 121## Dictionary-Based BreakIterator 122 123Some languages are written without spaces, and word and line breaking requires 124more than rules over character sequences. ICU provides dictionary support for 125word boundaries in Chinese, Japanese, Thai, Lao, Khmer and Burmese. 126 127Use of the dictionaries is automatic when text in one of the dictionary 128languages is encountered. There is no separate API, and no extra programming 129steps required by applications making use of the dictionaries. 130 131## Usage 132 133To locate boundaries in a document, create a BreakIterator using the 134`BreakIterator::create***Instance` family of methods in C++, or the `ubrk_open()` 135function (C), where "`***`" is `Character`, `Word`, `Line` or `Sentence`, 136depending on the type of iterator wanted. These factory methods also take a 137parameter that specifies the locale for the language of the text to be processed. 138 139When creating a `BreakIterator`, a locale is also specified, and the behavior of 140the BreakIterator obtained may be specialized in some way for that locale. For 141most locales the default break iterator behavior is used. 142 143Applications also may register customized BreakIterators for use in specific 144locales. Once such a break iterator has been registered, any requests for break 145iterators for that locale will return copies of the registered break iterator. 146 147ICU may cache service instances. Therefore, registration should be done during 148startup, before opening services by locale ID. 149 150In the general-usage-model, applications will use the following basic steps to 151analyze a piece of text for boundaries: 152 1531. Create a `BreakIterator` with the desired behavior 154 1552. Use the `setText()` method to set the iterator to analyze a particular piece 156 of text. 157 1583. Locate the desired boundaries using the appropriate combination of `first()`, 159 `last()`, `next()`, `previous()`, `preceding()`, and `following()` methods. 160 161The `setText()` method can be called more than once, allowing reuse of a 162BreakIterator on new pieces of text. Because the creation of a `BreakIterator` can 163be relatively time-consuming, it makes good sense to reuse them when practical. 164 165The iterator always points to a boundary position between two characters. The 166numerical value of the position, as returned by `current()` is the zero-based 167index of the character following the boundary. Thus a position of zero 168represents a boundary preceding the first character of the text, and a position 169of one represents a boundary between the first and second characters. 170 171The `first()` and `last()` methods reset the iterator's current position to the 172beginning or end of the text (the beginning and the end are always considered 173boundaries). The `next()` and `previous()` methods advance the iterator one boundary 174forward or backward from the current position. If the `next()` or `previous()` 175methods run off the beginning or end of the text, it returns DONE. The `current()` 176method returns the current position. 177 178The `following()` and `preceding()` methods are used for random access, to move the 179iterator to an arbitrary position within the text. Since a BreakIterator always 180points to a boundary position, the `following()` and `preceding()` methods will 181never set the iterator to point to the position specified by the caller (even if 182it is, in fact, a boundary position). `BreakIterator` will, however, set the 183iterator to the nearest boundary position before or after the specified 184position. 185 186`isBoundary()` returns true if the specified position is a boundary. 187 188### Thread Safety 189 190`BreakIterator`s are not thread safe. This is inherit in their design—break 191iterators are stateful, holding a reference to and position in the text, meaning 192that a single instance cannot operate in parallel on multiple texts. 193 194For concurrent break iteration, each thread must use its own break iterator. 195These can be obtained by creating separate break iterators of the desired type, 196or by initially creating a main break iterator and then creating a clone for 197each thread. 198 199### Line Breaking Strictness, a CSS Property 200 201CSS has the concept of "[Line Breaking 202Strictness](https://www.w3.org/TR/css-text-3/#line-break-property)". This 203property specifies the strictness of line-breaking rules applied within an 204element: especially how wrapping interacts with punctuation and symbols. ICU 205line break iterators can choose a strictness using locale tags: 206 207| Locale | Behavior | 208| ------------ | ----------- | 209| `en@lb=strict` <br/> `ja@lb=strict` | Breaks text using the most stringent set of line-breaking rules | 210| `en@lb=normal` <br/> `ja@lb=normal` | Breaks text using the most common set of line-breaking rules. | 211| `en@lb=loose` <br/> `ja@lb=loose` | Breaks text using the least restrictive set of line-breaking rules. Typically used for short lines, such as in newspapers. | 212 213### Sentence Break Filters 214 215Sentence breaking can return false positives - an indication that sentence ends 216in an incorrect position - in the presence of abbreviations. For example, 217consider the sentence 218 219> In the meantime Mr. Weston arrived with his small ship. 220 221Default sentence break shows a false boundary following the "Mr." 222 223ICU includes lists of common abbreviations that can be used to filter, to 224ignore, these false sentence boundaries. Filtering is enabled by the presence of 225the `ss` locale tag when creating the break iterator. 226 227| Locale | Behavior | 228| ---------------- | ------------------------------------------------------- | 229| `en` | no filtering | 230| `en@ss=standard` | Filter based on common English language abbreviations. | 231| `es@ss=standard` | Filter with common Spanish abbreviations. | 232 233Abbreviation lists are available (as of ICU 64) for English, German, Spanish, 234French, Italian and Portuguese. 235 236## Accuracy 237 238ICU's break iterators are based on the default boundary rules described in the 239Unicode Standard Annexes [14](https://www.unicode.org/reports/tr14/) and 240[29](https://www.unicode.org/reports/tr29/). These are relatively 241simple boundary rules that can be implemented efficiently, and are sufficient 242for many purposes and languages. However, some languages and applications will 243require a more sophisticated linguistic analysis of the text in order to find 244boundaries with good accuracy. Such an analysis is not directly available from 245ICU at this time. 246 247Break Iterators based on custom, user-supplied boundary rules can be created and 248used by applications with requirements that are not met by the standard default 249boundary rules. 250 251## BreakIterator Boundary Analysis Examples 252 253### Print out all the word-boundary positions in a UnicodeString 254 255**In C++:** 256 257```c++ 258void listWordBoundaries(const UnicodeString& s) { 259 UErrorCode status = U_ZERO_ERROR; 260 BreakIterator* bi = BreakIterator::createWordInstance(Locale::getUS(), status); 261 bi->setText(s); 262 int32_t p = bi->first(); 263 while (p != BreakIterator::DONE) { 264 printf("Boundary at position %d\n", p); 265 p = bi->next(); 266 } 267 delete bi; 268} 269``` 270 271**In C:** 272 273```c 274void listWordBoundaries(const UChar* s, int32_t len) { 275 UBreakIterator* bi; 276 int32_t p; 277 UErrorCode err = U_ZERO_ERROR; 278 bi = ubrk_open(UBRK_WORD, 0, s, len, &err); 279 if (U_FAILURE(err)) return; 280 p = ubrk_first(bi); 281 while (p != UBRK_DONE) { 282 printf("Boundary at position %d\n", p); 283 p = ubrk_next(bi); 284 } 285 ubrk_close(bi); 286} 287``` 288 289### Get the boundaries of the word that contains a double-click position 290 291**In C++:** 292 293```c++ 294void wordContaining(BreakIterator& wordBrk, 295 int32_t idx, 296 const UnicodeString& s, 297 int32_t& start, 298 int32_t& end) { 299 // this function is written to assume that we have an 300 // appropriate BreakIterator stored in an object or a 301 // global variable somewhere-- When possible, programmers 302 // should avoid having the create() and delete calls in 303 // a function of this nature. 304 if (s.isEmpty()) 305 return; 306 wordBrk.setText(s); 307 start = wordBrk.preceding(idx + 1); 308 end = wordBrk.next(); 309 // NOTE: for this and similar operations, use preceding() and next() 310 // as shown here, not following() and previous(). preceding() is 311 // faster than following() and next() is faster than previous() 312 // NOTE: By using preceding(idx + 1) above, we're adopting the convention 313 // that if the double-click comes right on top of a word boundary, it 314 // selects the word that _begins_ on that boundary (preceding(idx) would 315 // instead select the word that _ends_ on that boundary). 316} 317``` 318 319**In C:** 320 321```c 322void wordContaining(UBreakIterator* wordBrk, 323 int32_t idx, 324 const UChar* s, 325 int32_t sLen, 326 int32_t* start, 327 int32_t* end, 328 UErrorCode* err) { 329 if (wordBrk == NULL || s == NULL || start == NULL || end == NULL) { 330 *err = U_ILLEGAL_ARGUMENT_ERROR; 331 return; 332 } 333 ubrk_setText(wordBrk, s, sLen, err); 334 if (U_SUCCESS(*err)) { 335 *start = ubrk_preceding(wordBrk, idx + 1); 336 *end = ubrk_next(wordBrk); 337 } 338} 339``` 340 341### Check for Whole Words 342 343Use the following to check if a range of text is a "whole word": 344 345**In C++:** 346 347```c++ 348UBool isWholeWord(BreakIterator& wordBrk, 349 const UnicodeString& s, 350 int32_t start, 351 int32_t end) { 352 if (s.isEmpty()) 353 return FALSE; 354 wordBrk.setText(s); 355 if (!wordBrk.isBoundary(start)) 356 return FALSE; 357 return wordBrk.isBoundary(end); 358} 359``` 360 361**In C:** 362 363```c 364UBool isWholeWord(UBreakIterator* wordBrk, 365 const UChar* s, 366 int32_t sLen, 367 int32_t start, 368 int32_t end, 369 UErrorCode* err) { 370 UBool result = FALSE; 371 if (wordBrk == NULL || s == NULL) { 372 *err = U_ILLEGAL_ARGUMENT_ERROR; 373 return FALSE; 374 } 375 ubrk_setText(wordBrk, s, sLen, err); 376 if (U_SUCCESS(*err)) { 377 result = ubrk_isBoundary(wordBrk, start) && ubrk_isBoundary(wordBrk, end); 378 } 379 return result; 380} 381``` 382 383Count the words in a document (C++ only): 384 385```c++ 386int32_t containsLetters(RuleBasedBreakIterator& bi, const UnicodeString& s, int32_t start) { 387 bi.setText(s); 388 int32_t count = 0; 389 while (start != BreakIterator::DONE) { 390 int breakType = bi.getRuleStatus(); 391 if (breakType != UBRK_WORD_NONE) { 392 // Exclude spaces, punctuation, and the like. 393 // A status value UBRK_WORD_NONE indicates that the boundary does 394 // not start a word or number. 395 // 396 ++count; 397 } 398 start = bi.next(); 399 } 400 return count; 401} 402``` 403 404The function `getRuleStatus()` returns an enum giving additional information on 405the text preceding the last break position found. Using this value, it is 406possible to distinguish between numbers, words, words containing kana 407characters, words containing ideographic characters, and non-word characters, 408such as spaces or punctuation. The sample uses the break status value to filter 409out, and not count, boundaries associated with non-word characters. 410 411### Word-wrap a document (C++ only) 412 413The sample function below wraps a paragraph so that each line is less than or 414equal to 72 characters. The function fills in an array passed in by the caller 415with the starting offsets of 416each line in the document. Also, it fills in a second array to track how many 417trailing white space characters there are in the line. For simplicity, it is 418assumed that an outside process has already broken the document into paragraphs. 419For example, it is assumed that every string the function is passed has a single 420newline at the end only. 421 422```c++ 423int32_t wrapParagraph(const UnicodeString& s, 424 const Locale& locale, 425 int32_t lineStarts[], 426 int32_t trailingwhitespace[], 427 int32_t maxLines, 428 UErrorCode &status) { 429 430 int32_t numLines = 0; 431 int32_t p, q; 432 const int32_t MAX_CHARS_PER_LINE = 72; 433 UChar c; 434 435 BreakIterator *bi = BreakIterator::createLineInstance(locale, status); 436 if (U_FAILURE(status)) { 437 delete bi; 438 return 0; 439 } 440 bi->setText(s); 441 442 443 p = 0; 444 while (p < s.length()) { 445 // jump ahead in the paragraph by the maximum number of 446 // characters that will fit 447 q = p + MAX_CHARS_PER_LINE; 448 449 // if this puts us on a white space character, a control character 450 // (which includes newlines), or a non-spacing mark, seek forward 451 // and stop on the next character that is not any of these things 452 // since none of these characters will be visible at the end of a 453 // line, we can ignore them for the purposes of figuring out how 454 // many characters will fit on the line) 455 if (q < s.length()) { 456 c = s[q]; 457 while (q < s.length() 458 && (u_isspace(c) 459 || u_charType(c) == U_CONTROL_CHAR 460 || u_charType(c) == U_NON_SPACING_MARK 461 )) { 462 ++q; 463 c = s[q]; 464 } 465 } 466 467 // then locate the last legal line-break decision at or before 468 // the current position ("at or before" is what causes the "+ 1") 469 q = bi->preceding(q + 1); 470 471 // if this causes us to wind back to where we started, then the 472 // line has no legal line-break positions. Break the line at 473 // the maximum number of characters 474 if (q == p) { 475 p += MAX_CHARS_PER_LINE; 476 lineStarts[numLines] = p; 477 trailingwhitespace[numLines] = 0; 478 ++numLines; 479 } 480 // otherwise, we got a good line-break position. Record the start of this 481 // line (p) and then seek back from the end of this line (q) until you find 482 // a non-white space character (same criteria as above) and 483 // record the number of white space characters at the end of the 484 // line in the other results array 485 else { 486 lineStarts[numLines] = p; 487 int32_t nextLineStart = q; 488 489 for (q--; q > p; q--) { 490 c = s[q]; 491 if (!(u_isspace(c) 492 || u_charType(c) == U_CONTROL_CHAR 493 || u_charType(c) == U_NON_SPACING_MARK)) { 494 break; 495 } 496 } 497 trailingwhitespace[numLines] = nextLineStart - q -1; 498 p = nextLineStart; 499 ++numLines; 500 } 501 if (numLines >= maxLines) { 502 break; 503 } 504 } 505 delete bi; 506 return numLines; 507} 508``` 509 510Most text editors would not break lines based on the number of characters on a 511line. Even with a monospaced font, there are still many Unicode characters that 512are not displayed and therefore should be filtered out of the calculation. With 513a proportional font, character widths are added up until a maximum line width is 514exceeded or an end of the paragraph marker is reached. 515 516Trailing white space does not need to be counted in the line-width measurement 517because it does not need to be displayed at the end of a line. The sample code 518above returns an array of trailing white space values because an external 519rendering process needs to be able to measure the length of the line (without 520the trailing white space) to justify the lines. For example, if the text is 521right-justified, the invisible white space would be drawn outside the margin. 522The line would actually end with the last visible character. 523 524In either case, the basic principle is to jump ahead in the text to the location 525where the line would break (without taking word breaks into account). Then, move 526backwards using the preceding() method to find the last legal breaking position 527before that location. Iterating straight through the text with next() method 528will generally be slower. 529 530## ICU BreakIterator Data Files 531 532The source code for the ICU break rules for the standard boundary types is 533located in the directory 534[icu4c/source/data/brkitr/rules](https://github.com/unicode-org/icu/tree/master/icu4c/source/data/brkitr/rules). 535These files will be built, and the corresponding binary state tables 536incorporated into ICU's data, by the standard ICU4C build process. 537 538The dictionary word lists used by word break, and for some languages, line break 539are in 540[icu4c/source/data/brkitr/dictionaries](https://github.com/unicode-org/icu/tree/master/icu4c/source/data/brkitr/dictionaries). 541 542The same data is used by both ICU4C and ICU4J. In the normal ICU build process, 543the source data is processed into a binary form using ICU4C, and the resulting 544binary tables are incorporated into ICU4J. 545