• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1---
2layout: default
3title: Boundary Analysis
4nav_order: 10
5has_children: true
6---
7<!--
8© 2020 and later: Unicode, Inc. and others.
9License & terms of use: http://www.unicode.org/copyright.html
10-->
11
12# Boundary Analysis
13{: .no_toc }
14
15## Contents
16{: .no_toc .text-delta }
17
181. TOC
19{:toc}
20
21---
22
23## Overview of Text Boundary Analysis
24
25Text boundary analysis is the process of locating linguistic boundaries while
26formatting and handling text. Examples of this process include:
27
281. Locating appropriate points to word-wrap text to fit within specific margins
29   while displaying or printing.
30
312. Locating the beginning of a word that the user has selected.
32
333. Counting characters, words, sentences, or paragraphs.
34
354. Determining how far to move the text cursor when the user hits an arrow key
36    (Some characters require more than one position in the text store and some
37    characters in the text store do not display at all).
38
395. Making a list of the unique words in a document.
40
416. Figuring out if a given range of text contains only whole words.
42
437. Capitalizing the first letter of each word.
44
458. Locating a particular unit of the text (For example, finding the third word
46    in the document).
47
48The `BreakIterator` classes were designed to support these kinds of tasks. The
49BreakIterator objects maintain a location between two characters in the text.
50This location will always be a text boundary. Clients can move the location
51forward to the next boundary or backward to the previous boundary. Clients can
52also check if a particular location within a source text is on a boundary or
53find the boundary which is before or after a particular location.
54
55## Four Types of BreakIterator
56
57ICU `BreakIterator`s can be used to locate the following kinds of text boundaries:
58
591. Character Boundary
60
612. Word Boundary
62
633. Line-break Boundary
64
654. Sentence Boundary
66
67Each type of boundary is found in accordance with the rules specified by Unicode
68Standard Annex #29, *Unicode Text Segmentation*
69(<https://www.unicode.org/reports/tr29/> ) or Unicode Standard Annex #14, *Unicode
70Line Breaking Algorithm* (<https://www.unicode.org/reports/tr14/>)
71
72### Character Boundary
73
74The character-boundary iterator locates the boundaries according to the rules
75defined in <http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries>.
76These boundaries try to match what a user would think of as a "character"—a
77basic unit of a writing system for a language—which may be more than just a
78single Unicode code point.
79
80The letter `Ä`, for example, can be represented in Unicode either with a single
81code-point value or with two code-point values (one representing the `A` and
82another representing the umlaut `¨`). The character-boundary iterator will treat
83either representation as a single character.
84
85End-user characters, as described above, are also called grapheme clusters, in
86an attempt to limit the confusion caused by multiple meanings for the word
87"character".
88
89### Word Boundary
90
91The word-boundary iterator locates the boundaries of words, for purposes such as
92double click selection or "Find whole words" operations.
93
94Words boundaries are identified according to the rules in
95<https://www.unicode.org/reports/tr29/#Word_Boundaries>, supplemented by a word
96dictionary for text in Chinese, Japanese, Thai or Khmer. The rules used for
97locating word breaks take into account the alphabets and conventions used by
98different languages.
99
100Here's an example of a sentence, showing the boundary locations that will be
101identified by a word break iterator:
102
103> :point_right: **Note**: TODO: An example needs to be added here.
104
105### Line-break Boundary
106
107The line-break iterator locates positions that would be appropriate points to
108wrap lines when displaying the text. The boundary rules are define here:
109<https://www.unicode.org/reports/tr14/>
110
111This example shows the differences in the break locations produced by word and
112line break iterators:
113
114> :point_right: **Note**: TODO: An example needs to be added here.
115
116### Sentence Boundary
117
118A sentence-break iterator locates sentence boundaries according to the rules
119defined here: <https://www.unicode.org/reports/tr29/#Sentence_Boundaries>
120
121## Dictionary-Based BreakIterator
122
123Some languages are written without spaces, and word and line breaking requires
124more than rules over character sequences. ICU provides dictionary support for
125word boundaries in Chinese, Japanese, Thai, Lao, Khmer and Burmese.
126
127Use of the dictionaries is automatic when text in one of the dictionary
128languages is encountered. There is no separate API, and no extra programming
129steps required by applications making use of the dictionaries.
130
131## Usage
132
133To locate boundaries in a document, create a BreakIterator using the
134`BreakIterator::create***Instance` family of methods in C++, or the `ubrk_open()`
135function (C), where "`***`" is `Character`, `Word`, `Line` or `Sentence`,
136depending on the type of iterator wanted. These factory methods also take a
137parameter that specifies the locale for the language of the text to be processed.
138
139When creating a `BreakIterator`, a locale is also specified, and the behavior of
140the BreakIterator obtained may be specialized in some way for that locale. For
141most locales the default break iterator behavior is used.
142
143Applications also may register customized BreakIterators for use in specific
144locales. Once such a break iterator has been registered, any requests for break
145iterators for that locale will return copies of the registered break iterator.
146
147ICU may cache service instances. Therefore, registration should be done during
148startup, before opening services by locale ID.
149
150In the general-usage-model, applications will use the following basic steps to
151analyze a piece of text for boundaries:
152
1531. Create a `BreakIterator` with the desired behavior
154
1552. Use the `setText()` method to set the iterator to analyze a particular piece
156   of text.
157
1583. Locate the desired boundaries using the appropriate combination of `first()`,
159   `last()`, `next()`, `previous()`, `preceding()`, and `following()` methods.
160
161The `setText()` method can be called more than once, allowing reuse of a
162BreakIterator on new pieces of text. Because the creation of a `BreakIterator` can
163be relatively time-consuming, it makes good sense to reuse them when practical.
164
165The iterator always points to a boundary position between two characters. The
166numerical value of the position, as returned by `current()` is the zero-based
167index of the character following the boundary. Thus a position of zero
168represents a boundary preceding the first character of the text, and a position
169of one represents a boundary between the first and second characters.
170
171The `first()` and `last()` methods reset the iterator's current position to the
172beginning or end of the text (the beginning and the end are always considered
173boundaries). The `next()` and `previous()` methods advance the iterator one boundary
174forward or backward from the current position. If the `next()` or `previous()`
175methods run off the beginning or end of the text, it returns DONE. The `current()`
176method returns the current position.
177
178The `following()` and `preceding()` methods are used for random access, to move the
179iterator to an arbitrary position within the text. Since a BreakIterator always
180points to a boundary position, the `following()` and `preceding()` methods will
181never set the iterator to point to the position specified by the caller (even if
182it is, in fact, a boundary position). `BreakIterator` will, however, set the
183iterator to the nearest boundary position before or after the specified
184position.
185
186`isBoundary()` returns true if the specified position is a boundary.
187
188### Thread Safety
189
190`BreakIterator`s are not thread safe. This is inherit in their design—break
191iterators are stateful, holding a reference to and position in the text, meaning
192that a single instance cannot operate in parallel on multiple texts.
193
194For concurrent break iteration, each thread must use its own break iterator.
195These can be obtained by creating separate break iterators of the desired type,
196or by initially creating a main break iterator and then creating a clone for
197each thread.
198
199### Line Breaking Strictness, a CSS Property
200
201CSS has the concept of "[Line Breaking
202Strictness](https://www.w3.org/TR/css-text-3/#line-break-property)". This
203property specifies the strictness of line-breaking rules applied within an
204element: especially how wrapping interacts with punctuation and symbols. ICU
205line break iterators can choose a strictness using locale tags:
206
207| Locale       | Behavior    |
208| ------------ | ----------- |
209| `en@lb=strict` <br/> `ja@lb=strict`  | Breaks text using the most stringent set of line-breaking rules |
210| `en@lb=normal` <br/> `ja@lb=normal`  | Breaks text using the most common set of line-breaking rules. |
211| `en@lb=loose`  <br/> `ja@lb=loose`   | Breaks text using the least restrictive set of line-breaking rules. Typically used for short lines, such as in newspapers. |
212
213### Sentence Break Filters
214
215Sentence breaking can return false positives - an indication that sentence ends
216in an incorrect position - in the presence of abbreviations. For example,
217consider the sentence
218
219> In the meantime Mr. Weston arrived with his small ship.
220
221Default sentence break shows a false boundary following the "Mr."
222
223ICU includes lists of common abbreviations that can be used to filter, to
224ignore, these false sentence boundaries. Filtering is enabled by the presence of
225the `ss` locale tag when creating the break iterator.
226
227| Locale           | Behavior                                                |
228| ---------------- | ------------------------------------------------------- |
229| `en`             |  no filtering                                           |
230| `en@ss=standard` |  Filter based on common English language abbreviations. |
231| `es@ss=standard` |  Filter with common Spanish abbreviations.              |
232
233Abbreviation lists are available (as of ICU 64) for English, German, Spanish,
234French, Italian and Portuguese.
235
236## Accuracy
237
238ICU's break iterators are based on the default boundary rules described in the
239Unicode Standard Annexes [14](https://www.unicode.org/reports/tr14/) and
240[29](https://www.unicode.org/reports/tr29/). These are relatively
241simple boundary rules that can be implemented efficiently, and are sufficient
242for many purposes and languages. However, some languages and applications will
243require a more sophisticated linguistic analysis of the text in order to find
244boundaries with good accuracy. Such an analysis is not directly available from
245ICU at this time.
246
247Break Iterators based on custom, user-supplied boundary rules can be created and
248used by applications with requirements that are not met by the standard default
249boundary rules.
250
251## BreakIterator Boundary Analysis Examples
252
253### Print out all the word-boundary positions in a UnicodeString
254
255**In C++:**
256
257```c++
258void listWordBoundaries(const UnicodeString& s) {
259    UErrorCode status = U_ZERO_ERROR;
260    BreakIterator* bi = BreakIterator::createWordInstance(Locale::getUS(), status);
261    bi->setText(s);
262    int32_t p = bi->first();
263    while (p != BreakIterator::DONE) {
264        printf("Boundary at position %d\n", p);
265        p = bi->next();
266    }
267    delete bi;
268}
269```
270
271**In C:**
272
273```c
274void listWordBoundaries(const UChar* s, int32_t len) {
275    UBreakIterator* bi;
276    int32_t p;
277    UErrorCode err = U_ZERO_ERROR;
278    bi = ubrk_open(UBRK_WORD, 0, s, len, &err);
279    if (U_FAILURE(err)) return;
280    p = ubrk_first(bi);
281    while (p != UBRK_DONE) {
282        printf("Boundary at position %d\n", p);
283        p = ubrk_next(bi);
284    }
285    ubrk_close(bi);
286}
287```
288
289### Get the boundaries of the word that contains a double-click position
290
291**In C++:**
292
293```c++
294void wordContaining(BreakIterator& wordBrk,
295        int32_t idx,
296        const UnicodeString& s,
297        int32_t& start,
298        int32_t& end) {
299    // this function is written to assume that we have an
300    // appropriate BreakIterator stored in an object or a
301    // global variable somewhere-- When possible, programmers
302    // should avoid having the create() and delete calls in
303    // a function of this nature.
304    if (s.isEmpty())
305        return;
306    wordBrk.setText(s);
307    start = wordBrk.preceding(idx + 1);
308    end = wordBrk.next();
309    // NOTE: for this and similar operations, use preceding() and next()
310    // as shown here, not following() and previous(). preceding() is
311    // faster than following() and next() is faster than previous()
312    // NOTE: By using preceding(idx + 1) above, we're adopting the convention
313    // that if the double-click comes right on top of a word boundary, it
314    // selects the word that _begins_ on that boundary (preceding(idx) would
315    // instead select the word that _ends_ on that boundary).
316}
317```
318
319**In C:**
320
321```c
322void wordContaining(UBreakIterator* wordBrk,
323    int32_t idx,
324    const UChar* s,
325    int32_t sLen,
326    int32_t* start,
327    int32_t* end,
328    UErrorCode* err) {
329    if (wordBrk == NULL || s == NULL || start == NULL || end == NULL) {
330        *err = U_ILLEGAL_ARGUMENT_ERROR;
331        return;
332    }
333    ubrk_setText(wordBrk, s, sLen, err);
334    if (U_SUCCESS(*err)) {
335        *start = ubrk_preceding(wordBrk, idx + 1);
336        *end = ubrk_next(wordBrk);
337    }
338}
339```
340
341### Check for Whole Words
342
343Use the following to check if a range of text is a "whole word":
344
345**In C++:**
346
347```c++
348UBool isWholeWord(BreakIterator& wordBrk,
349    const UnicodeString& s,
350    int32_t start,
351    int32_t end) {
352    if (s.isEmpty())
353        return FALSE;
354    wordBrk.setText(s);
355    if (!wordBrk.isBoundary(start))
356        return FALSE;
357    return wordBrk.isBoundary(end);
358}
359```
360
361**In C:**
362
363```c
364UBool isWholeWord(UBreakIterator* wordBrk,
365    const UChar* s,
366    int32_t sLen,
367    int32_t start,
368    int32_t end,
369    UErrorCode* err) {
370    UBool result = FALSE;
371    if (wordBrk == NULL || s == NULL) {
372        *err = U_ILLEGAL_ARGUMENT_ERROR;
373        return FALSE;
374    }
375    ubrk_setText(wordBrk, s, sLen, err);
376    if (U_SUCCESS(*err)) {
377        result = ubrk_isBoundary(wordBrk, start) && ubrk_isBoundary(wordBrk, end);
378    }
379    return result;
380}
381```
382
383Count the words in a document (C++ only):
384
385```c++
386int32_t containsLetters(RuleBasedBreakIterator& bi, const UnicodeString& s, int32_t start) {
387    bi.setText(s);
388    int32_t count = 0;
389    while (start != BreakIterator::DONE) {
390        int breakType = bi.getRuleStatus();
391        if (breakType != UBRK_WORD_NONE) {
392            // Exclude spaces, punctuation, and the like.
393            // A status value UBRK_WORD_NONE indicates that the boundary does
394            // not start a word or number.
395            //
396            ++count;
397        }
398        start = bi.next();
399    }
400    return count;
401}
402```
403
404The function `getRuleStatus()` returns an enum giving additional information on
405the text preceding the last break position found. Using this value, it is
406possible to distinguish between numbers, words, words containing kana
407characters, words containing ideographic characters, and non-word characters,
408such as spaces or punctuation. The sample uses the break status value to filter
409out, and not count, boundaries associated with non-word characters.
410
411### Word-wrap a document (C++ only)
412
413The sample function below wraps a paragraph so that each line is less than or
414equal to 72 characters. The function fills in an array passed in by the caller
415with the starting offsets of
416each line in the document. Also, it fills in a second array to track how many
417trailing white space characters there are in the line. For simplicity, it is
418assumed that an outside process has already broken the document into paragraphs.
419For example, it is assumed that every string the function is passed has a single
420newline at the end only.
421
422```c++
423int32_t wrapParagraph(const UnicodeString& s,
424                   const Locale& locale,
425                   int32_t lineStarts[],
426                   int32_t trailingwhitespace[],
427                   int32_t maxLines,
428                   UErrorCode &status) {
429
430    int32_t        numLines = 0;
431    int32_t        p, q;
432    const int32_t MAX_CHARS_PER_LINE = 72;
433    UChar          c;
434
435    BreakIterator *bi = BreakIterator::createLineInstance(locale, status);
436    if (U_FAILURE(status)) {
437        delete bi;
438        return 0;
439    }
440    bi->setText(s);
441
442
443    p = 0;
444    while (p < s.length()) {
445        // jump ahead in the paragraph by the maximum number of
446        // characters that will fit
447        q = p + MAX_CHARS_PER_LINE;
448
449        // if this puts us on a white space character, a control character
450        // (which includes newlines), or a non-spacing mark, seek forward
451        // and stop on the next character that is not any of these things
452        // since none of these characters will be visible at the end of a
453        // line, we can ignore them for the purposes of figuring out how
454        // many characters will fit on the line)
455        if (q < s.length()) {
456            c = s[q];
457            while (q < s.length()
458                   && (u_isspace(c)
459                       || u_charType(c) == U_CONTROL_CHAR
460                       || u_charType(c) == U_NON_SPACING_MARK
461            )) {
462                ++q;
463                c = s[q];
464            }
465        }
466
467        // then locate the last legal line-break decision at or before
468        // the current position ("at or before" is what causes the "+ 1")
469        q = bi->preceding(q + 1);
470
471        // if this causes us to wind back to where we started, then the
472        // line has no legal line-break positions. Break the line at
473        // the maximum number of characters
474        if (q == p) {
475            p += MAX_CHARS_PER_LINE;
476            lineStarts[numLines] = p;
477            trailingwhitespace[numLines] = 0;
478            ++numLines;
479        }
480        // otherwise, we got a good line-break position. Record the start of this
481        // line (p) and then seek back from the end of this line (q) until you find
482        // a non-white space character (same criteria as above) and
483        // record the number of white space characters at the end of the
484        // line in the other results array
485        else {
486            lineStarts[numLines] = p;
487            int32_t nextLineStart = q;
488
489            for (q--; q > p; q--) {
490                c = s[q];
491                if (!(u_isspace(c)
492                       || u_charType(c) == U_CONTROL_CHAR
493                       || u_charType(c) == U_NON_SPACING_MARK)) {
494                    break;
495                }
496            }
497            trailingwhitespace[numLines] = nextLineStart - q -1;
498            p = nextLineStart;
499           ++numLines;
500        }
501        if (numLines >= maxLines) {
502            break;
503        }
504    }
505    delete bi;
506    return numLines;
507}
508```
509
510Most text editors would not break lines based on the number of characters on a
511line. Even with a monospaced font, there are still many Unicode characters that
512are not displayed and therefore should be filtered out of the calculation. With
513a proportional font, character widths are added up until a maximum line width is
514exceeded or an end of the paragraph marker is reached.
515
516Trailing white space does not need to be counted in the line-width measurement
517because it does not need to be displayed at the end of a line. The sample code
518above returns an array of trailing white space values because an external
519rendering process needs to be able to measure the length of the line (without
520the trailing white space) to justify the lines. For example, if the text is
521right-justified, the invisible white space would be drawn outside the margin.
522The line would actually end with the last visible character.
523
524In either case, the basic principle is to jump ahead in the text to the location
525where the line would break (without taking word breaks into account). Then, move
526backwards using the preceding() method to find the last legal breaking position
527before that location. Iterating straight through the text with next() method
528will generally be slower.
529
530## ICU BreakIterator Data Files
531
532The source code for the ICU break rules for the standard boundary types is
533located in the directory
534[icu4c/source/data/brkitr/rules](https://github.com/unicode-org/icu/tree/master/icu4c/source/data/brkitr/rules).
535These files will be built, and the corresponding binary state tables
536incorporated into ICU's data, by the standard ICU4C build process.
537
538The dictionary word lists used by word break, and for some languages, line break
539are in
540[icu4c/source/data/brkitr/dictionaries](https://github.com/unicode-org/icu/tree/master/icu4c/source/data/brkitr/dictionaries).
541
542The same data is used by both ICU4C and ICU4J. In the normal ICU build process,
543the source data is processed into a binary form using ICU4C, and the resulting
544binary tables are incorporated into ICU4J.
545