• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1---
2layout: default
3title: CharacterIterator
4nav_order: 3
5parent: Chars and Strings
6---
7<!--
8© 2020 and later: Unicode, Inc. and others.
9License & terms of use: http://www.unicode.org/copyright.html
10-->
11
12# CharacterIterator Class
13
14## Overview
15
16CharacterIterator is the abstract base class that defines a protocol for
17accessing characters in a text-storage object. This class has methods for
18iterating forward and backward over Unicode characters to return either the
19individual Unicode characters or their corresponding index values.
20
21Using CharacterIterator ICU iterates over text that is independent of its
22storage method. The text can be stored locally or remotely in a string, file,
23database, or other method. The CharacterIterator methods make the text appear as
24if it is local.
25
26The CharacterIterator keeps track of its current position and index in the text
27and can do the following
28
291.  Move forward or backward one Unicode character at a time
30
312.  Jump to a new location using absolute or relative positioning
32
333.  Move to the beginning or end of its range
34
354.  Return a character or the index to a character
36
37The information can be restricted to a sub-range of characters, can contain a
38large block of text that can be iterated as a whole, or can be broken into
39smaller blocks for the purpose of iteration.
40
41> :point_right: **Note**: *CharacterIterator is different from
42[Normalizer](../transforms/normalization/index) in that CharacterIterator
43walks through the Unicode characters without interpretation.*
44
45Prior to ICU release 1.6, the CharacterIterator class allowed access to a single
46UChar at a time and did not support variable-width encoding. Single UChar
47support makes it difficult when supplementary support is expected in UTF16
48encodings. Beginning with ICU release 1.6, the CharacterIterator class now
49efficiently supports UTF-16 encodings and provides new APIs for UTF32 return
50values. The API names for the UTF16 and UTF32 encodings differ because the UTF32
51APIs include "32" within their naming structure. For example,
52CharacterIterator::current() returns the code unit and Character::current32()
53returns a code point.
54
55## Base class inherited by CharacterIterator
56
57The class,
58[ForwardCharacterIterator,](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classForwardCharacterIterator.html)
59is a superclass of the CharacterIterator class. This superclass provides methods
60for forward iteration only for both UTF16 and UTF32 access, and is and based on
61a efficient forward iteration mechanism. In some situations, where you need to
62iterate over text that does not allow random-access, the
63ForwardCharacterIterator superclass is the most efficient method. For example,
64iterate a UChar string using a character converter with the [ucnv_getNextUChar()
65function.](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/ucnv_8h.html)
66
67## Subclasses of CharacterIterator provided by ICU
68
69ICU provides the following concrete subclasses of the CharacterIteratorclass:
70
711.  [UCharCharacterIterator](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classUCharCharacterIterator.html)
72    subclass iterates over a `UChar[]` array.
73
742.  [StringCharacterIterator](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classStringCharacterIterator.html)
75    subclass extends from `UCharCharacterIterator` and iterates over the contents
76    of a `UnicodeString`.
77
78## Usage
79
80To use the methods specified in CharacterIterator class, do one of the
81following:
82
831.  Make a subclass that inherits from the CharacterIterator class
84
852.  Use the StringCharacterIterator subclass
86
873.  Use the UCharCharacterIterator subclass
88
89CharacterIterator objects keep track of its current position within the text
90that is iterated over. The CharacterIterator class uses an object similar to a
91cursor that gets initialized to the beginning of the text and advances according
92to the operations that are used on the object. The current index can move
93between two positions (a start and a limit) that are set with the text. The
94limit position is one character greater than the position of the last UChar
95character that is used.
96
97### Forward iteration
98
99For efficiency, ICU can iterate over text using post-increment semantics or
100Forward Iteration. Forward Iteration is an access method that reads a character
101from the current index position and moves the index forward. It leaves the index
102behind the character it read and returns the character read. ICU can use
103nextPostInc() or next32PostInc() calls with hasNext() to perform Forward
104Iteration. These calls are the only character access methods provided by the
105ForwardCharacterIterator. An iteration loop can be started with the
106setToStart(), firstPostInc() or first32PostInc()calls . (The setToStart() call
107is implied after instantiating the iterator or setting the text.)
108
109The less efficient forward iteration mechanism that is available for
110compatibility with Java™ provides pre-increment semantics. With these methods,
111the current character is skipped, and then the following character is read and
112returned. This is a less efficient method for a variable-width encoding because
113the width of each character is determined twice; once to read it and once to
114skip it the next time ICU calls the method. The methods used for Forward
115Iteration are the next() or next32() calls. An iteration loop must start with
116first() or first32() calls to get the first character.
117
118### Backward iteration
119
120Backward Iteration has pre-decrement semantics, which are the exact opposite of
121the post-increment Forward Iteration. The current index reads the character that
122precedes the index, the character is returned, and the index is left at the
123beginning of this character. The methods used for Backward Iteration are the
124previous() or previous32() calls with the hasPrevious() call . An iteration loop
125can be started with setToEnd(), last(), or last32() calls.
126
127### Direct index manipulation
128
129The index can be set and moved directly without iteration to start iterating at
130an arbitrary position, skip some characters, or reset the index to an earlier
131position. It is possible to set the index to one after the last text code unit
132for backward iteration.
133
134The setIndex() and setIndex32() calls set the index to a new position and return
135the character at that new position. The setIndex32() call ensures that the new
136position is at the beginning of the character (on its first code unit). Since
137the character at the new position is returned, these functions can be used for
138both pre-increment and post-increment iteration semantics.
139Similarly, the current() and current32() calls return the character at the
140current index without modifying the index. The current32() call retrieves the
141complete character whether the index is on the first code unit or not.
142
143The index and the iteration boundaries can be retrieved using separate
144functions. The following syntax is used by ICU: startIndex() <= getIndex() <=
145endIndex().
146
147Without accessing the text, the setToStart() and setToEnd() calls set the index
148to the start or to the end of the text. Therefore, these calls are efficient in
149starting a forward (post-increment) or backward iteration.
150
151The most general functions for manipulating the index position are the move()
152and move32() calls. These calls allow you to move the index forward or backward
153relative to its current position, start the index, or move to the end of the
154index. The move() and move32() calls do not access the text and are best used
155for skipping part of it. The move32() call skips complete code points like
156next32PostInc() call and other UChar32-access methods.
157
158### Access to the iteration text
159
160The CharacterIterator class provides the following access methods for the entire
161text under iteration:
162
1631.  getText() sets a UnicodeString with the text
164
1652.  getLength() returns just the length of the text.
166
167This text (and the length) may include more than the actual iteration area
168because the start and end indexes may not be the start and end of the entire
169text. The text and the iteration range are set in the implementing subclasses.
170
171## Additional Sample Code
172
173C/C++: See
174[icu4c/source/samples/citer/](https://github.com/unicode-org/icu/blob/master/icu4c/source/samples/citer/)
175in the ICU source distribution for code samples.
176