1--- 2layout: default 3title: ICU Services 4nav_order: 4 5parent: ICU 6--- 7<!-- 8© 2020 and later: Unicode, Inc. and others. 9License & terms of use: http://www.unicode.org/copyright.html 10--> 11 12# ICU Services 13{: .no_toc } 14 15## Contents 16{: .no_toc .text-delta } 17 181. TOC 19{:toc} 20 21--- 22 23## Overview of the ICU Services 24 25ICU enables you to write language-independent C/C++ and Java code that is used on 26separate, localized resources to get language-specific results. ICU supports 27many features, including language-sensitive text, dates, time, numbers, 28currency, message sorting, and searching. ICU provides language-specific results 29for a broad range of languages. 30 31### Strings, Properties and CharacterIterator 32 33ICU provides basic Unicode support for the following: 34 35* [Unicode strings](strings/index.md) 36 37 ICU includes type definitions for UTF-16 strings and code points. It also 38 contains many C `u_string` functions and the C++ `UnicodeString` class with many 39 additional string functions. 40 41* [Unicode properties](strings/properties.md) 42 43 ICU includes the C definitions and functions found in `uchar.h` as well as 44 some macros found in `utf.h`. It also includes the C++ Unicode class. 45 46* [Unicode string iteration](strings/characteriterator.md) 47 48 In C, ICU uses the macros in `utf.h` for the iteration of strings. In C++, ICU 49 uses the characterIterator and its subclasses. 50 51### Conversion Basics 52 53A converter is used to transform text from one encoding type to another. In the 54case of Unicode, ICU transforms text from one encoding codepage to Unicode and 55back. An encoding is a mapping from a given character set definition to the 56actual bits used to represent the data. 57 58### Locale and Resources 59 60The ICU package contains the locale and resource bundles as well as the classes 61that implement them. Also, the ICU package contains the locale data (plain text 62resource bundles) and provides APIs to access and make use of that data in 63various services. Users need to understand these terms and the relationship 64between them. 65 66A locale identifies a group of users who have similar cultural and linguistic 67expectations for how their computers interact with them and process data. This 68is an abstract concept that is typically expressed by one of the following: 69 70A locale ID specifies a language and region enabling the software to support 71culturally and linguistically appropriate information for each user. A locale 72object represents a specific geographical, political, or cultural region. As a 73programmatic expression of locale IDs, ICU provides the C++ `Locale` class. In C, 74Application Programming Interfaces (APIs) use simple C `string` for locale IDs. 75 76ICU stores locale-specific data in resource bundles, which provide a general 77mechanism to access strings and other objects for ICU services to perform 78according to locale conventions. ICU contains data for its services to support 79many locales. Resource bundles contain the locale data of applications that use 80ICU. In C++, the `**ResourceBundle**` implements the locale data. In C, this 81feature is provided by the `**ures_**` interface. 82 83In addition to storing system-level data in ICU's resource bundles, applications 84typically also need to use resource bundles of their own to store 85locale-dependent application data. ICU provides the generic resource bundle APIs 86to access these bundles and also provides the tools to build them. 87 88> :point_right: **Note**: *Display strings, which are displayed to a user of a program, are bundled in a 89separate file instead of being embedded in the lines of the program.* 90 91### Locales and Services 92 93The interaction between locales and services is fundamental to ICU. Please refer 94to [Locales and Services](./locale/index#locales-and-services). 95 96### Transliteration 97 98Transliteration was originally designed to convert characters from one script to 99another (for example, from Greek to Latin, or Japanese Katakana to Latin). Now, 100transliteration is a more flexible mechanism that has pre-built transformations 101for case conversions, normalization conversions, the removal of given 102characters, and also for a variety of language and script transliterations. 103Transliterations can be chained together to perform a series of operations and 104each step of the process can use a UnicodeSet to restrict the characters that 105are affected. There are two basic types of transliterators: 106 107Most natural language transliterators (such as Greek-Latin) are written a 108rule-based transliterators. 109 110Transliterators can be written as text files using a 111simple language that is similar to regular expression syntax. 112 113### `Date` and `Time` Classes 114 115Date and time routines manage independent date and time functions in 116milliseconds since January 1, 1970 (0:00:00.000 UTC). Points in time before then 117are represented as negative numbers. 118 119ICU provides the following [classes](datetime/index.md) to support calendars and 120time zones: 121 122* [`Calendar`](datetime/calendar/index#calendar) 123 124 The abstract superclass for extracting calendar-related attributes from a `Date` value. 125 126* [`GregorianCalendar`](datetime/calendar/index#gregoriancalendar) 127 128 A concrete class for representing a Gregorian calendar. 129 130* [`TimeZone`](datetime/timezone/index.md) 131 132 An abstract superclass for representing a time zone. 133 134* [`SimpleTimeZone`](datetime/timezone/index.md) 135 136 A concrete class for representing a time zone for use with a Gregorian calendar. 137 138> :point_right: **Note**: *C classes provide the same functionality as the C++ classes with the exception 139of subclassing.* 140 141### Format and Parse 142 143Formatters translate between non-text data values and textual representations of 144those values. The result is a string of text that represents the internal value. 145A formatter can parse a string and convert a textual representation of some 146value (if it finds one it understands) back into its internal representation. 147For example, when the formatter reads the characters 1, 0, and 3 followed by 148something other than a digit, it produces the value 103 in its internal binary 149representation. 150 151A formatter takes a value and produces a user-readable string that represents 152that value or takes a string and parses it to produce a value. 153 154ICU provides the following areas and classes for general formatting, formatting 155numbers, formatting dates and times, and formatting messages: 156 157#### General Formatting 158 159See [Formatting and Parsing Classes](format_parse/index#formatting-and-parsing-classes) for an introduction to the following: 160 161* `Format` 162* `FieldPosition` 163* `ParsePosition` 164* `Formattable` 165 166#### Formatting Numbers 167 168* [`NumberFormat`](format_parse/numbers/index#formatting-numbers) 169 NumberFormat provides the basic fields and methods to format number objects 170 and number primitives into localized strings and parse localized strings to 171 number objects. 172 173* [`DecimalFormat`](format_parse/numbers/index#decimalformat) 174 DecimalFormat provides the methods used to format number objects and number 175 primitives into localized strings and parse localized strings into number 176 objects in base 10. 177 178* [`DecimalFormatSymbols`](format_parse/numbers/index#decimalformatsymbols) 179 DecimalFormatSymbols is a concrete class used by DecimalFormat to access 180 localized number strings such as the grouping separators, the decimal 181 separator, and the percent sign. 182 183#### Formatting Dates and Times 184 185* [`DateFormat`](format_parse/datetime/index.md) 186 187 `DateFormat` provides the basic fields and methods for formatting date objects 188 to localized strings and parsing date and time strings to date objects. 189 190* [`SimpleDateFormat`](format_parse/datetime/index.md) 191 192 `SimpleDateFormat` is a concrete class used to format date objects to 193 localized strings and to parse date and time strings to date objects using a 194 `GregorianCalendar`. 195 196* [`DateFormatSymbols`](format_parse/datetime/index.md) 197 198 `DateFormatSymbols` is a concrete class used to access localized date and time 199 formatting strings, such as names of the months, days of the week, and the 200 time zone. 201 202#### Formatting Messages 203 204* [`MessageFormat`](format_parse/messages/index.md) 205 206 `MessageFormat` is a concrete class used to produce a language-specific user 207 message that contains numbers, currency, percentages, date, time, and string 208 variables. 209 210* [`ChoiceFormat`](format_parse/messages/index.md) 211 212 `ChoiceFormat` is a concrete class used to map strings to ranges of numbers 213 and to handle plural words and name series in user messages. 214 215> :point_right: **Note**: *C classes provide the same functionality as the C++ classes with the exception 216of subclassing.* 217 218### Searching and Sorting 219 220Sorting and searching non-English text presents a number of challenges that many 221English speakers are unaware of. The primary source of difficulty is accents, 222which have very different meanings in different languages, and sometimes even 223within the same language: 224 225* Many accented letters, such as the é in café, are treated as minor variants 226 on the letter that is accented. 227 228* Sometimes the accented form of a letter is treated as a distinct letter for 229 the purposes of comparison. For example, Å in Danish is treated as a 230 separate letter that sorts just after Z. 231 232* In some cases, an accented letter is treated as if it were two letters. In 233 traditional German, for example, ä is compared as if it were ae. 234 235Searching and sorting is done through collation using the `Collator` class and its 236sub-classes `RuleBasedCollator` and `CollationElementIterator` as well as the 237`CollationKey` object. Collation determines the proper sort sequence for two or 238more natural language strings. It also can determine if two strings are 239equivalent for the purpose of searching. 240 241The `Collator` class and its sub-class `RuleBasedCollator` perform locale-sensitive 242string comparisons to create sorting and searching routines for natural language 243text. `Collator` and `RuleBasedCollator` can distinguish between characters 244associated with base characters (such as 'a' and 'b'), accent marks (such as 245'ò', 'ó'), and uppercase or lowercase properties (such as 'a' and 'A'). 246 247ICU provides the following collation classes for sorting and searching natural 248language text according to locale-specific rules: 249 250* [`Collator`](collation/architecture.md) is the abstract base class of all classes that compare strings. 251 252* [`CollationElementIterator`](collation/architecture.md) is a concrete iterator class that provides an 253 iterator for stepping through each character of a locale-specific string 254 according to the rules of a specific collator object. 255 256* [`RuleBasedCollator`](collation/architecture.md) is the only built-in 257 implementation of the collator. It 258 provides a sophisticated mechanism for comparing strings in a 259 language-specific manner, and an interface that allows the user to 260 specifically customize the sorting order. 261 262* [`CollationKey`](collation/architecture.md) is an object that enables the fast sorting of strings by 263 representing a string as a sort key under the rules of a specific collator 264 object. 265 266> :point_right: **Note**: *C classes provide the same functionality as the C++ classes with the exception 267of subclassing.* 268 269### Text Analysis 270 271The BreakIterator services can be used for formatting and handling text; 272locating the beginning and ending points of a word; counting words, sentences, 273and paragraphs; and listing unique words. Specifically, text operations can be 274done to locate the following linguistic boundaries: 275 276* Display text on the screen and locate places in the text where the 277 BreakIterator can perform word-wrapping to fit the text within the margins 278 279* Locate the beginning and end of a word that the user has selected 280 281* Count graphemes (or characters), words, sentences, or paragraphs 282 283* Determine how far to move in the text store when the user hits an arrow key 284 to move forward or backward one grapheme 285 286* Make a list of all the unique words in a document 287 288* Figure out whether or not a range of text contains only whole words 289 290* Capitalize the first letter of each word 291 292* Extract a particular unit from the text such as "find me the third grapheme 293 in this document" 294 295The BreakIterator services were designed and developed around an "iterator" or 296"cursor" style of interface. The object points to a particular place in the 297text. You can move the pointer forward or backward to search the text for 298boundaries. 299 300The `BreakIterator` class makes it possible to iterate over user characters. A 301`BreakIterator` can find the location of a character, word, sentence or potential 302line-break boundary. This makes it possible for a software program to properly 303select characters for text operations such as highlighting a character, cutting 304a word, moving to the next sentence, or wrapping words at a line ending. 305`BreakIterator` performs these operations in a locale-sensitive manner, meaning 306that it recognizes text boundaries according to the particular locale ID. 307 308ICU provides the following classes for iterating over locale-specific text: 309 310* [`BreakIterator`](boundaryanalysis/index.md) 311 312 The abstract base class that defines the operations for finding and getting 313 the positions of logical breaks in a string of text: characters, words, 314 sentences, and potential line breaks. 315 316* [`CharacterIterator`](strings/characteriterator.md) 317 318 The abstract base class for forward and backward iteration over a string of 319 Unicode characters. 320 321* [`StringCharacterIterator`](strings/index.md) 322 323 A concrete class for forward and backward iteration over a string of Unicode 324 characters. `StringCharacterIterator` inherits from `CharacterIterator`. 325 326### Paragraph Layout 327 328See [Paragraph Layout](./layoutengine/paragraph.md) for more details. 329 330## Locale-Dependent Operations 331 332Many of the ICU classes are locale-sensitive, meaning that you have to create a 333different one for each locale. 334 335| C API | C++ Class | Description | 336|----------|--------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| 337| `ubrk_` | `BreakIterator` | The `BreakIterator` class implements methods to find the location of boundaries in the text. | 338| `ucal_` | `Calendar` | The `Calendar` class is an abstract base class that converts between a `UDate` object and a set of integer fields such as `YEAR`, `MONTH`, `DAY`, `HOUR`, and so on. | 339| `umsg.h` | `ChoiceFormat` | A `ChoiceFormat` class enables you to attach a format to a range of numbers. | 340| `ucol_` | `CollationElementIterator` | The `CollationElementIterator` class is used as an iterator to walk through each character of an international string. | 341| `ucol_` | `CollationKey` | The `Collator` class generates the Collation keys. | 342| `ucol_` | `Collator` | The `Collator` class performs locale-sensitive string comparison. | 343| `udat_` | `DateFormat` | `DateFormat` is an abstract class for a family of classes. `DateFormat` converts dates and times from their internal representations to a textual form that is language-independent, and then back to their internal representations. | 344| `udat_` | `DateFormatSymbols` | `DateFormatSymbols` is a public class that encapsulates localized date and time formatting data. This information includes time zone information. | 345| `unum_` | `DecimalFormatSymbols` | This class represents the set of symbols needed by `DecimalFormat` to format numbers. | 346| `umsg.h` | `Format` | The `Format` class is the base class for all formats. | 347| `ucal_` | `GregorianCalendar` | `GregorianCalendar` is a concrete class that provides the standard calendar used in many locations. | 348| `uloc_` | `Locale` | A `Locale` object represents a specific geographical, political, or cultural region. | 349| `umsg.h` | `MessageFormat` | `MessageFormat` provides a means to produce concatenated messages in language-neutral way. | 350| `unum_` | `NumberFormat` | `NumberFormat` is an abstract base class for all number formats. | 351| `ures_` | `ResourceBundle` | `ResourceBundle` provides a means to access a collection of locale-specific information. | 352| `ucol_` | `RuleBasedCollator` | The `RuleBasedCollator` provides the implementation of the `Collator` class using data-driven tables. | 353| `udat_` | `SimpleDateFormat` | `SimpleDateFormat` is a concrete class used to format and parse dates in a language-independent way. | 354| `ucal_` | `SimpleTimeZone` | `SimpleTimeZone` is a concrete subclass of `TimeZone` that represents a time zone for use with a Gregorian calendar. | 355| `usearch_` | `StringSearch` | `StringSearch` provides a way to search text in a locale sensitive manner. | 356| `ucal_` | `TimeZone` | `TimeZone` represents a time zone offset, and also determines daylight savings time settings. | 357 358## Locale-Independent Operations 359 360The following ICU services can be used in all locales as they provide 361locale-independent services and users do not need to specify a locale ID: 362 363| C API | C++ Class | Description | 364|-----------|-------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| 365| `ubidi_` | | `UBiDi` is used for implementing the Unicode BiDi algorithm. | 366| `utf.h` | `CharacterIterator` | `CharacterIterator` is an abstract class that defines an API for iteration on text objects. It is an interface for forward and backward iteration and for the random access of a text object. Also, it provides backward compatibility to the Java and older ICU `CharacterIterator` classes. | 367| n/a | `Formattable` | `Formattable` is a thin wrapper class that converts between the primitive numeric types (`double`, `long`, and so on) and the `UDate` and `UnicodeString` classes. `Formattable` objects can be passed to the `Format` class or its subclasses for formatting. | 368| `unorm_` | `Normalizer` | `Normalizer` transforms Unicode text into an equivalent composed or decomposed form to allow for easier sorting and searching of text. | 369| n/a | `ParsePosition` | `ParsePosition` is a simple class used by the `Format` class and its subclasses to keep track of the current position during parsing. | 370| `uidna_` | | An implementation of the IDNA protocol as defined in RFC 3490. | 371| `utf.h` | `StringCharacterIterator` | A concrete subclass of `CharacterIterator` that iterates over the characters (code units or code points) in a `UnicodeString`. | 372| `utf.h` | `UCharCharacterIterator` | A concrete subclass of `CharacterIterator` that iterates over the characters (code units or code points) in a `UChar` array. | 373| `uchar.h` | | The Unicode character properties API allows you to query the properties associated with individual Unicode character values. | 374| `uregex_` | `RegexMatcher` | `RegexMatcher` is a regular expressions implementation. This allows you to perform string matching based upon a pattern. | 375| `utrans_` | `Transliterator` | `Transliterator` is an abstract class that transliterates text from one format to another. The most common type of transliterator is a script, or an alphabet. | 376| `uset_` | `UnicodeSet` | Objects of the `UnicodeSet` class represent character classes used in regular expressions. These classes specify a subset of the set of all Unicode characters. This is a mutable set of Unicode characters. | 377| `ustring.h` | `UnicodeString` | `UnicodeString` is a string class that stores Unicode characters directly. This class is a concrete implementation of the abstract class `Replaceable`. | 378| `ushape.h` | | Provides operations to transform (shape) between Arabic characters and their presentation forms. | 379| `ucnv_` | | The Unicode conversion API allows you to convert data written in one codepage/encoding to and from UTF-16. | 380