• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1---
2layout: default
3title: C/POSIX Migration
4nav_order: 6
5parent: ICU
6---
7<!--
8© 2020 and later: Unicode, Inc. and others.
9License & terms of use: http://www.unicode.org/copyright.html
10-->
11
12# C/POSIX Migration
13{: .no_toc }
14
15## Contents
16{: .no_toc .text-delta }
17
181. TOC
19{:toc}
20
21---
22
23## Migration from Standard C and POSIX APIs
24
25The ISO C and POSIX standards define a number of APIs for string handling and
26internationalization in C. They do not support Unicode well because they were
27initially designed before Unicode/ISO 10646 were developed, and the POSIX APIs
28are also problematic for other internationalization aspects.
29
30This chapter discusses C/POSIX APIs with their problems, and shows which ICU
31APIs to use instead.
32
33> :point_right:  **Note**: *We use the term "POSIX" to mean the POSIX.1 standard (IEEE Std 1003.1) which
34defines system interfaces and headers with relevance for string handling and
35internationalization. The XPG3, XPG4, Single Unix Specification (SUS) and other
36standards include POSIX.1 as a subset, adding other specifications that are
37irrelevant for this topic.*
38
39> :construction: This chapter is not complete yet – more POSIX APIs are expected to be discussed
40in the future.
41
42## Strings and Characters
43
44### Character Sets and Encodings
45
46#### ISO C
47
48The ISO C standard provides two basic character types (`char` and `wchar_t`) and
49defines strings as arrays of units of these types. The standard allows nearly
50arbitrary character and string character sets and encodings, which was necessary
51when there was no single character set that worked everywhere.
52
53For portable C programs, characters and strings are opaque, i.e., a program
54cannot assume that any particular character is represented by any particular
55code or sequence of codes. Programs use standard library functions to handle
56characters and strings. Only a small set of characters — usually the set of
57graphic characters available in US-ASCII — can be reliably accessed via
58character and string literals.
59
60#### Problems
61
621.  Many different encodings are used on each platform, making it difficult for
63    multiple programs and libraries to process the same text.
64
652.  Programs often need to know the codes of special characters. For example,
66    code that parses a filename needs to know how the path and file separators
67    are encoded; this is commonly possible because filenames deliberately use
68    US-ASCII characters, but any software that uses non-ASCII characters becomes
69    platform-dependent. It is practically impossible to provide sophisticated
70    text processing without knowledge of the character set, its string encoding,
71    and other detailed features.
72
733.  The C/POSIX standards only provide a very limited set of useful functions
74    for character and string handling; many functions that are provided do not
75    work for non-trivial cases.
76
774.  While the size of the char type is in practice fixed to 8 bits in modern
78    compilers, and its common encodings are reasonably well documented, the size
79    of wchar_t varies between 8/16/32 bits depending on the compiler, and only
80    few of the string encodings used with it are documented.
81
825.  See also [What size wchar_t do I need for
83    Unicode?](http://icu-project.org/docs/papers/unicode_wchar_t.html)
84
856.  A program based on this model must be recompiled for each platform. Usually,
86    it must be recompiled for each supported language or family of languages.
87
887.  The ISO C standard basically requires, by how its standard functions are
89    defined, that the data type for a single character code in a large character
90    set is the same as the string base unit type (wchar_t). This has led to C
91    standard library implementations using Unicode encodings which are either
92    limited for single-character functions to only part of Unicode, or suffer
93    from reduced interoperability with most Unicode-aware software.
94
95#### ICU
96
97ICU always processes Unicode text. Unicode covers all languages and allows safe
98hard coding of character codes, in addition to providing many standard or
99recommended algorithms and a lot of useful character property data. See the
100chapters about [Unicode Basics](unicode.md) and [Strings](strings/index.md) and others.
101
102ICU uses the 16-bit encoding form of Unicode (UTF-16) for processing, making it
103fully interoperable with most Unicode-aware software. See [UTF-16 for
104Processing](http://www.unicode.org/notes/tn12/). In the case of ICU4J, this is
105naturally the case because the Java language and the JDK use UTF-16.
106
107ICU uses and/or provides direct access to all of the [Unicode
108properties](strings/properties.md) which provide a much finer-grained
109classification of characters than [C/POSIX character
110classes](https://htmlpreview.github.io/?https://github.com/unicode-org/icu-docs/blob/master/design/posix_classes.html).
111
112In C/C++ source code character and string literals, ICU uses only "invariant"
113characters. They are the subset of graphic ASCII characters that are almost
114always encoded with the same byte values on all systems. (One set of byte values
115for ASCII-based systems, and another such set of byte values for EBCDIC
116systems.) See
117[`utypes.h`](https://github.com/unicode-org/icu/blob/master/icu4c/source/common/unicode/utypes.h)
118for the set of "invariant" characters.
119
120With the use of Unicode, the implementation of many of the Unicode standard
121algorithms, and its cross-platform availability, ICU provides for consistent,
122portable, and reliable text processing.
123
124### Case Mappings
125
126#### ISO C
127
128The standard C functions `tolower()`, `toupper()`, etc. take and return one
129character code each.
130
131#### Problems
132
1331.  This does not work for German, where the character "ß" (sharp s) uppercases
134    to the two characters "SS". (It "expands".)
135
1362.  It does not work for Greek, where the character "Σ" (capital sigma)
137    lowercases to either "ς" (small final sigma) or "σ" (small sigma) depending
138    on whether the capital sigma is the last letter in a word. (It is
139    context-dependent.)
140
1413.  It does not work for Lithuanian and Turkic languages where a "combining dot
142    above" character may need to be removed in certain cases. (It "contracts"
143    and is language- and context-dependent.)
144
1454.  There are a number of other such cases.
146
1475.  There are no standard functions for title-casing strings.
148
1496.  There are no standard functions for case-folding strings. (Case-folding is
150    used for case-insensitive comparisons; there are C/POSIX functions for
151    direct, case-insensitive comparisons of pairs of strings. Case-folding is
152    useful when one string is compared to many others, or as part of a chain of
153    transformations of a string.)
154
155#### ICU
156
157Case mappings are operations taking and returning strings, to support length
158changes and context dependencies. Unicode provides algorithms and data for
159proper case mappings, and ICU provides APIs for them. (See the API references
160for various string functions and for Transforms/Transliteration.)
161
162### Character Classes
163
164#### ISO C
165
166The standard C functions isalpha(), isdigit(), etc. take a character code each
167and return boolean values for whether the character belongs to the current
168locale's respective character class.
169
170#### Problems
171
1721.  Character classes are bound to locales, instead of providing consistent
173    classifications for characters.
174
1752.  The same character may have different classifications depending on the
176    locale and the platform.
177
1783.  There are only very few POSIX character classes, and they are not well
179    defined. For example, there is a class for punctuation characters but not
180    one for symbols.
181
1824.  For example, the dollar symbol (“$”) may or may not belong to the punct
183    class depending on the locale, even on the same system.
184
1855.  The standard allows at most two sets of decimal digits: The digits of the
186    “portable character set” (i.e., those in the ASCII repertoire) and one more.
187    Some implementations only recognize ASCII digits in the isdigit() function.
188    However, there are many sets of decimal digits in a multilingual character
189    set like Unicode.
190
1916.  The POSIX standard assumes that each locale definition file carries the
192    character class data for all relevant characters. With many locales using
193    overlapping character repertoires, this can lead to a lot of duplication.
194    For efficiency, many UTF-8 locales define character classes only for very
195    few characters instead of for all of Unicode. For example, some de_DE.utf-8
196    locales only define character classes for characters used in German, or for
197    the repertoire of ISO 8859-1 – in other words, for only a tiny fraction of
198    the representable Unicode repertoire. Processing of text using more than
199    this repertoire is not possible with such an implementation.
200
2017.  For more about the problems with POSIX character classes in a Unicode
202    context see [Annex C: Compatibility Properties in Unicode
203    Technical Standard #18: Unicode Regular Expressions](http://www.unicode.org/reports/tr18/#Compatibility_Properties)
204    and see the mailing list archives for the unicode list (on unicode.org). See
205    also the ICU design document about [C/POSIX character
206    classes](https://htmlpreview.github.io/?https://github.com/unicode-org/icu-docs/blob/master/design/posix_classes.html).
207
208#### ICU
209
210ICU provides locale-independent access to all [Unicode
211properties](strings/properties.md) (except Unihan.txt properties), as well as to
212the POSIX character classes, via functions defined in `uchar.h` and in ICU4J's
213`UCharacter` class (see API references) as well as via `UnicodeSet`. The POSIX
214character classes are implemented according to the recommendations in UTS #18.
215
216The Unicode Character Database defines more than 70 character properties, their
217values are designed for the large character set as well as for real text
218processing, and they are updated with each version of Unicode. The UCD is
219available online, facilitating industry-wide consistency in the implementation
220of Unicode properties.
221
222## Formatting and Parsing
223
224### Currency Formatting
225
226#### POSIX
227
228The `strfmon()` function is used to format monetary values. The default format and
229the currency display symbol or display name are selected by the LC_MONETARY
230locale ID. The number formatting can also be controlled with a formatting string
231resembling what `printf()` uses.
232
233#### Problems
234
2351.  Selection of the currency via a locale ID is unreliable: Countries change
236    currencies over time, and the locale data for a particular country may not
237    be available. This results in using the wrong currency. For example, an
238    application may assume that a country has switched from a previous currency
239    to the Euro, but it may run on an OS that predates the switch.
240
2412.  Using a single locale ID for the whole format makes it very difficult to
242    format values for multiple currencies with the same number format (for
243    example, for an exchange rate list or for showing the price of an item
244    adjusted for several currencies). `strfmon()` allows to specify the number
245    format fully, but then the application cannot use a country's default number
246    format.
247
2483.  The set of formattable currencies is limited to those that are available via
249    locale IDs on a particular system.
250
2514.  There does not appear to be a function to parse currency values.
252
253#### ICU
254
255ICU number formatting APIs have separate, orthogonal settings for the number
256format, which can be selected with a locale ID, and the currency, which is
257specified with an ISO code. See the [Formatting
258Numbers](format_parse/numbers/index.md) chapter for details.
259