1--- 2layout: default 3title: C/POSIX Migration 4nav_order: 6 5parent: ICU 6--- 7<!-- 8© 2020 and later: Unicode, Inc. and others. 9License & terms of use: http://www.unicode.org/copyright.html 10--> 11 12# C/POSIX Migration 13{: .no_toc } 14 15## Contents 16{: .no_toc .text-delta } 17 181. TOC 19{:toc} 20 21--- 22 23## Migration from Standard C and POSIX APIs 24 25The ISO C and POSIX standards define a number of APIs for string handling and 26internationalization in C. They do not support Unicode well because they were 27initially designed before Unicode/ISO 10646 were developed, and the POSIX APIs 28are also problematic for other internationalization aspects. 29 30This chapter discusses C/POSIX APIs with their problems, and shows which ICU 31APIs to use instead. 32 33> :point_right: **Note**: *We use the term "POSIX" to mean the POSIX.1 standard (IEEE Std 1003.1) which 34defines system interfaces and headers with relevance for string handling and 35internationalization. The XPG3, XPG4, Single Unix Specification (SUS) and other 36standards include POSIX.1 as a subset, adding other specifications that are 37irrelevant for this topic.* 38 39> :construction: This chapter is not complete yet – more POSIX APIs are expected to be discussed 40in the future. 41 42## Strings and Characters 43 44### Character Sets and Encodings 45 46#### ISO C 47 48The ISO C standard provides two basic character types (`char` and `wchar_t`) and 49defines strings as arrays of units of these types. The standard allows nearly 50arbitrary character and string character sets and encodings, which was necessary 51when there was no single character set that worked everywhere. 52 53For portable C programs, characters and strings are opaque, i.e., a program 54cannot assume that any particular character is represented by any particular 55code or sequence of codes. Programs use standard library functions to handle 56characters and strings. Only a small set of characters — usually the set of 57graphic characters available in US-ASCII — can be reliably accessed via 58character and string literals. 59 60#### Problems 61 621. Many different encodings are used on each platform, making it difficult for 63 multiple programs and libraries to process the same text. 64 652. Programs often need to know the codes of special characters. For example, 66 code that parses a filename needs to know how the path and file separators 67 are encoded; this is commonly possible because filenames deliberately use 68 US-ASCII characters, but any software that uses non-ASCII characters becomes 69 platform-dependent. It is practically impossible to provide sophisticated 70 text processing without knowledge of the character set, its string encoding, 71 and other detailed features. 72 733. The C/POSIX standards only provide a very limited set of useful functions 74 for character and string handling; many functions that are provided do not 75 work for non-trivial cases. 76 774. While the size of the char type is in practice fixed to 8 bits in modern 78 compilers, and its common encodings are reasonably well documented, the size 79 of wchar_t varies between 8/16/32 bits depending on the compiler, and only 80 few of the string encodings used with it are documented. 81 825. See also [What size wchar_t do I need for 83 Unicode?](http://icu-project.org/docs/papers/unicode_wchar_t.html) 84 856. A program based on this model must be recompiled for each platform. Usually, 86 it must be recompiled for each supported language or family of languages. 87 887. The ISO C standard basically requires, by how its standard functions are 89 defined, that the data type for a single character code in a large character 90 set is the same as the string base unit type (wchar_t). This has led to C 91 standard library implementations using Unicode encodings which are either 92 limited for single-character functions to only part of Unicode, or suffer 93 from reduced interoperability with most Unicode-aware software. 94 95#### ICU 96 97ICU always processes Unicode text. Unicode covers all languages and allows safe 98hard coding of character codes, in addition to providing many standard or 99recommended algorithms and a lot of useful character property data. See the 100chapters about [Unicode Basics](unicode.md) and [Strings](strings/index.md) and others. 101 102ICU uses the 16-bit encoding form of Unicode (UTF-16) for processing, making it 103fully interoperable with most Unicode-aware software. See [UTF-16 for 104Processing](http://www.unicode.org/notes/tn12/). In the case of ICU4J, this is 105naturally the case because the Java language and the JDK use UTF-16. 106 107ICU uses and/or provides direct access to all of the [Unicode 108properties](strings/properties.md) which provide a much finer-grained 109classification of characters than [C/POSIX character 110classes](https://htmlpreview.github.io/?https://github.com/unicode-org/icu-docs/blob/master/design/posix_classes.html). 111 112In C/C++ source code character and string literals, ICU uses only "invariant" 113characters. They are the subset of graphic ASCII characters that are almost 114always encoded with the same byte values on all systems. (One set of byte values 115for ASCII-based systems, and another such set of byte values for EBCDIC 116systems.) See 117[`utypes.h`](https://github.com/unicode-org/icu/blob/master/icu4c/source/common/unicode/utypes.h) 118for the set of "invariant" characters. 119 120With the use of Unicode, the implementation of many of the Unicode standard 121algorithms, and its cross-platform availability, ICU provides for consistent, 122portable, and reliable text processing. 123 124### Case Mappings 125 126#### ISO C 127 128The standard C functions `tolower()`, `toupper()`, etc. take and return one 129character code each. 130 131#### Problems 132 1331. This does not work for German, where the character "ß" (sharp s) uppercases 134 to the two characters "SS". (It "expands".) 135 1362. It does not work for Greek, where the character "Σ" (capital sigma) 137 lowercases to either "ς" (small final sigma) or "σ" (small sigma) depending 138 on whether the capital sigma is the last letter in a word. (It is 139 context-dependent.) 140 1413. It does not work for Lithuanian and Turkic languages where a "combining dot 142 above" character may need to be removed in certain cases. (It "contracts" 143 and is language- and context-dependent.) 144 1454. There are a number of other such cases. 146 1475. There are no standard functions for title-casing strings. 148 1496. There are no standard functions for case-folding strings. (Case-folding is 150 used for case-insensitive comparisons; there are C/POSIX functions for 151 direct, case-insensitive comparisons of pairs of strings. Case-folding is 152 useful when one string is compared to many others, or as part of a chain of 153 transformations of a string.) 154 155#### ICU 156 157Case mappings are operations taking and returning strings, to support length 158changes and context dependencies. Unicode provides algorithms and data for 159proper case mappings, and ICU provides APIs for them. (See the API references 160for various string functions and for Transforms/Transliteration.) 161 162### Character Classes 163 164#### ISO C 165 166The standard C functions isalpha(), isdigit(), etc. take a character code each 167and return boolean values for whether the character belongs to the current 168locale's respective character class. 169 170#### Problems 171 1721. Character classes are bound to locales, instead of providing consistent 173 classifications for characters. 174 1752. The same character may have different classifications depending on the 176 locale and the platform. 177 1783. There are only very few POSIX character classes, and they are not well 179 defined. For example, there is a class for punctuation characters but not 180 one for symbols. 181 1824. For example, the dollar symbol (“$”) may or may not belong to the punct 183 class depending on the locale, even on the same system. 184 1855. The standard allows at most two sets of decimal digits: The digits of the 186 “portable character set” (i.e., those in the ASCII repertoire) and one more. 187 Some implementations only recognize ASCII digits in the isdigit() function. 188 However, there are many sets of decimal digits in a multilingual character 189 set like Unicode. 190 1916. The POSIX standard assumes that each locale definition file carries the 192 character class data for all relevant characters. With many locales using 193 overlapping character repertoires, this can lead to a lot of duplication. 194 For efficiency, many UTF-8 locales define character classes only for very 195 few characters instead of for all of Unicode. For example, some de_DE.utf-8 196 locales only define character classes for characters used in German, or for 197 the repertoire of ISO 8859-1 – in other words, for only a tiny fraction of 198 the representable Unicode repertoire. Processing of text using more than 199 this repertoire is not possible with such an implementation. 200 2017. For more about the problems with POSIX character classes in a Unicode 202 context see [Annex C: Compatibility Properties in Unicode 203 Technical Standard #18: Unicode Regular Expressions](http://www.unicode.org/reports/tr18/#Compatibility_Properties) 204 and see the mailing list archives for the unicode list (on unicode.org). See 205 also the ICU design document about [C/POSIX character 206 classes](https://htmlpreview.github.io/?https://github.com/unicode-org/icu-docs/blob/master/design/posix_classes.html). 207 208#### ICU 209 210ICU provides locale-independent access to all [Unicode 211properties](strings/properties.md) (except Unihan.txt properties), as well as to 212the POSIX character classes, via functions defined in `uchar.h` and in ICU4J's 213`UCharacter` class (see API references) as well as via `UnicodeSet`. The POSIX 214character classes are implemented according to the recommendations in UTS #18. 215 216The Unicode Character Database defines more than 70 character properties, their 217values are designed for the large character set as well as for real text 218processing, and they are updated with each version of Unicode. The UCD is 219available online, facilitating industry-wide consistency in the implementation 220of Unicode properties. 221 222## Formatting and Parsing 223 224### Currency Formatting 225 226#### POSIX 227 228The `strfmon()` function is used to format monetary values. The default format and 229the currency display symbol or display name are selected by the LC_MONETARY 230locale ID. The number formatting can also be controlled with a formatting string 231resembling what `printf()` uses. 232 233#### Problems 234 2351. Selection of the currency via a locale ID is unreliable: Countries change 236 currencies over time, and the locale data for a particular country may not 237 be available. This results in using the wrong currency. For example, an 238 application may assume that a country has switched from a previous currency 239 to the Euro, but it may run on an OS that predates the switch. 240 2412. Using a single locale ID for the whole format makes it very difficult to 242 format values for multiple currencies with the same number format (for 243 example, for an exchange rate list or for showing the price of an item 244 adjusted for several currencies). `strfmon()` allows to specify the number 245 format fully, but then the application cannot use a country's default number 246 format. 247 2483. The set of formattable currencies is limited to those that are available via 249 locale IDs on a particular system. 250 2514. There does not appear to be a function to parse currency values. 252 253#### ICU 254 255ICU number formatting APIs have separate, orthogonal settings for the number 256format, which can be selected with a locale ID, and the currency, which is 257specified with an ISO code. See the [Formatting 258Numbers](format_parse/numbers/index.md) chapter for details. 259