1--- 2layout: default 3title: Internationalization 4nav_order: 1 5parent: ICU 6--- 7<!-- 8© 2020 and later: Unicode, Inc. and others. 9License & terms of use: http://www.unicode.org/copyright.html 10--> 11 12# Software Internationalization 13{: .no_toc } 14 15## Contents 16{: .no_toc .text-delta } 17 181. TOC 19{:toc} 20 21--- 22 23## Overview of Software Internationalization 24 25Developing globalized software is a continuous balancing act as software 26developers and project managers inadvertently underestimate the level of effort 27and detail required to create foreign-language software releases. 28 29Software developers must understand the ICU services to design and deploy 30successful software releases. The services can save ICU users time in dealing 31with the kinds of problems that typically arise during critical stages of the 32software life cycle. 33 34In general, the standard process for creating globalized software includes 35"internationalization", which covers generic coding and design issues, and 36"localization", which involves translating and customizing a product for a 37specific market. 38 39Software developers must understand the intricacies of internationalization 40since they write the actual underlying code. How well they use established 41services to achieve mission objectives determines the overall success of the 42project. At a fundamental level, code and feature design affect how a product is 43translated and customized. Therefore, software developers need to understand key 44localization concepts. 45 46From a geographic perspective, a locale is a place. From a software perspective, 47a locale is an ID used to select information associated with a language and/or 48a place. ICU locale information includes the name and identifier of the spoken 49language, sorting and collating requirements, currency usage, numeric display 50preferences, and text direction (left-to-right or right-to-left, horizontal or 51vertical). 52 53General locale-sensitive standards include keyboard layouts, default paper and 54envelope sizes, common printers and monitor resolutions, character sets or 55encoding ranges, and input methods. 56 57## ICU Services Overview 58 59The ICU services support all major locales with language and sub-language pairs. 60The sub-language generally corresponds to a country. One way to think of this is 61in terms of the phrase "X language as spoken in Y country." The way people speak 62or write a particular language might not change dramatically from one country to 63the next (for example, German is spoken in Austria, Germany, and Switzerland). 64However, cultural conventions and national standards often differ a great deal. 65 66A key advantage to using the ICU services is the net result in reduced time to 67market. The translation of the display strings is bundled in separate text files 68for translation. A programmer team with translators no longer needs to search 69the source code in order to rewrite the software for each country and language. 70 71## Internationalization and Unicode 72 73Unicode enables a program to use a standard encoding scheme for all textual data 74within the program's environment. Conversion has to be done with incoming and 75outgoing data only. Operations on the text (while it is in the environment) are 76simplified since you do not have to keep track of the encoding of a particular 77text. 78 79Unicode supports multilingual data since it encodes characters for all world 80languages. You do not have to tag pieces of data with their encoding to enable 81the right characters, and you can mix languages within a single piece of text. 82 83Some of the advantages of using ICU to internationalize your program include the 84following: 85 86* It can handle text in any language or combination of languages. 87 88* The source code can be written so that the program can work for many 89 locales. 90 91* Configurable, pluggable localization is enabled. 92 93* Multiple locales are supported at the same time. 94 95* Non-technical people can be given access to information and you don't have 96 to open the source code to them. 97 98* Software can be developed so that the same code can be ported to various 99 platforms. 100 101## Project Management Tips for Internationalizing Software 102 103The following two processes are key when managing, developing and designing a 104successful internationalization software deliverable: 105 1061. Separate the program's executable code from its UI elements. 107 1082. Avoid making cultural assumptions. 109 110Keep static information (such as pictures, window layouts) separate from the 111program code. Also ensure that the text which the program generates on the fly 112(such as numbers and dates) comes out in the right language. The text must be 113formatted correctly for the targeted user community. 114 115Make sure the analysis and manipulation of both text and kinds of data 116(such as dates), is done in a manner that can be easily adapted for different 117languages and user communities. This includes tasks such as alphabetizing lists 118and looking for line-break positions. 119 120Characters must display on the screen correctly (the text's storage format must 121be translated to the proper visual images). They must also be accepted as input 122(translated from keystrokes, voice input or another kind of input into the 123text's storage format). These processes are relatively easy for English, but 124quite challenging for other languages. 125 126### Separating Executable Code from UI Elements 127 128Good software design requires that the programming code implementing the user 129interface (UI) be kept separate from code implementing the underlying 130functionality. The description of the UI must also be kept separate from the 131code implementing it. 132 133The description of the UI contains items that the user sees, including the 134various messages, buttons, and menu commands. It also contains information about 135how dialog boxes are to be laid out, and how icons, colors or other visual 136elements are to be used. For example, German words tend to be longer since they 137contains grammatical suffixes that English has lost in the last 800 years. The 138following table shows how word lengths can differ among languages. 139 140|English|German|Cyrillic-Serbian| 141|--------|--------|-------------| 142|cut|ausschneiden|исеци| 143|copy|kopieren|копирај| 144|paste|einfügen|залепи| 145 146The description of the UI, especially user-visible pieces of text, must be kept 147together and not embedded in the program's executable code. ICU provides the 148ResourceBundle services for this purpose. 149 150### Avoiding Cultural/Hidden Assumptions 151 152Another difficulty encountered when designing and implementing code is to make 153it flexible enough to handle different ways of doing things in other countries 154and cultures. Most programmers make unconscious assumptions about their user's 155language and customs when they design their programs. For example, in Thailand, 156the official calendar is the Buddhist calendar and not the Gregorian calendar. 157 158These assumptions make it difficult to translate the user interface portion of 159the code for some user communities without rewriting the underlying program. The 160ICU libraries provide flexible APIs that can be used to perform the most common 161and important tasks. They contain pre-built supporting data that enables them to 162work correctly in 75 languages and more than 200 locales. The key is 163understanding when, where, why, or how to use the APIs effectively. 164 165The remainder of this section provides an overview of some cultural and hidden 166assumptions components. See a list of topics below: 167* [Numbers and Dates](#numbers-and-dates) 168* [Messages](#messages) 169* [Measuring Units](#measuring-units) 170* [Alphabetical Order of Characters](#alphabetical-order-of-characters) 171* [Characters](#characters) 172* [Text Input and Layout](#text-input-and-layout) 173* [Text Manipulation](#text-manipulation) 174* [Date/Time Formatting](#datetime-formatting) 175* [Distributed Locale Support](#distributed-locale-support) 176* [LayoutEngine](#layoutengine) 177 178#### Numbers and Dates 179 180Numbers and dates are represented in different languages. Do not implement 181routines for converting numbers into strings, and do not call low-level system 182interfaces like `sprintf()` that do not produce language-sensitive results. 183Instead, see how ICU's [NumberFormat](format_parse/numbers/index.md) and 184[DateFormat](format_parse/datetime/index.md) services can be used more 185effectively. 186 187#### Messages 188 189Be careful when formulating assumptions about how individual pieces of text are 190used together to create a complete sentence (for example, when error messages 191are generated). The elements might go together in a different order if the 192message is translated into a new language. ICU provides 193[MessageFormat](format_parse/messages/index.md) (§) and 194[ChoiceFormat](format_parse/messages/index.md) (§) to help with these 195occurrences. 196 197> :point_right: **Note**: There also might be situations where parts of the sentence change when other 198parts of the sentence also change (selecting between singular and plural nouns 199that go after a number is the most common example). 200 201#### Measuring Units 202 203Numerical representations can change with regard to measurement units and 204currency values. Currency values can vary by country. A good example of this is 205the representation of $1,000 dollars. This amount can represent either U.S. or 206Canadian dollar values. US dollars can be displayed as USD while Canadian 207dollars can be displayed as CAD, depending on the locale. In this case, the 208displayed numerical quantity might change, and the number itself might also 209change. [NumberFormat](format_parse/numbers/index.md) provides some support for 210this. 211 212#### Alphabetical Order of Characters 213 214All languages (even those using the same alphabet) do not necessarily have the 215same concept of alphabetical order. Do not assume that alphabetical order is the 216same as the numerical order of the character's code-point values. In practice, 217'a' is distinct from 'A' and 'b' is distinct from 'B'. Each has a different code 218point . This means that you cannot use a bit-wise lexical comparison (such as 219what strcmp() provides), to sort user-visible lists. 220 221Not all languages interpret the same characters as equivalent. If a character's 222case is changed it is not always a one-to-one mapping. Accent differences, the 223presence or absence of certain characters, and even spelling differences might 224be insignificant when determining whether two strings are equal. The 225[Collator](collation/index.md) services provide significant help in this area. 226 227#### Characters 228 229A character does not necessarily correspond to a single code-point position in 230the backing store. All languages might not have the same definition of a word, 231and might not find that any group of characters separated by a white space is an 232acceptable approximation for the definition of a word. ICU provides the 233[BreakIterator](boundaryanalysis/index.md) services to help locate boundaries or 234when counting units of text. 235 236When checking characters for membership in a particular class, do not list the 237specific characters you are interested in, and do not assume they come in any 238particular order in the encoding scheme. For example, /A-Za-z/ does not mean all 239letters in most European languages, and /0-9/ does not mean all digits in many 240writing systems. This also holds true when using C interfaces such as `isupper()` 241and `islower()`. ICU provides a large group of utility functions for testing 242character properties, such as `u_isupper()` and `u_islower()`. 243 244#### Text Input and Layout 245 246Do not assume anything about how a piece of text might be drawn on the screen, 247including how much room it takes up, the direction it flows, or where on the 248screen it should start. All of these text elements vary according to language. 249As a result, there might not be a one-to-one relationship between characters and 250keystrokes. One-to-many, many-to-one, and many-to-many relationships between 251characters and keystrokes all occur in real text in some languages. 252 253#### Text Manipulation 254 255Do not assume that all textual data, which the program stores and manipulates, 256is in any particular language or writing system. ICU provides many methods that 257help with text storage. The `UnicodeString` class and `u_strxxx` functions are 258provided for Unicode-based character manipulation. For example, when appending 259an existing Unicode character buffer, characters can be removed or extracted out 260of the buffer. 261 262A good example of text manipulation is the Rosetta stone. The same text is 263written on it in Hieroglyphic, Greek and Demotic. ICU provides the services to 264correctly process multi-lingual text such as this correctly. 265 266#### Date/Time Formatting 267 268Time can be determined in many units, such as the lengths of months or years, 269which day is the first day of the week, or the allowable range of values like 270month and year (with `DateFormat`). It can also determine the time zone you are in 271(with `TimeZone`), or when daylight-savings time starts. ICU provides the Calendar 272services needed to handle these issues. 273 274#### Distributed Locale Support 275 276In most server applications, do not assume that all clients connected to the 277server interact with their users in the same language. Also do not assume that a 278session stops and restarts whenever a user speaking one language replaces 279another user speaking a different language. ICU provides sufficient flexibility 280for a program to handle multiple locales at the same time. 281 282For example, a Web server needs to serve pages to different users, languages, 283and date formats at the same time. 284 285#### LayoutEngine 286 287The ICU LayoutEngine is an Open Source library that provides a uniform, easy to 288use interface for preparing complex scripts or text for display. The Latin 289script, which is the most commonly used script among software developers, is 290also the least complex script to display especially when it is used to write 291English. Using the Latin script, characters can be displayed from left to right 292in the order that they are stored in memory. Some scripts require rendering 293behavior that is more complicated than the Latin script. We refer to these 294scripts as "complex scripts" and to text written in these scripts as "complex 295text." 296