1--- 2layout: default 3title: Localizing with ICU 4nav_order: 3 5parent: Locales and Resources 6--- 7<!-- 8© 2020 and later: Unicode, Inc. and others. 9License & terms of use: http://www.unicode.org/copyright.html 10--> 11 12# Localizing with ICU 13{: .no_toc } 14 15## Contents 16{: .no_toc .text-delta } 17 181. TOC 19{:toc} 20 21--- 22 23## Overview 24 25There are many different formats for software localization, i.e., for resource 26bundles. The most important file format feature for translation of text elements 27is to represent key-value pairs where the values are strings. 28 29Each format was designed for a certain purpose. Many but not all formats are 30recognized by translation tools. For localization it is best to use a source 31format that is optimized for translation, and to convert from it to the 32platform-specific formats at build time. 33 34This overview concentrates on the formats that are relevant for working with 35ICU. The examples below show only lists of strings, which is the lowest common 36denominator for resource bundles. 37 38## Recommendation 39 40The most promising long-term approach is to author localizable data in XLIFF 41format (see the [XLIFF](#xliff) (§) section below) and to convert it to native, 42platform/tool-specific formats at build time. 43 44Short-term, due to the lack of ICU tools for XLIFF, either custom tools must be 45used to convert from some authoring/translation format to Java/ICU formats, or 46one of the Java/ICU formats needs to be used for authoring and translation. 47 48## Java and ICU4J 49 50### .properties files 51 52Java `PropertyResourceBundle` uses runtime-parsed .properties files. They contain 53key-value pairs where both keys and values are Unicode strings. No other native 54data types (e.g., integers or binaries) are supported. There is no way to 55specify a charset, therefore .properties files must be in ISO 8859-1 with \u 56escape sequences (see the Java `native2ascii` tool). 57 58Defined at: http://java.sun.com/j2se/1.4/docs/api/java/util/PropertyResourceBundle.html 59 60Example: (`example_de.properties`) 61 62```properties 63key1=Deutsche Sprache schwere Sprache 64key2=Düsseldorf 65``` 66 67### .java ListResourceBundle files 68 69Java `ListResourceBundle` files provide implementation subclasses of the 70`ListResourceBundle` abstract base class. **They are Java code!** Source files are 71.java files that are compiled as usual with the javac compiler. Syntactic rules 72of Java apply. As Java source code, they can contain arbitrary Java objects and 73can be nested. 74 75Although the Java compiler allows to specify a charset on the command line, this 76is uncommon, and .java resource bundle files are therefore usually encoded in 77ISO 8859-1 with \u escapes like .properties files. 78 79Defined at: http://java.sun.com/j2se/1.4/docs/api/java/util/ListResourceBundle.html 80 81Example: (`example_de.java`) 82 83```java 84public class example_de extends ListResourceBundle { 85 public Object[][] getContents() { 86 return contents; 87 } 88 static final Object[][] contents={ 89 { "key1", "Deutsche Sprache " + 90 "schwere Sprache" }, 91 { "key2", "Düsseldorf" } 92 }; 93} 94``` 95 96ICU4J can also access the ICU4C resource bundles described in the next section, 97using the API described in the [UResourceBundle](https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/util/UResourceBundle.html) documentation. 98 99## ICU4C 100 101### .txt resource bundles 102 103ICU4C natively uses a plain text source format with a nested structure that was 104derived from Java `ListResourceBundle` .java files when the original ICU Java 105class files were ported to C++. The ICU4C bundle format can of course contain 106only data, not code, unlike .java files. Resource bundle source files are 107compiled with the `genrb` tool into a binary runtime form (`.res` files) that is 108portable among platforms with the same charset family (ASCII vs. EBCDIC) and 109endianness. 110 111Features: 112 1131. Key-value pairs. Keys are strings of "invariant characters" - a portable subset of the ASCII graphic character repertoire. About "invariant characters" see the definition of the .txt file format (URL below) or [icu/source/common/unicode/utypes.h](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/utypes_8h.html) 114 1152. Values can be Unicode strings, integers, binaries (BLOBs), integer array (vectors), and nested structures. Nested structures are either arrays (position-indexed vectors) of values or "tables" of key-value pairs. 116 1173. Values inside nested structures can be all of the ones as on the top level, arbitrarily deeply nested via arrays and tables. 118 1194. Long strings can be split across lines: Adjacent strings separated only by whitespace including line breaks) are automatically concatenated at build time. 120 1215. At runtime, when a top-level item is not found, then ICU looks up the same key in the parent bundle as determined by the locale ID. 122 1236. A value can also be an "alias", which is simply a reference to another bundle's item. This is to save space by storing large data pieces only once when they cannot be inherited along the locale ID hierarchy (e.g., collation data in ICU shared among zh_HK and zh_TW). 124 1257. Source files can be in any charset. Unicode signature byte sequences are recognized automatically (UTF-8/16, SCSU, ...), otherwise the tool takes a charset name on the command line. 126 127Defined at: [icu-docs/main/design/bnf_rb.txt](https://raw.githubusercontent.com/unicode-org/icu-docs/main/design/bnf_rb.txt) 128 129To use with ICU4C, see the [Resource Bundle APIs](resources#resource-bundle-apis) section of this userguide. 130 131Example: (`de.txt`) 132 133``` 134de { 135 key1 { "Deutsche Sprache " 136 "schwere Sprache" } 137 key2 { "Düsseldorf" } 138} 139``` 140 141### ICU4C XML resource bundles 142 143The ICU4C XML resource bundle format was defined simply to express the same 144capabilities of the .txt and binary ICU4C resource bundles in XML form. However, 145we have decided to drop the format for lack of use and instead adopt standard 146XLIFF format for localization. For more information on XLIFF format, see the 147following section. For examples on using ICU tools to produce and read XLIFF 148format see the XLIFF Usage section in the [resource management chapter](resources#using-xliff-for-localization). 149 150## XLIFF 151 152The XML Localization Interchange File Format (XLIFF) is an emerging industry 153standard "for the interchange of localization information". Version 1.1 is 154available (2003-Oct-31), and 1.2 is almost complete (2007-Jan-20). 155 156This is the result of a quick review of XLIFF and may need to be improved. 157 158Features: 159 1601. Multiple resource bundles per XLIFF file are supported. 161 1622. Multiple languages per XLIFF file are supported. 163 1643. XLIFF provides a rich set of ways to communicate intent, types of items, 165 etc. all the way from content creation to all stages and phases of 166 translation. 167 1684. Nesting of values appears to not be supported. 169 1705. XLIFF is independent of actual build-time or runtime resource bundle 171 formats. .xlf files must be converted to native formats at build time. 172 173Defined at: http://www.oasis-open.org/committees/xliff/ 174 175Example: (`example.xlf`) 176 177```xml 178<<?xml version="1.0" encoding="utf-8"?> 179<xliff version = "1.1" xmlns='urn:oasis:names:tc:xliff:document:1.1' 180xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance' 181xsi:schemaLocation='urn:oasis:names:tc:xliff:document:1.1 182http://www.oasis-open.org/committees/xliff/documents/xliff-core-1.1.xsd'> 183 <file xml:space = "preserve" source-language = "en" target-language = "sh" 184 datatype = "x-icu-resource-bundle" original = "root.txt" 185 date = "2007-08-17T21:17:08Z"> 186 <header> 187 <tool tool-id = "genrb-3.3-icu-3.8" tool-name = "genrb"/> 188 </header> 189 <body> 190 <group id = "root" restype = "x-icu-table"> 191 <trans-unit id = "optionMessage" resname = "optionMessage"> 192 <source>unrecognized command line option:</source> 193 <target>nepoznata opcija na komandnoj liniji:</target> 194 </trans-unit> 195 <trans-unit id = "usage" resname = "usage"> 196 <source>usage: ufortune [-v] [-l locale]</source> 197 <target>upotreba: ufortune [-v] [-l lokal]</target> 198 </trans-unit> 199 </group> 200 </body> 201 </file> 202</xliff> 203``` 204 205For examples on using ICU tools to produce and read XLIFF format see the XLIFF 206Usage (§) section in the [resource management chapter](resources#using-xliff-for-localization). 207 208## DITA 209 210The Darwin Information Typing Architecture (DITA) is "IBM's XML architecture for 211topic-oriented information". It is a family of XML formats for several types of 212publications including manuals and resource bundles. It is extensible. For 213example, subformats can be defined by refining DTDs. One design feature is to 214provide cross-document references for reuse of existing contents. For more 215information see http://www.ibm.com/developerworks/xml/library/x-dita4/index.html 216 217While it is certainly possible to define resource bundle formats via DTDs in the 218DITA framework, there currently (2002-Nov-27) do not appear to be resource 219bundle formats actually defined, or tools available specifically for them. 220 221## Linux/gettext 222 223The OpenI18N specification requires support for message handling functions 224(mostly variants of `gettext()`) as defined in `libintl.h`. See Tables 3-5 and 3-6 225and Annex C in http://www.openi18n.org/docs/html/LI18NUX-2000-amd4.htm 226 227Resource bundles ("portable object files", extension .po) are plain text files 228with key-value pairs for string values. The format and functions support a 229simple selection of plural forms by associating integer values (via C language 230expressions) with indexes of strings. 231 232The `msgfmt` utility compiles .po files into "message object files" (extension 233.mo). The charset is determined from the locale ID in `LC_CTYPE`. There are 234additional supporting tools for .po files. 235 236*Note: The OpenI18N specification also requires POSIX `gencat`/`catgets` support. See the [POSIX](#posixcatsgets) (§) section below.* 237 238Defined at: Annex C of the Li18nux-2000 specification, see above. 239 240Example: (`example.po`) 241 242``` 243domain "example_domain" 244msgid "key1" 245msgstr "Deutsche Sprache schwere Sprache" 246msgid "key2" 247msgstr "Düsseldorf" 248``` 249 250## POSIX/catgets 251 252POSIX (The Open Group specification) defines message catalogs with the `catgets()` 253C function and the gencat build-time tool. Message catalogs contain key-value 254pairs where the keys are integers `1`..`NL_MSGMAX` (see `limits.h`), and the values 255are strings. Strings can span multiple lines. The charset is determined from the 256locale ID in `LC_CTYPE`. 257 258Defined at: 259https://pubs.opengroup.org/onlinepubs/009695399/utilities/gencat.html and 260https://pubs.opengroup.org/onlinepubs/009695399/functions/catgets.html 261 262Example: (`example.txt`) 263 264``` 2651 Deutsche Sprache \ 266schwere Sprache 2672 Düsseldorf 268``` 269 270## Windows 271 272Windows uses a number of file formats depending on the language environment -- 273MSVC 6, Visual Basic, or Visual Studio .NET. The most well-known source formats 274are the [.rc Resource](https://docs.microsoft.com/windows/win32/menurc/about-resource-files) 275and [.mc Message](https://docs.microsoft.com/en-us/windows/win32/eventlog/message-files) 276file formats. They both get compiled into .res files that are linked into 277special sections of executables. Source formats can be UTF-16, while compiled 278strings are (almost) always UTF-16 from .rc files (except for predefined 279ComboBox strings) and can optionally be UTF-16 from .mc files. 280 281.rc files carry key-value pairs where the keys are usually numeric but can be 282strings. Values can be strings, string tables, or one of many Windows 283GUI-specific structured types that compile directly into binary formats that the 284GUI system interprets at runtime. .rc files can include C #include files for 285#defined numeric keys. .mc files contain string values preceded by per-message 286headers similar to the Linux/gettext() format. There is a special format of 287messages with positional arguments, with printf-style formatting per argument. 288In both .rc and .mc formats, Windows LCID values are defined to be set on the 289compiled resources. 290 291Developers and translators usually overlook the fact that binary resources are 292included, and include them into each translation. This despite Windows, like 293Java and ICU, using locale ID fallback at runtime. 294 295.rc and .mc files are tightly integrated with Microsoft C/C++, Visual Studio and 296the Windows platform, but are not used on any other platforms. 297 298A [sample Windows .rc file](#sample-windows-rc-file) (§) is at the end of this document. 299 300## ICU tools 301 302ICU 2.4 provides tools for conversion between resource bundle formats: 303 3041. ICU4C .txt -> ICU4C .res: Default operation of genrb (ICU 2.0 and before). 305 3062. ICU4C .txt -> ICU4C .xml: Option with genrb (ICU 2.4). 307 3083. ICU4C .txt -> Java ListResourceBundle .java format: Option with genrb (ICU 309 2.2). 310 Generates subclasses of ICUListResourceBundle to support non-string types. 311 3124. Java ListResourceBundle .java format -> ICU4C .txt: Use ICU4J 2.4's 313 src/com/ibm/icu/dev/tools/localeconverter 314 3155. ICU4C .xml -> ICU4C .txt: There is a tool for this conversion, but it is not 316 fully tested or documented. Please see the 317 [XLIFF2ICUConverter](https://icu-project.org/download/xliff2icuconverter.html) 318 tool. 319 320There are currently no ICU tools for XLIFF. 321 322### Converting de.txt to a ListResourceBundle 323 324The following genrb invocation generates a ListResourceBundle from `de.txt` (see 325the example file `de.txt` above): 326 327`genrb -j -b TestName -p com.example de.txt` 328 329The -j option causes .java output, -b is an arbitrary bundle name prefix, and -p 330is an arbitrary package name. "Arbitrary" means "depends on your product" and 331may be truly arbitrary if the generated .java files are not actually used in a 332Java application. genrb auto-detects .txt files encoded in Unicode charsets like 333UTF-8 or UTF-16 if they have a signature byte sequence ("BOM"). The .java output 334file is in native2ascii format, i.e., it is encoded in US-ASCII with \u 335escapes. 336 337The output of the above genrb invocation is `TestName_de.java`: 338 339```java 340package com.example; 341import java.util.ListResourceBundle; 342import com.ibm.icu.impl.ICUListResourceBundle; 343public class TestName_de extends ICUListResourceBundle { 344 public TestName_de () { 345 super.contents = data; 346 } 347 static final Object[][] data = new Object[][] { 348 { 349 "key1", 350 "Deutsche Sprache schwere Sprache", 351 }, 352 { 353 "key2", 354 "D\u00FCsseldorf", 355 }, 356 }; 357} 358``` 359 360### Converting a ListResourceBundle back to .txt 361 362An ICUListResourceBundle .java file as generated in the previous example can be 363converted to an ICU4C .txt file with the following steps: 364 3651. Compile the .java file, e.g. with `javac -d . TestName_de.java`. ICU4J needs 366 to be on the classpath (or use the -classpath option). If the .java file is 367 not in `native2ascii` format, then use the -encoding option (e.g. -encoding 368 UTF-8). The -d option (specifying an output directory, in this example the 369 current folder) is required. Without it, the Java compiler would not 370 generate the com/example folder hierarchy that is required in the next step. 371 3722. You now have a .class file `com/example/TestName_de.class`. 373 3743. Invoke the ICU4J locale converter tool to generate ICU4C .txt format output for 375 this .class file: 376 377 `java -cp ;(folder to ICU4J)/icu4j.jar;(working folder for the previous steps); com.ibm.icu.dev.tool.localeconverter.ConvertICUListResourceBundle -icu -package com.example -bundle-name TestName de > de.txt` 378 379 Note that the classpath must include the working folder for the previous 380 steps (the folder that contains "com"). The package name (com.example), 381 bundle name (TestName) and locale ID (de) must match the .java/.class files. 382 Note also that the locale converter writes to the standard output; the 383 command line above includes a redirection to de.txt. 384 385The last step generates a new de.txt in `native2ascii` format: 386 387``` 388de { 389 key2{"D\u00FCsseldorf"} 390 key1{"Deutsche Sprache schwere Sprache"} 391} 392``` 393 394## Further information 395 3961. TMX: "The purpose of TMX is to allow easier exchange of translation memory 397 data between tools and/or translation vendors with little or no loss of 398 critical data during the process." 399 http://www.lisa.org/tmx/ 400 4012. LISA: Localisation Industry Standards Association 402 http://www.lisa.org/ 403 404## Sample Windows .rc file 405 406This file (`winrc.rc`) was generated with MSVC 6, using the New Project wizard to 407generate a simple "Hello World!" application, changing the LCIDs to German, then 408adding the two example strings as above. 409 410``` 411//Microsoft Developer Studio generated resource script. 412// 413#include "resource.h" 414#define APSTUDIO_READONLY_SYMBOLS 415///////////////////////////////////////////////////////////////////////////// 416// 417// Generated from the TEXTINCLUDE 2 resource. 418// 419#define APSTUDIO_HIDDEN_SYMBOLS 420#include "windows.h" 421#undef APSTUDIO_HIDDEN_SYMBOLS 422#include "resource.h" 423///////////////////////////////////////////////////////////////////////////// 424#undef APSTUDIO_READONLY_SYMBOLS 425///////////////////////////////////////////////////////////////////////////// 426// German (Germany) resources 427#if !defined(AFX_RESOURCE_DLL) || defined(AFX_TARG_DEU) 428#ifdef _WIN32 429LANGUAGE LANG_GERMAN, SUBLANG_GERMAN 430#pragma code_page(1252) 431#endif //_WIN32 432///////////////////////////////////////////////////////////////////////////// 433// 434// Icon 435// 436// Icon with lowest ID value placed first to ensure application icon 437// remains consistent on all systems. 438IDI_WINRC ICON DISCARDABLE "winrc.ICO" 439IDI_SMALL ICON DISCARDABLE "SMALL.ICO" 440///////////////////////////////////////////////////////////////////////////// 441// 442// Menu 443// 444IDC_WINRC MENU DISCARDABLE 445BEGIN 446 POPUP "&File" 447 BEGIN 448 MENUITEM "E&xit", IDM_EXIT 449 END 450 POPUP "&Help" 451 BEGIN 452 MENUITEM "&About ...", IDM_ABOUT 453 END 454END 455///////////////////////////////////////////////////////////////////////////// 456// 457// Accelerator 458// 459IDC_WINRC ACCELERATORS MOVEABLE PURE 460BEGIN 461 "?", IDM_ABOUT, ASCII, ALT 462 "/", IDM_ABOUT, ASCII, ALT 463END 464///////////////////////////////////////////////////////////////////////////// 465// 466// Dialog 467// 468IDD_ABOUTBOX DIALOG DISCARDABLE 22, 17, 230, 75 469STYLE DS_MODALFRAME | WS_CAPTION | WS_SYSMENU 470CAPTION "About" 471FONT 8, "System" 472BEGIN 473 ICON IDI_WINRC,IDC_MYICON,14,9,16,16 474 LTEXT "winrc Version 1.0",IDC_STATIC,49,10,119,8,SS_NOPREFIX 475 LTEXT "Copyright (C) 2002",IDC_STATIC,49,20,119,8 476 DEFPUSHBUTTON "OK",IDOK,195,6,30,11,WS_GROUP 477END 478///////////////////////////////////////////////////////////////////////////// 479// 480// String Table 481// 482STRINGTABLE DISCARDABLE 483BEGIN 484IDS_APP_TITLE "winrc" 485IDS_HELLO "Hello World!" 486IDC_WINRC "WINRC" 487IDS_SENTENCE "Deutsche Sprache schwere Sprache" 488IDS_CITY "Düsseldorf" 489END 490#endif // German (Germany) resources 491///////////////////////////////////////////////////////////////////////////// 492///////////////////////////////////////////////////////////////////////////// 493// English (U.S.) resources 494#if !defined(AFX_RESOURCE_DLL) || defined(AFX_TARG_ENU) 495#ifdef _WIN32 496LANGUAGE LANG_ENGLISH, SUBLANG_ENGLISH_US 497#pragma code_page(1252) 498#endif //_WIN32 499#ifdef APSTUDIO_INVOKED 500///////////////////////////////////////////////////////////////////////////// 501// 502// TEXTINCLUDE 503// 5042 TEXTINCLUDE DISCARDABLE 505BEGIN 506 "#define APSTUDIO_HIDDEN_SYMBOLS\r\n" 507 "#include ""windows.h""\r\n" 508 "#undef APSTUDIO_HIDDEN_SYMBOLS\r\n" 509 "#include ""resource.h""\r\n" 510 "\0" 511END 5123 TEXTINCLUDE DISCARDABLE 513BEGIN 514 "\r\n" 515 "\0" 516END 5171 TEXTINCLUDE DISCARDABLE 518BEGIN 519 "resource.h\0" 520END 521#endif // APSTUDIO_INVOKED 522#endif // English (U.S.) resources 523///////////////////////////////////////////////////////////////////////////// 524#ifndef APSTUDIO_INVOKED 525///////////////////////////////////////////////////////////////////////////// 526// 527// Generated from the TEXTINCLUDE 3 resource. 528// 529///////////////////////////////////////////////////////////////////////////// 530#endif // not APSTUDIO_INVOKED 531```