• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1---
2layout: default
3title: Localizing with ICU
4nav_order: 3
5parent: Locales and Resources
6---
7<!--
8© 2020 and later: Unicode, Inc. and others.
9License & terms of use: http://www.unicode.org/copyright.html
10-->
11
12# Localizing with ICU
13{: .no_toc }
14
15## Contents
16{: .no_toc .text-delta }
17
181. TOC
19{:toc}
20
21---
22
23## Overview
24
25There are many different formats for software localization, i.e., for resource
26bundles. The most important file format feature for translation of text elements
27is to represent key-value pairs where the values are strings.
28
29Each format was designed for a certain purpose. Many but not all formats are
30recognized by translation tools. For localization it is best to use a source
31format that is optimized for translation, and to convert from it to the
32platform-specific formats at build time.
33
34This overview concentrates on the formats that are relevant for working with
35ICU. The examples below show only lists of strings, which is the lowest common
36denominator for resource bundles.
37
38## Recommendation
39
40The most promising long-term approach is to author localizable data in XLIFF
41format (see the [XLIFF](#xliff) (§) section below) and to convert it to native,
42platform/tool-specific formats at build time.
43
44Short-term, due to the lack of ICU tools for XLIFF, either custom tools must be
45used to convert from some authoring/translation format to Java/ICU formats, or
46one of the Java/ICU formats needs to be used for authoring and translation.
47
48## Java and ICU4J
49
50### .properties files
51
52Java `PropertyResourceBundle` uses runtime-parsed .properties files. They contain
53key-value pairs where both keys and values are Unicode strings. No other native
54data types (e.g., integers or binaries) are supported. There is no way to
55specify a charset, therefore .properties files must be in ISO 8859-1 with \u
56escape sequences (see the Java `native2ascii` tool).
57
58Defined at: http://java.sun.com/j2se/1.4/docs/api/java/util/PropertyResourceBundle.html
59
60Example: (`example_de.properties`)
61
62```properties
63key1=Deutsche Sprache schwere Sprache
64key2=Düsseldorf
65```
66
67### .java ListResourceBundle files
68
69Java `ListResourceBundle` files provide implementation subclasses of the
70`ListResourceBundle` abstract base class. **They are Java code!** Source files are
71.java files that are compiled as usual with the javac compiler. Syntactic rules
72of Java apply. As Java source code, they can contain arbitrary Java objects and
73can be nested.
74
75Although the Java compiler allows to specify a charset on the command line, this
76is uncommon, and .java resource bundle files are therefore usually encoded in
77ISO 8859-1 with \u escapes like .properties files.
78
79Defined at: http://java.sun.com/j2se/1.4/docs/api/java/util/ListResourceBundle.html
80
81Example: (`example_de.java`)
82
83```java
84public class example_de extends ListResourceBundle {
85    public Object[][] getContents() {
86        return contents;
87    }
88    static final Object[][] contents={
89        { "key1", "Deutsche Sprache " +
90            "schwere Sprache" },
91        { "key2", "Düsseldorf" }
92    };
93}
94```
95
96ICU4J can also access the ICU4C resource bundles described in the next section,
97using the API described in the [UResourceBundle](https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/util/UResourceBundle.html) documentation.
98
99## ICU4C
100
101### .txt resource bundles
102
103ICU4C natively uses a plain text source format with a nested structure that was
104derived from Java `ListResourceBundle` .java files when the original ICU Java
105class files were ported to C++. The ICU4C bundle format can of course contain
106only data, not code, unlike .java files. Resource bundle source files are
107compiled with the `genrb` tool into a binary runtime form (`.res` files) that is
108portable among platforms with the same charset family (ASCII vs. EBCDIC) and
109endianness.
110
111Features:
112
1131. Key-value pairs. Keys are strings of "invariant characters" - a portable subset of the ASCII graphic character repertoire. About "invariant characters" see the definition of the .txt file format (URL below) or [icu/source/common/unicode/utypes.h](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/utypes_8h.html)
114
1152. Values can be Unicode strings, integers, binaries (BLOBs), integer array (vectors), and nested structures. Nested structures are either arrays (position-indexed vectors) of values or "tables" of key-value pairs.
116
1173. Values inside nested structures can be all of the ones as on the top level, arbitrarily deeply nested via arrays and tables.
118
1194. Long strings can be split across lines: Adjacent strings separated only by whitespace including line breaks) are automatically concatenated at build time.
120
1215. At runtime, when a top-level item is not found, then ICU looks up the same key in the parent bundle as determined by the locale ID.
122
1236. A value can also be an "alias", which is simply a reference to another bundle's item. This is to save space by storing large data pieces only once when they cannot be inherited along the locale ID hierarchy (e.g., collation data in ICU shared among zh_HK and zh_TW).
124
1257. Source files can be in any charset. Unicode signature byte sequences are recognized automatically (UTF-8/16, SCSU, ...), otherwise the tool takes a charset name on the command line.
126
127Defined at: [icu-docs/main/design/bnf_rb.txt](https://raw.githubusercontent.com/unicode-org/icu-docs/main/design/bnf_rb.txt)
128
129To use with ICU4C, see the [Resource Bundle APIs](resources#resource-bundle-apis) section of this userguide.
130
131Example: (`de.txt`)
132
133```
134de {
135    key1 { "Deutsche Sprache "
136            "schwere Sprache" }
137    key2 { "Düsseldorf" }
138}
139```
140
141### ICU4C XML resource bundles
142
143The ICU4C XML resource bundle format was defined simply to express the same
144capabilities of the .txt and binary ICU4C resource bundles in XML form. However,
145we have decided to drop the format for lack of use and instead adopt standard
146XLIFF format for localization. For more information on XLIFF format, see the
147following section. For examples on using ICU tools to produce and read XLIFF
148format see the XLIFF Usage section in the [resource management chapter](resources#using-xliff-for-localization).
149
150## XLIFF
151
152The XML Localization Interchange File Format (XLIFF) is an emerging industry
153standard "for the interchange of localization information". Version 1.1 is
154available (2003-Oct-31), and 1.2 is almost complete (2007-Jan-20).
155
156This is the result of a quick review of XLIFF and may need to be improved.
157
158Features:
159
1601.  Multiple resource bundles per XLIFF file are supported.
161
1622.  Multiple languages per XLIFF file are supported.
163
1643.  XLIFF provides a rich set of ways to communicate intent, types of items,
165    etc. all the way from content creation to all stages and phases of
166    translation.
167
1684.  Nesting of values appears to not be supported.
169
1705.  XLIFF is independent of actual build-time or runtime resource bundle
171    formats. .xlf files must be converted to native formats at build time.
172
173Defined at: http://www.oasis-open.org/committees/xliff/
174
175Example: (`example.xlf`)
176
177```xml
178<<?xml version="1.0" encoding="utf-8"?>
179<xliff version = "1.1" xmlns='urn:oasis:names:tc:xliff:document:1.1'
180xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance'
181xsi:schemaLocation='urn:oasis:names:tc:xliff:document:1.1
182http://www.oasis-open.org/committees/xliff/documents/xliff-core-1.1.xsd'>
183    <file xml:space = "preserve" source-language = "en" target-language = "sh"
184    datatype = "x-icu-resource-bundle" original = "root.txt"
185    date = "2007-08-17T21:17:08Z">
186        <header>
187            <tool tool-id = "genrb-3.3-icu-3.8" tool-name = "genrb"/>
188        </header>
189        <body>
190            <group id = "root" restype = "x-icu-table">
191                <trans-unit id = "optionMessage" resname = "optionMessage">
192                    <source>unrecognized command line option:</source>
193                    <target>nepoznata opcija na komandnoj liniji:</target>
194                </trans-unit>
195                <trans-unit id = "usage" resname = "usage">
196                    <source>usage: ufortune [-v] [-l locale]</source>
197                    <target>upotreba: ufortune [-v] [-l lokal]</target>
198                </trans-unit>
199            </group>
200        </body>
201    </file>
202</xliff>
203```
204
205For examples on using ICU tools to produce and read XLIFF format see the XLIFF
206Usage (§) section in the [resource management chapter](resources#using-xliff-for-localization).
207
208## DITA
209
210The Darwin Information Typing Architecture (DITA) is "IBM's XML architecture for
211topic-oriented information". It is a family of XML formats for several types of
212publications including manuals and resource bundles. It is extensible. For
213example, subformats can be defined by refining DTDs. One design feature is to
214provide cross-document references for reuse of existing contents. For more
215information see http://www.ibm.com/developerworks/xml/library/x-dita4/index.html
216
217While it is certainly possible to define resource bundle formats via DTDs in the
218DITA framework, there currently (2002-Nov-27) do not appear to be resource
219bundle formats actually defined, or tools available specifically for them.
220
221## Linux/gettext
222
223The OpenI18N specification requires support for message handling functions
224(mostly variants of `gettext()`) as defined in `libintl.h`. See Tables 3-5 and 3-6
225and Annex C in http://www.openi18n.org/docs/html/LI18NUX-2000-amd4.htm
226
227Resource bundles ("portable object files", extension .po) are plain text files
228with key-value pairs for string values. The format and functions support a
229simple selection of plural forms by associating integer values (via C language
230expressions) with indexes of strings.
231
232The `msgfmt` utility compiles .po files into "message object files" (extension
233.mo). The charset is determined from the locale ID in `LC_CTYPE`. There are
234additional supporting tools for .po files.
235
236*Note: The OpenI18N specification also requires POSIX `gencat`/`catgets` support. See the [POSIX](#posixcatsgets) (§) section below.*
237
238Defined at: Annex C of the Li18nux-2000 specification, see above.
239
240Example: (`example.po`)
241
242```
243domain "example_domain"
244msgid "key1"
245msgstr "Deutsche Sprache schwere Sprache"
246msgid "key2"
247msgstr "Düsseldorf"
248```
249
250## POSIX/catgets
251
252POSIX (The Open Group specification) defines message catalogs with the `catgets()`
253C function and the gencat build-time tool. Message catalogs contain key-value
254pairs where the keys are integers `1`..`NL_MSGMAX` (see `limits.h`), and the values
255are strings. Strings can span multiple lines. The charset is determined from the
256locale ID in `LC_CTYPE`.
257
258Defined at:
259https://pubs.opengroup.org/onlinepubs/009695399/utilities/gencat.html and
260https://pubs.opengroup.org/onlinepubs/009695399/functions/catgets.html
261
262Example: (`example.txt`)
263
264```
2651 Deutsche Sprache \
266schwere Sprache
2672 Düsseldorf
268```
269
270## Windows
271
272Windows uses a number of file formats depending on the language environment --
273MSVC 6, Visual Basic, or Visual Studio .NET. The most well-known source formats
274are the [.rc Resource](https://docs.microsoft.com/windows/win32/menurc/about-resource-files)
275and [.mc Message](https://docs.microsoft.com/en-us/windows/win32/eventlog/message-files)
276file formats. They both get compiled into .res files that are linked into
277special sections of executables. Source formats can be UTF-16, while compiled
278strings are (almost) always UTF-16 from .rc files (except for predefined
279ComboBox strings) and can optionally be UTF-16 from .mc files.
280
281.rc files carry key-value pairs where the keys are usually numeric but can be
282strings. Values can be strings, string tables, or one of many Windows
283GUI-specific structured types that compile directly into binary formats that the
284GUI system interprets at runtime. .rc files can include C #include files for
285#defined numeric keys. .mc files contain string values preceded by per-message
286headers similar to the Linux/gettext() format. There is a special format of
287messages with positional arguments, with printf-style formatting per argument.
288In both .rc and .mc formats, Windows LCID values are defined to be set on the
289compiled resources.
290
291Developers and translators usually overlook the fact that binary resources are
292included, and include them into each translation. This despite Windows, like
293Java and ICU, using locale ID fallback at runtime.
294
295.rc and .mc files are tightly integrated with Microsoft C/C++, Visual Studio and
296the Windows platform, but are not used on any other platforms.
297
298A [sample Windows .rc file](#sample-windows-rc-file) (§) is at the end of this document.
299
300## ICU tools
301
302ICU 2.4 provides tools for conversion between resource bundle formats:
303
3041.  ICU4C .txt -> ICU4C .res: Default operation of genrb (ICU 2.0 and before).
305
3062.  ICU4C .txt -> ICU4C .xml: Option with genrb (ICU 2.4).
307
3083.  ICU4C .txt -> Java ListResourceBundle .java format: Option with genrb (ICU
309    2.2).
310    Generates subclasses of ICUListResourceBundle to support non-string types.
311
3124.  Java ListResourceBundle .java format -> ICU4C .txt: Use ICU4J 2.4's
313    src/com/ibm/icu/dev/tools/localeconverter
314
3155.  ICU4C .xml -> ICU4C .txt: There is a tool for this conversion, but it is not
316    fully tested or documented. Please see the
317    [XLIFF2ICUConverter](https://icu-project.org/download/xliff2icuconverter.html)
318    tool.
319
320There are currently no ICU tools for XLIFF.
321
322### Converting de.txt to a ListResourceBundle
323
324The following genrb invocation generates a ListResourceBundle from `de.txt` (see
325the example file `de.txt` above):
326
327`genrb -j -b TestName -p com.example de.txt`
328
329The -j option causes .java output, -b is an arbitrary bundle name prefix, and -p
330is an arbitrary package name. "Arbitrary" means "depends on your product" and
331may be truly arbitrary if the generated .java files are not actually used in a
332Java application. genrb auto-detects .txt files encoded in Unicode charsets like
333UTF-8 or UTF-16 if they have a signature byte sequence ("BOM"). The .java output
334file is in native2ascii format, i.e., it is encoded in US-ASCII with \u
335escapes.
336
337The output of the above genrb invocation is `TestName_de.java`:
338
339```java
340package com.example;
341import java.util.ListResourceBundle;
342import com.ibm.icu.impl.ICUListResourceBundle;
343public class TestName_de extends ICUListResourceBundle {
344    public TestName_de () {
345        super.contents = data;
346    }
347    static final Object[][] data = new Object[][] {
348        {
349            "key1",
350            "Deutsche Sprache schwere Sprache",
351        },
352        {
353            "key2",
354            "D\u00FCsseldorf",
355        },
356    };
357}
358```
359
360### Converting a ListResourceBundle back to .txt
361
362An ICUListResourceBundle .java file as generated in the previous example can be
363converted to an ICU4C .txt file with the following steps:
364
3651.  Compile the .java file, e.g. with `javac -d . TestName_de.java`. ICU4J needs
366    to be on the classpath (or use the -classpath option). If the .java file is
367    not in `native2ascii` format, then use the -encoding option (e.g. -encoding
368    UTF-8). The -d option (specifying an output directory, in this example the
369    current folder) is required. Without it, the Java compiler would not
370    generate the com/example folder hierarchy that is required in the next step.
371
3722.  You now have a .class file `com/example/TestName_de.class`.
373
3743.  Invoke the ICU4J locale converter tool to generate ICU4C .txt format output for
375    this .class file:
376
377    `java -cp ;(folder to ICU4J)/icu4j.jar;(working folder for the previous steps); com.ibm.icu.dev.tool.localeconverter.ConvertICUListResourceBundle -icu -package com.example -bundle-name TestName de > de.txt`
378
379    Note that the classpath must include the working folder for the previous
380    steps (the folder that contains "com"). The package name (com.example),
381    bundle name (TestName) and locale ID (de) must match the .java/.class files.
382    Note also that the locale converter writes to the standard output; the
383    command line above includes a redirection to de.txt.
384
385The last step generates a new de.txt in `native2ascii` format:
386
387```
388de {
389    key2{"D\u00FCsseldorf"}
390    key1{"Deutsche Sprache schwere Sprache"}
391}
392```
393
394## Further information
395
3961.  TMX: "The purpose of TMX is to allow easier exchange of translation memory
397    data between tools and/or translation vendors with little or no loss of
398    critical data during the process."
399    http://www.lisa.org/tmx/
400
4012.  LISA: Localisation Industry Standards Association
402    http://www.lisa.org/
403
404## Sample Windows .rc file
405
406This file (`winrc.rc`) was generated with MSVC 6, using the New Project wizard to
407generate a simple "Hello World!" application, changing the LCIDs to German, then
408adding the two example strings as above.
409
410```
411//Microsoft Developer Studio generated resource script.
412//
413#include "resource.h"
414#define APSTUDIO_READONLY_SYMBOLS
415/////////////////////////////////////////////////////////////////////////////
416//
417// Generated from the TEXTINCLUDE 2 resource.
418//
419#define APSTUDIO_HIDDEN_SYMBOLS
420#include "windows.h"
421#undef APSTUDIO_HIDDEN_SYMBOLS
422#include "resource.h"
423/////////////////////////////////////////////////////////////////////////////
424#undef APSTUDIO_READONLY_SYMBOLS
425/////////////////////////////////////////////////////////////////////////////
426// German (Germany) resources
427#if !defined(AFX_RESOURCE_DLL) || defined(AFX_TARG_DEU)
428#ifdef _WIN32
429LANGUAGE LANG_GERMAN, SUBLANG_GERMAN
430#pragma code_page(1252)
431#endif //_WIN32
432/////////////////////////////////////////////////////////////////////////////
433//
434// Icon
435//
436// Icon with lowest ID value placed first to ensure application icon
437// remains consistent on all systems.
438IDI_WINRC ICON DISCARDABLE "winrc.ICO"
439IDI_SMALL ICON DISCARDABLE "SMALL.ICO"
440/////////////////////////////////////////////////////////////////////////////
441//
442// Menu
443//
444IDC_WINRC MENU DISCARDABLE
445BEGIN
446    POPUP "&File"
447    BEGIN
448        MENUITEM "E&xit", IDM_EXIT
449    END
450    POPUP "&Help"
451    BEGIN
452        MENUITEM "&About ...", IDM_ABOUT
453    END
454END
455/////////////////////////////////////////////////////////////////////////////
456//
457// Accelerator
458//
459IDC_WINRC ACCELERATORS MOVEABLE PURE
460BEGIN
461    "?", IDM_ABOUT, ASCII, ALT
462    "/", IDM_ABOUT, ASCII, ALT
463END
464/////////////////////////////////////////////////////////////////////////////
465//
466// Dialog
467//
468IDD_ABOUTBOX DIALOG DISCARDABLE 22, 17, 230, 75
469STYLE DS_MODALFRAME | WS_CAPTION | WS_SYSMENU
470CAPTION "About"
471FONT 8, "System"
472BEGIN
473    ICON IDI_WINRC,IDC_MYICON,14,9,16,16
474    LTEXT "winrc Version 1.0",IDC_STATIC,49,10,119,8,SS_NOPREFIX
475    LTEXT "Copyright (C) 2002",IDC_STATIC,49,20,119,8
476    DEFPUSHBUTTON "OK",IDOK,195,6,30,11,WS_GROUP
477END
478/////////////////////////////////////////////////////////////////////////////
479//
480// String Table
481//
482STRINGTABLE DISCARDABLE
483BEGIN
484IDS_APP_TITLE "winrc"
485IDS_HELLO "Hello World!"
486IDC_WINRC "WINRC"
487IDS_SENTENCE "Deutsche Sprache schwere Sprache"
488IDS_CITY "Düsseldorf"
489END
490#endif // German (Germany) resources
491/////////////////////////////////////////////////////////////////////////////
492/////////////////////////////////////////////////////////////////////////////
493// English (U.S.) resources
494#if !defined(AFX_RESOURCE_DLL) || defined(AFX_TARG_ENU)
495#ifdef _WIN32
496LANGUAGE LANG_ENGLISH, SUBLANG_ENGLISH_US
497#pragma code_page(1252)
498#endif //_WIN32
499#ifdef APSTUDIO_INVOKED
500/////////////////////////////////////////////////////////////////////////////
501//
502// TEXTINCLUDE
503//
5042 TEXTINCLUDE DISCARDABLE
505BEGIN
506    "#define APSTUDIO_HIDDEN_SYMBOLS\r\n"
507    "#include ""windows.h""\r\n"
508    "#undef APSTUDIO_HIDDEN_SYMBOLS\r\n"
509    "#include ""resource.h""\r\n"
510    "\0"
511END
5123 TEXTINCLUDE DISCARDABLE
513BEGIN
514    "\r\n"
515    "\0"
516END
5171 TEXTINCLUDE DISCARDABLE
518BEGIN
519    "resource.h\0"
520END
521#endif // APSTUDIO_INVOKED
522#endif // English (U.S.) resources
523/////////////////////////////////////////////////////////////////////////////
524#ifndef APSTUDIO_INVOKED
525/////////////////////////////////////////////////////////////////////////////
526//
527// Generated from the TEXTINCLUDE 3 resource.
528//
529/////////////////////////////////////////////////////////////////////////////
530#endif // not APSTUDIO_INVOKED
531```