• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1---
2layout: default
3title: Transforms
4nav_order: 4
5parent: Transforms
6---
7<!--
8© 2020 and later: Unicode, Inc. and others.
9License & terms of use: http://www.unicode.org/copyright.html
10-->
11
12# General Transforms
13{: .no_toc }
14
15## Contents
16{: .no_toc .text-delta }
17
181. TOC
19{:toc}
20
21---
22
23## Overview
24
25General transforms provide a general-purpose package for processing Unicode
26text. They are a powerful and flexible mechanism for handling a variety of
27different tasks, including:
28
291.  Uppercase, Lowercase, Titlecase, Full/Halfwidth conversions
302.  Normalization
313.  Hex and Character Name conversions
324.  Script to Script conversion
33
34Originally, Transforms were designed to convert characters from one script to
35another (for example, from Greek to Latin, or Japanese Katakana to Latin). This
36is still reflected in the class name, which remains **Transliterator**. However,
37the services performed by that class now represent a much more general mechanism
38capable of handling a much broader range of tasks. In particular, the Transforms
39include pre-built transformations for case conversions, for normalization
40conversions, for the removal of given characters, and also for a variety of
41language and script transliterations. Transforms can be chained together to
42perform a series of operations and each step of the process can use a UnicodeSet
43to restrict the characters that are affected.
44
45For example, to remove accents from characters, use the following transform:
46
47```
48NFD; [:Nonspacing Mark:] Remove; NFC.
49```
50
51This transform separates accents from their base characters, removes the
52accents, and then puts the remaining text into an unaccented form.
53
54A transliteration either can be applied to a complete string of text or can be
55used incrementally for typing or buffering input. In the latter case, the
56transform provides the correct time delay to process characters when there is an
57unambiguous mapping. Transliterators can also be used with more complex text,
58such as styled text, to maintain the style information where possible. For
59example, "~~Αλφaβ~~ητικός" will retain the strikethrough in transliterating to
60"~~Alphab~~ētikós".
61
62> :point_right: **Note**: *The transliteration process not only retains font size, but also other
63characteristics such as font type and color.*
64
65For an online demonstration of ICU transliteration, see
66<http://demo.icu-project.org/icu-bin/translit> .
67
68## Script Transliteration
69
70Script Transliteration is the general process of converting characters from one
71script to another. For example, it can convert characters from Greek to Latin,
72or Japanese katakana to Latin. The user must understand that script
73transliteration is not translation. Rather, script transliteration it is the
74conversion of letters from one script to another without translating the
75underlying words. The following shows a sample of script transliteration:
76
77| Source | Transliteration |
78|---|---|
79| キャンパス | kyanpasu |
80| Αλφαβητικός Κατάλογος | Alphabētikós Katálogos |
81| биологическом | biologichyeskom |
82
83> :point_right: **Note**: *Some of the characters may not be
84visible on the screen unless you have a Unicode font with all the Greek letters.
85If you have a licensed copy of Microsoft® Office, you can use the "Arial Unicode
86MS" font, or you can download the [CODE2000](http://www.code2000.net/) font for
87free. For more information, see [Display
88Problems?](http://www.unicode.org/help/display_problems.html) on the Unicode web
89site.*
90
91While the user may not recognize that the Japanese word "kyanpasu" is equivalent
92to the English word "campus," it is easier to recognize and interpret the word
93in text than if the letters were left in the original script. There are several
94situations where this transliteration is especially useful. For example, when a
95user views names that are entered in a world-wide database, it is extremely
96helpful to view and refer to the names in the user's native script. It is also
97useful for product support. For example, if a service engineer is sent a program
98dump that is filled with characters from foreign scripts, it is much easier to
99diagnose the problem when the text is transliterated and the service engineer
100can recognize the characters. Also, when the user performs searching and
101indexing tasks, transliteration can retrieve information in a different script.
102The following shows these retrieval capabilities:
103
104| Source | Transliteration |
105|---|---|
106| 김, 국삼 | Gim, Gugsam |
107| 김, 명희 | Gim, Myeonghyi |
108| 정, 병호 | Jeong, Byeongho |
109| ... | ... |
110| たけだ, まさゆき | Takeda, Masayuki |
111| ますだ, よしひこ | Masuda, Yoshihiko |
112| やまもと, のぼる | Yamamoto, Noboru |
113| ... | ... |
114| Ρούτση, Άννα | Roútsē, Ánna |
115| Καλούδης, Χρήστος | Kaloúdēs, Chrḗstos |
116| Θεοδωράτου, Ελένη | Theodōrátou, Elénē |
117
118Transliteration can also be used to convert unfamiliar letters within the same
119script, such as converting Icelandic THORN (þ) to th.
120
121## Transliterator Identifiers
122
123Transliterators are not created directly using C++ or Java constructors.
124Instead, the are created by giving an **identifier**—a name string in a specific
125format—to one of the Transliterator factory methods, such as
126`Transliterator.getInstance()` (Java) or `Transliterator::createInstance()`. The
127following are some examples of identifiers:
128
1291.  `Latin-Cyrillic`
1302.  `[:Lu:] Latin-Greek (Greek-Latin/UNGEGN)`
1313.  `[A-Za-z]; Lower(); Latin-Katakana; Katakana-Hiragana; ([:Hiragana:])`
132
133It is important to understand identifiers and their syntax, since it is through
134the use of identifiers that one creates transforms, restricts their effective
135range, and combines them together. This section describes transform identifiers
136in detail. Throughout this section, it is important to distinguish between
137**identifiers** and the **actual transforms** that they refer to. All actual
138transforms are named by well-formed identifiers, but not all well-formed
139identifiers refer to actual transforms. An analogy is C++ method names. I can
140write the syntactially well-formed method name "void
141Cursor::getPosition(Position& pos)", but whether or not this refers to an actual
142method in an actual class is a different matter.
143
144### Basic IDs
145
146The simplest identifier is a 'basic ID'.
147
148```
149basicID := (<source> "-")? <target> ("/" <variant>)?
150```
151
152A basic ID typically names a source and target. In "Katakana-Latin", "Katakana"
153is the source and "Latin" is the target. The source specifier describes the
154characters or strings that the transform will modify. The target specifier
155describes the result of the modification. If the source is not given, then the
156source is "Any", the set of all characters. Source and Target specifiers can be
157[Script IDs](http://www.unicode.org/cldr/utility/properties.jsp#Script) (long like
158"Latin" or short like "Latn"), [Unicode language
159Identifiers](http://www.unicode.org/reports/tr35/#Unicode_Language_and_Locale_Identifiers)
160(like fr, en_US, or zh_Hant), or special tags (like Any or Hex). For example:
161
1621.  Katakana-Latin
1632.  Null
1643.  Hex-Any/Perl
1654.  Latin-el
1665.  Greek-en_US/UNGEGN
167
168Some basic IDs contain a further specifier following a forward slash. This is
169the variant, and it further specifies the transform when several versions of a
170single transformation are possible. For example, ICU provides several transforms
171that convert from Unicode characters to escaped representations. These include
172standard Unicode syntax "U+4E01", Perl syntax "\\x{4E01}", XML syntax
173"\&#x4E01;", and others. The transforms for these operations are named
174"Any-Hex/Unicode", "Any-Hex/Perl", and "Any-Hex/XML", respectively. If no
175variant is specified, then the default variant is selected. In the example of
176"Any-Hex", this is the Java variant (for historical reasons), so "Any-Hex" is
177equivalent to "Any-Hex/Java".
178
179### Filtered IDs
180
181A filtered IDs is a basic IDs constrained by a filter. For example, to specify a
182transform that converts only ASCII vowels to uppercase, use the ID:
183
184```
185[aeiou] Upper
186```
187
188The filter is a valid UnicodeSet pattern prefixed to the basic ID. Only
189characters within the set will be modified by the transform. Some transforms are
190only useful with filters, for example, the Remove transform, which deletes all
191input characters. Specifying `"[:Nonspacing Mark:] Remove"` gives a transform
192that removes non-spacing marks from input text.
193
194> :point_right: **Note**: *As of ICU 2.0, the filter pattern must be enclosed in brackets. Perl-syntax
195patterns of the form `"\p{Lu}"` cannot be used directly; instead they must be
196enclosed, e.g. `"[\p{Lu}]"`.*
197
198### Inverses
199
200Any transform ID can be modified to form an "inverse" ID. This is the ID of a
201related transform that performs an inverse operation. For basic IDs, this is
202done by exchanging the source and target names. For example, the inverse of
203"Latin-Greek/UNGEGN" is "Greek-Latin/UNGEGN", and vice versa. The variant, if
204any, is unaffected.
205
206If there is no named source, the same rule still applies, using the implicit
207source "Any". So the inverse of "Hex/Perl" is "Hex-Any/Perl", since the former
208is really shorthand for "Any-Hex/Perl".
209
210The notion of inverses carries two important caveats. The first involves the
211semantics of inverses. Consider a transform "A-B". Its inverse, "B-A", is
212thought of as reversing the transformation accomplished by "A-B". The degree and
213completeness of the reversal, however, is not guaranteed.
214
215For example, consider the "Lower" transform. It has an inverse of "Upper" (this
216is a special, non-standard inverse relationship that the transliteration service
217knows about). Applying "Lower" to the string "Hello There" yields the string
218"hello there". Applying "Upper" to this result then yields "HELLO THERE", which
219is not the same as the original string.
220
221Complete and exact reversal **is** possible if the transform has been explicitly
222designed to support this. Examples of transforms that support this are "Any-Hex"
223and "SCRIPT-Latin", where SCRIPT is a supported transliteration script. The
224"SCRIPT-Latin" transforms support exact reversal of well-formed text in SCRIPT
225to Latin (via "SCRIPT-Latin") and back to SCRIPT (via "Latin-SCRIPT"). This is
226called **round-trip integrity**. They do not, however, support round-trip
227integrity from Latin to SCRIPT and back to Latin.
228
229> :point_right: **Note**: *Do not assume that a transform's inverse will provide a complete or exact
230reversal.*
231
232The second caveat with inverses has to do with existence. Although any ID can be
233inverted, this does not guarantee that the inverse ID actually exists. For
234example, if I create a custom translitertor `Latin-Antarean` and register it with
235the system, I can then pass the string "Latin-Antarean" to `createInstance()` or
236`getInstance()` to get that transform. If I then ask for its inverse, however, the
237request will fail, since I have not created and registered "Antarean-Latin" with
238the system.
239
240> :point_right: **Note**: *Any transform ID can be inverted, but the inverse ID may not name an actual
241registered transform.*
242
243### Custom Inverses
244
245Consider the transforms "Any-Lower" and "Any-Upper": It is convenient to
246associate these as inverses of one another. However, using the standard
247procedure for ID inversion on "Any-Lower" yields "Lower-Any", which is not what
248we want. To override the standard ID inversion, the inverse ID can be explicitly
249stated within the ID string as follows:
250
251`"Any-Lower (Any-Upper)"` or equivalently `"Lower (Upper)"`
252
253When this ID is inverted, the result is "Any-Upper (Any-Lower)". Using this
254mechanism, the user can form arbitrary inverse relations when necessary.
255
256When using custom inverses of the form "A-B (C-D)", either "A-B" or "C-D" may be
257empty. An empty element is the same as "Null". That is, "A-B ( )" is the same as
258"A-B (Null)", and it inverts to the null transform (which does nothing). The
259null transform it inverts to has the ID "(A-B)", also written "Null (A-B)", and
260inverts back to "A-B ( )". Note that "A-B ( )" is very different from both "A-B"
261and "(A-B)":
262
263| ID | Inverse of ID |
264|---|---|
265| A-B | B-A |
266| A-B ( ) | (A-B) |
267| (A-B) | A-B ( ) |
268
269For some system transforms, special inverse mappings exists automatically. These
270mappings are symmetrical, that is, the right column is the inverse of the left
271column, and vice versa. The mappings are:
272
273| | |
274|---|---|
275| Any-Null | Any-Null |
276| Any-NFD | Any-NFC |
277| Any-NFKD | Any-NFKC |
278| Any-Lower | Any-Upper |
279
280In other words, writing "Any-NFD" is exactly equivalent to writing "Any-NFD
281(Any-NFC)" since the system maps the former to the latter internally. However,
282one can still alter the mapping of these transforms by specifying an explicit
283custom inverse, e.g. "NFD (Lower)".
284
285### Compound IDs
286
287Transliterators are often combined in sequence to achieve a desired
288transformation. This is analogous to the composition of mathematical functions.
289For example, given a script that converts lowercase ASCII characters from Latin
290script to Katakana script, it is convenient to first (1) separate input base
291characters and accents, and then (2) convert uppercase to lowercase. (Katakana
292is caseless, so it is best to write rules that operate only on the lowercase
293Latin base characters and produce corresponding Katakana.) To achieve this, a
294**compound transform** can be specified as follows:
295
296```
297NFKD; Lower; Latin-Katakana;
298```
299
300(In real life, we would probably use "NFD", but we use "NFKD" for explanatory
301purposes here.) It is also desirable to modify only Latin script characters. To
302do so, a filter may be prefixed to the entire compound transform. This is called
303a **global filter** to distinguish it from filters on the individual transforms
304within the compound:
305
306```
307[:Latin:]; NFKD; Lower; Latin-Katakana;
308```
309
310The inverse of such a transform is formed by reversing the list and inverting
311each element. In this example, this would be:
312
313```
314Katkana-Latin; Upper; NFKC; ([:Latin:]);
315```
316
317Note that two special mappings take effect: "Lower" to "Upper" and "NFKD" to
318"NFKC". Note also that the global filter is enclosed in parentheses, rendering
319it inoperative in the reverse direction.
320
321In this example we probably don't really want to map Latin characters to
322uppercase in the reverse direction, so we need to modify the original transform
323as follows:
324
325```
326[:Latin:]; NFKD; Lower(); Latin-Katakana;
327```
328
329Recall that the empty parentheses in "Lower ( )" are shorthand for "Lower
330(Null)" where "Null" is the null transform, that is, the transform that leaves
331text unchanged. The inverse of this is "Null (Lower)", also written "(Lower)".
332Now the inverse of the entire compound is:
333
334```
335Katakana-Latin; (Lower); NFKC; ([:Latin:]);
336```
337
338This still isn't quite right, since we really want to recompose our output, in
339both directions. We also want to only touch Katakana characters in the reverse
340direction. Our final example, modified to address these two concerns, is as
341follows:
342
343```
344[:Latin:]; NFKD; Lower(); Latin-Katakana; NFC; ([:Katakana:]);
345```
346
347This inverts to:
348
349```
350[:Katakana:]; NFD; Katakana-Latin; (Lower); NFKC; ([:Latin:]);
351```
352
353(In real life, we would probably use only "NFD" and "NFC", but we use the
354compatibility normalizers in this example so they can be distinguished.)
355
356Compound IDs are the most complex identifiers that can be formed. Many system
357transforms are actually compound transforms that have been aliased to basic IDs.
358It is also possible to write a transform rule with embedded instructions for
359generating a compound transform; system transforms use this approach as well.
360
361### Formal ID Syntax
362
363Here is a formal description of the identifier syntax. The 'ID' entity can be
364passed to `getInstance()` or `createInstance()`.
365
366| ID | := Single_ID \| Compound_ID |
367|---|---|
368| Single_ID | := filter? Basic_ID ( '(' Basic_ID? ')' )? \| filter? '(' Basic_ID ')'
369| Compound_ID | := ( filter ';' )? ( Single_ID ';' )+ ( '(' filter ');' )?
370| Basic_ID | := Spec \| Spec '-' Spec \| Spec '/' Identifier \| Spec '-' Spec '/' Identifier
371| Spec | := script-name \| locale-name \| Identifier
372| Identifier | := identifier-start identifier-part\*
373
374Elements enclosed in single quotes are literals. Parentheses group elements.
375Vertical bars represent exclusive alternatives. The '?' suffix repeats the
376preceding element zero or one times. The '+' suffix repeats the preceding
377element one or more times.
378
379A 'script-name' is a string acceptable to the UScript API that specifies a
380script. It may be a full script name such as "Latin" or a script abbreviation
381such as "Latn". A 'locale-name' is a standard locale name such as "hi_IN". The
382'identifier-start' and 'identifier-part' elements are characters defined by the
383UCharacter API to start and continue identifier names. Finally, 'filter' is a
384valid UnicodeSet pattern.
385
386> :point_right: **Note**: *As of ICU 2.0, the filter must be enclosed in brackets. Top-level Perl-style
387patterns are unsupported in 2.0.*
388
389## ICU Transliterators
390
391Currently, there are a number of basic transliterations supplied with ICU. The
392following table shows these basic transforms:
393
394### General
395
396| | |
397|---|---|
398| → Any-Null | Has no effect; leaves input text unchanged. |
399| → Any-Remove | Deletes input characters. This is useful when combined with a filter that restricts the characters to be deleted. |
400| → Any-Lower, Any-Upper, Any-Title | Converts to the specified case. See [Case Mappings](../casemappings.md) for more information. |
401| → Any-NFD, Any-NFC, Any-NFKD, Any-NFKC, Any-FCD, Any-FCC | Converts to the specified normalized form. See [Normalization](../normalization/index.md) for more information. |
402| Any-Name | Converts between characters and their Unicode names in curly braces using Perl syntax. For example: ., \\N{FULL STOP}\\N{COMMA} |
403| Any-Hex | Converts between characters and their Unicode code point values. For example: ., \\u002E\\u002C Any-Hex/XML uses the &#xXXXX; format. For example: ., &#x2E;&#x2C; Variants include Any-Hex/C, Any-Hex/Java, Any-Hex/Perl, Any-Hex/XML, and Any-Hex/XML10. Any-Hex, with no variant, is equivalent to Any-Hex/Java, for historical reasons. |
404| → Any-Accents | Lets you type e- for e-macron, etc. For example: o' ó |
405| Any-Publishing | Converts between real punctuation and typewriter punctuation. For example: “a” — ‘b’ "a" -- 'b' |
406| → Latin-ASCII | Converts non-ASCII-range punctuation, symbols, and Latin letters in an approximate ASCII-range equivalent. For example: « → '<<', © → '(C)', Æ → AE. Can be combined with Any-Latin to produce a transform that will convert as much as possible to an ASCII-range representation: “Any-Latin; Latin-ASCII”. |
407| IPA-XSampa | Convert between IPA characters and the XSampa ASCII-range representation of IPA characters. |
408| Fullwidth-Halfwidth | Converts between narrow or half-width characters and full-width. For example: アルアノリウ tech アルアノリウ tech |
409| Latin-NumericPinyin | Converts between a Pinyin Latin representation using tone marks and one using numeric tone indicators. |
410
411### Script/Language
412
413The ICU script/language transforms are based on common standards for the
414particular scripts, where possible. In some cases, the transforms are augmented
415to support reversibility.
416
417> :point_right: **Note**: *Standard transliteration methods often do not follow the pronunciation rules of
418any particular language in the target script. For more information on the design
419of transliterations, see the following Guidelines (§) section. *
420
421The built-in script transforms are:
422
423| | |
424|---|---|
425| Latin | Arabic, Armenian, Bopomofo, Cyrillic, Georgian, Greek (with UNGEGN variant), Han (with Names variant → Latin), Hangul, Hebrew, Hiragana, Indic, Jamo, Katakana, Syriac, Thaana, Thai |
426| Indic | Indic |
427| Hiragana | Katakana |
428| Simplified (Hans) | Traditional (Hant) |
429
430Indic includes Devanagari, Gujarati, Gurmukhi, Kannada, Malayalam, Oriya, Tamil,
431and Telegu. ICU can transliterate from Latin to any of these dialects and back,
432and from Indic script to any other Indic script. For example, you can
433transliterate from Kannada to Gujarati, or from Latin to Oriya.
434
435In addition, ICU may supply transliterations that are specific to language
436pairs, or between a language and a script. For example, ICU could have a ru-en
437(Russian-English) transform.
438
439As with locales, there is a fallback mechanism. If the Russian-English transform
440is requested and is not available, then ICU will search for a Russian-Latin
441transform. If the Russian-Latin transform is not available, ICU will search for
442a Cyrillic-Latin transform.
443
444For information on the precise makeup of each of the script transforms, see
445Script Transliterator Sources (§) section below.
446
447## Guidelines for Script/Language Transliterations
448
449There are a number of generally desirable guidelines for script
450transliterations. These guidelines are rarely satisfied simultaneously, so
451constructing a reasonable transliteration is always a process of balancing
452different requirements. These requirements are most important for people who are
453building transliterations, but are also useful as background information for
454users. The following lists the general guidelines for transliterations:
455
4561.  complete: every well-formed sequence of characters in the source script
457    should transliterate to a sequence of characters from the target script.
458
4592.  predictable: the letters themselves (without any knowledge of the languages
460    written in that script) should be sufficient for the transliteration, based
461    on a relatively small number of rules. This allows the transliteration to be
462    performed mechanically.
463
4643.  pronounceable: transliteration is not as useful if the process simply maps
465    the characters without any regard to their pronunciation. Simply mapping
466    "αβγδεζηθ..." to "abcdefgh..." would yield strings that might be complete
467    and unambiguous, but cannot be pronounced.
468
4694.  unambiguous: it is always possible to recover the text in the source script
470    from the transliteration in the target script. Someone that knows the
471    transliteration rules will be able to recover the precise spelling of the
472    original source text (for example, it is possible to go from Elláda back to
473    the original Ελλάδα). It is possible to define an reverse (or inverse)
474    mapping. Thus, this property is sometimes called reversibility (or
475    invertibility).
476
477### Ambiguity
478
479In transliteration, multiple characters may produce ambiguities unless the rules
480are carefully designed. For example, the Greek character PSI (ψ) maps to ps, but
481ps could also (theoretically) result from the sequence PI, SIGMA (πσ) since PI
482(π) maps to p and SIGMA (σ) maps to s.
483
484The Japanese transliteration standards provide a good mechanism for handling
485similar ambiguities. Using the Japanese transliteration standards, whenever an
486ambiguous sequence in the target script does not result from a single letter,
487the transform uses an apostrophe to disambiguate it. For example, it uses that
488procedure to distinguish between man'ichi and manichi. Using this procedure, the
489Greek character PI SIGMA (πσ) maps to p's. This method is recommended for all
490script transliteration methods.
491
492> :point_right: **Note**: *Some characters in a target script are not normally found outside of certain
493contexts. For example, the small Japanese "ya" character, as in "kya" (キャ), is
494not normally found in isolation. To handle such characters, ICU uses a tilde.
495For example, to display an isolated small "ya", type "~ya". To represent a
496non-final Greek sigma (ασ) at the end of a word, use "a~s". To represent a final
497sigma in a non-final position (ςα), type "~sa". *
498
499For the general script transforms, a common technique for reversibility is to
500use extra accents to distinguish between letters that may not be otherwise
501distinguished. For example, the following shows Greek text that is mapped to
502fully reversible Latin:
503
504> **`Greek-Latin`**
505> | | |
506> |---|---|
507> | τί φῄς; γραφὴν σέ τις, ὡς ἔοικε, γέγραπται: οὐ γὰρ ἐκεῖνό γε καταγνώσομαι, ὡς σὺ ἕτερον. | tí phḗis; graphḕn sé tis, hōs éoike, gégraptai: ou gàr ekeînó ge katagnṓsomai, hōs sỳ héteron. |
508
509If the user wants a version without certain accents, then a transform can be
510used to remove the accents. For example, the following transliterates to Latin
511but removes the macron accents on the long vowels.
512
513> **`Greek-Latin; nfd; [\u0304] remove; nfc`**
514> | | |
515> |---|---|
516> | τί φῄς; γραφὴν σέ τις, ὡς ἔοικε, γέγραπται: οὐ γὰρ ἐκεῖνό γε καταγνώσομαι, ὡς σὺ ἕτερον. | tí phéis; graphèn sé tis, hos éoike, gégraptai: ou gàr ekeînó ge katagnósomai, hos sỳ héteron.
517
518The following transliterates to Latin but removes all accents:
519
520> **`Greek-Latin; nfd; [:nonspacing marks:] remove; nfc`**
521> | | |
522> |---|---|
523> | τί φῄς; γραφὴν σέ τις, ὡς ἔοικε, γέγραπται: οὐ γὰρ ἐκεῖνό γε καταγνώσομαι, ὡς σὺ ἕτερον. | ti pheis; graphen se tis, hos eoike, gegraptai: ou gar ekeino ge katagnosomai, hos sy heteron. |
524
525### Pronunciation
526
527Standard transliteration methods often do not follow the pronunciation rules of
528any particular language in the target script. For example, the Japanese Hepburn
529system uses a "j" that has the English phonetic value (as opposed to French,
530German, or Spanish), but uses vowels that do not have the standard English
531sounds. A transliteration method might also require some special knowledge to
532have the correct pronunciation. For example, in the Japanese kunrei-siki system,
533"tu" is pronounced as "tsu". This is similar to situations where there are
534different languages within the same script. For example, knowing that the word
535Gewalt comes from German allows a knowledgeable reader to pronounce the "w" as a
536"v".
537
538In some cases, transliteration may be heavily influenced by tradition. For
539example, the modern Greek letter beta (β) sounds like a "v", but a transform may
540continue to use a b (as in biology). In that case, the user would need to know
541that a "b" in the transliterated word corresponded to beta (β) and is to be
542pronounced as a "v" in modern Greek. Letters may also be transliterated
543differently according to their context to make the pronunciation more
544predictable. For example, since the Greek sequence GAMMA GAMMA (γγ) is
545pronounced as "ng", the first GAMMA can be transcribed as an "n".
546
547> :point_right: **Note**: *In general, predictability means that when transliterating Latin script to
548other scripts, English text will not produce phonetic results. This is because
549the pronunciation of English cannot be predicted easily from the letters in a
550word: e.g. grove, move, and love all end with "ove", but are pronounced very
551differently.*
552
553### Cautions
554
555Reversibility may require modifications of traditional transcription methods.
556For example, there are two standard methods for transliterating Japanese
557katakana and hiragana into Latin letters. The kunrei-siki method is unambiguous.
558The Hepburn method can be more easily pronounced by foreigners but is ambiguous.
559In the Hepburn method, both ZI (ジ) and DI (ヂ) are represented by "ji" and both
560ZU (ズ) and DU (ヅ) are represented by "zu". A slightly amended version of
561Hepburn, that uses "dji" for DI and "dzu" for DU, is unambiguous.
562
563When a sequence of two letters map to one, case mappings (uppercase and
564lowercase) must be handled carefully to ensure reversibility. For cased scripts,
565the two letters may need to have different cases, depending on the next letter.
566For example, the Greek letter PHI (Φ) maps to PH in Latin, but Φο maps to Pho,
567and not to PHo.
568
569Some scripts have characters that take on different shapes depending on their
570context. Usually, this is done at the display level (such as with Arabic) and
571does not require special transliteration support. However, in a few cases this
572is represented with different character codes, such as in Greek and Hebrew. For
573example, a Greek SIGMA is written in a final form (ς) at the end of words, and a
574non-final form (σ) in other locations. This requires the transform to map
575different characters based on the context.
576
577> :point_right: **Note**: *It is useful for the reverse mapping to be complete so that arbitrary strings
578in the target script can be reasonably mapped back to the source script.
579Complete reverse mapping makes it much easier to do mechanical quality checks
580and so on. For example, even though the letter "q" might not be necessary in a
581transliteration of Greek, it can be mapped to a KAPPA (κ). Such reverse mappings
582will not, in general, be unambiguous.*
583
584## Using Transliterators
585
586Transliterators have APIs in C, C++, and Java™. Only the C++ APIs are listed
587here. For more information on the C, Java, and other APIs, see the relevant API
588docs.
589
590To list the available Transliterators, use code like the following:
591
592```
593count = Transliterator:: countAvailableIDs();
594myID = Transliterator::getAvailableID(n);
595```
596
597The ID should not be displayed to users as it is for internal use only. A
598separate string, one that can be localized to different languages, is obtained
599with a static method. (This method is static to allow the translated names to be
600augmented without changing the code.) To get a localized name for use in a GUI,
601use the following:
602
603```
604Transliterator::getDisplayName(myID, france, nameForUser);
605```
606
607To create a Transliterator, use the following:
608
609```
610UErrorCode status = U_ZERO_ERROR;
611Transliterator *myTrans = Transliterator::createInstance("Latin-Greek",
612UTRANS_FORWARD, status);
613```
614
615To get a pre-made compound transform, use a series of IDs separated by ";". For
616example:
617
618```
619myTrans = Transliterator::createInstance(
620    "any-NFD; [:nonspacing mark:] any-remove; any-NFC", UTRANS_FORWARD, status);
621```
622
623To convert an entire string, use the following:
624
625```
626myTrans.transliterate(myString);
627```
628
629For more complex cases, such a keyboard input, the following full method
630provides more control:
631
632```
633myTrans.transliterate(replaceable, positions, complete);
634```
635
636The Replaceable interface (or abstract class in C++) allows more complex text to
637be used with Transliterators, such as styled text. In ICU4J, a wrapper is
638supplied for StringBuffer. A wrapper is an interface to text that handles a very
639few operations. For example, the interface can access characters and replace one
640substring with another. By using this interface, replacement text can take on
641the same style as the text it is replacing, so that style information is not
642lost. With a replaceable interface to HTML or XML, even higher level structure
643can be preserved.
644
645The positions parameter contains information about the range of text that should
646be transliterated, plus the possibly larger range of text that can serve as
647context.
648
649The `complete` parameter indicates whether or not you are to consider the text up
650to the limit to be complete or not. For keyboard input, the `complete` parameter
651should normally be false. Only when the conversion is complete is that parameter
652set to true. For example, suppose that a transform converts "sh" to X, and "s"
653in other cases to Y. If the complete parameter is true, then a dangling "s"
654converts to Y; when the complete parameter is false, then the dangling "s"
655should not be converted, since there is more text to come.
656
657In keyboard input, normally start/cursor and limit/end are set to the selection
658at the time the transform is chosen. The following shows how the selection is
659chosen:
660
661```
662positions.start = positions.cursor = selection.getStart();
663positions.limit = positions.end = selection.getEnd();
664```
665
666As the user types or inserts `inputChars`, call the following:
667
668```
669replacable.replace(positions.limit, positions.limit, inputChars); // update the
670text
671positions.limit += inputChars.length(); // update the positions
672myTrans.transliterate(replaceable, positions, false);
673```
674
675If the user performs an action that indicates he or she is done with the text,
676then transliterate is called one last time using the following:
677
678```
679myTrans.transliterate(replaceable, positions, false);
680```
681
682Transliterator objects are stateless. They retain no information between calls
683to `transliterate()`.
684
685The statelessness might seem to limit the complexity of the operations that can
686be performed. In practice, complex transliterations happen by delaying the
687replacement of text until it is known that no other replacements are possible.
688In other words, although the Transliterator objects are stateless, the source
689text itself embodies all the needed information and delayed operation allows
690arbitrary complexity.
691
692## Designing Transliterators
693
694Many people use the supplied transforms. However, there are two different ways
695of designing transforms. Many transforms can be produced without subclassing,
696simply by designing rules for a RuleBasedTransliterator. If conversions can be
697done algorithmically much more compactly than with a long list of rules, then
698consider subclassing Transliterator directly. For example, ICU itself supplies
699specialized subclasses for the following:
700
7011.  Hangul Jamo
702
7032.  Any Hex
704
7053.  Wrapping the string functions for normalization, case mapping, etc.
706
707### Subclassing Transliterators
708
709Subclassers must override `handleTransliterate(Replaceable text, Positions
710positions, boolean complete)`. They can override some of the other methods for
711efficiency, but ensure that the results are identical. In `handleTransliterate`
712convert the text from `positions.cursor` up to `positions.limit`. The context from
713`positions.start` to `positions.end` may be taken into account as context when doing
714this conversion, but should not be converted themselves. Never look at any
715characters before `positions.start` or after `positions.end`.
716
717The `complete` parameter indicates whether or not the text up to limit is
718complete. For example, suppose that you would convert "sh" to X, and "s" in
719other cases to Y. If the complete parameter is true, then a dangling "s"
720converts to Y; when the complete parameter is false, then the dangling "s"
721should not be converted. When you return from the method, `positions.cursor`
722should be set to the furthest position you processed. Typically this will be up
723to `limit`; in case there was an incomplete sequence at the end, `cursor` should be
724set to the position just before that sequence.
725
726### Rule-Based Transliterators
727
728ICU supplies the foundation for producing well-behaved transliterations and
729supplies a number of typing transliterations for different scripts. The simplest
730mechanism for producing transliterations is called a RuleBasedTransliterator.
731The RuleBasedTransliterator is a data-based class that allows transliterations
732to be built up with a series of rules. These rules provide a specialized set of
733context-sensitive matching operations. The operations are similar to
734regular-expression rules, but adapted to the specific domain of
735transliterations.
736
737The simplest rule is a conversion rule, which replaces one string of characters
738with another. The conversion rule takes the following form:
739
740```
741xy > z ;
742```
743
744This converts any substring "xy" into "z". Rules are executed in order, so:
745
746```
747sch > sh ;
748ss > z ;
749```
750
751This conversion rule transforms "bass school" into "baz shool". The transform
752walks through the string from start to finish. Thus given the rules above
753"bassch" will convert to "bazch", because the "ss" rule is found before the
754"sch" rule in the string (later, we'll see a way to override this behavior). If
755two rules can both apply at a given point in the string, then the transform
756applies the first rule in the list.
757
758All of the ASCII characters except numbers and letters are reserved for use in
759the rule syntax. Normally, these characters do not need to be converted.
760However, to convert them use either a pair of single quotes or a slash. The pair
761of single quotes can be used to surround a whole string of text. The slash
762affects only the character immediately after it. For example, to convert from
763two less-than signs to the word "much less than", use one of the following
764rules:
765
766```
767\<\< > much\ less\ than ;
768'<<' > 'much less than' ;
769'<<' > much' 'less\ than ;
770```
771
772*Spaces may be inserted anywhere without any effect on the rules. Use extra space to separate items out for clarity without worrying about the effects. This feature is particularly useful with combining marks; it is handy to put some spaces around it to separate it from the surrounding text. The following is an example:*
773
774```
775 ͅ> i ; # an iota-subscript diacritic turns into an i.
776```
777
778*For a real space in the rules, place quotes around it. For a real backslash,
779either double it \, or quote it '\'. For a real single quote, double it '',
780or place a backslash before it \'. Each of the following means the same thing:*
781
782```
783'can''t go'
784'can\'t go'
785can\'t\ go
786can''t' 'go
787```
788
789*Any text that starts with a hash mark and concludes a line is a comment. Comments help document how the rules work. The following shows a comment in a rule:*
790
791```
792x > ks ; # change every x into ks
793```
794
795We can use "\\u" notation instead of any letter. For instance, instead of using
796the Greek πp, we could write:
797
798```
799\u03C0 > p ;
800```
801
802We can also define and use variables, such as:
803
804```
805$pi = \u03C0 ; $pi > p ;
806```
807
808#### Dual Rules
809
810Rules can also specify what happens when an inverse transform is formed. To do
811this, we reverse the direction of the "<" sign. Thus the above example becomes:
812
813```
814$pi < p ;
815```
816
817With the inverse transform, "p" will convert to the Greek p. These two
818directions can be combined together into a dual conversion rule by using the
819"<>" operator, yielding:
820
821```
822$pi <> p ;
823```
824
825#### Context
826
827Context can be used to have the results of a transformation be different
828depending on the characters before or after. The following means "Remove
829hyphens, but only when they follow lowercase letters":
830
831```
832[:lowercase letter:] } '-' > '' ;
833```
834
835> :point_right: **Note**: *The context itself (`[:lowercase letter:]`) is unaffected by the replacement;
836only the text between the curly braces is changed. *
837
838#### Revisiting
839
840If the resulting text contains a vertical bar "|", then that means that
841processing will proceed from that point and that the transform will revisit part
842of the resulting text. For example, if we have:
843
844```
845x > y | z ;
846z a > w;
847```
848
849then the string "xa" will convert to "yw". First, "xa" is converted to "yza".
850Then the processing will continue from after the character "y", pick up the
851"za", and convert it. Had we not had the "|", the result would have been simply
852"yza".
853
854#### Example
855
856The following shows how these features are combined together in the
857Transliterator "Any-Publishing". This transform converts the ASCII typewriter
858conventions into text more suitable for desktop publishing (in English). It
859turns straight quotation marks or UNIX style quotation marks into curly
860quotation marks, fixes multiple spaces, and converts double-hyphens into a dash.
861
862```
863# Variables
864$single = \' ;
865$space = ' ' ;
866$double = \" ;
867$back = \` ;
868$tab = '\u0008' ;
869
870# the following is for spaces, line ends, (, [, {, ...
871$makeRight = [[:separator:][:start punctuation:][:initial punctuation:]] ;
872
873# fix UNIX quotes
874$back $back > “ ; # generate right d.q.m. (double quotation mark)
875$back > ‘ ;
876
877# fix typewriter quotes, by context
878$makeRight { $double <> “ ; # convert a double to right d.q.m. after certain chars
879^ { $double > “ ; # convert a double at the start of the line.
880$double <> ” ; # otherwise convert to a left q.m.
881
882$makeRight {$single} <> ‘ ; # do the same for s.q.m.s
883^ {$single} > ‘ ;
884$single <> ’;
885
886# fix multiple spaces and hyphens
887$space {$space} > ; # collapse multiple spaces
888'--' <> — ; # convert fake dash into real one
889```
890
891### Rule Syntax
892
893The following describes the full format of the list of rules used to create a
894RuleBasedTransliterator. Each rule in the list is terminated by a semicolon. The
895list consists of the following:
896
8971.  an optional filter rule
8982.  zero or more transform rules
8993.  zero or more variable-definition rules
9004.  zero or more conversion rules
9015.  an optional inverse filter rule
902
903The filter rule, if present, must appear at the beginning of the list, before
904any of the other rules. The inverse filter rule, if present, must appear at the
905end of the list, after all of the other rules. The other rules may occur in any
906order and be freely intermixed.
907
908The rule list can also generate the inverse of the transform. In that case, the
909inverse of each of the rules is used, as described below.
910
911#### Transform Rules
912
913Each transform rule consists of two colons followed by a transform name. For
914example:
915
916```
917:: NFD ;
918```
919
920The inverse of a transform rule follows the same conventions as when we create a
921transform by name. For example:
922
923```
924:: lower () ; # only executed for the normal
925:: (lower) ; # only executed for the inverse
926:: lower ; # executed for both the normal and the inverse
927```
928
929#### Variable Definition Rules
930
931Each variable definition is of the following form:
932
933```
934$variableName = contents ;
935```
936
937The variable name can contain letters and digits, but must start with a letter.
938More precisely, the variable names use Unicode identifiers as defined by the
939identifier properties in ICU. The identifier properties allow for the use of
940foreign letters and numbers. See the Unicode class for C++ and the UCharacter
941class for Java.
942
943The contents of a variable definition is any sequence of Unicode sets and
944characters or characters. For example:
945
946```
947$mac = M [aA] [cC] ;
948```
949
950Variables are only replaced within other variable definition rules and within
951conversion rules. They have no effect on transliteration rules.
952
953#### Filter Rules
954
955A filter rule consists of two colons followed by a UnicodeSet. This filter is
956global in that only the characters matching the filter will be affected by any
957transform rules or conversion rules. The inverse filter rule consists of two
958colons followed by a UnicodeSet in parentheses. This filter is also global for
959the inverse transform.
960
961For example, the Hiragana-Latin transform can be implemented by "pivoting"
962through the Katakana converter, as follows:
963
964```
965# don't touch any katakana that was in the text!
966:: [:^Katakana:] ;
967
968:: Hiragana-Katakana;
969:: Katakana-Latin;
970
971# don't touch any katakana that was in the text
972# for the inverse either!
973:: ([:^Katakana:]) ;
974```
975
976The filters keep the transform from mistakenly converting any of the "pivot"
977characters. Note that this is a case where a rule list contains no conversion
978rules at all, just transform rules and filters.
979
980#### Conversion Rules
981
982Conversion rules can be forward, backward, or double. The complete conversion
983rule syntax is described below:
984
985##### Forward
986
987A forward conversion rule is of the following form:
988
989```
990before_context { text_to_replace } after_context > completed_result | result_to_revisit ;
991```
992
993If there is no before_context, then the "{" can be omitted. If there is no
994after_context, then the "}" can be omitted. If there is no result_to_revisit,
995then the "|" can be omitted. A forward conversion rule is only executed for the
996normal transform and is ignored when generating the inverse transform.
997
998##### Backward
999
1000A backward conversion rule is of the following form:
1001
1002```
1003completed_result | result_to_revisit < before_context { text_to_replace } after_context ;
1004```
1005
1006The same omission rules apply as in the case of forward conversion rules. A
1007backward conversion rule is only executed for the inverse transform and is
1008ignored when generating the normal transform.
1009
1010##### Dual
1011
1012A dual conversion rule combines a forward conversion rule and a backward
1013conversion rule into one, as discussed above. It is of the form:
1014
1015```
1016a { b | c } d <> e { f | g } h ;
1017```
1018
1019When generating the normal transform and the inverse, the revisit mark "|" and
1020the before and after contexts are ignored on the sides where they don't belong.
1021Thus, the above is exactly equivalent to the sequence of the following two
1022rules:
1023
1024```
1025a { b c } d > f | g ;
1026b | c < e { f g } h ;
1027```
1028
1029#### Intermixing Transform Rules and Conversion Rules
1030
1031Starting in ICU 3.4, transform rules and conversion rules may be freely
1032intermixed. (In earlier versions of ICU, transform rules were only allowed at
1033the beginning or end of the rule set, immediately after the global filter or
1034immediately before the reverse global filter.) Inserting a transform rule into
1035the middle of a set of conversion rules has an important side effect.
1036
1037Normally, conversion rules are considered together as a group. The only time
1038their order in the rule set is important is when more than one rule matches at
1039the same point in the string. In that case, the one that occurs earlier in the
1040rule set wins. In all other situations, when multiple rules match overlapping
1041parts of the string, the one that matches earlier wins.
1042
1043Transform rules apply to the whole string. If you have several transform rules
1044in a row, the first one is applied to the whole string, then the second one is
1045applied to the whole string, and so on. To reconcile this behavior with the
1046behavior of conversion rules, transform rules have the side effect of breaking a
1047surrounding set of conversion rules into two groups: First all of the conversion
1048rules before the transform rule are applied as a group to the whole string in
1049the usual way, then the transform rule is applied to the whole string, and then
1050the conversion rules after the transform rule are applied as a group to the
1051whole string. For example, consider the following rules:
1052
1053```
1054abc > xyz;
1055xyz > def;
1056::Upper;
1057```
1058
1059If you apply these rules to “abcxyz”, you get “XYZDEF”. If you move the
1060“::Upper;” to the middle of the rule set and change the cases accordingly...
1061
1062```
1063abc > xyz;
1064::Upper;
1065XYZ > DEF;
1066```
1067
1068...applying this to “abcxyz” produces “DEFDEF”. This is because “::Upper;”
1069causes the transliterator to reset to the beginning of the string: The first
1070rule turns the string into “xyzxyz”, the second rule uppercases the whole thing
1071to “XYZXYZ”, and the third rule turns this into “DEFDEF”.
1072
1073This can be useful when a transform naturally occurs in multiple “passes.”
1074Consider this rule set:
1075
1076```
1077[:Separator:]* > ' ';
1078'high school' > 'H.S.';
1079'middle school' > 'M.S.';
1080'elementary school' > 'E.S.';
1081```
1082
1083If you apply this rule to “high school”, you get “H.S.”, but if you apply it to
1084“high school” (with two spaces), you just get “high school” (with one space). To
1085have “high school” (with two spaces) turn into “H.S.”, you'd either have to have
1086the first rule back up some arbitrary distance (far enough to see “elementary”,
1087if you want all the rules to work), or you have to include the whole left-hand
1088side of the first rule in the other rules, which can make them hard to read and
1089maintain:
1090
1091```
1092$space = [:Separator:]*;
1093high $space school > 'H.S.';
1094middle $space school > 'M.S.';
1095elementary $space school > 'E.S.';
1096```
1097
1098Instead, you can simply insert “::Null;” in order to get things to work right:
1099
1100```
1101[:Separator:]* > ' ';
1102::Null;
1103'high school' > 'H.S.';
1104'middle school' > 'M.S.';
1105'elementary school' > 'E.S.';
1106```
1107
1108The “::Null;” has no effect of its own (the null transliterator, by definition,
1109doesn't do anything), but it splits the other rules into two “passes”: The first
1110rule is applied to the whole string, normalizing all runs of whitespace into
1111single spaces, and then we start over at the beginning of the string to look for
1112the phrases. “high school” (with four spaces) gets correctly converted to
1113H.S.”.
1114
1115This can also sometimes be useful with rules that have overlapping domains.
1116Consider this rule set from before:
1117
1118```
1119sch > sh ;
1120ss > z ;
1121```
1122
1123Apply this rule to “bassch” results in “bazch” because “ss” matches earlier in
1124the string than “sch”. If you really wanted “bassh”-- that is, if you wanted the
1125first rule to win even when the second rule matches earlier in the string, you'd
1126either have to add another rule for this special case...
1127
1128```
1129sch > sh ;
1130ssch > ssh;
1131ss > z ;
1132```
1133
1134...or you could use a transform rule to apply the conversions in two passes:
1135
1136```
1137sch > sh ;
1138::Null;
1139ss > z ;
1140```
1141
1142#### Masking
1143
1144When transforms are built, a warning is returned if rules are masked. This
1145happens when a rule could not be executed because the earlier one would always
1146match.
1147
1148```
1149a > b ;
1150ac > d ; # masked!
1151```
1152
1153In this case, for example, every string that could have a match for "ac" will
1154already match "a", because the rules are executed in order. However, the
1155transform compiler will not currently catch cases that would be masked because
1156of the use of UnicodeSets or regular expression operators, such as the
1157following:
1158
1159```
1160a } [:L:] > b ;
1161ac > d ; # masked, but not caught by the compiler
1162```
1163
1164#### Inverse Summary
1165
1166The following table shows how the same rule list generates two different
1167transforms, where the inverse is restated in terms of forward rules (this is a
1168contrived example, simply to show the reordering):
1169
1170##### Original Rules
1171
1172```
1173:: [:Uppercase Letter:] ;
1174:: latin-greek ;
1175:: greek-japanese ;
1176x <> y ;
1177z > w ;
1178r < m ;
1179:: upper;
1180a > b ;
1181c <> d ;
1182:: any-publishing ;
1183:: ([:Number:]) ;
1184```
1185
1186##### Forward
1187
1188```
1189:: [:Uppercase Letter:] ;
1190:: latin-greek ;
1191:: greek-japanese ;
1192x > y ;
1193z > w ;
1194:: upper ;
1195a > b ;
1196c > d ;
1197:: any-publishing ;
1198```
1199
1200##### Inverse
1201
1202```
1203:: [:Number:] ;
1204:: publishing-any ;
1205d > c ;
1206:: lower ;
1207y > x ;
1208m > r ;
1209:: japanese-greek ;
1210:: greek-latin ;
1211```
1212
1213> :point_right: **Note**: *Note how the irrelevant rules (the inverse filter rule and the rules containing
1214<) are omitted (ignored, actually) in the forward direction, and notice how
1215things are reversed: the transform rules are inverted and happen in the opposite
1216order, and the groups of conversion rules are also executed in the opposite
1217relative order (although the rules within each group are executed in the same
1218order).*
1219
1220#### Function Calls
1221
1222As of ICU 2.1, rule-based transforms can invoke other transforms. The transform
1223being invoked must be registered with the system before it can be used in a
1224rule. The syntax for a function call resembles a Perl subroutine call:
1225
1226```
1227( [a-zA-Z] ) ( [a-zA-Z]* ) > &Any-Upper($1) &Any-Lower($2) ;
1228```
1229
1230This example transforms strings of ASCII letters to have an initial uppercase
1231letter followed by lowercase letters. (In practice, you would use the `Any-Title`
1232to do proper titlecasing.)
1233
1234The formal syntax is:
1235
1236```
1237'&' Basic-id '(' Text-arg ')'
1238```
1239
1240Elements in single quotes are literals. Basic-id is a basic ID, as described
1241earlier. It specifies a source, target, and optional variant, but does not
1242include a filter, explicit reverse, or compound elements. Text-arg is any text
1243that may appear on the output side of a rule. This means nested function calls
1244are supported.
1245
1246For more information on the use of rules, and more examples of the syntax in
1247use, see the [tutorial](./rules.md).
1248
1249### Regular Expression
1250
1251The rules are similar to Regular Expressions in offering: Variables, Property
1252matches, Contextual matches, Rearrangement ($1, $2…), and Quantifiers (\*, +,
1253?). They are more powerful in offering: Ordered Rules, Cursor Backup,
1254Buffered/Keyboard support. They are less powerful in that they have only greedy
1255quantifiers, no backup (so no X | Y), and no input-side back references.
1256
1257Here is a simple example that shows the difference between a set of
1258Transliterator rules, and successively applying regular expression replacements.
1259
1260Since the transform processes each of its rules at each point, it catches the yx
1261before the xy in the second case. Since each of the regular expressions is
1262evaluated over the whole string, that isn't possible. Simply using multiple
1263regular expressions can't account for the interaction and ordering of characters
1264and rules. (You can, however, simulate the regex behavior with transform rules
1265by using a transform rule to split the conversion rules into passes.)
1266
1267For more details on constructing rules, see the [Transliterator Rule
1268Tutorial](rules.md) .
1269
1270## Script Transliterator Sources
1271
1272Currently ICU offers script transliterations between Latin and certain other
1273scripts (such script transliterations are called romanizations), plus
1274transliterations between the Indic scripts (excluding Urdu). Additional
1275romanizations and other script transliterations will be added in the future. In
1276general, ICU follows the [UNGEGN: Working Group on Romanization
1277Systems](http://www.eki.ee/wgrs/) where possible. The following describes the
1278sources used.
1279
1280Except where otherwise noted, all of these systems are designed to be
1281reversible. For bicameral scripts (those with upper and lower case), case may
1282not be completely preserved. The transliterations are also designed to be
1283complete for the letters a-z. A fallback is used for a letter that is not used
1284in the transliteration.
1285
1286### Korean
1287
1288There are many romanizations of Korean. The default transliteration follows the
1289[Korean Ministry of Culture & Tourism
1290Transliteration](http://www.korea.net/korea/kor_loca.asp?code=A020303)
1291regulations with the clause 8 variant for reversibility:
1292
12938. When it is necessary to convert Romanized Korean back to Hangul in special
1294cases such as in academic articles, Romanization is done according to Hangul
1295spelling and not pronunciation. Each Hangul letter is Romanized as explained in
1296section 2 except that ㄱ, ㄷ, ㅂ, ㄹ are always written as g, d, b, l. When ㅇ has no
1297sound value, it is replaced by a hyphen may also be used when it is necessary to
1298distinguish between syllables.
1299
1300There is one other variation: an apostrophe is used instead of a hyphen, since
1301it has better title casing behavior. To change this, see the Modifications (§)
1302section below.
1303
1304### Japanese
1305
1306The default transliteration for Japanese uses the a slight variant of the
1307Hepburn system. With Hepburn system, both ZI (ジ) and DI (ヂ) are represented by
1308"ji" and both ZU (ズ) and DU (ヅ) are represented by "zu". This is amended
1309slightly for reversibility by using "dji" for DI and "dzu" for DU.
1310
1311The Katakana transliteration is reversible. Hiragana-Katakana transliteration is
1312not completely reversible since there are several Katakana letters that do not
1313have corresponding Hiragana equivalents. Also, the length mark is not used with
1314Hiragana. The Hiragana-Latin transliteration is also not reversible since
1315internally it is a combination of Katakana-Hiragana and Hiragana-Latin.
1316
1317### Greek
1318
1319The default transliteration uses a standard transcription for Greek. The
1320transliterations is one that is aimed at preserving etymology. The ISO 843
1321variant has the following differences:
1322
1323| Greek | Default | ISO 843 |
1324|---|---|---|
1325| β  | b  | v |
1326| γ\* | n | g |
1327| η | ē | ī |
1328| ̔ | h | (omitted) |
1329| ̀ | ̀ | (omitted) |
1330| ~ | ~ | (omitted) |
1331
1332\* before γ, κ, ξ, χ
1333
1334### Cyrillic
1335
1336Cyrillic generally follows ISO 9 for the base Cyrillic set. There are tentative
1337plans to add extended Cyrillic characters in the future, plus variants for GOST
1338and other national standards.
1339
1340### Indic
1341
1342The default romanization uses the ISCII standard with some minor modifications
1343for reversibility. Internally, all Indic scripts are transliterated by
1344converting first to an internal form, called Interindic, then from Interindic to
1345the target script.
1346
1347Transliteration of Indic scripts in ICU follows the ISO 15919 standard for
1348Romanization of Indic scripts using diacritics. Internally, all Indic scripts
1349are transliterated by converting first to an internal form, called Inter-Indic,
1350then from Inter-Indic to the target script. ISO 15919 differs from ISCII 91 in
1351application of diacritics for certain characters. These differences are shown in
1352the following example (illustrated with Devanagari, although the same principles
1353apply to the other Indic scripts):
1354
1355| Devanagari | ISCII 91 | ISO 15919 |
1356|---|---|---|
1357| ऋ | ṛ | r̥ |
1358| ऌ | ḻ | l̥ |
1359| ॠ | ṝ | r̥̄ |
1360| ॡ | ḻ̄ | l̥̄ |
1361| ढ़ | d̂ha | ṛha |
1362| ड़ |d̂a | ṛa |
1363
1364> :point_right: **Note**: *With some fonts the diacritics will not be correctly placed on the base
1365letters. The macron on a lowercase L may look particularly bad.*
1366
1367Transliteration rules in Indic are reversible with the exception of the ZWJ and
1368ZWNJ used to request explicit rendering effects. For example:
1369
1370| Devanagari | Romanization | Note |
1371|---|---|---|
1372| क्ष | kṣa | normal |
1373| क्‍ष | kṣa | explicit halant requested |
1374| क्‌ष | kṣa | half-consonant requested |
1375
1376There are two particular instances where transliterations may produce unexpected
1377results: (1) where a halant after a consonant is implied by the romanization (in
1378such cases the vowel needs to be explicitly written out), and (2) with the
1379transliteration of 'c'.
1380
1381For example:
1382
1383| Devanagari | Romanization |
1384|---|---|
1385| सेन्गुप्त | Sēngupta |
1386| सेनगुप्त | Sēnagupta |
1387| मोनिच | Monica |
1388| मोनिक | Monika |
1389
1390### Modifications
1391
1392It is easy using transforms to create variants of the defaults. For example, to
1393create a variant of Korean that uses hyphens instead of apostrophes, use the
1394following rules:
1395
1396```
1397:: Latin-Hangul ;
1398'' <> '-' ;
1399```
1400
1401### More Information
1402
1403For more information, see:
1404
14051.  [UNGEGN: Working Group on Romanization Systems](http://www.eki.ee/wgrs/)
1406
14072.  [Transliteration of Non-Roman Alphabets and Scripts (Søren
1408    Binks)](http://transliteration.eki.ee/)
1409
14103.  [Standards for Archival Description:
1411    Romanization](http://www.archivists.org/catalog/stds99/chapter8.html)
1412
14134.  [ISO-15915
1414    (Hindi)](http://transliteration.eki.ee/pdf/Hindi-Marathi-Nepali.pdf)
1415
14165.  [ISO-15915 (Gujarati)](http://transliteration.eki.ee/pdf/Gujarati.pdf)
1417
14186.  [ISO-15915 (Kannada)](http://transliteration.eki.ee/pdf/Kannada.pdf)
1419
14207.  [ISCII-91](http://www.cdacindia.com/html/gist/down/iscii_d.asp)
1421