1--- 2layout: default 3title: Transforms 4nav_order: 4 5parent: Transforms 6--- 7<!-- 8© 2020 and later: Unicode, Inc. and others. 9License & terms of use: http://www.unicode.org/copyright.html 10--> 11 12# General Transforms 13{: .no_toc } 14 15## Contents 16{: .no_toc .text-delta } 17 181. TOC 19{:toc} 20 21--- 22 23## Overview 24 25General transforms provide a general-purpose package for processing Unicode 26text. They are a powerful and flexible mechanism for handling a variety of 27different tasks, including: 28 291. Uppercase, Lowercase, Titlecase, Full/Halfwidth conversions 302. Normalization 313. Hex and Character Name conversions 324. Script to Script conversion 33 34Originally, Transforms were designed to convert characters from one script to 35another (for example, from Greek to Latin, or Japanese Katakana to Latin). This 36is still reflected in the class name, which remains **Transliterator**. However, 37the services performed by that class now represent a much more general mechanism 38capable of handling a much broader range of tasks. In particular, the Transforms 39include pre-built transformations for case conversions, for normalization 40conversions, for the removal of given characters, and also for a variety of 41language and script transliterations. Transforms can be chained together to 42perform a series of operations and each step of the process can use a UnicodeSet 43to restrict the characters that are affected. 44 45For example, to remove accents from characters, use the following transform: 46 47``` 48NFD; [:Nonspacing Mark:] Remove; NFC. 49``` 50 51This transform separates accents from their base characters, removes the 52accents, and then puts the remaining text into an unaccented form. 53 54A transliteration either can be applied to a complete string of text or can be 55used incrementally for typing or buffering input. In the latter case, the 56transform provides the correct time delay to process characters when there is an 57unambiguous mapping. Transliterators can also be used with more complex text, 58such as styled text, to maintain the style information where possible. For 59example, "~~Αλφaβ~~ητικός" will retain the strikethrough in transliterating to 60"~~Alphab~~ētikós". 61 62> :point_right: **Note**: *The transliteration process not only retains font size, but also other 63characteristics such as font type and color.* 64 65For an online demonstration of ICU transliteration, see 66<http://demo.icu-project.org/icu-bin/translit> . 67 68## Script Transliteration 69 70Script Transliteration is the general process of converting characters from one 71script to another. For example, it can convert characters from Greek to Latin, 72or Japanese katakana to Latin. The user must understand that script 73transliteration is not translation. Rather, script transliteration it is the 74conversion of letters from one script to another without translating the 75underlying words. The following shows a sample of script transliteration: 76 77| Source | Transliteration | 78|---|---| 79| キャンパス | kyanpasu | 80| Αλφαβητικός Κατάλογος | Alphabētikós Katálogos | 81| биологическом | biologichyeskom | 82 83> :point_right: **Note**: *Some of the characters may not be 84visible on the screen unless you have a Unicode font with all the Greek letters. 85If you have a licensed copy of Microsoft® Office, you can use the "Arial Unicode 86MS" font, or you can download the [CODE2000](http://www.code2000.net/) font for 87free. For more information, see [Display 88Problems?](http://www.unicode.org/help/display_problems.html) on the Unicode web 89site.* 90 91While the user may not recognize that the Japanese word "kyanpasu" is equivalent 92to the English word "campus," it is easier to recognize and interpret the word 93in text than if the letters were left in the original script. There are several 94situations where this transliteration is especially useful. For example, when a 95user views names that are entered in a world-wide database, it is extremely 96helpful to view and refer to the names in the user's native script. It is also 97useful for product support. For example, if a service engineer is sent a program 98dump that is filled with characters from foreign scripts, it is much easier to 99diagnose the problem when the text is transliterated and the service engineer 100can recognize the characters. Also, when the user performs searching and 101indexing tasks, transliteration can retrieve information in a different script. 102The following shows these retrieval capabilities: 103 104| Source | Transliteration | 105|---|---| 106| 김, 국삼 | Gim, Gugsam | 107| 김, 명희 | Gim, Myeonghyi | 108| 정, 병호 | Jeong, Byeongho | 109| ... | ... | 110| たけだ, まさゆき | Takeda, Masayuki | 111| ますだ, よしひこ | Masuda, Yoshihiko | 112| やまもと, のぼる | Yamamoto, Noboru | 113| ... | ... | 114| Ρούτση, Άννα | Roútsē, Ánna | 115| Καλούδης, Χρήστος | Kaloúdēs, Chrḗstos | 116| Θεοδωράτου, Ελένη | Theodōrátou, Elénē | 117 118Transliteration can also be used to convert unfamiliar letters within the same 119script, such as converting Icelandic THORN (þ) to th. 120 121## Transliterator Identifiers 122 123Transliterators are not created directly using C++ or Java constructors. 124Instead, the are created by giving an **identifier**—a name string in a specific 125format—to one of the Transliterator factory methods, such as 126`Transliterator.getInstance()` (Java) or `Transliterator::createInstance()`. The 127following are some examples of identifiers: 128 1291. `Latin-Cyrillic` 1302. `[:Lu:] Latin-Greek (Greek-Latin/UNGEGN)` 1313. `[A-Za-z]; Lower(); Latin-Katakana; Katakana-Hiragana; ([:Hiragana:])` 132 133It is important to understand identifiers and their syntax, since it is through 134the use of identifiers that one creates transforms, restricts their effective 135range, and combines them together. This section describes transform identifiers 136in detail. Throughout this section, it is important to distinguish between 137**identifiers** and the **actual transforms** that they refer to. All actual 138transforms are named by well-formed identifiers, but not all well-formed 139identifiers refer to actual transforms. An analogy is C++ method names. I can 140write the syntactially well-formed method name "void 141Cursor::getPosition(Position& pos)", but whether or not this refers to an actual 142method in an actual class is a different matter. 143 144### Basic IDs 145 146The simplest identifier is a 'basic ID'. 147 148``` 149basicID := (<source> "-")? <target> ("/" <variant>)? 150``` 151 152A basic ID typically names a source and target. In "Katakana-Latin", "Katakana" 153is the source and "Latin" is the target. The source specifier describes the 154characters or strings that the transform will modify. The target specifier 155describes the result of the modification. If the source is not given, then the 156source is "Any", the set of all characters. Source and Target specifiers can be 157[Script IDs](http://www.unicode.org/cldr/utility/properties.jsp#Script) (long like 158"Latin" or short like "Latn"), [Unicode language 159Identifiers](http://www.unicode.org/reports/tr35/#Unicode_Language_and_Locale_Identifiers) 160(like fr, en_US, or zh_Hant), or special tags (like Any or Hex). For example: 161 1621. Katakana-Latin 1632. Null 1643. Hex-Any/Perl 1654. Latin-el 1665. Greek-en_US/UNGEGN 167 168Some basic IDs contain a further specifier following a forward slash. This is 169the variant, and it further specifies the transform when several versions of a 170single transformation are possible. For example, ICU provides several transforms 171that convert from Unicode characters to escaped representations. These include 172standard Unicode syntax "U+4E01", Perl syntax "\\x{4E01}", XML syntax 173"\丁", and others. The transforms for these operations are named 174"Any-Hex/Unicode", "Any-Hex/Perl", and "Any-Hex/XML", respectively. If no 175variant is specified, then the default variant is selected. In the example of 176"Any-Hex", this is the Java variant (for historical reasons), so "Any-Hex" is 177equivalent to "Any-Hex/Java". 178 179### Filtered IDs 180 181A filtered IDs is a basic IDs constrained by a filter. For example, to specify a 182transform that converts only ASCII vowels to uppercase, use the ID: 183 184``` 185[aeiou] Upper 186``` 187 188The filter is a valid UnicodeSet pattern prefixed to the basic ID. Only 189characters within the set will be modified by the transform. Some transforms are 190only useful with filters, for example, the Remove transform, which deletes all 191input characters. Specifying `"[:Nonspacing Mark:] Remove"` gives a transform 192that removes non-spacing marks from input text. 193 194> :point_right: **Note**: *As of ICU 2.0, the filter pattern must be enclosed in brackets. Perl-syntax 195patterns of the form `"\p{Lu}"` cannot be used directly; instead they must be 196enclosed, e.g. `"[\p{Lu}]"`.* 197 198### Inverses 199 200Any transform ID can be modified to form an "inverse" ID. This is the ID of a 201related transform that performs an inverse operation. For basic IDs, this is 202done by exchanging the source and target names. For example, the inverse of 203"Latin-Greek/UNGEGN" is "Greek-Latin/UNGEGN", and vice versa. The variant, if 204any, is unaffected. 205 206If there is no named source, the same rule still applies, using the implicit 207source "Any". So the inverse of "Hex/Perl" is "Hex-Any/Perl", since the former 208is really shorthand for "Any-Hex/Perl". 209 210The notion of inverses carries two important caveats. The first involves the 211semantics of inverses. Consider a transform "A-B". Its inverse, "B-A", is 212thought of as reversing the transformation accomplished by "A-B". The degree and 213completeness of the reversal, however, is not guaranteed. 214 215For example, consider the "Lower" transform. It has an inverse of "Upper" (this 216is a special, non-standard inverse relationship that the transliteration service 217knows about). Applying "Lower" to the string "Hello There" yields the string 218"hello there". Applying "Upper" to this result then yields "HELLO THERE", which 219is not the same as the original string. 220 221Complete and exact reversal **is** possible if the transform has been explicitly 222designed to support this. Examples of transforms that support this are "Any-Hex" 223and "SCRIPT-Latin", where SCRIPT is a supported transliteration script. The 224"SCRIPT-Latin" transforms support exact reversal of well-formed text in SCRIPT 225to Latin (via "SCRIPT-Latin") and back to SCRIPT (via "Latin-SCRIPT"). This is 226called **round-trip integrity**. They do not, however, support round-trip 227integrity from Latin to SCRIPT and back to Latin. 228 229> :point_right: **Note**: *Do not assume that a transform's inverse will provide a complete or exact 230reversal.* 231 232The second caveat with inverses has to do with existence. Although any ID can be 233inverted, this does not guarantee that the inverse ID actually exists. For 234example, if I create a custom translitertor `Latin-Antarean` and register it with 235the system, I can then pass the string "Latin-Antarean" to `createInstance()` or 236`getInstance()` to get that transform. If I then ask for its inverse, however, the 237request will fail, since I have not created and registered "Antarean-Latin" with 238the system. 239 240> :point_right: **Note**: *Any transform ID can be inverted, but the inverse ID may not name an actual 241registered transform.* 242 243### Custom Inverses 244 245Consider the transforms "Any-Lower" and "Any-Upper": It is convenient to 246associate these as inverses of one another. However, using the standard 247procedure for ID inversion on "Any-Lower" yields "Lower-Any", which is not what 248we want. To override the standard ID inversion, the inverse ID can be explicitly 249stated within the ID string as follows: 250 251`"Any-Lower (Any-Upper)"` or equivalently `"Lower (Upper)"` 252 253When this ID is inverted, the result is "Any-Upper (Any-Lower)". Using this 254mechanism, the user can form arbitrary inverse relations when necessary. 255 256When using custom inverses of the form "A-B (C-D)", either "A-B" or "C-D" may be 257empty. An empty element is the same as "Null". That is, "A-B ( )" is the same as 258"A-B (Null)", and it inverts to the null transform (which does nothing). The 259null transform it inverts to has the ID "(A-B)", also written "Null (A-B)", and 260inverts back to "A-B ( )". Note that "A-B ( )" is very different from both "A-B" 261and "(A-B)": 262 263| ID | Inverse of ID | 264|---|---| 265| A-B | B-A | 266| A-B ( ) | (A-B) | 267| (A-B) | A-B ( ) | 268 269For some system transforms, special inverse mappings exists automatically. These 270mappings are symmetrical, that is, the right column is the inverse of the left 271column, and vice versa. The mappings are: 272 273| | | 274|---|---| 275| Any-Null | Any-Null | 276| Any-NFD | Any-NFC | 277| Any-NFKD | Any-NFKC | 278| Any-Lower | Any-Upper | 279 280In other words, writing "Any-NFD" is exactly equivalent to writing "Any-NFD 281(Any-NFC)" since the system maps the former to the latter internally. However, 282one can still alter the mapping of these transforms by specifying an explicit 283custom inverse, e.g. "NFD (Lower)". 284 285### Compound IDs 286 287Transliterators are often combined in sequence to achieve a desired 288transformation. This is analogous to the composition of mathematical functions. 289For example, given a script that converts lowercase ASCII characters from Latin 290script to Katakana script, it is convenient to first (1) separate input base 291characters and accents, and then (2) convert uppercase to lowercase. (Katakana 292is caseless, so it is best to write rules that operate only on the lowercase 293Latin base characters and produce corresponding Katakana.) To achieve this, a 294**compound transform** can be specified as follows: 295 296``` 297NFKD; Lower; Latin-Katakana; 298``` 299 300(In real life, we would probably use "NFD", but we use "NFKD" for explanatory 301purposes here.) It is also desirable to modify only Latin script characters. To 302do so, a filter may be prefixed to the entire compound transform. This is called 303a **global filter** to distinguish it from filters on the individual transforms 304within the compound: 305 306``` 307[:Latin:]; NFKD; Lower; Latin-Katakana; 308``` 309 310The inverse of such a transform is formed by reversing the list and inverting 311each element. In this example, this would be: 312 313``` 314Katkana-Latin; Upper; NFKC; ([:Latin:]); 315``` 316 317Note that two special mappings take effect: "Lower" to "Upper" and "NFKD" to 318"NFKC". Note also that the global filter is enclosed in parentheses, rendering 319it inoperative in the reverse direction. 320 321In this example we probably don't really want to map Latin characters to 322uppercase in the reverse direction, so we need to modify the original transform 323as follows: 324 325``` 326[:Latin:]; NFKD; Lower(); Latin-Katakana; 327``` 328 329Recall that the empty parentheses in "Lower ( )" are shorthand for "Lower 330(Null)" where "Null" is the null transform, that is, the transform that leaves 331text unchanged. The inverse of this is "Null (Lower)", also written "(Lower)". 332Now the inverse of the entire compound is: 333 334``` 335Katakana-Latin; (Lower); NFKC; ([:Latin:]); 336``` 337 338This still isn't quite right, since we really want to recompose our output, in 339both directions. We also want to only touch Katakana characters in the reverse 340direction. Our final example, modified to address these two concerns, is as 341follows: 342 343``` 344[:Latin:]; NFKD; Lower(); Latin-Katakana; NFC; ([:Katakana:]); 345``` 346 347This inverts to: 348 349``` 350[:Katakana:]; NFD; Katakana-Latin; (Lower); NFKC; ([:Latin:]); 351``` 352 353(In real life, we would probably use only "NFD" and "NFC", but we use the 354compatibility normalizers in this example so they can be distinguished.) 355 356Compound IDs are the most complex identifiers that can be formed. Many system 357transforms are actually compound transforms that have been aliased to basic IDs. 358It is also possible to write a transform rule with embedded instructions for 359generating a compound transform; system transforms use this approach as well. 360 361### Formal ID Syntax 362 363Here is a formal description of the identifier syntax. The 'ID' entity can be 364passed to `getInstance()` or `createInstance()`. 365 366| ID | := Single_ID \| Compound_ID | 367|---|---| 368| Single_ID | := filter? Basic_ID ( '(' Basic_ID? ')' )? \| filter? '(' Basic_ID ')' 369| Compound_ID | := ( filter ';' )? ( Single_ID ';' )+ ( '(' filter ');' )? 370| Basic_ID | := Spec \| Spec '-' Spec \| Spec '/' Identifier \| Spec '-' Spec '/' Identifier 371| Spec | := script-name \| locale-name \| Identifier 372| Identifier | := identifier-start identifier-part\* 373 374Elements enclosed in single quotes are literals. Parentheses group elements. 375Vertical bars represent exclusive alternatives. The '?' suffix repeats the 376preceding element zero or one times. The '+' suffix repeats the preceding 377element one or more times. 378 379A 'script-name' is a string acceptable to the UScript API that specifies a 380script. It may be a full script name such as "Latin" or a script abbreviation 381such as "Latn". A 'locale-name' is a standard locale name such as "hi_IN". The 382'identifier-start' and 'identifier-part' elements are characters defined by the 383UCharacter API to start and continue identifier names. Finally, 'filter' is a 384valid UnicodeSet pattern. 385 386> :point_right: **Note**: *As of ICU 2.0, the filter must be enclosed in brackets. Top-level Perl-style 387patterns are unsupported in 2.0.* 388 389## ICU Transliterators 390 391Currently, there are a number of basic transliterations supplied with ICU. The 392following table shows these basic transforms: 393 394### General 395 396| | | 397|---|---| 398| → Any-Null | Has no effect; leaves input text unchanged. | 399| → Any-Remove | Deletes input characters. This is useful when combined with a filter that restricts the characters to be deleted. | 400| → Any-Lower, Any-Upper, Any-Title | Converts to the specified case. See [Case Mappings](../casemappings.md) for more information. | 401| → Any-NFD, Any-NFC, Any-NFKD, Any-NFKC, Any-FCD, Any-FCC | Converts to the specified normalized form. See [Normalization](../normalization/index.md) for more information. | 402| Any-Name | Converts between characters and their Unicode names in curly braces using Perl syntax. For example: ., \\N{FULL STOP}\\N{COMMA} | 403| Any-Hex | Converts between characters and their Unicode code point values. For example: ., \\u002E\\u002C Any-Hex/XML uses the &#xXXXX; format. For example: ., ., Variants include Any-Hex/C, Any-Hex/Java, Any-Hex/Perl, Any-Hex/XML, and Any-Hex/XML10. Any-Hex, with no variant, is equivalent to Any-Hex/Java, for historical reasons. | 404| → Any-Accents | Lets you type e- for e-macron, etc. For example: o' ó | 405| Any-Publishing | Converts between real punctuation and typewriter punctuation. For example: “a” — ‘b’ "a" -- 'b' | 406| → Latin-ASCII | Converts non-ASCII-range punctuation, symbols, and Latin letters in an approximate ASCII-range equivalent. For example: « → '<<', © → '(C)', Æ → AE. Can be combined with Any-Latin to produce a transform that will convert as much as possible to an ASCII-range representation: “Any-Latin; Latin-ASCII”. | 407| IPA-XSampa | Convert between IPA characters and the XSampa ASCII-range representation of IPA characters. | 408| Fullwidth-Halfwidth | Converts between narrow or half-width characters and full-width. For example: アルアノリウ tech アルアノリウ tech | 409| Latin-NumericPinyin | Converts between a Pinyin Latin representation using tone marks and one using numeric tone indicators. | 410 411### Script/Language 412 413The ICU script/language transforms are based on common standards for the 414particular scripts, where possible. In some cases, the transforms are augmented 415to support reversibility. 416 417> :point_right: **Note**: *Standard transliteration methods often do not follow the pronunciation rules of 418any particular language in the target script. For more information on the design 419of transliterations, see the following Guidelines (§) section. * 420 421The built-in script transforms are: 422 423| | | 424|---|---| 425| Latin | Arabic, Armenian, Bopomofo, Cyrillic, Georgian, Greek (with UNGEGN variant), Han (with Names variant → Latin), Hangul, Hebrew, Hiragana, Indic, Jamo, Katakana, Syriac, Thaana, Thai | 426| Indic | Indic | 427| Hiragana | Katakana | 428| Simplified (Hans) | Traditional (Hant) | 429 430Indic includes Devanagari, Gujarati, Gurmukhi, Kannada, Malayalam, Oriya, Tamil, 431and Telegu. ICU can transliterate from Latin to any of these dialects and back, 432and from Indic script to any other Indic script. For example, you can 433transliterate from Kannada to Gujarati, or from Latin to Oriya. 434 435In addition, ICU may supply transliterations that are specific to language 436pairs, or between a language and a script. For example, ICU could have a ru-en 437(Russian-English) transform. 438 439As with locales, there is a fallback mechanism. If the Russian-English transform 440is requested and is not available, then ICU will search for a Russian-Latin 441transform. If the Russian-Latin transform is not available, ICU will search for 442a Cyrillic-Latin transform. 443 444For information on the precise makeup of each of the script transforms, see 445Script Transliterator Sources (§) section below. 446 447## Guidelines for Script/Language Transliterations 448 449There are a number of generally desirable guidelines for script 450transliterations. These guidelines are rarely satisfied simultaneously, so 451constructing a reasonable transliteration is always a process of balancing 452different requirements. These requirements are most important for people who are 453building transliterations, but are also useful as background information for 454users. The following lists the general guidelines for transliterations: 455 4561. complete: every well-formed sequence of characters in the source script 457 should transliterate to a sequence of characters from the target script. 458 4592. predictable: the letters themselves (without any knowledge of the languages 460 written in that script) should be sufficient for the transliteration, based 461 on a relatively small number of rules. This allows the transliteration to be 462 performed mechanically. 463 4643. pronounceable: transliteration is not as useful if the process simply maps 465 the characters without any regard to their pronunciation. Simply mapping 466 "αβγδεζηθ..." to "abcdefgh..." would yield strings that might be complete 467 and unambiguous, but cannot be pronounced. 468 4694. unambiguous: it is always possible to recover the text in the source script 470 from the transliteration in the target script. Someone that knows the 471 transliteration rules will be able to recover the precise spelling of the 472 original source text (for example, it is possible to go from Elláda back to 473 the original Ελλάδα). It is possible to define an reverse (or inverse) 474 mapping. Thus, this property is sometimes called reversibility (or 475 invertibility). 476 477### Ambiguity 478 479In transliteration, multiple characters may produce ambiguities unless the rules 480are carefully designed. For example, the Greek character PSI (ψ) maps to ps, but 481ps could also (theoretically) result from the sequence PI, SIGMA (πσ) since PI 482(π) maps to p and SIGMA (σ) maps to s. 483 484The Japanese transliteration standards provide a good mechanism for handling 485similar ambiguities. Using the Japanese transliteration standards, whenever an 486ambiguous sequence in the target script does not result from a single letter, 487the transform uses an apostrophe to disambiguate it. For example, it uses that 488procedure to distinguish between man'ichi and manichi. Using this procedure, the 489Greek character PI SIGMA (πσ) maps to p's. This method is recommended for all 490script transliteration methods. 491 492> :point_right: **Note**: *Some characters in a target script are not normally found outside of certain 493contexts. For example, the small Japanese "ya" character, as in "kya" (キャ), is 494not normally found in isolation. To handle such characters, ICU uses a tilde. 495For example, to display an isolated small "ya", type "~ya". To represent a 496non-final Greek sigma (ασ) at the end of a word, use "a~s". To represent a final 497sigma in a non-final position (ςα), type "~sa". * 498 499For the general script transforms, a common technique for reversibility is to 500use extra accents to distinguish between letters that may not be otherwise 501distinguished. For example, the following shows Greek text that is mapped to 502fully reversible Latin: 503 504> **`Greek-Latin`** 505> | | | 506> |---|---| 507> | τί φῄς; γραφὴν σέ τις, ὡς ἔοικε, γέγραπται: οὐ γὰρ ἐκεῖνό γε καταγνώσομαι, ὡς σὺ ἕτερον. | tí phḗis; graphḕn sé tis, hōs éoike, gégraptai: ou gàr ekeînó ge katagnṓsomai, hōs sỳ héteron. | 508 509If the user wants a version without certain accents, then a transform can be 510used to remove the accents. For example, the following transliterates to Latin 511but removes the macron accents on the long vowels. 512 513> **`Greek-Latin; nfd; [\u0304] remove; nfc`** 514> | | | 515> |---|---| 516> | τί φῄς; γραφὴν σέ τις, ὡς ἔοικε, γέγραπται: οὐ γὰρ ἐκεῖνό γε καταγνώσομαι, ὡς σὺ ἕτερον. | tí phéis; graphèn sé tis, hos éoike, gégraptai: ou gàr ekeînó ge katagnósomai, hos sỳ héteron. 517 518The following transliterates to Latin but removes all accents: 519 520> **`Greek-Latin; nfd; [:nonspacing marks:] remove; nfc`** 521> | | | 522> |---|---| 523> | τί φῄς; γραφὴν σέ τις, ὡς ἔοικε, γέγραπται: οὐ γὰρ ἐκεῖνό γε καταγνώσομαι, ὡς σὺ ἕτερον. | ti pheis; graphen se tis, hos eoike, gegraptai: ou gar ekeino ge katagnosomai, hos sy heteron. | 524 525### Pronunciation 526 527Standard transliteration methods often do not follow the pronunciation rules of 528any particular language in the target script. For example, the Japanese Hepburn 529system uses a "j" that has the English phonetic value (as opposed to French, 530German, or Spanish), but uses vowels that do not have the standard English 531sounds. A transliteration method might also require some special knowledge to 532have the correct pronunciation. For example, in the Japanese kunrei-siki system, 533"tu" is pronounced as "tsu". This is similar to situations where there are 534different languages within the same script. For example, knowing that the word 535Gewalt comes from German allows a knowledgeable reader to pronounce the "w" as a 536"v". 537 538In some cases, transliteration may be heavily influenced by tradition. For 539example, the modern Greek letter beta (β) sounds like a "v", but a transform may 540continue to use a b (as in biology). In that case, the user would need to know 541that a "b" in the transliterated word corresponded to beta (β) and is to be 542pronounced as a "v" in modern Greek. Letters may also be transliterated 543differently according to their context to make the pronunciation more 544predictable. For example, since the Greek sequence GAMMA GAMMA (γγ) is 545pronounced as "ng", the first GAMMA can be transcribed as an "n". 546 547> :point_right: **Note**: *In general, predictability means that when transliterating Latin script to 548other scripts, English text will not produce phonetic results. This is because 549the pronunciation of English cannot be predicted easily from the letters in a 550word: e.g. grove, move, and love all end with "ove", but are pronounced very 551differently.* 552 553### Cautions 554 555Reversibility may require modifications of traditional transcription methods. 556For example, there are two standard methods for transliterating Japanese 557katakana and hiragana into Latin letters. The kunrei-siki method is unambiguous. 558The Hepburn method can be more easily pronounced by foreigners but is ambiguous. 559In the Hepburn method, both ZI (ジ) and DI (ヂ) are represented by "ji" and both 560ZU (ズ) and DU (ヅ) are represented by "zu". A slightly amended version of 561Hepburn, that uses "dji" for DI and "dzu" for DU, is unambiguous. 562 563When a sequence of two letters map to one, case mappings (uppercase and 564lowercase) must be handled carefully to ensure reversibility. For cased scripts, 565the two letters may need to have different cases, depending on the next letter. 566For example, the Greek letter PHI (Φ) maps to PH in Latin, but Φο maps to Pho, 567and not to PHo. 568 569Some scripts have characters that take on different shapes depending on their 570context. Usually, this is done at the display level (such as with Arabic) and 571does not require special transliteration support. However, in a few cases this 572is represented with different character codes, such as in Greek and Hebrew. For 573example, a Greek SIGMA is written in a final form (ς) at the end of words, and a 574non-final form (σ) in other locations. This requires the transform to map 575different characters based on the context. 576 577> :point_right: **Note**: *It is useful for the reverse mapping to be complete so that arbitrary strings 578in the target script can be reasonably mapped back to the source script. 579Complete reverse mapping makes it much easier to do mechanical quality checks 580and so on. For example, even though the letter "q" might not be necessary in a 581transliteration of Greek, it can be mapped to a KAPPA (κ). Such reverse mappings 582will not, in general, be unambiguous.* 583 584## Using Transliterators 585 586Transliterators have APIs in C, C++, and Java™. Only the C++ APIs are listed 587here. For more information on the C, Java, and other APIs, see the relevant API 588docs. 589 590To list the available Transliterators, use code like the following: 591 592``` 593count = Transliterator:: countAvailableIDs(); 594myID = Transliterator::getAvailableID(n); 595``` 596 597The ID should not be displayed to users as it is for internal use only. A 598separate string, one that can be localized to different languages, is obtained 599with a static method. (This method is static to allow the translated names to be 600augmented without changing the code.) To get a localized name for use in a GUI, 601use the following: 602 603``` 604Transliterator::getDisplayName(myID, france, nameForUser); 605``` 606 607To create a Transliterator, use the following: 608 609``` 610UErrorCode status = U_ZERO_ERROR; 611Transliterator *myTrans = Transliterator::createInstance("Latin-Greek", 612UTRANS_FORWARD, status); 613``` 614 615To get a pre-made compound transform, use a series of IDs separated by ";". For 616example: 617 618``` 619myTrans = Transliterator::createInstance( 620 "any-NFD; [:nonspacing mark:] any-remove; any-NFC", UTRANS_FORWARD, status); 621``` 622 623To convert an entire string, use the following: 624 625``` 626myTrans.transliterate(myString); 627``` 628 629For more complex cases, such a keyboard input, the following full method 630provides more control: 631 632``` 633myTrans.transliterate(replaceable, positions, complete); 634``` 635 636The Replaceable interface (or abstract class in C++) allows more complex text to 637be used with Transliterators, such as styled text. In ICU4J, a wrapper is 638supplied for StringBuffer. A wrapper is an interface to text that handles a very 639few operations. For example, the interface can access characters and replace one 640substring with another. By using this interface, replacement text can take on 641the same style as the text it is replacing, so that style information is not 642lost. With a replaceable interface to HTML or XML, even higher level structure 643can be preserved. 644 645The positions parameter contains information about the range of text that should 646be transliterated, plus the possibly larger range of text that can serve as 647context. 648 649The `complete` parameter indicates whether or not you are to consider the text up 650to the limit to be complete or not. For keyboard input, the `complete` parameter 651should normally be false. Only when the conversion is complete is that parameter 652set to true. For example, suppose that a transform converts "sh" to X, and "s" 653in other cases to Y. If the complete parameter is true, then a dangling "s" 654converts to Y; when the complete parameter is false, then the dangling "s" 655should not be converted, since there is more text to come. 656 657In keyboard input, normally start/cursor and limit/end are set to the selection 658at the time the transform is chosen. The following shows how the selection is 659chosen: 660 661``` 662positions.start = positions.cursor = selection.getStart(); 663positions.limit = positions.end = selection.getEnd(); 664``` 665 666As the user types or inserts `inputChars`, call the following: 667 668``` 669replacable.replace(positions.limit, positions.limit, inputChars); // update the 670text 671positions.limit += inputChars.length(); // update the positions 672myTrans.transliterate(replaceable, positions, false); 673``` 674 675If the user performs an action that indicates he or she is done with the text, 676then transliterate is called one last time using the following: 677 678``` 679myTrans.transliterate(replaceable, positions, false); 680``` 681 682Transliterator objects are stateless. They retain no information between calls 683to `transliterate()`. 684 685The statelessness might seem to limit the complexity of the operations that can 686be performed. In practice, complex transliterations happen by delaying the 687replacement of text until it is known that no other replacements are possible. 688In other words, although the Transliterator objects are stateless, the source 689text itself embodies all the needed information and delayed operation allows 690arbitrary complexity. 691 692## Designing Transliterators 693 694Many people use the supplied transforms. However, there are two different ways 695of designing transforms. Many transforms can be produced without subclassing, 696simply by designing rules for a RuleBasedTransliterator. If conversions can be 697done algorithmically much more compactly than with a long list of rules, then 698consider subclassing Transliterator directly. For example, ICU itself supplies 699specialized subclasses for the following: 700 7011. Hangul Jamo 702 7032. Any Hex 704 7053. Wrapping the string functions for normalization, case mapping, etc. 706 707### Subclassing Transliterators 708 709Subclassers must override `handleTransliterate(Replaceable text, Positions 710positions, boolean complete)`. They can override some of the other methods for 711efficiency, but ensure that the results are identical. In `handleTransliterate` 712convert the text from `positions.cursor` up to `positions.limit`. The context from 713`positions.start` to `positions.end` may be taken into account as context when doing 714this conversion, but should not be converted themselves. Never look at any 715characters before `positions.start` or after `positions.end`. 716 717The `complete` parameter indicates whether or not the text up to limit is 718complete. For example, suppose that you would convert "sh" to X, and "s" in 719other cases to Y. If the complete parameter is true, then a dangling "s" 720converts to Y; when the complete parameter is false, then the dangling "s" 721should not be converted. When you return from the method, `positions.cursor` 722should be set to the furthest position you processed. Typically this will be up 723to `limit`; in case there was an incomplete sequence at the end, `cursor` should be 724set to the position just before that sequence. 725 726### Rule-Based Transliterators 727 728ICU supplies the foundation for producing well-behaved transliterations and 729supplies a number of typing transliterations for different scripts. The simplest 730mechanism for producing transliterations is called a RuleBasedTransliterator. 731The RuleBasedTransliterator is a data-based class that allows transliterations 732to be built up with a series of rules. These rules provide a specialized set of 733context-sensitive matching operations. The operations are similar to 734regular-expression rules, but adapted to the specific domain of 735transliterations. 736 737The simplest rule is a conversion rule, which replaces one string of characters 738with another. The conversion rule takes the following form: 739 740``` 741xy > z ; 742``` 743 744This converts any substring "xy" into "z". Rules are executed in order, so: 745 746``` 747sch > sh ; 748ss > z ; 749``` 750 751This conversion rule transforms "bass school" into "baz shool". The transform 752walks through the string from start to finish. Thus given the rules above 753"bassch" will convert to "bazch", because the "ss" rule is found before the 754"sch" rule in the string (later, we'll see a way to override this behavior). If 755two rules can both apply at a given point in the string, then the transform 756applies the first rule in the list. 757 758All of the ASCII characters except numbers and letters are reserved for use in 759the rule syntax. Normally, these characters do not need to be converted. 760However, to convert them use either a pair of single quotes or a slash. The pair 761of single quotes can be used to surround a whole string of text. The slash 762affects only the character immediately after it. For example, to convert from 763two less-than signs to the word "much less than", use one of the following 764rules: 765 766``` 767\<\< > much\ less\ than ; 768'<<' > 'much less than' ; 769'<<' > much' 'less\ than ; 770``` 771 772*Spaces may be inserted anywhere without any effect on the rules. Use extra space to separate items out for clarity without worrying about the effects. This feature is particularly useful with combining marks; it is handy to put some spaces around it to separate it from the surrounding text. The following is an example:* 773 774``` 775 ͅ> i ; # an iota-subscript diacritic turns into an i. 776``` 777 778*For a real space in the rules, place quotes around it. For a real backslash, 779either double it \, or quote it '\'. For a real single quote, double it '', 780or place a backslash before it \'. Each of the following means the same thing:* 781 782``` 783'can''t go' 784'can\'t go' 785can\'t\ go 786can''t' 'go 787``` 788 789*Any text that starts with a hash mark and concludes a line is a comment. Comments help document how the rules work. The following shows a comment in a rule:* 790 791``` 792x > ks ; # change every x into ks 793``` 794 795We can use "\\u" notation instead of any letter. For instance, instead of using 796the Greek πp, we could write: 797 798``` 799\u03C0 > p ; 800``` 801 802We can also define and use variables, such as: 803 804``` 805$pi = \u03C0 ; $pi > p ; 806``` 807 808#### Dual Rules 809 810Rules can also specify what happens when an inverse transform is formed. To do 811this, we reverse the direction of the "<" sign. Thus the above example becomes: 812 813``` 814$pi < p ; 815``` 816 817With the inverse transform, "p" will convert to the Greek p. These two 818directions can be combined together into a dual conversion rule by using the 819"<>" operator, yielding: 820 821``` 822$pi <> p ; 823``` 824 825#### Context 826 827Context can be used to have the results of a transformation be different 828depending on the characters before or after. The following means "Remove 829hyphens, but only when they follow lowercase letters": 830 831``` 832[:lowercase letter:] } '-' > '' ; 833``` 834 835> :point_right: **Note**: *The context itself (`[:lowercase letter:]`) is unaffected by the replacement; 836only the text between the curly braces is changed. * 837 838#### Revisiting 839 840If the resulting text contains a vertical bar "|", then that means that 841processing will proceed from that point and that the transform will revisit part 842of the resulting text. For example, if we have: 843 844``` 845x > y | z ; 846z a > w; 847``` 848 849then the string "xa" will convert to "yw". First, "xa" is converted to "yza". 850Then the processing will continue from after the character "y", pick up the 851"za", and convert it. Had we not had the "|", the result would have been simply 852"yza". 853 854#### Example 855 856The following shows how these features are combined together in the 857Transliterator "Any-Publishing". This transform converts the ASCII typewriter 858conventions into text more suitable for desktop publishing (in English). It 859turns straight quotation marks or UNIX style quotation marks into curly 860quotation marks, fixes multiple spaces, and converts double-hyphens into a dash. 861 862``` 863# Variables 864$single = \' ; 865$space = ' ' ; 866$double = \" ; 867$back = \` ; 868$tab = '\u0008' ; 869 870# the following is for spaces, line ends, (, [, {, ... 871$makeRight = [[:separator:][:start punctuation:][:initial punctuation:]] ; 872 873# fix UNIX quotes 874$back $back > “ ; # generate right d.q.m. (double quotation mark) 875$back > ‘ ; 876 877# fix typewriter quotes, by context 878$makeRight { $double <> “ ; # convert a double to right d.q.m. after certain chars 879^ { $double > “ ; # convert a double at the start of the line. 880$double <> ” ; # otherwise convert to a left q.m. 881 882$makeRight {$single} <> ‘ ; # do the same for s.q.m.s 883^ {$single} > ‘ ; 884$single <> ’; 885 886# fix multiple spaces and hyphens 887$space {$space} > ; # collapse multiple spaces 888'--' <> — ; # convert fake dash into real one 889``` 890 891### Rule Syntax 892 893The following describes the full format of the list of rules used to create a 894RuleBasedTransliterator. Each rule in the list is terminated by a semicolon. The 895list consists of the following: 896 8971. an optional filter rule 8982. zero or more transform rules 8993. zero or more variable-definition rules 9004. zero or more conversion rules 9015. an optional inverse filter rule 902 903The filter rule, if present, must appear at the beginning of the list, before 904any of the other rules. The inverse filter rule, if present, must appear at the 905end of the list, after all of the other rules. The other rules may occur in any 906order and be freely intermixed. 907 908The rule list can also generate the inverse of the transform. In that case, the 909inverse of each of the rules is used, as described below. 910 911#### Transform Rules 912 913Each transform rule consists of two colons followed by a transform name. For 914example: 915 916``` 917:: NFD ; 918``` 919 920The inverse of a transform rule follows the same conventions as when we create a 921transform by name. For example: 922 923``` 924:: lower () ; # only executed for the normal 925:: (lower) ; # only executed for the inverse 926:: lower ; # executed for both the normal and the inverse 927``` 928 929#### Variable Definition Rules 930 931Each variable definition is of the following form: 932 933``` 934$variableName = contents ; 935``` 936 937The variable name can contain letters and digits, but must start with a letter. 938More precisely, the variable names use Unicode identifiers as defined by the 939identifier properties in ICU. The identifier properties allow for the use of 940foreign letters and numbers. See the Unicode class for C++ and the UCharacter 941class for Java. 942 943The contents of a variable definition is any sequence of Unicode sets and 944characters or characters. For example: 945 946``` 947$mac = M [aA] [cC] ; 948``` 949 950Variables are only replaced within other variable definition rules and within 951conversion rules. They have no effect on transliteration rules. 952 953#### Filter Rules 954 955A filter rule consists of two colons followed by a UnicodeSet. This filter is 956global in that only the characters matching the filter will be affected by any 957transform rules or conversion rules. The inverse filter rule consists of two 958colons followed by a UnicodeSet in parentheses. This filter is also global for 959the inverse transform. 960 961For example, the Hiragana-Latin transform can be implemented by "pivoting" 962through the Katakana converter, as follows: 963 964``` 965# don't touch any katakana that was in the text! 966:: [:^Katakana:] ; 967 968:: Hiragana-Katakana; 969:: Katakana-Latin; 970 971# don't touch any katakana that was in the text 972# for the inverse either! 973:: ([:^Katakana:]) ; 974``` 975 976The filters keep the transform from mistakenly converting any of the "pivot" 977characters. Note that this is a case where a rule list contains no conversion 978rules at all, just transform rules and filters. 979 980#### Conversion Rules 981 982Conversion rules can be forward, backward, or double. The complete conversion 983rule syntax is described below: 984 985##### Forward 986 987A forward conversion rule is of the following form: 988 989``` 990before_context { text_to_replace } after_context > completed_result | result_to_revisit ; 991``` 992 993If there is no before_context, then the "{" can be omitted. If there is no 994after_context, then the "}" can be omitted. If there is no result_to_revisit, 995then the "|" can be omitted. A forward conversion rule is only executed for the 996normal transform and is ignored when generating the inverse transform. 997 998##### Backward 999 1000A backward conversion rule is of the following form: 1001 1002``` 1003completed_result | result_to_revisit < before_context { text_to_replace } after_context ; 1004``` 1005 1006The same omission rules apply as in the case of forward conversion rules. A 1007backward conversion rule is only executed for the inverse transform and is 1008ignored when generating the normal transform. 1009 1010##### Dual 1011 1012A dual conversion rule combines a forward conversion rule and a backward 1013conversion rule into one, as discussed above. It is of the form: 1014 1015``` 1016a { b | c } d <> e { f | g } h ; 1017``` 1018 1019When generating the normal transform and the inverse, the revisit mark "|" and 1020the before and after contexts are ignored on the sides where they don't belong. 1021Thus, the above is exactly equivalent to the sequence of the following two 1022rules: 1023 1024``` 1025a { b c } d > f | g ; 1026b | c < e { f g } h ; 1027``` 1028 1029#### Intermixing Transform Rules and Conversion Rules 1030 1031Starting in ICU 3.4, transform rules and conversion rules may be freely 1032intermixed. (In earlier versions of ICU, transform rules were only allowed at 1033the beginning or end of the rule set, immediately after the global filter or 1034immediately before the reverse global filter.) Inserting a transform rule into 1035the middle of a set of conversion rules has an important side effect. 1036 1037Normally, conversion rules are considered together as a group. The only time 1038their order in the rule set is important is when more than one rule matches at 1039the same point in the string. In that case, the one that occurs earlier in the 1040rule set wins. In all other situations, when multiple rules match overlapping 1041parts of the string, the one that matches earlier wins. 1042 1043Transform rules apply to the whole string. If you have several transform rules 1044in a row, the first one is applied to the whole string, then the second one is 1045applied to the whole string, and so on. To reconcile this behavior with the 1046behavior of conversion rules, transform rules have the side effect of breaking a 1047surrounding set of conversion rules into two groups: First all of the conversion 1048rules before the transform rule are applied as a group to the whole string in 1049the usual way, then the transform rule is applied to the whole string, and then 1050the conversion rules after the transform rule are applied as a group to the 1051whole string. For example, consider the following rules: 1052 1053``` 1054abc > xyz; 1055xyz > def; 1056::Upper; 1057``` 1058 1059If you apply these rules to “abcxyz”, you get “XYZDEF”. If you move the 1060“::Upper;” to the middle of the rule set and change the cases accordingly... 1061 1062``` 1063abc > xyz; 1064::Upper; 1065XYZ > DEF; 1066``` 1067 1068...applying this to “abcxyz” produces “DEFDEF”. This is because “::Upper;” 1069causes the transliterator to reset to the beginning of the string: The first 1070rule turns the string into “xyzxyz”, the second rule uppercases the whole thing 1071to “XYZXYZ”, and the third rule turns this into “DEFDEF”. 1072 1073This can be useful when a transform naturally occurs in multiple “passes.” 1074Consider this rule set: 1075 1076``` 1077[:Separator:]* > ' '; 1078'high school' > 'H.S.'; 1079'middle school' > 'M.S.'; 1080'elementary school' > 'E.S.'; 1081``` 1082 1083If you apply this rule to “high school”, you get “H.S.”, but if you apply it to 1084“high school” (with two spaces), you just get “high school” (with one space). To 1085have “high school” (with two spaces) turn into “H.S.”, you'd either have to have 1086the first rule back up some arbitrary distance (far enough to see “elementary”, 1087if you want all the rules to work), or you have to include the whole left-hand 1088side of the first rule in the other rules, which can make them hard to read and 1089maintain: 1090 1091``` 1092$space = [:Separator:]*; 1093high $space school > 'H.S.'; 1094middle $space school > 'M.S.'; 1095elementary $space school > 'E.S.'; 1096``` 1097 1098Instead, you can simply insert “::Null;” in order to get things to work right: 1099 1100``` 1101[:Separator:]* > ' '; 1102::Null; 1103'high school' > 'H.S.'; 1104'middle school' > 'M.S.'; 1105'elementary school' > 'E.S.'; 1106``` 1107 1108The “::Null;” has no effect of its own (the null transliterator, by definition, 1109doesn't do anything), but it splits the other rules into two “passes”: The first 1110rule is applied to the whole string, normalizing all runs of whitespace into 1111single spaces, and then we start over at the beginning of the string to look for 1112the phrases. “high school” (with four spaces) gets correctly converted to 1113“H.S.”. 1114 1115This can also sometimes be useful with rules that have overlapping domains. 1116Consider this rule set from before: 1117 1118``` 1119sch > sh ; 1120ss > z ; 1121``` 1122 1123Apply this rule to “bassch” results in “bazch” because “ss” matches earlier in 1124the string than “sch”. If you really wanted “bassh”-- that is, if you wanted the 1125first rule to win even when the second rule matches earlier in the string, you'd 1126either have to add another rule for this special case... 1127 1128``` 1129sch > sh ; 1130ssch > ssh; 1131ss > z ; 1132``` 1133 1134...or you could use a transform rule to apply the conversions in two passes: 1135 1136``` 1137sch > sh ; 1138::Null; 1139ss > z ; 1140``` 1141 1142#### Masking 1143 1144When transforms are built, a warning is returned if rules are masked. This 1145happens when a rule could not be executed because the earlier one would always 1146match. 1147 1148``` 1149a > b ; 1150ac > d ; # masked! 1151``` 1152 1153In this case, for example, every string that could have a match for "ac" will 1154already match "a", because the rules are executed in order. However, the 1155transform compiler will not currently catch cases that would be masked because 1156of the use of UnicodeSets or regular expression operators, such as the 1157following: 1158 1159``` 1160a } [:L:] > b ; 1161ac > d ; # masked, but not caught by the compiler 1162``` 1163 1164#### Inverse Summary 1165 1166The following table shows how the same rule list generates two different 1167transforms, where the inverse is restated in terms of forward rules (this is a 1168contrived example, simply to show the reordering): 1169 1170##### Original Rules 1171 1172``` 1173:: [:Uppercase Letter:] ; 1174:: latin-greek ; 1175:: greek-japanese ; 1176x <> y ; 1177z > w ; 1178r < m ; 1179:: upper; 1180a > b ; 1181c <> d ; 1182:: any-publishing ; 1183:: ([:Number:]) ; 1184``` 1185 1186##### Forward 1187 1188``` 1189:: [:Uppercase Letter:] ; 1190:: latin-greek ; 1191:: greek-japanese ; 1192x > y ; 1193z > w ; 1194:: upper ; 1195a > b ; 1196c > d ; 1197:: any-publishing ; 1198``` 1199 1200##### Inverse 1201 1202``` 1203:: [:Number:] ; 1204:: publishing-any ; 1205d > c ; 1206:: lower ; 1207y > x ; 1208m > r ; 1209:: japanese-greek ; 1210:: greek-latin ; 1211``` 1212 1213> :point_right: **Note**: *Note how the irrelevant rules (the inverse filter rule and the rules containing 1214<) are omitted (ignored, actually) in the forward direction, and notice how 1215things are reversed: the transform rules are inverted and happen in the opposite 1216order, and the groups of conversion rules are also executed in the opposite 1217relative order (although the rules within each group are executed in the same 1218order).* 1219 1220#### Function Calls 1221 1222As of ICU 2.1, rule-based transforms can invoke other transforms. The transform 1223being invoked must be registered with the system before it can be used in a 1224rule. The syntax for a function call resembles a Perl subroutine call: 1225 1226``` 1227( [a-zA-Z] ) ( [a-zA-Z]* ) > &Any-Upper($1) &Any-Lower($2) ; 1228``` 1229 1230This example transforms strings of ASCII letters to have an initial uppercase 1231letter followed by lowercase letters. (In practice, you would use the `Any-Title` 1232to do proper titlecasing.) 1233 1234The formal syntax is: 1235 1236``` 1237'&' Basic-id '(' Text-arg ')' 1238``` 1239 1240Elements in single quotes are literals. Basic-id is a basic ID, as described 1241earlier. It specifies a source, target, and optional variant, but does not 1242include a filter, explicit reverse, or compound elements. Text-arg is any text 1243that may appear on the output side of a rule. This means nested function calls 1244are supported. 1245 1246For more information on the use of rules, and more examples of the syntax in 1247use, see the [tutorial](./rules.md). 1248 1249### Regular Expression 1250 1251The rules are similar to Regular Expressions in offering: Variables, Property 1252matches, Contextual matches, Rearrangement ($1, $2…), and Quantifiers (\*, +, 1253?). They are more powerful in offering: Ordered Rules, Cursor Backup, 1254Buffered/Keyboard support. They are less powerful in that they have only greedy 1255quantifiers, no backup (so no X | Y), and no input-side back references. 1256 1257Here is a simple example that shows the difference between a set of 1258Transliterator rules, and successively applying regular expression replacements. 1259 1260Since the transform processes each of its rules at each point, it catches the yx 1261before the xy in the second case. Since each of the regular expressions is 1262evaluated over the whole string, that isn't possible. Simply using multiple 1263regular expressions can't account for the interaction and ordering of characters 1264and rules. (You can, however, simulate the regex behavior with transform rules 1265by using a transform rule to split the conversion rules into passes.) 1266 1267For more details on constructing rules, see the [Transliterator Rule 1268Tutorial](rules.md) . 1269 1270## Script Transliterator Sources 1271 1272Currently ICU offers script transliterations between Latin and certain other 1273scripts (such script transliterations are called romanizations), plus 1274transliterations between the Indic scripts (excluding Urdu). Additional 1275romanizations and other script transliterations will be added in the future. In 1276general, ICU follows the [UNGEGN: Working Group on Romanization 1277Systems](http://www.eki.ee/wgrs/) where possible. The following describes the 1278sources used. 1279 1280Except where otherwise noted, all of these systems are designed to be 1281reversible. For bicameral scripts (those with upper and lower case), case may 1282not be completely preserved. The transliterations are also designed to be 1283complete for the letters a-z. A fallback is used for a letter that is not used 1284in the transliteration. 1285 1286### Korean 1287 1288There are many romanizations of Korean. The default transliteration follows the 1289[Korean Ministry of Culture & Tourism 1290Transliteration](http://www.korea.net/korea/kor_loca.asp?code=A020303) 1291regulations with the clause 8 variant for reversibility: 1292 12938. When it is necessary to convert Romanized Korean back to Hangul in special 1294cases such as in academic articles, Romanization is done according to Hangul 1295spelling and not pronunciation. Each Hangul letter is Romanized as explained in 1296section 2 except that ㄱ, ㄷ, ㅂ, ㄹ are always written as g, d, b, l. When ㅇ has no 1297sound value, it is replaced by a hyphen may also be used when it is necessary to 1298distinguish between syllables. 1299 1300There is one other variation: an apostrophe is used instead of a hyphen, since 1301it has better title casing behavior. To change this, see the Modifications (§) 1302section below. 1303 1304### Japanese 1305 1306The default transliteration for Japanese uses the a slight variant of the 1307Hepburn system. With Hepburn system, both ZI (ジ) and DI (ヂ) are represented by 1308"ji" and both ZU (ズ) and DU (ヅ) are represented by "zu". This is amended 1309slightly for reversibility by using "dji" for DI and "dzu" for DU. 1310 1311The Katakana transliteration is reversible. Hiragana-Katakana transliteration is 1312not completely reversible since there are several Katakana letters that do not 1313have corresponding Hiragana equivalents. Also, the length mark is not used with 1314Hiragana. The Hiragana-Latin transliteration is also not reversible since 1315internally it is a combination of Katakana-Hiragana and Hiragana-Latin. 1316 1317### Greek 1318 1319The default transliteration uses a standard transcription for Greek. The 1320transliterations is one that is aimed at preserving etymology. The ISO 843 1321variant has the following differences: 1322 1323| Greek | Default | ISO 843 | 1324|---|---|---| 1325| β | b | v | 1326| γ\* | n | g | 1327| η | ē | ī | 1328| ̔ | h | (omitted) | 1329| ̀ | ̀ | (omitted) | 1330| ~ | ~ | (omitted) | 1331 1332\* before γ, κ, ξ, χ 1333 1334### Cyrillic 1335 1336Cyrillic generally follows ISO 9 for the base Cyrillic set. There are tentative 1337plans to add extended Cyrillic characters in the future, plus variants for GOST 1338and other national standards. 1339 1340### Indic 1341 1342The default romanization uses the ISCII standard with some minor modifications 1343for reversibility. Internally, all Indic scripts are transliterated by 1344converting first to an internal form, called Interindic, then from Interindic to 1345the target script. 1346 1347Transliteration of Indic scripts in ICU follows the ISO 15919 standard for 1348Romanization of Indic scripts using diacritics. Internally, all Indic scripts 1349are transliterated by converting first to an internal form, called Inter-Indic, 1350then from Inter-Indic to the target script. ISO 15919 differs from ISCII 91 in 1351application of diacritics for certain characters. These differences are shown in 1352the following example (illustrated with Devanagari, although the same principles 1353apply to the other Indic scripts): 1354 1355| Devanagari | ISCII 91 | ISO 15919 | 1356|---|---|---| 1357| ऋ | ṛ | r̥ | 1358| ऌ | ḻ | l̥ | 1359| ॠ | ṝ | r̥̄ | 1360| ॡ | ḻ̄ | l̥̄ | 1361| ढ़ | d̂ha | ṛha | 1362| ड़ |d̂a | ṛa | 1363 1364> :point_right: **Note**: *With some fonts the diacritics will not be correctly placed on the base 1365letters. The macron on a lowercase L may look particularly bad.* 1366 1367Transliteration rules in Indic are reversible with the exception of the ZWJ and 1368ZWNJ used to request explicit rendering effects. For example: 1369 1370| Devanagari | Romanization | Note | 1371|---|---|---| 1372| क्ष | kṣa | normal | 1373| क्ष | kṣa | explicit halant requested | 1374| क्ष | kṣa | half-consonant requested | 1375 1376There are two particular instances where transliterations may produce unexpected 1377results: (1) where a halant after a consonant is implied by the romanization (in 1378such cases the vowel needs to be explicitly written out), and (2) with the 1379transliteration of 'c'. 1380 1381For example: 1382 1383| Devanagari | Romanization | 1384|---|---| 1385| सेन्गुप्त | Sēngupta | 1386| सेनगुप्त | Sēnagupta | 1387| मोनिच | Monica | 1388| मोनिक | Monika | 1389 1390### Modifications 1391 1392It is easy using transforms to create variants of the defaults. For example, to 1393create a variant of Korean that uses hyphens instead of apostrophes, use the 1394following rules: 1395 1396``` 1397:: Latin-Hangul ; 1398'' <> '-' ; 1399``` 1400 1401### More Information 1402 1403For more information, see: 1404 14051. [UNGEGN: Working Group on Romanization Systems](http://www.eki.ee/wgrs/) 1406 14072. [Transliteration of Non-Roman Alphabets and Scripts (Søren 1408 Binks)](http://transliteration.eki.ee/) 1409 14103. [Standards for Archival Description: 1411 Romanization](http://www.archivists.org/catalog/stds99/chapter8.html) 1412 14134. [ISO-15915 1414 (Hindi)](http://transliteration.eki.ee/pdf/Hindi-Marathi-Nepali.pdf) 1415 14165. [ISO-15915 (Gujarati)](http://transliteration.eki.ee/pdf/Gujarati.pdf) 1417 14186. [ISO-15915 (Kannada)](http://transliteration.eki.ee/pdf/Kannada.pdf) 1419 14207. [ISCII-91](http://www.cdacindia.com/html/gist/down/iscii_d.asp) 1421