Unicode Transliteration Guidelines
This document describes guidelines for the creation and use of CLDR
transliterations. Preliminary
charts
are available for the available transliterations -- be sure to read the known issues
there. Please file any feedback
on this document or those charts at
Locale Bugs.
Display. Some of the characters in this
document may not be visible in your browser, and with some fonts the diacritics
will not be correctly placed on the base letters. See
Display Problems. |
Transliteration is the general process of converting characters from one script
to another, where the result is roughly phonetic for languages in the target script.
For example, "Phobos" and "Deimos" are transliterations of Greek mythological "Φόβος"
and "Δεῖμος" into Latin letters, used to name the moons of Mars.
Transliteration is not translation. Rather, transliteration is the conversion
of letters from one script to another without translating the underlying words.
The following shows a sample of transliteration systems:
Sample Transliteration Systems
Source |
Translation |
Transliteration |
System |
Αλφαβητικός |
Alphabetic |
Alphabētikós |
Classic |
|
Alfavi̱tikós |
UNGEGN |
しんばし |
new bridge (district in Tokyo) |
shimbashi |
Hepburn |
sinbasi |
Kunrei |
яйца Фаберже |
Fabergé eggs |
yaytsa Faberzhe |
BGN/PCGN |
jajca Faberže |
Scholarly |
âjca Faberže |
ISO |
While an English speaker may
not recognize that the Japanese word kyanpasu is equivalent
to the English word campus, the word kyanpasu is still far easier
to recognize and interpret than if the letters were left in the original script.
There are several situations where this transliteration is especially useful,
such as the following. See the sidebar for examples.
Sample Transliterations
Source |
Transliteration |
김,
국삼 |
Gim,
Gugsam |
김,
명희 |
Gim,
Myeonghyi |
정,
병호 |
Jeong, Byeongho
|
... |
... |
たけだ, まさゆき |
Takeda, Masayuki
|
ますだ, よしひこ |
Masuda, Yoshihiko
|
やまもと, のぼる
|
Yamamoto, Noboru
|
... |
... |
Ρούτση, Άννα |
Roútsē, Ánna |
Καλούδης, Χρήστος |
Kaloúdēs, Chrḗstos |
Θεοδωράτου, Ελένη |
Theodōrátou, Elénē |
- When a user views names that are entered in a world-wide database, it is
extremely helpful to view and refer to the names in the user's native script.
- When the user performs searching and indexing tasks, transliteration can
retrieve information in a different script.
- When a service engineer is sent a program dump that is filled with characters
from foreign scripts, it is much easier to diagnose the problem when the text
is transliterated and the service engineer can recognize the characters.
The term transliteration
is sometimes given a narrow meaning, implying that the transformation is reversible (sometimes called
lossless). In CLDR this is not the case;
the term transliteration
is interpreted broadly to mean both reversible and non-reversible transforms of
text. (Note that even if theoretically a transliteration
system is supposed to be reversible, in source standards it is often not
specified in sufficient detail in the edge cases to actually be reversible.) A
non-reversible transliteration is often called a transcription, or called
a lossy or ambiguous transcription.
Note that reversibility is generally
only in one direction, so for native to Latin a transliteration may be reversible, but not the
contrary. For example, Hangul is reversible, in that any Hangul to Latin to
Hangul should provide the same Hangul as the input. Thus we have the following:
갗
→
gach →
갗
However, for completeness, many Latin
characters have fallbacks. This means that more than one Latin character may map to the same
Hangul. Thus from Latin we don't have reversibility, because two
different Latin source strings round-trip back to the same Latin string.
gach
→
갗 →
gach
gac
→
갗 →
gach
Transliteration can also be used to convert unfamiliar letters within the same
script, such as converting Icelandic THORN (þ) to th. These are not typically reversible.
There is an online demo using released CLDR data
at ICU Transform Demo.
There are many systems for transliteration between languages: the same text can
be transliterated in many different ways. For example, for the Greek example above,
the transliteration is classical, while the UNGEGN
alternate has different correspondences, such as φ → f instead of φ → ph.
CLDR provides for generic mappings from script to script (such as Cyrillic-Latin),
and also language-specific variants (Russian-French, or Serbian-German). There can
also be semi-generic mappings, such as Russian-Latin or Cyrillic-French. These can
be referred to, respectively, as script transliterations, language-specific transliterations, or
script-language transliterations. Transliterations from other scripts to Latin are also called
Romanizations.
Even within particular languages, there can be variant systems according to different
authorities, or even varying across time (if the authority for a system changes its recommendation).
The canonical identifier that CLDR uses for these has the form:
source-target/variant
The source (and target) can be a language or script, either using the English
name or a locale code. The variant should specify the authority for the system, and if necessary
for disambiguation,
the year. For example, the identifier for the Russian to Latin transliteration according
to the UNGEGN system would be:
- ru-und_Latn/UNGEGN, or
- Russian-Latin/UNGEGN
If there were multiple versions of these over time, the variant would be, say,
UNGEGN2006.
The assumption is that implementations will allow the use of fallbacks, if the
exact transliteration specified is unavailable. For example, the following would
be the fallback chain for the identifier Russian-English/UNGEGN. This is similar
to the Lookup Fallback Pattern used in
BCP 47 Tags for Identifying Languages,
except that it uses a "stepladder approach" to progressively handle the fallback
among source, target, and variant, with priorities being the target, source, and
variant, in that order.
- Russian-English/UNGEGN
- Russian-English
- Cyrillic-English/UNGEGN
- Cyrillic-English
- Russian-Latin/UNGEGN
- Russian-Latin
- Cyrillic-Latin/UNGEGN
- Cyrillic-Latin
There are a number of generally desirable guidelines for script transliterations.
These guidelines are rarely satisfied simultaneously, so constructing a reasonable
transliteration is always a process of balancing different requirements. These requirements
are most important for people who are building transliterations, but are also useful
as background information for users.
The following lists the general guidelines
for Unicode CLDR transliterations:
- standard: follow established systems (standards, authorities, or
de facto practice) where possible, deviating sometimes where necessary for reversibility.
In CLDR, the systems are generally described in the comments in the XML data files found in
the in the
transforms
folder online. For example, the system for Arabic transliteration in CLDR are
found in the comments in
Arabic-Latin.xml;
there is a reference to the
UNGEGN Arabic Tables.
Similarly for Hebrew, which also follows the
Hebrew UNGEGN Tables.
- complete: every well-formed sequence of characters in the source
script should transliterate to a sequence of characters from the target script,
and vice versa.
- predictable: the letters themselves (without any knowledge of the
languages written in that script) should be sufficient for the transliteration,
based on a relatively small number of rules. This allows the transliteration
to be performed mechanically.
- pronounceable: the resulting characters have reasonable
pronunciations in the target script. Transliteration is not as useful if the process simply
maps the characters without any regard to their pronunciation. Simply mapping
by alphabetic order ("αβγδεζηθ..." to "abcdefgh...") could yield strings that
might be complete and unambiguous, but the pronunciation would be completely
unexpected.
- reversible: it is possible to recover the text in the source script
from the transliteration in the target script. That is, someone that knows the transliteration
rules would be able to recover the precise spelling of the original source text.
For example, it is possible to go from Elláda back to the original Ελλάδα,
while if the transliteration were Ellada (with no accent), it would
not be possible.
Some of these principles may not be achievable simultaneously; in particular,
adherence to a standard system and reversibility. Often small changes in
existing systems can be made to accommodate reversibility. However, where a particular
system specifies a fundamentally non-reversible transliterations, those transliterations
as represented in CLDR may not be reversible.
In transliteration, multiple characters may produce ambiguities
(non-reversible mappings) unless the rules
are carefully designed. For example, the Greek character PSI (ψ) maps to ps,
but ps could also result from the sequence PI, SIGMA (πσ) since PI (π) maps
to p and SIGMA (σ) maps to s.
The Japanese transliteration standards provide a good mechanism for handling
these kinds of ambiguities. Using the Japanese transliteration standards, whenever
an ambiguous sequence in the target script does not result from a single letter,
the transform uses an apostrophe to disambiguate it. For example, it uses that procedure
to distinguish between man'ichi and manichi. Using this procedure,
the Greek character PI SIGMA (πσ) maps to p's. This method is recommended
for all script transliteration methods, although sometimes the character may vary:
for example, "-" is used in Korean.
Note: We've had a recent proposal to consistently use the hyphenation dot
for this code, thus we'd have πσ → p‧s.
A second problem is that some characters in a target script are not normally
found outside of certain contexts. For example, the small Japanese "ya" character,
as in "kya" (キャ), is not normally found in isolation. To handle such characters,
the Unicode transliterations currently use different conventions.
- Tilde: "ャ" in isolation is represented as "~ya"
- Diacritics: Greek "ς" in isolation is represented as s̱
Note: The CLDR committee is considering converging on a common representation for
this. The advantage of a common representation is that it allows for easy filtering.
For the default script transforms, the goal is to have unambiguous mappings,
with variants for any common use mappings that are ambiguous (non-reversible). In
some cases, however, case may not be preserved. For example,
Latin |
Greek |
Latin |
ps PS |
ψ Ψ |
ps PS |
psa Psa PsA |
ψα Ψα ΨΑ |
psa Psa PSA |
psA PSA PSa |
ψΑ ΨΑ Ψα |
psA PSA Psa |
The following shows Greek text that is mapped to fully reversible Latin:
Greek-Latin |
|
τί φῄς; γραφὴν σέ τις, ὡς ἔοικε, γέγραπται: οὐ γὰρ ἐκεῖνό γε καταγνώσομαι,
ὡς σὺ ἕτερον. |
tí phḗis; graphḕn sé tis, hōs éoike, gégraptai: ou gàr ekeînó ge katagnṓsomai,
hōs sỳ héteron. |
If the user wants a version without certain accents, then CLDR's chaining
rules can be
used to remove the accents. For example, the following transliterates to Latin but
removes the macron accents on the long vowels.
Greek-Latin; nfd; [\u0304] remove; nfc |
|
τί φῄς; γραφὴν σέ τις, ὡς ἔοικε, γέγραπται: οὐ γὰρ ἐκεῖνό γε καταγνώσομαι,
ὡς σὺ ἕτερον. |
tí phéis; graphèn sé tis, hos éoike, gégraptai: ou gàr ekeînó ge katagnósomai,
hos sỳ héteron. |
The above chaining rules, separated by semi-colons, perform the following
commands in order:
Rule |
Description |
Greek-Latin |
transliterate Greek to Latin |
nfd
|
convert to Unicode NFD format (separating accents from base characters) |
[\u0304] remove |
remove accents, but filter the command to only apply to a single
character:
U+0304 ( ̄ ) COMBINING MACRON |
nfc |
convert to Unicode NFC format (rejoining accents to base characters) |
The following transliterates to Latin but removes all accents. Note
that the only change is to expand the filter for the remove command.
Greek-Latin; nfd; [:nonspacing marks:] remove; nfc |
|
τί φῄς; γραφὴν σέ τις, ὡς ἔοικε, γέγραπται: οὐ γὰρ ἐκεῖνό γε καταγνώσομαι,
ὡς σὺ ἕτερον. |
ti pheis; graphen se tis, hos eoike, gegraptai: ou gar ekeino ge katagnosomai,
hos sy heteron. |
Standard transliteration methods often do not follow the pronunciation rules
of any particular language in the target script. For example, the Japanese Hepburn
system uses a "j" that has the English phonetic value (as opposed to French, German,
or Spanish), but uses vowels that do not have the standard English sounds. A transliteration
method might also require some special knowledge to have the correct pronunciation.
For example, in the Japanese kunrei-siki system, "ti" is pronounced as English "chee".
This is similar to situations where there are different languages within the same
script. For example, knowing that the word Gewalt comes from German allows
a knowledgeable reader to pronounce the "w" as a "v".
When encountering a
foreign word like jawa, there is little assurance how it is to be
pronounced even when it is not a transliteration (it is just from /span>another Latin-script language). The j could be
pronounced (for an English speaker) as in jump,
or Junker,
or jour;
and so on. Transcriptions are only roughly phonetic, and only so when the
specific pronunciation rules are understood.
The pronunciation of the characters
in the original script may also be influenced by context, which may be
particularly misleading in transliteration. For, in the Bengali নিঃশব,
transliterated as niḥśaba, the visarga ḥ
is not pronounced itself (whereas elsewhere it may be) but lengthens the
ś
sound, and the final inherent a is pronounced (whereas it
commonly is not), and the two inherent a's are pronounced as ɔ
and ô,
respectively.
In some cases, transliteration may be heavily influenced by tradition. For example,
the modern Greek letter beta (β) sounds like a "v", but a
transliteration may use a b (as in biology). In that case, the user would need to know
that a "b" in the transliterated word corresponded to beta (β) and is to be pronounced
as a v in modern Greek.
Letters may also be transliterated differently according
to their context to make the pronunciation more predictable. For example, since
the Greek sequence GAMMA GAMMA (γγ) is pronounced as ng, the first GAMMA
can be transcribed as an "n" in that context.
Similarly, the transliteration can give other guidance to the pronunciation in the
source language, for example, using "n" or "m" for the same Japanese character
(ん) depending on context, even though there is no distinction in the source
script.
In general, predictability means that when transliterating Latin script to other
scripts using reversible transliterations, English text will not produce phonetic
results. This is because the pronunciation of English cannot be predicted easily
from the letters in a word: e.g. grove, move, and love all end with "ove", but are
pronounced very differently.
Reversibility may require modifications of traditional transcription methods.
For example, there are two standard methods for transliterating Japanese katakana
and hiragana into Latin letters. The kunrei-siki method is unambiguous. The Hepburn
method can be more easily pronounced by foreigners but is ambiguous. In the Hepburn
method, both ZI (ジ) and DI (ヂ) are represented by "ji" and both ZU (ズ) and DU (ヅ)
are represented by "zu". A slightly amended version of Hepburn, that uses "dji"
for DI and "dzu" for DU, is unambiguous.
When a sequence of two letters map to one, case mappings (uppercase and lowercase)
must be handled carefully to ensure reversibility. For cased scripts, the two letters
may need to have different cases, depending on the next letter. For example, the
Greek letter PHI (Φ) maps to PH in Latin, but Φο maps to Pho, and not to PHo.
Some scripts have characters that take on different shapes depending on their
context. Usually, this is done at the display level (such as with Arabic) and does
not require special transliteration support. However, in a few cases this is represented
with different character codes, such as in Greek and Hebrew. For example, a Greek
SIGMA is written in a final form (ς) at the end of words, and a non-final form (σ)
in other locations. This also requires the transform to map different characters based
on the context.
Another thing to look out for when
dealing with cased scripts is that some of the characters in the target script may
not be able to represent case distinctions, such as some of the IPA characters
in the Latin script.
It is useful for the reverse mapping to be complete so that arbitrary strings
in the target script can be reasonably mapped back to the source script. Complete
reverse mapping makes it much easier to do mechanical quality checks and so on.
For example, even though the letter "q" might not be necessary in a transliteration
of Greek, it can be mapped to a KAPPA (κ). Such reverse mappings will not, in general,
be unambiguous.
Currently Unicode CLDR offers Romanizations for certain scripts, plus transliterations
between the Indic scripts (excluding Urdu). Additional script transliterations will
be added in the future.
Except where otherwise noted, all of these systems are designed to be reversible.
For bicameral scripts (those with uppercase and lowercase), however, case may not
be completely preserved.
The transliterations are also designed to be complete for any sequence of the
Latin letters a-z. A fallback is used for a letter that is not covered by
the transliteration, and default letters may be inserted as required. For example,
in the Hangul transliteration, rink → 린크 → linkeu. That is, "r" is
mapped to the closest other letter, and a default vowel is inserted at the end (since
"nk" cannot end a syllable).
Preliminary
charts
are available for the available transliterations. Be sure to read the known issues
described there.
There are many Romanizations of Korean. The default transliteration in Unicode
CLDR follows the Korean
Ministry of Culture & Tourism Transliteration regulations (see also
English summary).
There is an optional clause 8 variant for reversibility:
"제 8 항 학술 연구 논문 등 특수 분야에서 한글 복원을 전제로 표기할 경우에는 한글 표기를 대상으로 적는다. 이때 글자 대응은
제2장을 따르되 'ㄱ, ㄷ, ㅂ, ㄹ'은 'g, d, b, l'로만 적는다. 음가 없는 'ㅇ'은 붙임표(-)로 표기하되 어두에서는 생략하는
것을 원칙으로 한다. 기타 분절의 필요가 있을 때에도 붙임표(-)를 쓴다."
translation: "Clause 8: When it is required to recover the original
Hangul representation faithfully as in scholarly articles, ' ㄱ, ㄷ, ㅂ, ㄹ' must
be always romanized as 'g, d, b, l' while the mapping for the rest of the letters
remains the same as specified in clause 2. The placeholder 'ㅇ' at the beginning
of a syllable should be represented with '-', but should be omitted at the beginning
of a word. In addition, '-' should be used in other cases where a syllable boundary
needs to be explicitly marked (be disambiguated."
There are a number of cases where this Romanization may be ambiguous, because
sometimes multiple Latin letters map to a single entity (jamo) in Hangul. This happens
with vowels and consonants, the latter being slightly more complicated because there
are both initial and final consonants:
Type |
Multi-Character Consonants |
Initial-Only |
tt pp jj |
Initial-or-Final |
kk ch ss |
Final-Only |
gs nj nh lg lm lb ls lt lp lh bs ng |
CLDR uses the following rules for disambiguation of the possible boundaries
between letters, in order. The first rule comes from Clause 8.
- Don't break so as to require an implicit vowel or null consonant (if possible)
- Don't break within Initial-Only or Initial-Or-Final sequences (if possible)
- Favor longest match first.
If there is a single consonant between vowels, then Rule #1 will group it with
the following vowel if there is one (this is the same as the first part of Clause
8). If there is a sequence of four consonants between vowels, then there is only
one possible break (with well-formed text). So the only ambiguities lie with two
or three consonants between vowels, where there are possible multi-character consonants
involved. Even there, in most cases the resolution is simple, because there isn't
a possible multi-character consonant in the case of two, or two possible multi-character
consonants in the case of 3. For example, in the following cases, the left side
is unambiguous:
angda = ang-da → 앙다
apda = ap-da → 앞다
There are a relatively small number of possible ambiguities, listed below using
"a" as a sample vowel.
No. of Cons. |
Latin |
CLDR Disambiguation |
Hangul |
Comments |
2 |
atta |
= a-tta |
아따 |
Rule 1, then 2 |
appa |
= a-ppa |
아빠 |
ajja |
= a-jja |
아짜 |
akka |
= a-kka |
아까 |
Rule 1, then 2 |
assa |
= a-ssa |
아싸 |
acha |
= a-cha |
아차 |
agsa |
= ag-sa |
악사 |
Rule 1 |
anja |
= an-ja |
안자 |
anha |
= an-ha |
안하 |
alga |
= al-ga |
알가 |
alma |
= al-ma |
알마 |
alba |
= al-ba |
알바 |
alsa |
= al-sa |
알사 |
alta |
= al-ta |
알타 |
alpa |
= al-pa |
알파 |
alha |
= al-ha |
알하 |
absa |
= ab-sa |
압사 |
anga |
= an-ga |
안가 |
3 |
agssa |
= ag-ssa |
악싸 |
Rule 1, then 2 |
anjja |
= an-jja |
안짜 |
alssa |
= al-ssa |
알싸 |
abssa |
= ab-ssa |
압싸 |
akkka |
= akk-ka |
앆카 |
Rule 1, then 2, then 3 |
asssa |
= ass-sa |
았사 |
Known bugs* |
Should be Rule 1, then 2 |
altta |
= alt-ta |
앑타 |
= al-tta |
알따 |
alppa |
= alp-pa |
앒파 |
= al-ppa |
알빠 |
* There is one other known bug in CLDR 1.5.1, where "ch"
transliterates incorrectly in the degenerate case — when not followed by a vowel.
For vowel sequences, the situation is simpler. Only Rule #3 applies, so aeo =
ae-o → 애오.
The default transliteration for Japanese uses the a slight variant of the Hepburn
system. With Hepburn system, both ZI (ジ) and DI (ヂ) are represented by "ji" and
both ZU (ズ) and DU (ヅ) are represented by "zu". This is amended slightly for reversibility
by using "dji" for DI and "dzu" for DU.
The default transliteration uses a standard transcription for Greek which is
aimed at preserving etymology. The ISO 843 variant includes following differences:
Greek |
Default |
ISO 843 |
β |
b |
v |
γ* |
n |
g |
η |
ē |
ī |
̔ |
h |
(omitted) |
̀ |
̀ |
(omitted) |
~ |
~ |
(omitted) |
* before γ, κ, ξ, χ
Cyrillic generally follows ISO 9 for the base Cyrillic set. There are tentative
plans to add extended Cyrillic characters in the future, plus variants for GOST
and other national standards.
Transliteration of Indic scripts follows the ISO 15919
Transliteration of Devanagari and related Indic
scripts into Latin characters. Internally, all Indic scripts are transliterated
by converting first to an internal form, called Inter-Indic, then from Inter-Indic
to the target script. Inter-Indic thus provides a pivot between the different
scripts, and contains a superset of correspondences for all of them.
ISO 15919 differs from ISCII 91 in application of diacritics
for certain characters. These differences are shown in the following example (illustrated
with Devanagari, although the same principles apply to the other Indic scripts):
Devanagari |
ISCII 91 |
ISO 15919 |
ऋ |
ṛ |
r̥ |
ऌ |
ḻ |
l̥ |
ॠ |
ṝ |
r̥̄ |
ॡ |
ḻ̄ |
l̥̄ |
ढ़ |
d̂ha |
ṛha |
ड़ |
d̂a |
ṛa |
Transliteration rules from Indic to Latin are reversible with the exception of
the ZWJ and ZWNJ used to request explicit rendering effects. For example:
Devanagari |
Romanization |
Note |
क्ष |
kṣa |
normal |
क्ष |
kṣa |
explicit halant requested |
क्ष |
kṣa |
half-consonant requested |
Transliteration between Indic scripts are roundtrip where there are corresponding
letters. Otherwise, there may be fallbacks.
There are two particular instances where transliterations may produce unexpected
results: (1) where the final vowel is suppressed in speech, and (2) with the transliteration
of 'c'.
For example:
Devanagari |
Romanization |
Notes |
सेन्गुप्त |
Sēngupta |
|
सेनगुप्त |
Sēnagupta |
The final 'a' is not pronounced |
मोनिक |
Monika |
|
मोनिच |
Monica |
The 'c' is pronounced "ch" |
Unicode CLDR provides other transliterations based on the
U.S. Board on Geographic Names (BGN) transliterations.
These are currently unidirectional — to Latin only. The goal is to make them bidirectional
in future versions of CLDR.
Other transliterations are generally based on the
UNGEGN: Working Group on Romanization Systems
transliterations. These systems are in
wider actual implementation than most ISO standardized transliterations, and are
published freely available on the web (http://www.eki.ee/wgrs/)
and thus easily accessible to all.
The UNGEGN also has good documentation. For example, the
UNGEGN Arabic Tables
not only presents the UN system, but compares it with the BGN/PCGN 1956 system,
the I.G.N. System 1973, ISO 233:1984, the royal Jordanian Geographic Centre
System, and the Survey of Egypt System.
If you are interested in providing transliterations for one or
more scripts, file an initial bug report at
Locale Bugs. The initial bug should contain the scripts and or languages
involved, and the system being followed (with a link to a full description of
the proposed transliteration system), and a brief example. The proposed data can
also be in that bug, or be added in a Reply to that bug.
You can also file a bug in
Locale Bugs if
you find a problem in an existing transliteration.
For submission to CLDR, the data
needs to supplied in the correct XML format, and should follow an accepted
standard. It is best if the results are
tested using the ICU
Transform Demo first, since if the data doesn't validate it would
not be accepted into CLDR. As mentioned above, even if a transliteration is only used
in certain countries or contexts CLDR can provide for them with different
variant tags.
The format for rules is specified in
Transform_Rules.
The XML format is just a series of rules and comments. For comparison, you can see what is
currently in CLDR in the
transforms
folder online. For example, see
Hebrew-Latin.xml.
For more information, see: