• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1---
2layout: default
3title: Transform Rule Tutorial
4nav_order: 5
5parent: Transforms
6---
7<!--
8© 2020 and later: Unicode, Inc. and others.
9License & terms of use: http://www.unicode.org/copyright.html
10-->
11
12# Transform Rule Tutorial
13{: .no_toc }
14
15## Contents
16{: .no_toc .text-delta }
17
181. TOC
19{:toc}
20
21---
22
23## Overview
24
25This tutorial describes the process of building a custom transform based on a
26set of rules. The tutorial does not describe, in detail, the features of
27transform; instead, it explains the process of building rules and describes the
28features needed to perform different tasks. The focus is on building a script
29transform since this process provides concrete examples that incorporates most
30of the rules.
31
32## Script Transliterators
33
34The first task in building a script transform is to determine which system of
35transliteration to use as a model. There are dozens of different systems for
36each language and script.
37
38The International Organization for Standardization
39([ISO](http://www.elot.gr/tc46sc2/)) uses a strict definition of
40transliteration, which requires it to be reversible. Although the goal for ICU
41script transforms is to be reversible, they do not have to adhere to this
42definition. In general, most transliteration systems in use are not reversible.
43This tutorial will describe the process for building a reversible transform
44since it illustrates more of the issues involved in the rules. (For guidelines
45in building transforms, see "Guidelines for Designing Script Transliterations"
46(§) in the [General Transforms](index.md) chapter. For external sources for
47script transforms, see Script Transliterator Sources (§) in that same chapter)
48
49> :point_right: **Note**: See *[*Properties and ICU Rule Syntax*](../../strings/properties.md) *for
50information regarding syntax characters.*
51
52In this example, we start with a set of rules for Greek since they provide a
53real example based on mathematics. We will use the rules that do not involve the
54pronunciation of Modern Greek; instead, we will use rules that correspond to the
55way that Greek words were incorporated into the English language. For example,
56we will transliterate "Βιολογία-Φυσιολογία" as "Biología-Physiología", not as
57"Violohía-Fisiolohía". To illustrate some of the trickier cases, we will also
58transliterate the Greek accents that are no longer in use in modern Greek.
59
60> :point_right: **Note**: *Some of the characters may not be visible on the screen unless you have a
61Unicode font with all the Greek letters. If you have a licensed copy of
62Microsoft® Office, you can use the "Arial Unicode MS" font, or you can download
63the [CODE2000](http://www.code2000.net/) font for free. For more information,
64see [Display Problems?](http://www.unicode.org/help/display_problems.html) on
65the Unicode web site.*
66
67We will also verify that every Latin letter maps to a Greek letter. This insures
68that when we reverse the transliteration that the process can handle all the
69Latin letters.
70
71> :point_right: **Note**: *This direction is not reversible. The following table illustrates this
72situation:*
73
74| Source→Target | Reversible | φ → ph → φ |
75|---------------|------------|------------|
76| Target→Source | Not (Necessarily) Reversible | f → φ → ph |
77
78
79## Basics
80
81In non-complex cases, we have a one-to-one relationship between letters in both
82Greek and Latin. These rules map between a source string and a target string.
83The following shows this relationship:
84
85```
86π <> p;
87```
88
89This rule states that when you transliterate from Greek to Latin, convert π to p
90and when you transliterate from Latin to Greek, convert p to π. The syntax is
91
92```
93string1 <> string2 ;
94```
95
96We will start by adding a whole batch of simple mappings. These mappings will
97not work yet, but we will start with them. For now, we will not use the
98uppercase versions of characters.
99
100    # One to One Mappings
101    α <> a;
102    β <> b;
103    γ <> g;
104    δ <> d;
105    ε <> e;
106
107We will also add rules for completeness. These provide fallback mappings for
108Latin characters that do not normally result from transliterating Greek
109characters.
110
111    # Completeness Mappings
112    κ < c;
113    κ < q;
114
115## Context and Range
116
117We have completed the simple one-to-one mappings and the rules for completeness.
118The next step is to look at the characters in context. In Greek, for example,
119the transform converts a "γ" to an "n" if it is before any of the following
120characters: γ, κ, ξ, or χ. Otherwise the transform converts it to a "g". The
121following list a all of the possibilities:
122
123    γγ > ng;
124    γκ > nk;
125    γξ > nx;
126    γχ > nch;
127    γ > g;
128
129All the rules are evaluated in the order they are listed. The transform will
130first try to match the first four rules. If all of these rules fail, it will use
131the last one.
132
133However, this method quickly becomes tiresome when you consider all the possible
134uppercase and lowercase combinations. An alternative is to use two additional
135features: context and range.
136
137### Context
138
139First, we will consider the impact of context on a transform. We already have
140rules for converting γ, κ, ξ, and χ. We must consider how to convert the γ
141character when it is followed by γ, κ, ξ, and χ. Otherwise we must permit
142those characters to be converted using their specific rules. This is done with
143the following:
144
145    γ } γ > n;
146    γ } κ > n;
147    γ } ξ > n;
148    γ } χ > n;
149    γ > g;
150
151A left curly brace marks the start of a context rule. The context rule will be
152followed when the transform matches the rules against the source text, but
153itself will not be converted. For example, if we had the sequence γγ, the
154transform converts the first γ into an "n" using the first rule, then the second
155γ is unaffected by that rule. The "γ" matches a "k" rule and is converts it into
156a "k". The result is "nk".
157
158### Range
159
160Using context, we have the same number of rules. But, by using range, we can
161collapse the first four rules into one. The following shows how we can use
162range:
163
164    {γ}[γκξχ] > n;
165    γ > g;
166
167Any list of characters within square braces will match any one of the
168characters. We can then add the uppercase variants for completeness, to get:
169
170    γ } [ΓΚΞΧγκξχ] > n;
171    γ > g;
172
173Remember that we can use spaces for clarity. We can also write this rule as the
174following:
175
176    γ } [ Γ Κ Ξ Χ γ κ ξ χ ] > n ;
177    γ > g ;
178
179If a range of characters happens to have adjacent code numbers, we can just use
180a hyphen to abbreviate it. For example, instead of writing `[a b c d e f g m n o]`,
181we can simplify the range by writing `[a-g m-o]`.
182
183## Styled Text
184
185Another reason to use context is that transforms will convert styled text. When
186transforms convert styled text, they copy the style source text to the target
187text. However, the transforms are limited in that they can only convert whole
188replacements since it is impossible to know how any boundaries within the source
189text will correspond to the target text. Thus the following shows the effects of
190the two types of rules on some sample text:
191
192For example, suppose that we were to convert "γγ" to "ng". By using context, if
193there is a different style on the first gamma than on the second (such as font,
194size, color, etc), then that style difference is preserved in the resulting two
195characters. That is, the "n" will have the style of the first gamma, while the
196"g" will have the style of the second gamma.
197
198> :point_right: **Note**: *Contexts preserve the styles at a much finer granularity.*
199
200## Case
201
202When converting from Greek to Latin, we can just convert "θ" to and from "th".
203But what happens with the uppercase theta (Θ)? Sometimes we need to convert it
204to uppercase "TH", and sometimes to uppercase "T" and lowercase "h". We can
205choose between these based on the letters before and afterwards. If there is a
206lowercase letter after an uppercase letter, we can choose "Th", otherwise we
207will use "TH".
208
209We could manually list all the lowercase letters, but we also can use ranges.
210Ranges not only list characters explicitly, but they also give you access to all
211the characters that have a given Unicode property. Although the abbreviations
212are a bit arcane, we can specify common sets of characters such as all the
213uppercase letters. The following example shows how case and range can be used
214together:
215
216    Θ } [:LowercaseLetter:] <> Th;
217    Θ <> TH;
218
219The example allows words like Θεολογικές‚ to map to Theologikés and not
220THeologikés
221
222> :point_right: **Note**: *You either can specify properties with the POSIX-style syntax, such as
223[:LowercaseLetter:], or with the Perl-style syntax, such as
224\\p{LowercaseLetter}.*
225
226## Properties and Values
227
228A Greek sigma is written as "ς" if it is at the end of a word (but not
229completely separate) and as "σ" otherwise. When we convert characters from Greek
230to Latin, this is not a problem. However, it is a problem when we convert the
231character back to Greek from Latin. We need to convert an s depending on the
232context. While we could list all the possible letters in a range, we can also
233use a character property. Although the range `[:Letter:]` stands for all
234letters, we really want all the characters that aren't letters. To accomplish
235this, we can use a negated range: `[:^Letter:]`. The following shows a negated
236range:
237
238    σ < [:^Letter:] { s } [:^Letter:] ;
239    ς < s } [:^Letter:] ;
240    σ < s ;
241
242These rules state that if an "s" is surrounded by non-letters, convert it to
243"σ". Otherwise, if the "s" is followed by a non-letter, convert it to "ς". If
244all else fails, convert it to "σ"
245
246> :point_right: **Note**: *Negated ranges [^...] will match at the beginning and the end of a string.
247This makes the rules much easier to write. *
248
249To make the rules clearer, you can use variables. Instead of the example above,
250we can write the following:
251
252    $nonletter = [:^Letter:] ;
253    σ < $nonletter { s } $nonletter ;
254    ς < s } $nonletter ;
255    σ < s ;
256
257There are many more properties available that can be used in combination. For
258following table lists some examples:
259
260| Combination | Example | Description: All code points that are: |
261|----------------|--------------------------|--------------------------------------------|
262| Union | [[:Greek:] [:letter:]] | either in the Greek script, or are letters |
263| Intersection | [[:Greek:] & [:letter:]] | are both Greek and letters |
264| Set Difference | [[:Greek:] - [:letter:]] | are Greek but not letters |
265| Complement | [^[:Greek:] [:letter:]] | are neither Greek nor letters |
266
267For more on properties, see the [UnicodeSet](../../strings/unicodeset.md) and
268[Properties](../../strings/properties.md) chapters.
269
270## Repetition
271
272Elements in a rule can also repeat. For example, in the following rules, the
273transform converts an iota-subscript into a capital I if the preceding base
274letter is an uppercase character. Otherwise, the transform converts the
275iota-subscript into a lowercase character.
276
277    [:Uppercase Letter:] { ͅ } > I;
278    ͅ > i;
279
280However, this is not sufficient, since the base letter may be optionally
281followed by non-spacing marks. To capture that, we can use the \* syntax, which
282means repeat zero or more times. The following shows this syntax:
283
284    [:Uppercase Letter:] [:Nonspacing Mark:] \* { ͅ } > I ;
285    ͅ > i ;
286
287The following operators can be used for repetition:
288
289| Repetition Operators |  |
290|----------------------|------------------|
291| X* | zero or more X's |
292| X+ | one or more X's |
293| X? | Zero or one X |
294
295We can also use these operators as sequences with parentheses for grouping. For
296example, "a ( b c ) \* d" will match against "ad" or "abcd" or "abcbcd".
297
298*Currently, any repetition will cause the sequence to match as many times as allowed even if that causes the rest of the rule to fail. For example, suppose we have the following (contrived) rules:*
299*The intent was to transform a sequence like "able blue" into "ablæ blué". The rule does not work as it produces "ablé blué". The problem is that when the left side is matched against the text in the first rule, the `[:Letter:]*` matches all the way back through the "al" characters. Then there is no "a" left to match. To have it match properly, we must subtract the 'a' as in the following example:*
300
301## Æther
302
303The start and end of a string are treated specially. Essentially, characters off
304the end of the string are handled as if they were the noncharacter \\uFFFF,
305which is called "æther". (The code point \\uFFFF will never occur in any valid
306Unicode text). In particular, a negative Unicode set will generally also match
307against the start/end of a string. For example, the following rule will execute
308on the first **a** in a string, as well as an **a** that is actually preceded by
309a non-letter.
310
311| Rule | [:^L:] { a > b ; |
312|---------|------------------|
313| Source | a xa a |
314| Results | b xa b |
315
316This is because \\uFFFF is an element of `[:^L:]`, which includes all codepoints
317that do not represent letters. To refer explicitly to æther, you can use a **$**
318at the end of a range, such as in the following rules:
319
320| Rules | [0-9$] { a > b ; a } [0-9$] > b ;|
321|------------------|------------------|
322| Source | a 5a a |
323| Results | b 5b a |
324
325In these rules, an **a** before or after a number -- or at the start or end of a
326string -- will be matched. (You could also use \\uFFFF explicitly, but the $ is
327recommended).
328
329Thus to disallow a match against æther in a negation, you need to add the $ to
330the list of negated items. For example, the first rule and results from above
331would change to the following (notice that the first a is not replaced):
332
333| Rule | [^[:L:]$] { a > b ; |
334|---------|---------------------|
335| Source | a xa a |
336| Results | a xa b |
337
338> :point_right: **Note**: *Characters that are outside the context limits -- contextStart to contextEnd -- are also treated as
339æther.*
340
341The property `[:any:]` can be used to match all code points, including æther.
342Thus the following are equivalent:
343
344| Rule1 | [\u0000-\U0010FFFF] { a > A ; |
345|-------|-------------------------------|
346| Rule2 | [:any:] { a > A ; |
347
348However, since the transform is always greedy with no backup, this property is
349not very useful in practice. What is more often required is dealing with the end
350of lines. If you want to match the start or end of a line, then you can define a
351variable that includes all the line separator characters, and then use it in the
352context of your rules. For example:
353
354| Rules | $break = [[:Zp:][:Zl:] \u000A-\u000D \u0085 $] ; $break { a > A ;|
355|------------------|--------------------------------------------------|
356| Source | a a a a |
357| Results | A a A a |
358
359There is also a special character, the period (.), that is equivalent to the
360**negation** of the $break variable we defined above. It can be used to match
361any characters excluding those for linebreaks or æther. However, it cannot be
362used within a range: you can't have `[[.] - \u000A]`, for example. If you
363want to have different behavior you can define your own variables and use them
364instead of the period.
365
366> :point_right: **Note**: *There are a few other special escapes, that can be used in ranges. These are
367listed in the table below. However, instead of the latter two it is safest to
368use the above $break definition since it works for line endings across different
369platforms.*
370
371| Escape | Meaning | Code |
372|--------|-----------------|--------|
373| \t | Tab | \u0009 |
374| \n | Linefeed | \u000A |
375| \r | Carriage Return | \u000D |
376
377## Accents
378
379We could handle each accented character by itself with rules such as the
380following:
381
382    ά > á;
383    έ > é;
384    ...
385
386This procedure is very complicated when we consider all the possible
387combinations of accents and the fact that the text might not be normalized. In
388ICU 1.8, we can add other transforms as rules either before or after all the
389other rules. We then can modify the rules to the following:
390
391    :: NFD (NFC) ;
392    α <> a;
393    ...
394    ω <> ō;
395    :: NFC (NFD);
396
397These modified rules first separate accents from their base characters and then
398put them in a canonical order. We can then deal with the individual components,
399as desired. We can use NFC (NFC) at the end to put the entire result into
400standard canonical form. The inverse uses the transform rules in reverse order,
401so the (NFD) goes at the bottom and (NFC) at the top.
402
403A global filter can also be used with the transform rules. The following example
404shows a filter used in the rules:
405
406    :: [[:Greek:][:Inherited:]];
407    :: NFD (NFC) ;
408    α <> a;
409    ...
410    ω <> ō;
411    :: NFC (NFD);
412    :: ([[:Latin:][:Inherited:]]);
413
414The global filter will cause any other characters to be unaffected. In
415particular, the NFD then only applies to Greek characters and accents, leaving
416all other characters
417
418## Disambiguation
419
420If the transliteration is to be completely reversible, what would happen if we
421happened to have the Greek combination νγ? Because ν converts to n, both νγ and
422γγ convert to "ng" and we have an ambiguity. Normally, this sequence does not
423occur in the Greek language. However, for consistency -- and especially to aid
424in mechanical testing– we must consider this situation. (There are other cases
425in this and other languages where both sequences occur.)
426
427To resolve this ambiguity, use the mechanism recommended by the Japanese and
428Korean transliteration standards by inserting an apostrophe or hyphen to
429disambiguate the results. We can add a rule like the following that inserts an
430apostrophe after an "n" if we need to reverse the transliteration process:
431
432    ν } [ΓΚΞΧγκξχ] > n\';
433
434In ICU, there are several of these mechanisms for the Greek rules. The ICU rules
435undergo some fairly rigorous mechanical testing to ensure reversibility. Adding
436these disambiguation rules ensure that the rules can pass these tests and handle
437all possible sequences of characters correctly.
438
439There are some character forms that never occur in normal context. By
440convention, we use tilde (\~) for such cases to allow for reverse
441transliteration. Thus, if you had the text "Θεολογικές (ς)", it would
442transliterate to "Theologikés (\~s)". Using the tilde allows the reverse
443transliteration to detect the character and convert correctly back to the
444original: "Θεολογικές (ς)". Similarly, if we had the phrase "Θεολογικέσ", it
445would transliterate to "Theologiké~s". These are called anomalous characters.
446
447## Revisiting
448
449Rules allow for characters to be revisited after they are replaced. For example,
450the following converts "C" back "S" in front of "E", "I" or "Y". The vertical
451bar means that the character will be revisited, so that the "S" or "K" in a
452Greek transform will be applied to the result and will eventually produce a
453sigma (Σ, σ, or ς) or kappa (Κ or κ).
454
455    $softener = [eiyEIY] ;
456    | S < C } $softener ;
457    | K < C ;
458    | s < c } $softener ;
459    | k < c ;
460
461The ability to revisit is particularly useful in reducing the number of rules
462required for a given language. For example, in Japanese there are a large number
463of cases that follow the same pattern: "kyo" maps to a large hiragana for "ki"
464(き) followed by a small hiragana for "yo" (ょ). This can be done with a small
465number of rules with the following pattern:
466
467First, the ASCII punctuation mark, tilde "~", represents characters that never
468normally occur in isolation. This is a general convention for anomalous
469characters within the ICU rules in any event.
470
471    '~yu' > ゅ;
472    '~ye' > ぇ;
473    '~yo' > ょ;
474
475Second, any syllables that use this pattern are broken into the first hiragana
476and are followed by letters that will form the small hiragana.
477
478    by > び|'~y';
479    ch > ち|'~y';
480    dj > ぢ|'~y';
481    gy > ぎ|'~y';
482    j > じ|'~y';
483    ky > き|'~y';
484    my > み|'~y';
485    ny > に|'~y';
486    py > ぴ|'~y';
487    ry > り|'~y';
488    sh > し|'~y';
489
490Using these rules, "kyo" is first converted into "き~yo". Since the "~yo" is then
491revisited, this produces the desired final result, "きょ". Thus, a small number of
492rules (3 + 11 = 14) provide for a large number of cases. If all of the
493combinations of rules were used instead, it would require 3 x 11 = 33 rules.
494
495You can set the new revisit point (called the cursor) anywhere in the
496replacement text. You can even set the revisit point before or after the target
497text. The at-sign, as in the following example, is used as a filler to indicate
498the position, for those cases:
499
500    [aeiou] { x > | @ ks ;
501    ak > ack ;
502
503The first rule will convert "x", when preceded by a vowel, into "ks". The
504transform will then backup to the position before the vowel and continue. In the
505next pass, the "ak" will match and be invoked. Thus, if the source text is "ax",
506the result will be "ack".
507
508> :point_right: **Note**: *Although you can move the cursor forward or backward, it is limited in two
509ways: (a) to the text that is matched, (b) within the original substring that is
510to be converted. For example, if we have the rule "a b\* {x} > |@@@@@y" and it
511matches in the text "mabbx", the result will be "m|abby" (| represents the
512cursor position). Even though there are five @ signs, the cursor will only
513backup to the first character that is matched.*
514
515## Copying
516
517We can copy part of the matched string to the target text. Use parenthesis to
518group the text to copy and use "$n" (where n is a number from 1 to 99) to
519indicate which group. For example, in Korean, any vowel that does not have a
520consonant before it gets the null consonant (?) inserted before it. The
521following example shows this rule:
522
523    ([aeiouwy]) > ?| $1 ;
524
525To revisit the vowel again, insert the null consonant, insert the vowel, and
526then backup before the vowel to reconsider it. Similarly, we have a following
527rule that inserts a null vowel (?), if no real vowel is found after a consonant:
528
529    ([b-dg-hj-km-npr-t]) > | $1 eu;
530
531In this case, since we are going to reconsider the text again, we put in the
532Latin equivalent of the Korean null vowel, which is "eu".
533
534## Order Matters
535
536Two rules overlap when there is a string that both rules could match at the
537start. For example, the first part of the following rule does not overlap, but
538the last two parts do overlap:
539
540    β > b;
541    γ } [ Γ Κ Ξ Χ γ κ ξ χ ] > n ;
542    γ > g ;
543
544When rules do not overlap, they will produce the same result no matter what
545order they are in. It does not matter whether we have either of the following:
546
547    β > b;
548    γ > g ;
549    or
550    γ > g ;
551    β > b;
552
553When rules do overlap, order is important. In fact, a rule could be rendered
554completely useless. Suppose we have:
555
556    β } [aeiou] > b;
557    β } [^aeiou] > v;
558    β > p;
559
560In this case, the last rule is masked as none of the text that will match the
561rule will already be matched by previous rules. If a rule is masked, then a
562warning will be issued when you attempt to build a transform with the rules.
563
564## Combinations
565
566In Greek, a rough breathing mark on one of the first two vowels in a word
567represents an "H". This mark is invalid anywhere else in the language. In the
568normalize (NFD) form, the rough-breathing mark will be first accent after the
569vowel (with perhaps other accents following). So, we will start with the
570following variables and rule. The rule transforms a rough breathing mark into an
571"H", and moves it to before the vowels.
572
573    $gvowel = [ΑΕΗΙΟΥΩαεηιουω];
574    ($gvowel + ) ̔ > H | $1;
575
576A word like ὍΤΑΝ" is transformed into "HOTAN". This transformation does not work
577with a lowercase word like "ὅταν". To handle lowercase words, we insert another
578rule that moves the "H" over lowercase vowels and changes it to lowercase. The
579following shows this rule:
580
581    $gvowel = [ΑΕΗΙΟΥΩαεηιουω];
582    $lcgvowel = [αεηιουω];
583    ($lcgvowel +) ̔ > h | $1; # fix lowercase
584    ($gvowel + ) ̔ > H | $1;
585
586This rule provides the correct results as the lowercase word "ὅταν" is
587transformed into "hotan".
588
589There are also titlecase words such as "Ὅταν". For this situation, we need to
590lowercase the uppercase letters as the transform passes over them. We need to do
591that in two circumstances: (a) the breathing mark is on a capital letter
592followed by a lowercase, or (b) the breathing mark is on a lowercase vowel. The
593following shows how to write a rule for this situation:
594
595    $gvowel = [ΑΕΗΙΟΥΩαεηιουω];
596    $lcgvowel = [αεηιουω];
597
598    # fix Titlecase
599    {Ο ̔ } [:Nonspacing Mark:]* [:Ll:] > H | ο;
600
601    # fix Titlecase
602    {Ο ( $lcgvowel * ) ̔ } > H | ο $1;
603
604    # fix lowercase
605    ( $lcgvowel + ) ̔ > h | $1 ;
606    ($gvowel + ) ̔ > H | $1 ;
607
608This rule gives the correct results for lowercase as "Ὅταν" is transformed into
609"Hotan". We must copy the above insertion and modify it for each of the vowels
610since each has a different lowercase.
611
612We must also write a rule to handle a single letter word like "ὃ". In that case,
613we would need to look beyond the word, either forward or backward, to know
614whether to transform it to "HO" or to transform it to "Ho". Unlike the case of a
615capital theta (Θ), there are cases in the Greek language where single-vowel
616words have rough breathing marks. In this case, we would use several rules to
617match either before or after the word and ignore certain characters like
618punctuation and space (watch out for combining marks).
619
620## Pitfalls
621
6221.  **Case** When executing script conversions, if the source script has
623    uppercase and lowercase characters, and the target is lowercase, then
624    lowercase everything before your first rule. For example:
625    ```
626    # lowercase target before applying forward rules
627    :: [:Latin:] lower ();
628    ```
629    This will allow the rules to work even when they are given a mixture of
630    upper and lower case character. This procedure is done in the following ICU
631    transforms:
632    -  Latin-Hangul
633    -  Latin-Greek
634    -  Latin-Cyrillic
635    -  Latin-Devanagari
636    -  Latin-Gujarati
637    -  etc
638
6391.  **Punctuation** When executing script conversions, remember that scripts
640    have different punctuation conventions. For example, in the Greek language,
641    the ";" means a question mark. Generally, these punctuation marks also
642    should be converted when transliterating scripts.
643
6442.  **Normalization** Always design transform rules so that they work no matter
645    whether the source is normalized or not. (This is also true for the target,
646    in the case of backwards rules.) Generally, the best way to do this is to
647    have `:: NFD (NFC);` as the first line of the rules, and `:: NFC (NFD);` as the
648    last line. To supply filters, as described above, break each of these lines
649    into two separate lines. Then, apply the filter to either the normal or
650    inverse direction. Each of the accents then can be manipulated as separate
651    items that are always in a canonical order. If we are not using any accent
652    manipulation, we could use `:: NFC (NFC) ;` at the top of the rules instead.
653
6543.  **Ignorable Characters** Letters may have following accents such as the
655    following example:
656    ```
657    # convert z after letters into s
658    [:lowercase letter:] } z > s ;
659    ```
660    Normally, we want to ignore any accents that are on the z in performing the
661    rule. To do that, restate the rule as:
662    ```
663    # convert z after letters into s
664    [:lowercase letter:] [:mark:]* } z > s ;
665    ```
666    Even if we are not using NFD, this is still a good idea since some languages
667    use separate accents that cannot be combined.
668    Moreover, some languages may have embedded format codes, such as a
669    Left-Right Mark, or a Non-Joiner. Because of that, it is even safer to use
670    the following:
671    ```
672    # define at the top of your file
673    $ignore = [ [:mark:] [:format:] ] * ;
674    ...
675    # convert z after letters into sh
676    [:letter:] $ignore } z > s ;
677    ```
678
679
680> :point_right: **Note**: *Remember that the rules themselves must be in the same normalization format.
681Otherwise, nothing will match. To do this, run NFD on the rules themselves. In
682some cases, we must rearrange the order of the rules because of masking. For
683example, consider the following rules:*
684
685*If these rules are put in normalized form, then the second rule will mask the first. To avoid this, exchange the order because the NFD representation has the accents separate from the base character. We will not be able to see this on the screen if accents are rendered correctly. The following shows the NFD representation:*
686