• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1# Copyright (C) 2016 and later: Unicode, Inc. and others.
2# License & terms of use: http://www.unicode.org/copyright.html#License
3#
4# Corporation and others. All Rights Reserved.
5# Copyright (c) 2012-2015 International Business Machines
6# Corporation and others. All Rights Reserved.
7#
8# This file should be in UTF-8 with a signature byte sequence ("BOM").
9#
10# collationtest.txt: Collation test data.
11#
12# created on: 2012apr13
13# created by: Markus W. Scherer
14
15# A line with "** test: description" is used for verbose and error output.
16
17# A collator can be set with "@ root" or "@ locale language-tag",
18# for example "@ locale de-u-co-phonebk".
19# An old-style locale ID can also be used, for example "@ locale de@collation=phonebook".
20
21# A collator can be built with "@ rules".
22# An "@ rules" line is followed by one or more lines with the tailoring rules.
23
24# A collator can be modified with "% attribute=value".
25
26# "* compare" tests the order (= or <) of the following strings.
27# The relation can be "=" or "<" (the level of the difference is not specified)
28# or "<1", "<2", "<c", "<3", "<4" (indicating the level of the difference).
29
30# Test sections ("* compare") are terminated by
31# definitions of new collators, changing attributes, or new test sections.
32
33** test: simple CEs & expansions
34# Many types of mappings are tested elsewhere, including via the UCA conformance tests.
35# Here we mostly cover a few unusual mappings.
36@ rules
37&\x01                           # most control codes are ignorable
38<<<\u0300                       # tertiary CE
39&9<\x00                         # NUL not ignorable
40&\uA00A\uA00B=\uA002            # two long-primary CEs
41&\uA00A\uA00B\u00050005=\uA003  # three CEs, require 64 bits
42
43* compare
44=  \x01
45=  \x02
46<3 \u0300
47<1 9
48<1 \x00
49=  \x01\x00\x02
50<1 a
51<3 a\u0300
52<2 a\u0308
53=  ä
54<1 b
55<1 か        # Hiragana Ka (U+304B)
56<2 か\u3099  # plus voiced sound mark
57=  が        # Hiragana Ga (U+304C)
58<1 \uA00A\uA00B
59=  \uA002
60<1 \uA00A\uA00B\u00050004
61<1 \uA00A\uA00B\u00050005
62=  \uA003
63<1 \uA00A\uA00B\u00050006
64
65** test: contractions
66# Create some interesting mappings, and map some normalization-inert characters
67# (which are not subject to canonical reordering)
68# to some of the same CEs to check the sequence of CEs.
69@ rules
70
71# Contractions starting with 'a' should not continue with any character < U+0300
72# so that we can test a shortcut for that.
73&a=ⓐ
74&b<bz=ⓑ
75&d<dz\u0301=ⓓ           # d+z+acute
76&z
77<a\u0301=Ⓐ              # a+acute sorts after z
78<a\u0301\u0301=Ⓑ        # a+acute+acute
79<a\u0301\u0301\u0358=Ⓒ  # a+acute+acute+dot above right
80<a\u030a=Ⓓ              # a+ring
81<a\u0323=Ⓔ              # a+dot below
82<a\u0323\u0358=Ⓕ        # a+dot below+dot above right
83<a\u0327\u0323\u030a=Ⓖ  # a+cedilla+dot below+ring
84<a\u0327\u0323bz=Ⓗ      # a+cedilla+dot below+b+z
85
86&\U0001D158=⁰           # musical notehead black (has a symbol primary)
87<\U0001D158\U0001D165=¼ # musical quarter note
88
89# deliberately missing prefix contractions:
90# dz
91# a\u0327
92# a\u0327\u0323
93# a\u0327\u0323b
94
95&\x01
96<<<\U0001D165=¹         # musical stem (ccc=216)
97<<<\U0001D16D=²         # musical augmentation dot (ccc=226)
98<<<\U0001D165\U0001D16D=³  # stem+dot (ccc=216 226)
99&\u0301=❶               # acute (ccc=230)
100&\u030a=❷               # ring (ccc=230)
101&\u0308=❸               # diaeresis (ccc=230)
102<<\u0308\u0301=❹        # diaeresis+acute (=dialytika tonos) (ccc=230 230)
103&\u0327=❺               # cedilla (ccc=202)
104&\u0323=❻               # dot below (ccc=220)
105&\u0331=❼               # macron below (ccc=220)
106<<\u0331\u0358=❽        # macron below+dot above right (ccc=220 232)
107&\u0334=❾               # tilde overlay (ccc=1)
108&\u0358=❿               # dot above right (ccc=232)
109
110&\u0f71=①               # tibetan vowel sign aa
111&\u0f72=②               # tibetan vowel sign i
112#  \u0f71\u0f72         # tibetan vowel sign aa + i = ii = U+0F73
113&\u0f73=③               # tibetan vowel sign ii (ccc=0 but lccc=129)
114
115** test: simple contractions
116
117# Some strings are chosen to cause incremental contiguous contraction matching to
118# go into partial matches for prefixes of contractions
119# (where the prefixes are deliberately not also contractions).
120# When there is no complete match, then the matching code must back out of those
121# so that discontiguous contractions work as specified.
122
123* compare
124# contraction starter with no following text, or mismatch, or blocked
125<1 a
126=  ⓐ
127<1 aa
128=  ⓐⓐ
129<1 ab
130=  ⓐb
131<1 az
132=  ⓐz
133
134* compare
135<1 a
136<2 a\u0308\u030a  # ring blocked by diaeresis
137=  ⓐ❸❷
138<2 a\u0327
139=  ⓐ❺
140
141* compare
142<2 \u0308
143=  ❸
144<2 \u0308\u030a\u0301  # acute blocked by ring
145=  ❸❷❶
146
147* compare
148<1 \U0001D158
149=  ⁰
150<1 \U0001D158\U0001D165
151=  ¼
152
153# no discontiguous contraction because of missing prefix contraction d+z,
154# and a starter ('z') after the 'd'
155* compare
156<1 dz\u0323\u0301
157=  dz❻❶
158
159# contiguous contractions
160* compare
161<1 abz
162=  ⓐⓑ
163<1 abzz
164=  ⓐⓑz
165
166* compare
167<1 a
168<1 z
169<1 a\u0301
170=  Ⓐ
171<1 a\u0301\u0301
172=  Ⓑ
173<1 a\u0301\u0301\u0358
174=  Ⓒ
175<1 a\u030a
176=  Ⓓ
177<1 a\u0323\u0358
178=  Ⓕ
179<1 a\u0327\u0323\u030a  # match despite missing prefix
180=  Ⓖ
181<1 a\u0327\u0323bz
182=  Ⓗ
183
184* compare
185<2 \u0308\u0308\u0301  # acute blocked from first diaeresis, contracts with second
186=  ❸❹
187
188* compare
189<1 \U0001D158\U0001D165
190=  ¼
191
192* compare
193<3 \U0001D165\U0001D16D
194=  ³
195
196** test: discontiguous contractions
197* compare
198<1 a\u0327\u030a                # a+ring skips cedilla
199=  Ⓓ❺
200<2 a\u0327\u0327\u030a          # a+ring skips 2 cedillas
201=  Ⓓ❺❺
202<2 a\u0327\u0327\u0327\u030a    # a+ring skips 3 cedillas
203=  Ⓓ❺❺❺
204<2 a\u0334\u0327\u0327\u030a    # a+ring skips tilde overlay & 2 cedillas
205=  Ⓓ❾❺❺
206<1 a\u0327\u0323                # a+dot below skips cedilla
207=  Ⓔ❺
208<1 a\u0323\u0301\u0358          # a+dot below+dot ab.r.: 2-char match, then skips acute
209=  Ⓕ❶
210<2 a\u0334\u0323\u0358          # a+dot below skips tilde overlay
211=  Ⓕ❾
212
213* compare
214<2 \u0331\u0331\u0358           # macron below+dot ab.r. skips the second macron below
215=  ❽❼
216
217* compare
218<1 a\u0327\u0331\u0323\u030a    # a+ring skips cedilla, macron below, dot below (dot blocked by macron)
219=  Ⓓ❺❼❻
220<1 a\u0327\u0323\U0001D16D\u030a  # a+dot below skips cedilla
221=  Ⓔ❺²❷
222<2 a\u0327\u0327\u0323\u030a    # a+dot below skips 2 cedillas
223=  Ⓔ❺❺❷
224<2 a\u0327\u0323\u0323\u030a    # a+dot below skips cedilla
225=  Ⓔ❺❻❷
226<2 a\u0334\u0327\u0323\u030a    # a+dot below skips tilde overlay & cedilla
227=  Ⓔ❾❺❷
228
229* compare
230<1 \U0001D158\u0327\U0001D165   # quarter note skips cedilla
231=  ¼❺
232<1 a\U0001D165\u0323            # a+dot below skips stem
233=  Ⓔ¹
234
235# partial contiguous match, backs up, matches discontiguous contraction
236<1 a\u0327\u0323b
237=  Ⓔ❺b
238<1 a\u0327\u0323ba
239=  Ⓔ❺bⓐ
240
241# a+acute+acute+dot above right skips cedilla, continues matching 2 same-ccc combining marks
242* compare
243<1 a\u0327\u0301\u0301\u0358
244=  Ⓒ❺
245
246# FCD but not NFD
247* compare
248<1 a\u0f73\u0301                # a+acute skips tibetan ii
249=  Ⓐ③
250
251# FCD but the 0f71 inside the 0f73 must be skipped
252# to match the discontiguous contraction of the first 0f71 with the trailing 0f72 inside the 0f73
253* compare
254<1 \u0f71\u0f73                 # == \u0f73\u0f71 == \u0f71\u0f71\u0f72
255=  ③①
256
257** test: discontiguous contractions with nested contractions
258* compare
259<1 a\u0323\u0308\u0301\u0358
260=  Ⓕ❹
261<2 a\u0323\u0308\u0301\u0308\u0301\u0358
262=  Ⓕ❹❹
263
264** test: discontiguous contractions with interleaved contractions
265* compare
266# a+ring & cedilla & macron below+dot above right
267<1 a\u0327\u0331\u030a\u0358
268=  Ⓓ❺❽
269
270# a+ring & 1x..3x macron below+dot above right
271<2 a\u0331\u030a\u0358
272=  Ⓓ❽
273<2 a\u0331\u0331\u030a\u0358\u0358
274=  Ⓓ❽❽
275# also skips acute
276<2 a\u0331\u0331\u0331\u030a\u0301\u0358\u0358\u0358
277=  Ⓓ❽❽❽❶
278
279# a+dot below & stem+augmentation dot, followed by contiguous d+z+acute
280<1 a\U0001D165\u0323\U0001D16Ddz\u0301
281=  Ⓔ³ⓓ
282
283** test: some simple string comparisons
284@ root
285* compare
286# first string compares against ""
287= \u0000
288< a
289<1 b
290<3 B
291= \u0000B\u0000
292
293** test: compare with strength=primary
294% strength=primary
295* compare
296<1 a
297<1 b
298= B
299
300** test: compare with strength=secondary
301% strength=secondary
302* compare
303<1 a
304<1 b
305= B
306
307** test: compare with strength=tertiary
308% strength=tertiary
309* compare
310<1 a
311<1 b
312<3 B
313
314** test: compare with strength=quaternary
315% strength=quaternary
316* compare
317<1 a
318<1 b
319<3 B
320
321** test: compare with strength=identical
322% strength=identical
323* compare
324<1 a
325<1 b
326<3 B
327
328** test: côté with forwards secondary
329@ root
330* compare
331<1 cote
332<2 coté
333<2 côte
334<2 côté
335
336** test: côté with forwards secondary vs. U+FFFE merge separator
337# Merged sort keys: On each level, any difference in the first segment
338# must trump any further difference.
339* compare
340<1 cote\uFFFEcôté
341<2 coté\uFFFEcôte
342<2 côte\uFFFEcoté
343<2 côté\uFFFEcote
344
345** test: côté with backwards secondary
346% backwards=on
347* compare
348<1 cote
349<2 côte
350<2 coté
351<2 côté
352
353** test: côté with backwards secondary vs. U+FFFE merge separator
354# Merged sort keys: On each level, any difference in the first segment
355# must trump any further difference.
356* compare
357<1 cote\uFFFEcôté
358<2 côte\uFFFEcoté
359<2 coté\uFFFEcôte
360<2 côté\uFFFEcote
361
362** test: U+FFFE on identical level
363@ root
364% strength=identical
365* compare
366# All of these control codes are completely-ignorable, so that
367# their low code points are compared with the merge separator.
368# The merge separator must compare less than any other character.
369<1 \uFFFE\u0001\u0002\u0003
370<i \u0001\uFFFE\u0002\u0003
371<i \u0001\u0002\uFFFE\u0003
372<i \u0001\u0002\u0003\uFFFE
373
374* compare
375# The merge separator must even compare less than U+0000.
376<1 \uFFFE\u0000\u0000
377<i \u0000\uFFFE\u0000
378<i \u0000\u0000\uFFFE
379
380** test: Hani < surrogates < U+FFFD
381# Note: compareUTF8() treats unpaired surrogates like U+FFFD,
382# so with that the strings with surrogates will compare equal to each other
383# and equal to the string with U+FFFD.
384@ root
385% strength=identical
386* compare
387<1 abz
388<1 a\u4e00z
389<1 a\U00020000z
390<1 a\ud800z
391<1 a\udbffz
392<1 a\udc00z
393<1 a\udfffz
394<1 a\ufffdz
395
396** test: script reordering
397@ root
398% reorder Hani Zzzz digit
399* compare
400<1 ?
401<1 +
402<1 丂
403<1 a
404<1 α
405<1 5
406
407% reorder default
408* compare
409<1 ?
410<1 +
411<1 5
412<1 a
413<1 α
414<1 丂
415
416** test: empty rules
417@ rules
418* compare
419<1 a
420<2 ä
421<3 Ä
422<1 b
423
424** test: very simple rules
425@ rules
426&a=e<<<<q<<<<r<x<<<X<<y<<<Y;z,Z
427% strength=quaternary
428* compare
429<1 a
430=  e
431<4 q
432<4 r
433<1 x
434<3 X
435<2 y
436<3 Y
437<2 z
438<3 Z
439
440** test: tailoring twice before a root position: primary
441@ rules
442&[before 1]b<p
443&[before 1]b<q
444* compare
445<1 a
446<1 p
447<1 q
448<1 b
449
450** test: tailoring twice before a root position: secondary
451@ rules
452&[before 2]ſ<<p
453&[before 2]ſ<<q
454* compare
455<1 s
456<2 p
457<2 q
458<2 ſ
459
460# secondary-before common weight
461@ rules
462&[before 2]b<<p
463&[before 2]b<<q
464* compare
465<1 a
466<1 p
467<2 q
468<2 b
469
470** test: tailoring twice before a root position: tertiary
471@ rules
472&[before 3]B<<<p
473&[before 3]B<<<q
474* compare
475<1 b
476<3 p
477<3 q
478<3 B
479
480# tertiary-before common weight
481@ rules
482&[before 3]b<<<p
483&[before 3]b<<<q
484* compare
485<1 a
486<1 p
487<3 q
488<3 b
489
490@ rules
491&[before 2]b<<s
492&[before 3]s<<<p
493&[before 3]s<<<q
494* compare
495<1 a
496<1 p
497<3 q
498<3 s
499<2 b
500
501** test: tailor after completely ignorable
502@ rules
503&\x00<<<x<<y
504* compare
505= \x00
506= \x1F
507<3 x
508<2 y
509
510** test: secondary tailoring gaps, ICU ticket 9362
511@ rules
512&[before 2]s<<'_'
513&s<<r  # secondary between s and ſ (long s)
514&ſ<<*a-q  # more than 15 between ſ and secondary CE boundary
515&[before 2][first primary ignorable]<<u<<v  # between secondary CE boundary & lowest secondary CE
516&[last primary ignorable]<<y<<z
517
518* compare
519<2 u
520<2 v
521<2 \u0332  # lowest secondary CE
522<2 \u0308
523<2 y
524<2 z
525<1 s_
526<2 ss
527<2 sr
528<2 sſ
529<2 sa
530<2 sb
531<2 sp
532<2 sq
533<2 sus
534<2 svs
535<2 rs
536
537** test: tertiary tailoring gaps, ICU ticket 9362
538@ rules
539&[before 3]t<<<'_'
540&t<<<r  # tertiary between t and fullwidth t
541&ᵀ<<<*a-q  # more than 15 between ᵀ (modifier letter T) and tertiary CE boundary
542&[before 3][first secondary ignorable]<<<u<<<v  # between tertiary CE boundary & lowest tertiary CE
543&[last secondary ignorable]<<<y<<<z
544
545* compare
546<3 u
547<3 v
548# Note: The root collator currently does not map any characters to tertiary CEs.
549<3 y
550<3 z
551<1 t_
552<3 tt
553<3 tr
554<3 tt
555<3 tᵀ
556<3 ta
557<3 tb
558<3 tp
559<3 tq
560<3 tut
561<3 tvt
562<3 rt
563
564** test: secondary & tertiary around root character
565@ rules
566&[before 2]m<<r
567&m<<s
568&[before 3]m<<<u
569&m<<<v
570* compare
571<1 l
572<1 r
573<2 u
574<3 m
575<3 v
576<2 s
577<1 n
578
579** test: secondary & tertiary around tailored item
580@ rules
581&m<x
582&[before 2]x<<r
583&x<<s
584&[before 3]x<<<u
585&x<<<v
586* compare
587<1 m
588<1 r
589<2 u
590<3 x
591<3 v
592<2 s
593<1 n
594
595** test: more nesting of secondary & tertiary before
596@ rules
597&[before 3]m<<<u
598&[before 2]m<<r
599&[before 3]r<<<q
600&m<<<w
601&m<<t
602&[before 3]w<<<v
603&w<<<x
604&w<<s
605* compare
606<1 l
607<1 q
608<3 r
609<2 u
610<3 m
611<3 v
612<3 w
613<3 x
614<2 s
615<2 t
616<1 n
617
618** test: case bits
619@ rules
620&w<x  # tailored CE getting case bits
621  =uv=uV=Uv=UV  # 2 chars -> 1 CE
622&ae=ch=cH=Ch=CH  # 2 chars -> 2 CEs
623&rst=yz=yZ=Yz=YZ  # 2 chars -> 3 CEs
624% caseFirst=lower
625* compare
626<1 ae
627=  ch
628<3 cH
629<3 Ch
630<3 CH
631<1 rst
632=  yz
633<3 yZ
634<3 Yz
635<3 YZ
636<1 w
637<1 x
638=  uv
639<3 uV
640=  Uv  # mixed case on single CE cannot distinguish variations
641<3 UV
642
643** test: tertiary CEs, tertiary, caseLevel=off, caseFirst=lower
644@ rules
645&\u0001<<<t<<<T  # tertiary CEs
646% caseFirst=lower
647* compare
648<1 aa
649<3 aat
650<3 aaT
651<3 aA
652<3 aAt
653<3 ata
654<3 aTa
655
656** test: tertiary CEs, tertiary, caseLevel=off, caseFirst=upper
657% caseFirst=upper
658* compare
659<1 aA
660<3 aAt
661<3 aa
662<3 aat
663<3 aaT
664<3 ata
665<3 aTa
666
667** test: reset on expansion, ICU tickets 9415 & 9593
668@ rules
669&æ<x    # tailor the last primary CE so that x sorts between ae and af
670&æb=bæ  # copy all reset CEs to make bæ sort the same
671&각<h    # copy/tailor 3 CEs to make h sort before the next Hangul syllable 갂
672&⒀<<y   # copy/tailor 4 CEs to make y sort with only a secondary difference
673&l·=z   # handle the pre-context for · when fetching reset CEs
674   <<u  # copy/tailor 2 CEs
675
676* compare
677<1 ae
678<2 æ
679<1 x
680<1 af
681
682* compare
683<1 aeb
684<2 æb
685=  bæ
686
687* compare
688<1 각
689<1 h
690<1 갂
691<1 갃
692
693* compare
694<1 ·    # by itself: primary CE
695<1 l
696<2 l·   # l+middle dot has only a secondary difference from l
697=  z
698<2 u
699
700* compare
701<1 (13)
702<3 ⒀  # DUCET sets special tertiary weights in all CEs
703<2 y
704<1 (13[
705
706% alternate=shifted
707* compare
708<1 (13)
709=  13
710<3 ⒀
711=  y  # alternate=shifted removes the tailoring difference on the last CE
712<1 14
713
714** test: contraction inside extension, ICU ticket 9378
715@ rules
716&а<<х/й     # all letters are Cyrillic
717* compare
718<1 ай
719<2 х
720
721** test: no duplicate tailored CEs for different reset positions with same CEs, ICU ticket 10104
722@ rules
723&t<x &ᵀ<y           # same primary weights
724&q<u &[before 1]ꝗ<v # q and ꝗ are primary adjacent
725* compare
726<1 q
727<1 u
728<1 v
729<1 ꝗ
730<1 t
731<3 ᵀ
732<1 y
733<1 x
734
735# Principle: Each rule builds on the state of preceding rules and ignores following rules.
736
737** test: later rule does not affect earlier reset position, ICU ticket 10105
738@ rules
739&a < u < v < w  &ov < x  &b < v
740* compare
741<1 oa
742<1 ou
743<1 x    # CE(o) followed by CE between u and w
744<1 ow
745<1 ob
746<1 ov
747
748** test: later rule does not affect earlier extension (1), ICU ticket 10105
749@ rules
750&a=x/b &v=b
751% strength=secondary
752* compare
753<1 B
754<1 c
755<1 v
756=  b
757* compare
758<1 AB
759=  x
760<1 ac
761<1 av
762=  ab
763
764** test: later rule does not affect earlier extension (2), ICU ticket 10105
765@ rules
766&a <<< c / e &g <<< e / l
767% strength=secondary
768* compare
769<1 AE
770=  c
771<2 æ
772<1 agl
773=  ae
774
775** test: later rule does not affect earlier extension (3), ICU ticket 10105
776@ rules
777&a = b / c  &d = c / e
778% strength=secondary
779* compare
780<1 AC  # C is still only tertiary different from the original c
781=  b
782<1 ade
783=  ac
784
785** test: extension contains tailored character, ICU ticket 10105
786@ rules
787&a=e &b=u/e
788* compare
789<1 a
790=  e
791<1 ba
792=  be
793=  u
794
795** test: add simple mappings for characters with root context
796@ rules
797&z=·    # middle dot has a prefix mapping in the CLDR root
798&n=и    # и (U+0438) has contractions in the root
799* compare
800<1 l
801<2 l·   # root mapping for l|· still works
802<1 z
803=  ·
804* compare
805<1 n
806=  и
807<1 И
808<1 и\u0306  # root mapping for й=и\u0306 still works
809=  й
810<3 Й
811
812** test: add context mappings around characters with root context
813@ rules
814&z=·h   # middle dot has a prefix mapping in the CLDR root
815&n=ә|и  # и (U+0438) has contractions in the root
816* compare
817<1 l
818<2 l·   # root mapping for l|· still works
819<1 z
820=  ·h
821* compare
822<1 и
823<3 И
824<1 и\u0306  # root mapping for й=и\u0306 still works
825=  й
826* compare
827<1 әn
828=  әи
829<1 әo
830
831** test: many secondary CEs at the top of their range
832@ rules
833&[last primary ignorable]<<*\u2801-\u28ff
834* compare
835<2 \u0308
836<2 \u2801
837<2 \u2802
838<2 \u2803
839<2 \u2804
840<2 \u28fd
841<2 \u28fe
842<2 \u28ff
843<1 \x20
844
845** test: many tertiary CEs at the top of their range
846@ rules
847&[last secondary ignorable]<<<*a-z
848* compare
849<3 a
850<3 b
851<3 c
852<3 d
853# e..w
854<3 x
855<3 y
856<3 z
857<2 \u0308
858
859** test: tailor contraction together with nearly equivalent prefix, ICU ticket 10101
860@ rules
861&a=p|x &b=px &c=op
862* compare
863<1 b
864=  px
865<3 B
866<1 c
867=  op
868<3 C
869* compare
870<1 ca
871=  opx  # first contraction op, then prefix p|x
872<3 cA
873<3 Ca
874
875** test: reset position with prefix (pre-context), ICU ticket 10102
876@ rules
877&a=p|x &px=y
878* compare
879<1 pa
880=  px
881=  y
882<3 pA
883<1 q
884<1 x
885
886** test: prefix+contraction together (1), ICU ticket 10071
887@ rules
888&x=a|bc
889* compare
890<1 ab
891<1 Abc
892<1 abd
893<1 ac
894<1 aw
895<1 ax
896=  abc
897<3 aX
898<3 Ax
899<1 b
900<1 bb
901<1 bc
902<3 bC
903<3 Bc
904<1 bd
905
906** test: prefix+contraction together (2), ICU ticket 10071
907@ rules
908&w=bc &x=a|b
909* compare
910<1 w
911=  bc
912<3 W
913* compare
914<1 aw
915<1 ax
916=  ab
917<3 aX
918<1 axb
919<1 axc
920=  abc  # prefix match a|b takes precedence over contraction match bc
921<3 abC
922<1 abd
923<1 ay
924
925** test: prefix+contraction together (3), ICU ticket 10071
926@ rules
927&x=a|b &w=bc    # reverse order of rules as previous test, order should not matter here
928* compare       # same "compare" sequences as previous test
929<1 w
930=  bc
931<3 W
932* compare
933<1 aw
934<1 ax
935=  ab
936<3 aX
937<1 axb
938<1 axc
939=  abc  # prefix match a|b takes precedence over contraction match bc
940<3 abC
941<1 abd
942<1 ay
943
944** test: no mapping p|c, falls back to contraction ch, CLDR ticket 5962
945@ rules
946&d=ch &v=p|ci
947* compare
948<1 pc
949<3 pC
950<1 pcH
951<1 pcI
952<1 pd
953=  pch  # no-prefix contraction ch matches
954<3 pD
955<1 pv
956=  pci  # prefix+contraction p|ci matches
957<3 pV
958
959** test: tailor in & around compact ranges of root primaries
960# The Ogham characters U+1681..U+169A are in simple ascending order of primary CEs
961# which should be reliably encoded as one range in the root elements data.
962@ rules
963&[before 1]ᚁ<a
964&ᚁ<b
965&[before 1]ᚂ<c
966&ᚂ<d
967&[before 1]ᚚ<y
968&ᚚ<z
969&[before 2]ᚁ<<r
970&ᚁ<<s
971&[before 3]ᚚ<<<t
972&ᚚ<<<u
973* compare
974<1 ᣵ    # U+18F5 last Canadian Aboriginal
975<1 a
976<1 r
977<2 ᚁ
978<2 s
979<1 b
980<1 c
981<1 ᚂ
982<1 d
983<1 ᚃ
984<1 ᚙ
985<1 y
986<1 t
987<3 ᚚ
988<3 u
989<1 z
990<1 ᚠ    # U+16A0 first Runic
991
992** test: suppressContractions
993@ rules
994&z<ch<әж [suppressContractions [·cә]]
995* compare
996<1 ch
997<3 cH   # ch was suppressed
998<1 l
999<1 l·   # primary difference, not secondary, because l|· was suppressed
1000<1 ә
1001<2 ә\u0308  # secondary difference, not primary, because contractions for ә were suppressed
1002<1 әж
1003<3 әЖ
1004
1005** test: Hangul & Jamo
1006@ rules
1007&L=\u1100  # first Jamo L
1008&V=\u1161  # first Jamo V
1009&T=\u11A8  # first Jamo T
1010&\uAC01<<*\u4E00-\u4EFF  # first Hangul LVT syllable & lots of secondary diffs
1011* compare
1012<1 Lv
1013<3 LV
1014=  \u1100\u1161
1015=  \uAC00
1016<1 LVt
1017<3 LVT
1018=  \u1100\u1161\u11A8
1019=  \uAC00\u11A8
1020=  \uAC01
1021<2 LVT\u0308
1022<2 \u4E00
1023<2 \u4E01
1024<2 \u4E80
1025<2 \u4EFF
1026<2 LV\u0308T
1027<1 \uAC02
1028
1029** test: adjust special reset positions according to previous rules, CLDR ticket 6070
1030@ rules
1031&[last variable]<x
1032[maxVariable space]  # has effect only after building, no effect on following rules
1033&[last variable]<y
1034&[before 1][first regular]<z
1035* compare
1036<1 ?  # some punctuation
1037<1 x
1038<1 y
1039<1 z
1040<1 $  # some symbol
1041
1042@ rules
1043&[last primary ignorable]<<x<<<y
1044&[last primary ignorable]<<z
1045* compare
1046<2 \u0358
1047<2 x
1048<3 y
1049<2 z
1050<1 \x20
1051
1052@ rules
1053&[last secondary ignorable]<<<x
1054&[last secondary ignorable]<<<y
1055* compare
1056<3 x
1057<3 y
1058<2 \u0358
1059
1060@ rules
1061&[before 2][first variable]<<z
1062&[before 2][first variable]<<y
1063&[before 3][first variable]<<<x
1064&[before 3][first variable]<<<w
1065&[before 1][first variable]<v
1066&[before 2][first variable]<<u
1067&[before 3][first variable]<<<t
1068&[before 2]\uFDD1\xA0<<s  # FractionalUCA.txt: FDD1 00A0, SPACE first primary
1069* compare
1070<2 \u0358
1071<1 s
1072<2 \uFDD1\xA0
1073<1 t
1074<3 u
1075<2 v
1076<1 w
1077<3 x
1078<3 y
1079<2 z
1080<2 \t
1081
1082@ rules
1083&[before 2][first regular]<<z
1084&[before 3][first regular]<<<y
1085&[before 1][first regular]<x
1086&[before 3][first regular]<<<w
1087&[before 2]\uFDD1\u263A<<v  # FractionalUCA.txt: FDD1 263A, SYMBOL first primary
1088&[before 3][first regular]<<<u
1089&[before 1][first regular]<p  # primary before the boundary: becomes variable
1090&[before 3][first regular]<<<t  # not affected by p
1091&[last variable]<q              # after p!
1092* compare
1093<1 ?
1094<1 p
1095<1 q
1096<1 t
1097<3 u
1098<3 v
1099<1 w
1100<3 x
1101<1 y
1102<3 z
1103<1 $
1104
1105# check that p & q are indeed variable
1106% alternate=shifted
1107* compare
1108=  ?
1109=  p
1110=  q
1111<1 t
1112<3 u
1113<3 v
1114<1 w
1115<3 x
1116<1 y
1117<3 z
1118<1 $
1119
1120@ rules
1121&[before 2][first trailing]<<z
1122&[before 1][first trailing]<y
1123&[before 3][first trailing]<<<x
1124* compare
1125<1 \u4E00  # first Han, first implicit
1126<1 \uFDD1\uFDD0  # FractionalUCA.txt: unassigned first primary
1127# Note: The root collator currently does not map any characters to the trailing first boundary primary.
1128<1 x
1129<3 y
1130<1 z
1131<2 \uFFFD  # The root collator currently maps U+FFFD to the first real trailing primary.
1132
1133@ rules
1134&[before 2][first primary ignorable]<<z
1135&[before 2][first primary ignorable]<<y
1136&[before 3][first primary ignorable]<<<x
1137&[before 3][first primary ignorable]<<<w
1138* compare
1139=  \x01
1140<2 w
1141<3 x
1142<3 y
1143<2 z
1144<2 \u0301
1145
1146@ rules
1147&[before 3][first secondary ignorable]<<<y
1148&[before 3][first secondary ignorable]<<<x
1149* compare
1150=  \x01
1151<3 x
1152<3 y
1153<2 \u0301
1154
1155** test: canonical closure
1156@ rules
1157&X=A &U=Â
1158* compare
1159<1 U
1160=  Â
1161=  A\u0302
1162<2 Ú  # U with acute
1163=  U\u0301
1164=  Ấ  # A with circumflex & acute
1165=  Â\u0301
1166=  A\u0302\u0301
1167<1 X
1168=  A
1169<2 X\u030A  # with ring above
1170=  Å
1171=  A\u030A
1172=  \u212B  # Angstrom sign
1173
1174@ rules
1175&x=\u5140\u55C0
1176* compare
1177<1 x
1178=  \u5140\u55C0
1179=  \u5140\uFA0D
1180=  \uFA0C\u55C0
1181=  \uFA0C\uFA0D  # CJK compatibility characters
1182<3 X
1183
1184# canonical closure on prefix rules, ICU ticket 9444
1185@ rules
1186&x=ä|ŝ
1187* compare
1188<1 äs  # not tailored
1189<1 äx
1190=  äŝ
1191=  a\u0308s\u0302
1192=  a\u0308ŝ
1193=  äs\u0302
1194<3 äX
1195
1196** test: conjoining Jamo map to expansions
1197@ rules
1198&gg=\u1101  # Jamo Lead consonant GG
1199&nj=\u11AC  # Jamo Trail consonant NJ
1200* compare
1201<1 gg\u1161nj
1202=  \u1101\u1161\u11AC
1203=  \uAE4C\u11AC
1204=  \uAE51
1205<3 gg\u1161nJ
1206<1 \u1100\u1100
1207
1208** test: canonical tail closure, ICU ticket 5913
1209@ rules
1210&a<â
1211* compare
1212<1 a
1213<1 â              # tailored
1214=  a\u0302
1215<2 a\u0323\u0302  # discontiguous contraction
1216=  ạ\u0302        # equivalent
1217=  ậ              # equivalent
1218<1 b
1219
1220@ rules
1221&a<ạ
1222* compare
1223<1 a
1224<1 ạ              # tailored
1225=  a\u0323
1226<2 a\u0323\u0302  # contiguous contraction plus extra diacritic
1227=  ạ\u0302        # equivalent
1228=  ậ              # equivalent
1229<1 b
1230
1231# Tail closure should work even if there is a prefix and/or contraction.
1232@ rules
1233&a<\u5140|câ
1234# In order to find discontiguous contractions for \u5140|câ
1235# there must exist a mapping for \u5140|ca, regardless of what it maps to.
1236# (This follows from the UCA spec.)
1237&x=\u5140|ca
1238* compare
1239<1 \u5140a
1240=  \uFA0Ca
1241<1 \u5140câ              # tailored
1242=  \uFA0Ccâ
1243=  \u5140ca\u0302
1244=  \uFA0Cca\u0302
1245<2 \u5140ca\u0323\u0302  # discontiguous contraction
1246=  \uFA0Cca\u0323\u0302
1247=  \u5140cạ\u0302
1248=  \uFA0Ccạ\u0302
1249=  \u5140cậ
1250=  \uFA0Ccậ
1251<1 \u5140b
1252=  \uFA0Cb
1253<1 \u5140x
1254=  \u5140ca
1255
1256# Double-check that without the extra mapping there will be no discontiguous match.
1257@ rules
1258&a<\u5140|câ
1259* compare
1260<1 \u5140a
1261=  \uFA0Ca
1262<1 \u5140câ              # tailored
1263=  \uFA0Ccâ
1264=  \u5140ca\u0302
1265=  \uFA0Cca\u0302
1266<1 \u5140b
1267=  \uFA0Cb
1268<1 \u5140ca\u0323\u0302  # no discontiguous contraction
1269=  \uFA0Cca\u0323\u0302
1270=  \u5140cạ\u0302
1271=  \uFA0Ccạ\u0302
1272=  \u5140cậ
1273=  \uFA0Ccậ
1274
1275@ rules
1276&a<cạ
1277* compare
1278<1 a
1279<1 cạ              # tailored
1280=  ca\u0323
1281<2 ca\u0323\u0302  # contiguous contraction plus extra diacritic
1282=  cạ\u0302        # equivalent
1283=  cậ              # equivalent
1284<1 b
1285
1286# ᾢ = U+1FA2 GREEK SMALL LETTER OMEGA WITH PSILI AND VARIA AND YPOGEGRAMMENI
1287#   = 03C9 0313 0300 0345
1288# ccc = 0, 230, 230, 240
1289@ rules
1290&δ=αῳ
1291# In order to find discontiguous contractions for αῳ
1292# there must exist a mapping for αω, regardless of what it maps to.
1293# (This follows from the UCA spec.)
1294&ε=αω
1295* compare
1296<1 δ
1297=  αῳ
1298=  αω\u0345
1299<2 αω\u0313\u0300\u0345  # discontiguous contraction
1300=  αὠ\u0300\u0345
1301=  αὢ\u0345
1302=  αᾢ
1303<2 αω\u0300\u0313\u0345
1304=  αὼ\u0313\u0345
1305=  αῲ\u0313  # not FCD
1306<1 ε
1307=  αω
1308
1309# Double-check that without the extra mapping there will be no discontiguous match.
1310@ rules
1311&δ=αῳ
1312* compare
1313<1 αω\u0313\u0300\u0345  # no discontiguous contraction
1314=  αὠ\u0300\u0345
1315=  αὢ\u0345
1316=  αᾢ
1317<2 αω\u0300\u0313\u0345
1318=  αὼ\u0313\u0345
1319=  αῲ\u0313  # not FCD
1320<1 δ
1321=  αῳ
1322=  αω\u0345
1323
1324# Add U+0315 COMBINING COMMA ABOVE RIGHT which has ccc=232.
1325# Tests code paths where the tailored string has a combining mark
1326# that does not occur in any composite's decomposition.
1327@ rules
1328&δ=αὼ\u0315
1329* compare
1330<1 αω\u0313\u0300\u0315  # Not tailored: The grave accent blocks the comma above.
1331=  αὠ\u0300\u0315
1332=  αὢ\u0315
1333<1 δ
1334=  αὼ\u0315
1335=  αω\u0300\u0315
1336<2 αω\u0300\u0315\u0345
1337=  αὼ\u0315\u0345
1338=  αῲ\u0315  # not FCD
1339
1340** test: danish a+a vs. a-umlaut, ICU ticket 9319
1341@ rules
1342&z<aa
1343* compare
1344<1 z
1345<1 aa
1346<2 aa\u0308
1347=  aä
1348
1349** test: Jamo L with and in prefix
1350# Useful for the Korean "searchjl" tailoring (instead of contractions of pairs of Jamo L).
1351@ rules
1352# Jamo Lead consonant G after G or GG
1353&[last primary ignorable]<<\u1100|\u1100=\u1101|\u1100
1354# Jamo Lead consonant GG sorts like G+G
1355&\u1100\u1100=\u1101
1356# Note: Making G|GG and GG|GG sort the same as G|G+G
1357# would require the ability to reset on G|G+G,
1358# or we could make G-after-G equal to some secondary-CE character,
1359# and reset on a pair of those.
1360# (It does not matter much if there are at most two G in a row in real text.)
1361* compare
1362<1 \u1100
1363<2 \u1100\u1100  # only one primary from a sequence of G lead consonants
1364=  \u1101
1365<2 \u1100\u1100\u1100
1366=  \u1101\u1100
1367# but not = \u1100\u1101, see above
1368<1 \u1100\u1161
1369=  \uAC00
1370<2 \u1100\u1100\u1161
1371=  \u1100\uAC00  # prefix match from the L of the LV syllable
1372=  \u1101\u1161
1373=  \uAE4C
1374
1375** test: proposed Korean "searchjl" tailoring with prefixes, CLDR ticket 6546
1376@ rules
1377# Low secondary CEs for Jamo V & T.
1378# Note: T should sort before V for proper syllable order.
1379&\u0332  # COMBINING LOW LINE (first primary ignorable)
1380<<\u1161<<\u1162
1381
1382# Korean Jamo lead consonant search rules, part 2:
1383# Make modern compound L jamo primary equivalent to non-compound forms.
1384
1385# Secondary CEs for Jamo L-after-L, greater than Jamo V & T.
1386&\u0313  # COMBINING COMMA ABOVE (second primary ignorable)
1387=\u1100|\u1100
1388=\u1103|\u1103
1389=\u1107|\u1107
1390=\u1109|\u1109
1391=\u110C|\u110C
1392
1393# Compound L Jamo map to equivalent expansions of primary+secondary CE.
1394&\u1100\u0313=\u1101<<<\u3132  # HANGUL CHOSEONG SSANGKIYEOK, HANGUL LETTER SSANGKIYEOK
1395&\u1103\u0313=\u1104<<<\u3138  # HANGUL CHOSEONG SSANGTIKEUT, HANGUL LETTER SSANGTIKEUT
1396&\u1107\u0313=\u1108<<<\u3143  # HANGUL CHOSEONG SSANGPIEUP, HANGUL LETTER SSANGPIEUP
1397&\u1109\u0313=\u110A<<<\u3146  # HANGUL CHOSEONG SSANGSIOS, HANGUL LETTER SSANGSIOS
1398&\u110C\u0313=\u110D<<<\u3149  # HANGUL CHOSEONG SSANGCIEUC, HANGUL LETTER SSANGCIEUC
1399
1400* compare
1401<1 \u1100\u1161
1402=  \uAC00
1403<2 \u1100\u1162
1404=  \uAC1C
1405<2 \u1100\u1100\u1161
1406=  \u1100\uAC00
1407=  \u1101\u1161
1408=  \uAE4C
1409<3 \u3132\u1161
1410
1411** test: Hangul syllables in prefix & in the interior of a contraction
1412@ rules
1413&x=\u1100\u1161|a\u1102\u1162z
1414* compare
1415<1 \u1100\u1161x
1416=  \u1100\u1161a\u1102\u1162z
1417=  \u1100\u1161a\uB0B4z
1418=  \uAC00a\u1102\u1162z
1419=  \uAC00a\uB0B4z
1420
1421** test: digits are unsafe-backwards when numeric=on
1422@ root
1423% numeric=on
1424* compare
1425# If digits are not unsafe, then numeric collation sees "1"=="01" and "b">"a".
1426# We need to back up before the identical prefix "1" and compare the full numbers.
1427<1 11b
1428<1 101a
1429
1430** test: simple locale data test
1431@ locale de
1432* compare
1433<1 a
1434<2 ä
1435<1 ae
1436<2 æ
1437
1438@ locale de-u-co-phonebk
1439* compare
1440<1 a
1441<1 ae
1442<2 ä
1443<2 æ
1444
1445# The following test cases were moved here from ICU 52's DataDrivenCollationTest.txt.
1446
1447** test: DataDrivenCollationTest/TestMorePinyin
1448# Testing the primary strength.
1449@ locale zh
1450% strength=primary
1451* compare
1452< lā
1453= lĀ
1454= Lā
1455= LĀ
1456< lān
1457= lĀn
1458< lē
1459= lĒ
1460= Lē
1461= LĒ
1462< lēn
1463= lĒn
1464
1465** test: DataDrivenCollationTest/TestLithuanian
1466# Lithuanian sort order.
1467@ locale lt
1468* compare
1469< cz
1470< č
1471< d
1472< iz
1473< j
1474< sz
1475< š
1476< t
1477< zz
1478< ž
1479
1480** test: DataDrivenCollationTest/TestLatvian
1481# Latvian sort order.
1482@ locale lv
1483* compare
1484< cz
1485< č
1486< d
1487< gz
1488< ģ
1489< h
1490< iz
1491< j
1492< kz
1493< ķ
1494< l
1495< lz
1496< ļ
1497< m
1498< nz
1499< ņ
1500< o
1501< rz
1502< ŗ
1503< s
1504< sz
1505< š
1506< t
1507< zz
1508< ž
1509
1510** test: DataDrivenCollationTest/TestEstonian
1511# Estonian sort order.
1512@ locale et
1513* compare
1514< sy
1515< š
1516< šy
1517< z
1518< zy
1519< ž
1520< v
1521< va
1522< w
1523< õ
1524< õy
1525< ä
1526< äy
1527< ö
1528< öy
1529< ü
1530< üy
1531< x
1532
1533** test: DataDrivenCollationTest/TestAlbanian
1534# Albanian sort order.
1535@ locale sq
1536* compare
1537< cz
1538< ç
1539< d
1540< dz
1541< dh
1542< e
1543< ez
1544< ë
1545< f
1546< gz
1547< gj
1548< h
1549< lz
1550< ll
1551< m
1552< nz
1553< nj
1554< o
1555< rz
1556< rr
1557< s
1558< sz
1559< sh
1560< t
1561< tz
1562< th
1563< u
1564< xz
1565< xh
1566< y
1567< zz
1568< zh
1569
1570** test: DataDrivenCollationTest/TestSimplifiedChineseOrder
1571# Sorted file has different order.
1572@ root
1573# normalization=on turned on & off automatically.
1574* compare
1575< \u5F20
1576< \u5F20\u4E00\u8E3F
1577
1578** test: DataDrivenCollationTest/TestTibetanNormalizedIterativeCrash
1579# This pretty much crashes.
1580@ root
1581* compare
1582< \u0f71\u0f72\u0f80\u0f71\u0f72
1583< \u0f80
1584
1585** test: DataDrivenCollationTest/TestThaiPartialSortKeyProblems
1586# These are examples of strings that caused trouble in partial sort key testing.
1587@ locale th-TH
1588* compare
1589< \u0E01\u0E01\u0E38\u0E18\u0E20\u0E31\u0E13\u0E11\u0E4C
1590< \u0E01\u0E01\u0E38\u0E2A\u0E31\u0E19\u0E42\u0E18
1591* compare
1592< \u0E01\u0E07\u0E01\u0E32\u0E23
1593< \u0E01\u0E07\u0E42\u0E01\u0E49
1594* compare
1595< \u0E01\u0E23\u0E19\u0E17\u0E32
1596< \u0E01\u0E23\u0E19\u0E19\u0E40\u0E0A\u0E49\u0E32
1597* compare
1598< \u0E01\u0E23\u0E30\u0E40\u0E08\u0E35\u0E22\u0E27
1599< \u0E01\u0E23\u0E30\u0E40\u0E08\u0E35\u0E4A\u0E22\u0E27
1600* compare
1601< \u0E01\u0E23\u0E23\u0E40\u0E0A\u0E2D
1602< \u0E01\u0E23\u0E23\u0E40\u0E0A\u0E49\u0E32
1603
1604** test: DataDrivenCollationTest/TestJavaStyleRule
1605# java.text allows rules to start as '<<<x<<<y...'
1606# we emulate this by assuming a &[first tertiary ignorable] in this case.
1607@ rules
1608&\u0001=equal<<<z<<x<<<w &[first tertiary ignorable]=a &[first primary ignorable]=b
1609* compare
1610= a
1611= equal
1612< z
1613< x
1614= b  # x had become the new first primary ignorable
1615< w
1616
1617** test: DataDrivenCollationTest/TestShiftedIgnorable
1618# The UCA states that primary ignorables should be completely
1619# ignorable when following a shifted code point.
1620@ root
1621% alternate=shifted
1622% strength=quaternary
1623* compare
1624< a\u0020b
1625= a\u0020\u0300b
1626= a\u0020\u0301b
1627< a_b
1628= a_\u0300b
1629= a_\u0301b
1630< A\u0020b
1631= A\u0020\u0300b
1632= A\u0020\u0301b
1633< A_b
1634= A_\u0300b
1635= A_\u0301b
1636< a\u0301b
1637< A\u0301b
1638< a\u0300b
1639< A\u0300b
1640
1641** test: DataDrivenCollationTest/TestNShiftedIgnorable
1642# The UCA states that primary ignorables should be completely
1643# ignorable when following a shifted code point.
1644@ root
1645% alternate=non-ignorable
1646% strength=tertiary
1647* compare
1648< a\u0020b
1649< A\u0020b
1650< a\u0020\u0301b
1651< A\u0020\u0301b
1652< a\u0020\u0300b
1653< A\u0020\u0300b
1654< a_b
1655< A_b
1656< a_\u0301b
1657< A_\u0301b
1658< a_\u0300b
1659< A_\u0300b
1660< a\u0301b
1661< A\u0301b
1662< a\u0300b
1663< A\u0300b
1664
1665** test: DataDrivenCollationTest/TestSafeSurrogates
1666# It turned out that surrogates were not skipped properly
1667# when iterating backwards if they were in the middle of a
1668# contraction. This test assures that this is fixed.
1669@ rules
1670&a < x\ud800\udc00b
1671* compare
1672< a
1673< x\ud800\udc00b
1674
1675** test: DataDrivenCollationTest/da_TestPrimary
1676# This test goes through primary strength cases
1677@ locale da
1678% strength=primary
1679* compare
1680< Lvi
1681< Lwi
1682* compare
1683< L\u00e4vi
1684< L\u00f6wi
1685* compare
1686< L\u00fcbeck
1687= Lybeck
1688
1689** test: DataDrivenCollationTest/da_TestTertiary
1690# This test goes through tertiary strength cases
1691@ locale da
1692% strength=tertiary
1693* compare
1694< Luc
1695< luck
1696* compare
1697< luck
1698< L\u00fcbeck
1699* compare
1700< lybeck
1701< L\u00fcbeck
1702* compare
1703< L\u00e4vi
1704< L\u00f6we
1705* compare
1706< L\u00f6ww
1707< mast
1708
1709* compare
1710< A/S
1711< ANDRE
1712< ANDR\u00c9
1713< ANDREAS
1714< AS
1715< CA
1716< \u00c7A
1717< CB
1718< \u00c7C
1719< D.S.B.
1720< DA
1721< \u00d0A
1722< DB
1723< \u00d0C
1724< DSB
1725< DSC
1726< EKSTRA_ARBEJDE
1727< EKSTRABUD0
1728< H\u00d8ST
1729< HAAG
1730< H\u00c5NDBOG
1731< HAANDV\u00c6RKSBANKEN
1732< Karl
1733< karl
1734< NIELS\u0020J\u00d8RGEN
1735< NIELS-J\u00d8RGEN
1736< NIELSEN
1737< R\u00c9E,\u0020A
1738< REE,\u0020B
1739< R\u00c9E,\u0020L
1740< REE,\u0020V
1741< SCHYTT,\u0020B
1742< SCHYTT,\u0020H
1743< SCH\u00dcTT,\u0020H
1744< SCHYTT,\u0020L
1745< SCH\u00dcTT,\u0020M
1746< SS
1747< \u00df
1748< SSA
1749< STORE\u0020VILDMOSE
1750< STOREK\u00c6R0
1751< STORM\u0020PETERSEN
1752< STORMLY
1753< THORVALD
1754< THORVARDUR
1755< \u00feORVAR\u00d0UR
1756< THYGESEN
1757< VESTERG\u00c5RD,\u0020A
1758< VESTERGAARD,\u0020A
1759< VESTERG\u00c5RD,\u0020B
1760< \u00c6BLE
1761< \u00c4BLE
1762< \u00d8BERG
1763< \u00d6BERG
1764
1765* compare
1766< andere
1767< chaque
1768< chemin
1769< cote
1770< cot\u00e9
1771< c\u00f4te
1772< c\u00f4t\u00e9
1773< \u010du\u010d\u0113t
1774< Czech
1775< hi\u0161a
1776< irdisch
1777< lie
1778< lire
1779< llama
1780< l\u00f5ug
1781< l\u00f2za
1782< lu\u010d
1783< luck
1784< L\u00fcbeck
1785< lye
1786< l\u00e4vi
1787< L\u00f6wen
1788< m\u00e0\u0161ta
1789< m\u00eer
1790< myndig
1791< M\u00e4nner
1792< m\u00f6chten
1793< pi\u00f1a
1794< pint
1795< pylon
1796< \u0161\u00e0ran
1797< savoir
1798< \u0160erb\u016bra
1799< Sietla
1800< \u015blub
1801< subtle
1802< symbol
1803< s\u00e4mtlich
1804< verkehrt
1805< vox
1806< v\u00e4ga
1807< waffle
1808< wood
1809< yen
1810< yuan
1811< yucca
1812< \u017eal
1813< \u017eena
1814< \u017den\u0113va
1815< zoo0
1816< Zviedrija
1817< Z\u00fcrich
1818< zysk0
1819< \u00e4ndere
1820
1821** test: DataDrivenCollationTest/hi_TestNewRules
1822# This test goes through new rules and tests against old rules
1823@ locale hi
1824* compare
1825< कॐ
1826< कं
1827< कँ
1828< कः
1829
1830** test: DataDrivenCollationTest/ro_TestNewRules
1831# This test goes through new rules and tests against old rules
1832@ locale ro
1833* compare
1834< xAx
1835< xă
1836< xĂ
1837< Xă
1838< XĂ
1839< xăx
1840< xĂx
1841< xâ
1842< xÂ
1843< Xâ
1844< XÂ
1845< xâx
1846< xÂx
1847< xb
1848< xIx
1849< xî
1850< xÎ
1851< Xî
1852< XÎ
1853< xîx
1854< xÎx
1855< xj
1856< xSx
1857< xș
1858= xş
1859< xȘ
1860= xŞ
1861< Xș
1862= Xş
1863< XȘ
1864= XŞ
1865< xșx
1866= xşx
1867< xȘx
1868= xŞx
1869< xT
1870< xTx
1871< xț
1872= xţ
1873< xȚ
1874= xŢ
1875< Xț
1876= Xţ
1877< XȚ
1878= XŢ
1879< xțx
1880= xţx
1881< xȚx
1882= xŢx
1883< xU
1884
1885** test: DataDrivenCollationTest/testOffsets
1886# This tests cases where forwards and backwards iteration get different offsets
1887@ locale en
1888% strength=tertiary
1889* compare
1890< a\uD800\uDC00\uDC00
1891< b\uD800\uDC00\uDC00
1892* compare
1893< \u0301A\u0301\u0301
1894< \u0301B\u0301\u0301
1895* compare
1896< abcd\r\u0301
1897< abce\r\u0301
1898# TODO: test offsets in new CollationTest
1899
1900# End of test cases moved here from ICU 52's DataDrivenCollationTest.txt.
1901
1902** test: was ICU 52 cmsccoll/TestRedundantRules
1903@ rules
1904& a < b < c < d& [before 1] c < m
1905* compare
1906<1 a
1907<1 b
1908<1 m
1909<1 c
1910<1 d
1911
1912@ rules
1913& a < b <<< c << d <<< e& [before 3] e <<< x
1914* compare
1915<1 a
1916<1 b
1917<3 c
1918<2 d
1919<3 x
1920<3 e
1921
1922@ rules
1923& a < b <<< c << d <<< e <<< f < g& [before 1] g < x
1924* compare
1925<1 a
1926<1 b
1927<3 c
1928<2 d
1929<3 e
1930<3 f
1931<1 x
1932<1 g
1933
1934@ rules
1935& a <<< b << c < d& a < m
1936* compare
1937<1 a
1938<3 b
1939<2 c
1940<1 m
1941<1 d
1942
1943@ rules
1944&a<b<<b\u0301 &z<b
1945* compare
1946<1 a
1947<1 b\u0301
1948<1 z
1949<1 b
1950
1951@ rules
1952&z<m<<<q<<<m
1953* compare
1954<1 z
1955<1 q
1956<3 m
1957
1958@ rules
1959&z<<<m<q<<<m
1960* compare
1961<1 z
1962<1 q
1963<3 m
1964
1965@ rules
1966& a < b < c < d& r < c
1967* compare
1968<1 a
1969<1 b
1970<1 d
1971<1 r
1972<1 c
1973
1974@ rules
1975& a < b < c < d& c < m
1976* compare
1977<1 a
1978<1 b
1979<1 c
1980<1 m
1981<1 d
1982
1983@ rules
1984& a < b < c < d& a < m
1985* compare
1986<1 a
1987<1 m
1988<1 b
1989<1 c
1990<1 d
1991
1992** test: was ICU 52 cmsccoll/TestExpansionSyntax
1993# The following two rules should sort the particular list of strings the same.
1994@ rules
1995&AE <<< a << b <<< c &d <<< f
1996* compare
1997<1 AE
1998<3 a
1999<2 b
2000<3 c
2001<1 d
2002<3 f
2003
2004@ rules
2005&A <<< a / E << b / E <<< c /E  &d <<< f
2006* compare
2007<1 AE
2008<3 a
2009<2 b
2010<3 c
2011<1 d
2012<3 f
2013
2014# The following two rules should sort the particular list of strings the same.
2015@ rules
2016&AE <<< a <<< b << c << d < e < f <<< g
2017* compare
2018<1 AE
2019<3 a
2020<3 b
2021<2 c
2022<2 d
2023<1 e
2024<1 f
2025<3 g
2026
2027@ rules
2028&A <<< a / E <<< b / E << c / E << d / E < e < f <<< g
2029* compare
2030<1 AE
2031<3 a
2032<3 b
2033<2 c
2034<2 d
2035<1 e
2036<1 f
2037<3 g
2038
2039# The following two rules should sort the particular list of strings the same.
2040@ rules
2041&AE <<< B <<< C / D <<< F
2042* compare
2043<1 AE
2044<3 B
2045<3 F
2046<1 AED
2047<3 C
2048
2049@ rules
2050&A <<< B / E <<< C / ED <<< F / E
2051* compare
2052<1 AE
2053<3 B
2054<3 F
2055<1 AED
2056<3 C
2057
2058** test: never reorder trailing primaries
2059@ root
2060% reorder Zzzz Grek
2061* compare
2062<1 L
2063<1 字
2064<1 Ω
2065<1 \uFFFD
2066<1 \uFFFF
2067
2068** test: fall back to mappings with shorter prefixes, not immediately to ones with no prefixes
2069@ rules
2070&u=ab|cd
2071&v=b|ce
2072* compare
2073<1 abc
2074<1 abcc
2075<1 abcf
2076<1 abcd
2077=  abu
2078<1 abce
2079=  abv
2080
2081# With the following rules, there is only one prefix per composite ĉ or ç,
2082# but both prefixes apply to just c in NFD form.
2083# We would get different results for composed vs. NFD input
2084# if we fell back directly from longest-prefix mappings to no-prefix mappings.
2085@ rules
2086&x=op|ĉ
2087&y=p|ç
2088* compare
2089<1 opc
2090<2 opć
2091<1 opcz
2092<1 opd
2093<1 opĉ
2094=  opc\u0302
2095=  opx
2096<1 opç
2097=  opc\u0327
2098=  opy
2099
2100# The mapping is used which has the longest matching prefix for which
2101# there is also a suffix match, with the longest suffix match among several for that prefix.
2102@ rules
2103&❶=d
2104&❷=de
2105&❸=def
2106&①=c|d
2107&②=c|de
2108&③=c|def
2109&④=bc|d
2110&⑤=bc|de
2111&⑥=bc|def
2112&⑦=abc|d
2113&⑧=abc|de
2114&⑨=abc|def
2115* compare
2116<1 9aadzz
2117=  9aa❶zz
2118<1 9aadez
2119=  9aa❷z
2120<1 9aadef
2121=  9aa❸
2122<1 9acdzz
2123=  9ac①zz
2124<1 9acdez
2125=  9ac②z
2126<1 9acdef
2127=  9ac③
2128<1 9bcdzz
2129=  9bc④zz
2130<1 9bcdez
2131=  9bc⑤z
2132<1 9bcdef
2133=  9bc⑥
2134<1 abcdzz
2135=  abc⑦zz
2136<1 abcdez
2137=  abc⑧z
2138<1 abcdef
2139=  abc⑨
2140
2141** test: prefix + discontiguous contraction with missing prefix contraction
2142# Unfortunate terminology: The first "prefix" here is the pre-context,
2143# the second "prefix" refers to the contraction/relation string that is
2144# one shorter than the one being tested.
2145@ rules
2146&x=p|e
2147&y=p|ê
2148&z=op|ê
2149# No mapping for op|e:
2150# Discontiguous contraction matching should not match op|ê in opệ
2151# because it would have to skip the dot below and extend a match on op|e by the circumflex,
2152# but there is no match on op|e.
2153* compare
2154<1 oPe
2155<1 ope
2156=  opx
2157<1 opệ
2158=  opy\u0323  # y not z
2159<1 opê
2160=  opz
2161
2162# We cannot test for fallback by whether the contraction default CE32
2163# is for another contraction. With the following rules, there is no mapping for op|e,
2164# and the fallback to prefix p has no contractions.
2165@ rules
2166&x=p|e
2167&z=op|ê
2168* compare
2169<1 oPe
2170<1 ope
2171=  opx
2172<2 opệ
2173=  opx\u0323\u0302  # x not z
2174<1 opê
2175=  opz
2176
2177# One more variation: Fallback to the simple code point, no shorter non-empty prefix.
2178@ rules
2179&x=e
2180&z=op|ê
2181* compare
2182<1 ope
2183=  opx
2184<3 oPe
2185=  oPx
2186<2 opệ
2187=  opx\u0323\u0302  # x not z
2188<1 opê
2189=  opz
2190
2191** test: maxVariable via rules
2192@ rules
2193[maxVariable space][alternate shifted]
2194* compare
2195=  \u0020
2196=  \u000A
2197<1 .
2198<1 °  # degree sign
2199<1 $
2200<1 0
2201
2202** test: maxVariable via setting
2203@ root
2204% maxVariable=currency
2205% alternate=shifted
2206* compare
2207=  \u0020
2208=  \u000A
2209=  .
2210=  °  # degree sign
2211=  $
2212<1 0
2213
2214** test: ICU4J CollationMiscTest/TestContractionClosure (ää)
2215# This tests canonical closure, but it also tests that CollationFastLatin
2216# bails out properly for contractions with combining marks.
2217# For that we need pairs of strings that remain in the Latin fastpath
2218# long enough, hence the extra "= b" lines.
2219@ rules
2220&b=\u00e4\u00e4
2221* compare
2222<1 b
2223=  \u00e4\u00e4
2224=  b
2225=  a\u0308a\u0308
2226=  b
2227=  \u00e4a\u0308
2228=  b
2229=  a\u0308\u00e4
2230
2231** test: ICU4J CollationMiscTest/TestContractionClosure (Å)
2232@ rules
2233&b=\u00C5
2234* compare
2235<1 b
2236=  \u00C5
2237=  b
2238=  A\u030A
2239=  b
2240=  \u212B
2241
2242** test: reset-before on already-tailored characters, ICU ticket 10108
2243@ rules
2244&a<w<<x &[before 2]x<<y
2245* compare
2246<1 a
2247<1 w
2248<2 y
2249<2 x
2250
2251@ rules
2252&a<<w<<<x &[before 2]x<<y
2253* compare
2254<1 a
2255<2 y
2256<2 w
2257<3 x
2258
2259@ rules
2260&a<w<x &[before 2]x<<y
2261* compare
2262<1 a
2263<1 w
2264<1 y
2265<2 x
2266
2267@ rules
2268&a<w<<<x &[before 2]x<<y
2269* compare
2270<1 a
2271<1 y
2272<2 w
2273<3 x
2274
2275** test: numeric collation with other settings, ICU ticket 9092
2276@ root
2277% strength=identical
2278% caseFirst=upper
2279% numeric=on
2280* compare
2281<1 100\u0020a
2282<1 101
2283
2284** test: collation type fallback from unsupported type, ICU ticket 10149
2285@ locale fr-CA-u-co-phonebk
2286# Expect the same result as with fr-CA, using backwards-secondary order.
2287# That is, we should fall back from the unsupported collation type
2288# to the locale's default collation type.
2289* compare
2290<1 cote
2291<2 côte
2292<2 coté
2293<2 côté
2294
2295** test: @ is equivalent to [backwards 2], ICU ticket 9956
2296@ rules
2297&b<a @ &v<<w
2298* compare
2299<1 b
2300<1 a
2301<1 cote
2302<2 côte
2303<2 coté
2304<2 côté
2305<1 v
2306<2 w
2307<1 x
2308
2309** test: shifted+reordering, ICU ticket 9507
2310@ root
2311% reorder Grek punct space
2312% alternate=shifted
2313% strength=quaternary
2314# Which primaries are "variable" should be determined without script reordering,
2315# and then primaries should be reordered whether they are shifted to quaternary or not.
2316* compare
2317<4 (  # punctuation
2318<4 )
2319<4 \u0020  # space
2320<1 `  # symbol
2321<1 ^
2322<1 $  # currency symbol
2323<1 €
2324<1 0  # numbers
2325<1 ε  # Greek
2326<1 e  # Latin
2327<1 e(e
2328<4 e)e
2329<4 e\u0020e
2330<4 ee
2331<3 e(E
2332<4 e)E
2333<4 e\u0020E
2334<4 eE
2335
2336** test: "uppercase first" could sort a string before its prefix, ICU ticket 9351
2337@ rules
2338&\u0001<<<b<<<B
2339% caseFirst=upper
2340* compare
2341<1 aaa
2342<3 aaaB
2343
2344** test: secondary+case ignores secondary ignorables, ICU ticket 9355
2345@ rules
2346&\u0001<<<b<<<B
2347% strength=secondary
2348% caseLevel=on
2349* compare
2350<1 a
2351=  ab
2352=  aB
2353
2354** test: custom collation rules involving tail of a contraction in Malayalam, ICU ticket 6328
2355@ rules
2356&[before 2] ൌ << ൗ  # U+0D57 << U+0D4C == 0D46+0D57
2357* compare
2358<1 ൗx
2359<2 ൌx
2360<1 ൗy
2361<2 ൌy
2362
2363** test: quoted apostrophe in compact syntax, ICU ticket 8204
2364@ rules
2365&q<<*a''c
2366* compare
2367<1 d
2368<1 p
2369<1 q
2370<2 a
2371<2 \u0027
2372<2 c
2373<1 r
2374
2375# ICU ticket #8260 "Support all collation-related keywords in Collator.getInstance()"
2376** test: locale -u- with collation keywords, ICU ticket 8260
2377@ locale de-u-kv-sPace-ka-shifTed-kn-kk-falsE-kf-Upper-kc-tRue-ks-leVel4
2378* compare
2379<4 \u0020  # space is shifted, strength=quaternary
2380<1 !  # punctuation is regular
2381<1 2
2382<1 12  # numeric sorting
2383<1 B
2384<c b  # uppercase first on case level
2385<1 x\u0301\u0308
2386<2 x\u0308\u0301  # normalization off
2387
2388** test: locale @ with collation keywords, ICU ticket 8260
2389@ locale fr@colbAckwards=yes;ColStrength=Quaternary;kv=currencY;colalternate=shifted
2390* compare
2391<4 $  # currency symbols are shifted, strength=quaternary
2392<1 àla
2393<2 alà  # backwards secondary level
2394
2395** test: locale -u- with script reordering, ICU ticket 8260
2396@ locale el-u-kr-kana-SYMBOL-Grek-hani-cyrl-latn-digit-armn-deva-ethi-thai
2397* compare
2398<1 \u0020
2399<1 あ
2400<1 ☂
2401<1 Ω
2402<1 丂
2403<1 ж
2404<1 L
2405<1 4
2406<1 Ձ
2407<1 अ
2408<1 ሄ
2409<1 ฉ
2410
2411** test: locale @collation=type should be case-insensitive
2412@ locale de@coLLation=PhoneBook
2413* compare
2414<1 ae
2415<2 ä
2416<3 Ä
2417
2418** test: import root search rules plus German phonebook rules, ICU ticket 8962
2419@ locale de-u-co-search
2420* compare
2421<1 =
2422<1 ≠
2423<1 a
2424<1 ae
2425<2 ä
2426
2427# Once more, but with runtime builder.
2428@ rules
2429[import und-u-co-search][import de-u-co-phonebk]
2430* compare
2431<1 =
2432<1 ≠
2433<1 a
2434<1 ae
2435<2 ä
2436
2437# Once again, with import from "root" not "und" (as in a proper language tag).
2438@ rules
2439[import root-u-co-search][import de-u-co-phonebk]
2440* compare
2441<1 =
2442<1 ≠
2443<1 a
2444<1 ae
2445<2 ä
2446
2447** test: import rules from a language with non-Latin native script, and reset the reordering, ICU ticket 10998
2448# Greek should sort Greek first.
2449@ rules
2450[import el]
2451* compare
2452<1 4
2453<1 Ω
2454<1 L
2455
2456# Import Greek, and then reset the reordering.
2457@ rules
2458[import el][reorder Zzzz]
2459* compare
2460<1 4
2461<1 L
2462<1 Ω
2463
2464# "others" is a synonym for Zzzz.
2465@ rules
2466[import el][reorder others]
2467* compare
2468<1 4
2469<1 L
2470<1 Ω
2471
2472** test: regression test for CollationFastLatinBuilder, ICU ticket 11388
2473@ rules
2474&x<<aa<<<Aa<<<AA
2475% strength=secondary
2476* compare
2477<1 AA
2478<2 Aẩ
2479<2 aą
2480* compare
2481<1 AA
2482<2 aą
2483
2484** test: tailor tertiary-after a common tertiary where there is a lower one
2485# Assume that Hiragana small A has a below-common tertiary, and Hiragana A has a common one.
2486# See ICU ticket 11448 & CLDR ticket 7222.
2487@ rules
2488&あ<<<x<<<y<<<z
2489* compare
2490<1 ぁ
2491<3 あ
2492<3 x
2493<3 y
2494<3 z
2495<3 ァ
2496<1 い
2497
2498** test: tailor tertiary-after a below-common tertiary
2499@ rules
2500&ぁ<<<x<<<y<<<z
2501* compare
2502<1 ぁ
2503<3 x
2504<3 y
2505<3 z
2506<3 あ
2507<3 ァ
2508<1 い
2509
2510** test: tailor tertiary-before a common tertiary where there is a lower one
2511@ rules
2512&[before 3]あ<<<x<<<y<<<z
2513* compare
2514<1 ぁ
2515<3 x
2516<3 y
2517<3 z
2518<3 あ
2519<3 ァ
2520<1 い
2521
2522** test: tailor tertiary-before a below-common tertiary
2523@ rules
2524&[before 3]ぁ<<<x<<<y<<<z
2525* compare
2526<1 x
2527<3 y
2528<3 z
2529<3 ぁ
2530<3 あ
2531<3 ァ
2532<1 い
2533
2534** test: reorder single scripts not groups, ICU ticket 11449
2535@ root
2536% reorder Goth Latn
2537* compare
2538<1 4
2539<1 ��  # Gothic
2540<1 L
2541<1 Ω
2542# Before ICU 55, the following reordered together with Gothic.
2543<1 ��  # Old Italic
2544<1 ��  # Shavian
2545