• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1# This file is derived from
2#
3#    http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
4#
5# Which was created by   Markus Kuhn <mkuhn@acm.org> - 2000-09-02
6#
7# lines begining with # and blank lines are ignored
8#
9# Beyond that, this file consists of a series of test cases. Each test case consists of
10# 2 or 3 lines:
11#
12#  1. A UTF-8 string
13#  2. A status
14#      VALID      : The string is a valid UTF-8 representation of valid Unicode
15#      INCOMPLETE : The string has a partial character at the end
16#      NOTUNICODE : The string is valid UTF-8, but the characters represented
17#                   are not valid unicode (
18#      OVERLONG   : The string includes overlong sequences
19#      MALFORMED  : The string is not valid UTF-8
20# 3. If the status is VALID or NOTUNICODE, the UCS-4 representation of the string,
21#    as a series of hex numbers.
22
23# 1  Some correct UTF-8 text
24κόσμε
25VALID
2603ba 1f79 03c3 03bc 03b5
27
28# 2.1  First possible sequence of a certain length
29#
30# FIXME - handle NULLS?
31#
32# [ NULL BYTE ]
33#VALID
34#0000
35
3637VALID
380080
39
4041VALID
420800
43
44��
45VALID
4600010000
47
48�����
49NOTUNICODE
5000200000
51
52������
53NOTUNICODE
5404000000
55
56
57VALID
580000007f
59
60߿
61VALID
62000007ff
63
64￿
65NOTUNICODE
660000ffff
67
68����
69NOTUNICODE
70001fffff
71
72�����
73NOTUNICODE
7403ffffff
75
76������
77NOTUNICODE
787fffffff
79
80# 2.3  Other boundary conditions
81
8283VALID
84d7ff
85
8687VALID
88e000
89
9091VALID
92fffd
93
94��
95VALID
960010fffd
97
98��
99NOTUNICODE
1000010ffff
101
102����
103NOTUNICODE
10400110000
105
106# 3.1  Unexpected continuation bytes
107
108109MALFORMED
110111MALFORMED
112��
113MALFORMED
114���
115MALFORMED
116����
117MALFORMED
118�����
119MALFORMED
120������
121MALFORMED
122�������
123MALFORMED
124���������������������������������������������������������������
125MALFORMED
126
127# 3.2  Lonely start characters
128
129� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
130MALFORMED
131� � � � � � � � � � � � � � � �
132MALFORMED
133� � � � � � � �
134MALFORMED
135� � � �
136MALFORMED
137� �
138MALFORMED
139
140# 3.3  Sequences with last continuation byte missing
141
142143INCOMPLETE
144��
145INCOMPLETE
146���
147INCOMPLETE
148����
149INCOMPLETE
150�����
151INCOMPLETE
152153INCOMPLETE
154155INCOMPLETE
156���
157INCOMPLETE
158����
159INCOMPLETE
160�����
161INCOMPLETE
162
163# 3.4  Concatenation of incomplete sequences
164
165�����������������������������
166MALFORMED
167
168# 3.5  Impossible bytes
169
170171MALFORMED
172173MALFORMED
174����
175MALFORMED
176
177#  Examples of an overlong ASCII character
178
179��
180OVERLONG
181���
182OVERLONG
183����
184OVERLONG
185�����
186OVERLONG
187������
188OVERLONG
189
190#  Maximum overlong sequences
191
192��
193OVERLONG
194���
195OVERLONG
196����
197OVERLONG
198�����
199OVERLONG
200������
201OVERLONG
202
203# Overlong representation of the NUL character
204
205��
206OVERLONG
207���
208OVERLONG
209����
210OVERLONG
211�����
212OVERLONG
213������
214OVERLONG
215
216# Illegal code positions
217
218# Single UTF-16 surrogates
219
220221NOTUNICODE
222d800
223
224225NOTUNICODE
226db7f
227
228229NOTUNICODE
230db80
231
232233NOTUNICODE
234dbff
235
236237NOTUNICODE
238dc00
239
240241NOTUNICODE
242df80
243
244245NOTUNICODE
246dfff
247
248# Paired UTF-16 surrogates
249
250��
251NOTUNICODE
252d800 dc00
253
254��
255NOTUNICODE
256d800 dfff
257
258��
259NOTUNICODE
260db7f dc00
261
262��
263NOTUNICODE
264db7f dfff
265
266��
267NOTUNICODE
268db80 dc00
269
270��
271NOTUNICODE
272db80 dfff
273
274��
275NOTUNICODE
276dbff dc00
277
278��
279NOTUNICODE
280dbff dfff
281
282# Other illegal code positions
283
284285NOTUNICODE
286fffe
287
288￿
289NOTUNICODE
290ffff
291
292################
293#
294# Some more tests, not from Markus Kuhn's file
295#
296
297# Mixed plane 0 and higher planes
298
299A��B��C
300VALID
30141 00010000 42 10fffd 43
302