1# This file is derived from 2# 3# http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt 4# 5# Which was created by Markus Kuhn <mkuhn@acm.org> - 2000-09-02 6# 7# lines begining with # and blank lines are ignored 8# 9# Beyond that, this file consists of a series of test cases. Each test case consists of 10# 2 or 3 lines: 11# 12# 1. A UTF-8 string 13# 2. A status 14# VALID : The string is a valid UTF-8 representation of valid Unicode 15# INCOMPLETE : The string has a partial character at the end 16# NOTUNICODE : The string is valid UTF-8, but the characters represented 17# are not valid unicode ( 18# OVERLONG : The string includes overlong sequences 19# MALFORMED : The string is not valid UTF-8 20# 3. If the status is VALID or NOTUNICODE, the UCS-4 representation of the string, 21# as a series of hex numbers. 22 23# 1 Some correct UTF-8 text 24κόσμε 25VALID 2603ba 1f79 03c3 03bc 03b5 27 28# 2.1 First possible sequence of a certain length 29# 30# FIXME - handle NULLS? 31# 32# [ NULL BYTE ] 33#VALID 34#0000 35 36 37VALID 380080 39 40ࠀ 41VALID 420800 43 44 45VALID 4600010000 47 48����� 49NOTUNICODE 5000200000 51 52������ 53NOTUNICODE 5404000000 55 56 57VALID 580000007f 59 60߿ 61VALID 62000007ff 63 64 65NOTUNICODE 660000ffff 67 68���� 69NOTUNICODE 70001fffff 71 72����� 73NOTUNICODE 7403ffffff 75 76������ 77NOTUNICODE 787fffffff 79 80# 2.3 Other boundary conditions 81 82 83VALID 84d7ff 85 86 87VALID 88e000 89 90� 91VALID 92fffd 93 94 95VALID 960010fffd 97 98 99NOTUNICODE 1000010ffff 101 102���� 103NOTUNICODE 10400110000 105 106# 3.1 Unexpected continuation bytes 107 108� 109MALFORMED 110� 111MALFORMED 112�� 113MALFORMED 114��� 115MALFORMED 116���� 117MALFORMED 118����� 119MALFORMED 120������ 121MALFORMED 122������� 123MALFORMED 124��������������������������������������������������������������� 125MALFORMED 126 127# 3.2 Lonely start characters 128 129� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 130MALFORMED 131� � � � � � � � � � � � � � � � 132MALFORMED 133� � � � � � � � 134MALFORMED 135� � � � 136MALFORMED 137� � 138MALFORMED 139 140# 3.3 Sequences with last continuation byte missing 141 142� 143INCOMPLETE 144�� 145INCOMPLETE 146��� 147INCOMPLETE 148���� 149INCOMPLETE 150����� 151INCOMPLETE 152� 153INCOMPLETE 154� 155INCOMPLETE 156��� 157INCOMPLETE 158���� 159INCOMPLETE 160����� 161INCOMPLETE 162 163# 3.4 Concatenation of incomplete sequences 164 165����������������������������� 166MALFORMED 167 168# 3.5 Impossible bytes 169 170� 171MALFORMED 172� 173MALFORMED 174���� 175MALFORMED 176 177# Examples of an overlong ASCII character 178 179�� 180OVERLONG 181��� 182OVERLONG 183���� 184OVERLONG 185����� 186OVERLONG 187������ 188OVERLONG 189 190# Maximum overlong sequences 191 192�� 193OVERLONG 194��� 195OVERLONG 196���� 197OVERLONG 198����� 199OVERLONG 200������ 201OVERLONG 202 203# Overlong representation of the NUL character 204 205�� 206OVERLONG 207��� 208OVERLONG 209���� 210OVERLONG 211����� 212OVERLONG 213������ 214OVERLONG 215 216# Illegal code positions 217 218# Single UTF-16 surrogates 219 220� 221NOTUNICODE 222d800 223 224� 225NOTUNICODE 226db7f 227 228� 229NOTUNICODE 230db80 231 232� 233NOTUNICODE 234dbff 235 236� 237NOTUNICODE 238dc00 239 240� 241NOTUNICODE 242df80 243 244� 245NOTUNICODE 246dfff 247 248# Paired UTF-16 surrogates 249 250�� 251NOTUNICODE 252d800 dc00 253 254�� 255NOTUNICODE 256d800 dfff 257 258�� 259NOTUNICODE 260db7f dc00 261 262�� 263NOTUNICODE 264db7f dfff 265 266�� 267NOTUNICODE 268db80 dc00 269 270�� 271NOTUNICODE 272db80 dfff 273 274�� 275NOTUNICODE 276dbff dc00 277 278�� 279NOTUNICODE 280dbff dfff 281 282# Other illegal code positions 283 284 285NOTUNICODE 286fffe 287 288 289NOTUNICODE 290ffff 291 292################ 293# 294# Some more tests, not from Markus Kuhn's file 295# 296 297# Mixed plane 0 and higher planes 298 299ABC 300VALID 30141 00010000 42 10fffd 43 302