1# Copyright (C) 2016 and later: Unicode, Inc. and others. 2# License & terms of use: http://www.unicode.org/copyright.html 3# Copyright (c) 2001-2016 International Business Machines 4# Corporation and others. All Rights Reserved. 5# 6# RBBI Test Data 7# 8# File: rbbitst.txt 9# 10# The format of this file looks vaguely like some kind of xml-ish markup, 11# but it is NOT. The syntax is this.. 12# 13# <word> any following data is for word break testing 14# <sent> any following data is for sentence break testing 15# <line> any following data is for line break testing 16# <char> any following data is for char break testing 17# <locale local_name> Switch to the named locale at the next occurence of <word>, <sent>, etc. 18# <data> ... </data> test data. May span multiple lines. 19# <> Break position, status == 0 20# • Break position, status == 0 (Bullet, \u2022) 21# <nnn> Break position, status == nnn 22# \ Escape. Normal ICU unescape applied. 23# \ at end of line -> Line Continuation. Remove both the backslash and the new line 24# 25# In ICU4C, this test data is run by intltest, rbbi/RBBITest/TestExtended. 26# In ICU4J, this test data is run by com.ibm.icu.dev.test.rbbi.RBBITestExtended 27# 28# There are two copies of this file in the source repository, 29# [ICU4C] source/test/testdata/rbbitst.txt 30# [ICU4J] main/tests/core/src/com/ibm/icu/dev/test/rbbi/rbbitst.txt 31# 32# ICU4C's copy is the master. If any changes are made to ICU4J's copy, make sure they 33# are merged back into ICU4C's copy of the file, lest they get overwritten later. 34# TODO: figure out how to have a single copy of the file for use by both C and Java. 35 36 37# Temp debugging tests 38<locale en> 39<word> 40<data><0>ク<400>ライアン<400>ト<400>サーバー<400></data> 41# <data><0>ク<400>ライアン<400>トサーバー<400></data> 42 43## FILTERED BREAK TESTS 44 45# (William Bradford, public domain. http://catalog.hathitrust.org/Record/008651224 ) - edited. 46<locale en> 47<sent> 48<data>\ 49•In the meantime Mr. •Weston arrived with his small ship, which he had now recovered. •Capt. •Gorges, who informed the Sgt. here that one purpose of his going east was to meet with Mr. •Weston, took this opportunity to call him to account for some abuses he had to lay to his charge.•</data> 50 51<locale en@ss=standard> 52<sent> 53<data>\ 54•In the meantime Mr. Weston arrived with his small ship, which he had now recovered. •Capt. Gorges, who informed the Sgt. here that one purpose of his going east was to meet with Mr. Weston, took this opportunity to call him to account for some abuses he had to lay to his charge.•</data> 55 56# This hits the case where "D." would match the end of "Ph.D.". 57<locale en@ss=standard> 58<sent> 59<data>\ 60•Doctor with a D. •As in, Ph.D., you know.•</data> 61 62# same as root (unless some exceptions are added!) 63<locale tfg@ss=standard> 64<sent> 65<data>\ 66•In the meantime Mr. •Weston arrived with his small ship, which he had now recovered. •Capt. •Gorges, who informed the Sgt. here that one purpose of his going east was to meet with Mr. •Weston, took this opportunity to call him to account for some abuses he had to lay to his charge.•</data> 67 68# same as root (unless some exceptions are added!) 69<locale ja@ss=standard> 70<sent> 71<data>\ 72•In the meantime Mr. •Weston arrived with his small ship, which he had now recovered. •Capt. •Gorges, who informed the Sgt. here that one purpose of his going east was to meet with Mr. •Weston, took this opportunity to call him to account for some abuses he had to lay to his charge.•</data> 73 74## END FILTERED BREAK TESTS 75 76 77######################################################################################## 78# 79# 80# G r a p h e m e C l u s t e r T e s t s 81# 82# 83########################################################################################## 84<char> 85 86<data>•a•b•c• •,•\u0666•</data> # Quick Test 87<data>•\r•\r•\r\n•\r\n•\n•\r•</data> # don't break CR/LF 88 89# Always break after controls. Combining chars don't combine with them. 90<data>•\u0003•\N{COMBINING GRAVE ACCENT}•\r•\N{COMBINING GRAVE ACCENT}•</data> 91<data>•\u0085•\N{COMBINING MACRON}•A\N{COMBINING MACRON}•</data> 92 93# Surrogates 94<data>•\U00011000•\U00010020•\U00010000\N{COMBINING MACRON}•</data> 95<data>•\ud800\udc00•\udbff\udfff•a•</data> 96 97# Extend (Combining chars) combine. 98<data>•A\N{COMBINING GRAVE ACCENT}•B•</data> 99<data>•\N{GREEK SMALL LETTER MU}\N{COMBINING LOW LINE}\N{COMBINING HORN}•</data> 100<data>•a\u0301•b\u0302•c\u0303•d\u0304•e\u0305•f\u0306•g\u0307•h\u0308•i\u0309•</data> 101 102<data>•a\u0301\u0302\u0303\u0304\u0301\u0302\u0303\u0304\u0301\u0302\u0303\u0304\u0301\u0302\u0303\u0304\u0301\u0302\u0303\u0304\u0301\u0302\u0303\u0304\u0301\u0302\u0303\u0304\u0301\u0302\u0303\u0304\u0301\u0302\u0303\u0304\u0301\u0302\u0303\u0304\u0301\u0302\u0303\u0304\u0301\u0302\u0303\u0304\u0301\u0302\u0303\u0304\u0301\u0302\u0303\u0304\u0301\u0302\u0303\u0304\u0301\u0302\u0303\u0304\u0301\u0302\u0303\u0304\u0301\u0302\u0303\u0304\u0301\u0302\u0303\u0304\u0301\u0302\u0303\u0304\u0301\u0302\u0303\u0304\u0301\u0302\u0303\u0304\u0301\u0302\u0303\u0304\u0301\u0302\u0303\u0304\u0301\u0302\u0303\u0304\u0301\u0302\u0303\u0304\u0301\u0302\u0303\u0304\u0301\u0302\u0303\u0304\u0301\u0302\u0303\u0304\u0301\u0302\u0303\u0304\u0301\u0302\u0303\u0304•</data> 103 104# Don't break Hangul Syllables 105# L : \u1100 106# V : \u1161 107# T : \u11A8 108# LV : \uAC00 109# LVT : \uAC01 110 111<data>•\u1100\u1161\u11a8•\u1100\u1161\u11a8•</data> #LVT 112<data>•\u1100\u1161•\u1100\u1161•</data> 113<data>•\u1100\u1161\u11a8•\u1161•\u1100•\u11a8•\u1161\u1161\u1161\u11a8•</data> 114<data>•\u1100\u1100\uac01•\u1100\uac01•\u1100\uac01\u0301•\uac01•</data> 115<data>•\u1100\u0301•\u1161\u11a8\u0301•\u11a8•</data> 116 117 118 119# Hindi combining chars. (An old test) 120# TODO: Update these tests for Unicode 5.1 Extended Grapheme clusters 121#<data>•भ••ा•\u0930•\u0924• •\u0938\u0941\u0902•\u0926•\u0930• 122#•\u0939•\u094c•\u0964•</data> 123#<data>•\u0916\u0947•\u0938\u0941\u0902•\u0926•\u0930•\u0939•\u094c•\u0964•</data> 124 125 126# Bug 1587. Tamil. \u0baa\u0bc1 is an Extended Grpaheme Cluster 127<data>•\u0baa\u0bc1•\u0baa\u0bc1•</data> 128 129# Regression test for bug 1889 130<data>•\u0f40\u0f7d•\u0000•\u0f7e•</data> 131 132 133# 0xffff is a legal character, and should not stop the break iterator early. 134# (Requires special casing in implementation, which is why it gets a test.) 135<data>•\uffff•\uffff• •a•</data> 136 137# Treat Japanese Half Width voicing marks as combining 138<data>•A\uff9e•B\uff9f\uff9e\uff9f•C•</data> 139 140######################################################################################## 141# 142# 143# E x t e n d e d G r a p h e m e C l u s t e r T e s t s 144# 145# 146########################################################################################## 147#<xgc> 148 149# Plain Vanilla grapheme clusters 150#<data>•a•b•c•</data> 151#<data>•a\u0301\u0302• •b\u0303\u0304•</data> 152 153# Assorted Hindi combining marks 154#<data>•\u0904\u0903• •\u0937\u093E• •\u0904\u093F• •\u0937\u0940• •\u0937\u0949• •\u0937\u094A• •\u0937\u094B• •\u0937\u094C•</data> 155 156# Thai Clusters 157# $Prepend $Extend* $PrependBase $Extend*; 158# 159#<data>•\u0e40\u0e01•\u0e44\u0301\u0e23\u0302\u0303•\u0e40•\u0e40\u0e02•\u0e02• •</data> 160 161 162######################################################################################## 163# 164# 165# W o r d B o u n d a r y T e s t s 166# 167# 168########################################################################################## 169 170<word> 171# 172# Quick sanity test 173# 174<data>•hello<200> •there<200> •goodbye<200></data> 175<data>•hello<200> •12345<100> •,•</data> 176 177 178# 179# Test data originally in RBBIAPITest::TestFirstNextFollowing() and TestLastPreviousPreceding() 180# 181 182<word> 183<data>•This<200> •is<200> •a<200> •word<200> •break<200>.• • •Isn't<200> •it<200>?• •2.25<100></data> 184 185 186 187# 188# Data originally from TestDefaultRuleBasedWordIteration() 189# 190<data>•Write<200> •wordrules<200>.• •123.456<100> •alpha\u00adbeta\u00adgamma<200> •\u092f\u0939<200> •</data> 191<data>• •\u0939\u093f\u0928\u094d\u200d\u0926\u0940<200> •\u0939\u0948<200> •\u0905\u093e\u092a<200> •\u0938\u093f\u0916\u094b\u0917\u0947<200>?•</data> 192 193#Hindi Numbers 194<data>• •\u0968\u0966.\u0969\u096f<100> •\u0967\u0966\u0966.\u0966\u0966<100> •\N{RUPEE SIGN}•\u0967,\u0967\u0966\u0966.\u0966\u0966<100> • •\u0905\u092e\u091c<200>\n•</data> 195 196<data>•\u0938\u094d\u200d\u0935\u0924\u0902deadTA\u0930<200>\r•It's<200> •$•30.10<100> •12,34<100>¢•£•¤•¥•alpha\u05f3beta\u05f4gamma<200> •</data> 197 198<data>•Badges<200>?• •BADGES<200>!•?•!• •We<200> •don't<200> •need<200> •no<200> •STINKING<200> •BADGES<200>!•!•1000,233,456.000<100> •1,23.322<100>%•123.1222<100>$•123,000.20<100> •179.01<100>%•X<200> •Now<200>\r•is<200>\n•the<200>\r\n•time<200> •</data> 199 200#Hangul 201<data>•\uc5f0\ud569<200> •\uc7a5\ub85c\uad50\ud68c<200> •\u1109\u1161\u11bc\u1112\u1161\u11bc<200> •\u1112\u1161\u11ab\u110b\u1175\u11ab<200> •Hello<200>,• •how<200> •are<200> •you<200> •</data> 202 203<data>•Hello<200>,• •how<200> •are<200> •you<200> •\uc5f0\ud569<200> •\uc7a5\ub85c\uad50\ud68c<200> •\u1109\u1161\u11bc\u1112\u1161\u11bc<200> •\u1112\u1161\u11ab\u110b\u1175\u11ab<200> •</data> 204 205# Words containing non-BMP letters 206<data>•abc\U00010300<200> •abc\N{DESERET SMALL LETTER ENG}<200> •abc\N{MATHEMATICAL BOLD SMALL Z}<200> •abc\N{MATHEMATICAL SANS-SERIF BOLD ITALIC PI SYMBOL}<200> •</data> 207 208# Unassigned code points 209<data>•abc<200>\U0001D800•def<200>\U0001D3FF• •</data> 210 211# Hiragana & Katakana stay together, but separates from each other and Latin. 212# *** what to do about theoretical combos of chars? i.e. hiragana + accent 213#<data>•abc<200>\N{HIRAGANA LETTER SMALL A}<400>\N{HIRAGANA LETTER VU}\N{COMBINING ACUTE ACCENT}<400>\N{HIRAGANA ITERATION MARK}<400>\N{KATAKANA LETTER SMALL A}\N{KATAKANA ITERATION MARK}\N{HALFWIDTH KATAKANA LETTER WO}\N{HALFWIDTH KATAKANA LETTER N}<400>def<200>#•</data> 214 215# test normalization/dictionary handling of halfwidth katakana: same dictionary phrase in fullwidth and halfwidth 216<data>•芽キャベツ<400>芽キャベツ<400></data> 217 218# more Japanese tests 219# TODO: some script=common characters in the Hiragana and the Katakana block may not be treated correctly 220# (was formerly true for U+30FC); need to check and fix if so. 221#<data>•どー<400>せ<400>日本語<400>を<400>勉強<400>する<400>理由<400>について<400> •て<400>こと<400>は<400>我<400>でも<400>知<400>ら<400>も<400>い<400>こと<400>なん<400>だ<400>。•</data> 222<data>•日本語<400>を<400>勉強<400>する<400>理由<400>について<400> •て<400>こと<400>は<400>我<400>でも<400>知<400>ら<400>も<400>い<400>こと<400>なん<400>だ<400>。•</data> 223 224# Testing of word boundary for dictionary word containing both kanji and kana 225<data>•中だるみ<400>蔵王の森<400>ウ離島<400></data> 226 227# Testing of Chinese segmentation (taken from a Chinese news article) 228<data>•400<100>余<400>名<400>中央<400>委员<400>和<400>中央<400>候补<400>委员<400>都<400>领<400>到了<400>“•推荐<400>票<400>”•,•有<400>资格<400>在<400>200<100>多<400>名<400>符合<400>条件<400>的<400>63<100>岁<400>以下<400>中共<400>正<400>部<400>级<400>干部<400>中<400>,•选出<400>他们<400>属意<400>的<400>中央<400>政治局<400>委员<400>以<400>向<400>政治局<400>常委<400>会<400>举荐<400>。•</data> 229 230# Words with interior formatting characters 231<data>•def\N{COMBINING ACUTE ACCENT}\N{SYRIAC ABBREVIATION MARK}ghi<200> •</data> 232 233# to test for bug #4097779 234<data>•aa\N{COMBINING GRAVE ACCENT}a<200> •</data> 235 236# fullwidth numeric, midletter characters etc should be treated like their halfwidth counterparts 237# <data>•ISN'T<200> •19<100>日<400></data> 238# why was this added with the dbbi stuff? 239 240# to test for bug #4098467 241# What follows is a string of Korean characters (I found it in the Yellow Pages 242# ad for the Korean Presbyterian Church of San Francisco, and I hope I transcribed 243# it correctly), first as precomposed syllables, and then as conjoining jamo. 244# Both sequences should be semantically identical and break the same way. 245# precomposed syllables... 246<data>•\uc0c1\ud56d<200> •\ud55c\uc778<200> •\uc5f0\ud569<200> •\uc7a5\ub85c\uad50\ud68c<200> •\u1109\u1161\u11bc\u1112\u1161\u11bc<200> •\u1112\u1161\u11ab\u110b\u1175\u11ab<200> •\u110b\u1167\u11ab\u1112\u1161\u11b8<200> •\u110c\u1161\u11bc\u1105\u1169\u1100\u116d\u1112\u116c<200> •</data> 247 248# more Korean tests (Jamo not tested here, not counted as dictionary characters) 249# Disable them now because we don't include a Korean dictionary. 250#<data>•\ud55c\uad6d<200>\ub300\ud559\uad50<200>\uc790\uc5f0<200>\uacfc\ud559<200>\ub300\ud559<200>\ubb3c\ub9ac\ud559\uacfc<200></data> 251#<data>•\ud604\uc7ac<200>\ub294<200> •\uac80\ucc30<200>\uc774<200> •\ubd84\uc2dd<200>\ud68c\uacc4<200>\ubb38\uc81c<200>\ub97c<200> •\uc870\uc0ac<200>\ud560<200> •\uac00\ub2a5\uc131<200>\uc740<200> •\uc5c6\ub2e4<200>\u002e•</data> 252 253<data>•abc<200>\u4e01<400>\u4e02<400>\u3005<400>\u4e03\u4e03<400>abc<200> •</data> 254 255<data>•\u06c9<200>\uc799\ufffa•</data> 256 257 258# 259# Try some words from other scripts. 260# 261 262# Try some words from other scripts. 263# Greek, Cyrillic, Hebrew, Arabic, Arabic, Georgian, Latin 264# 265<data>•ΑΒΓ<200> •БВГ<200> •אבג֓<200> •ابت<200> •١٢٣<100> •\u10A0\u10A1\u10A2<200> •ABC<200> •</data> 266 267<data>•\u0301•A<200></data> 268 269 270# 271# Hindi word break tests, imported from the old RBBI tests. 272# An historical note: a much earlier version of ICU break iterators had a number 273# of special case rules for Hindi, which were tested by an earlier version of 274# this test data. The current RBBI rules do not special case Hindi in 275# any way, making this test data much less signfificant. 276# 277<data>•\u0917\u092a\u00ad\u0936\u092a<200>!•\u092f\u0939<200> •\u0939\u093f\u0928\u094d\u200d\u0926\u0940<200> •\u0939\u0948<200> •\u0905\u093e\u092a<200> •\u0938\u093f\u0916\u094b\u0917\u0947<200>?•\n•:•\u092a\u094d\u0930\u093e\u092f\u0903<200> 278•\u0935\u0930\u094d\u0937\u093e<200>\r\n•\u092a\u094d\u0930\u0915\u093e\u0936<200>,•\u0924\u0941\u092e\u093e\u0930\u094b<200> •\u092e\u093f\u0924\u094d\u0930<200> •\u0915\u093e<200> •\u092a\u0924\u094d\u0930<200> •\u092a\u095d\u094b<200> •\u0938\u094d\u0924\u094d\u0930\u093f<200>.• •\u0968\u0966.\u0969\u096f<100> •\u0967\u0966\u0966.\u0966\u0966<100>\u20a8•\u0967,\u0967\u0966\u0966.\u0966\u0966<100> •\u0905\u092e\u091c<200>\n•\u0938\u094d\u200d\u0935\u0924\u0902\u0924\u094d\u0930<200>\r•</data> 279 280# 281# Failures from monkey tests 282# 283<data>•\u8527<400>\u02ba<200>\u0027\u0d42•\u00b7•\u09ea<100></data> 284 285# 286# Jitterbug 5276 - treat Japanese half width voicing marks as Grapheme Extend 287# 288<data>•A\uff9e\uff9fBC<200> •1\uff9e\uff9f23<100></data> 289 290# User guide example: 291<data>•Parlez<200>-•vous<200> •français<200> •?•</data> 292 293# Test for #11673 294<word> 295<data>•ジョージア<400> •</data> 296 297# Test for #11723 298<word> 299<data>•アレルギー性<400>結膜炎<400></data> 300<data>•アテ<400>ローム<400>性<400>動脈硬化<400></data> 301 302# Ticket #11996 303<locale en> 304<word> 305<data>•栃木<400>県<400>足利<400>市<400>で<400>の<400>撮影<400>が<400>公開<400></data> 306<data>•栃木<400>県<400>足利<400>市<400>で<400>の<400>撮影<400>が<400>公開<400>さ<400>れ<400>た<400></data> 307 308# Ticket #11999 309# Unhandled Break Engine was consuming all characters, not just unhandled. 310# \U00011700 is AHOM LETTER KA. There is no dictionary for AHOM, triggering the unhandled engine, 311# which then incorrectly also consumed the following Japanese text. (ICU4J only) 312<word> 313<locale en> 314<data>•ロ<400>から<400>売却<400>完了<400>時<400>の<400>時価<400>が<400>提示<400>さ<400>れ<400>て<400>いる<400></data> 315<data>•\U00011700<200>ロ<400>から<400>売却<400>完了<400>時<400>の<400>時価<400>が<400>提示<400>さ<400>れ<400>て<400>いる<400></data> 316 317# 318# What Is Unicode in Japanese 319# From http://unicode.org/standard/translations/japanese.html 320 321<locale en> 322<word> 323<data><0>ユニ<400>コード<400>と<400>は<400>何<400>か<400>?<0></data> 324<data><0>ユニ<400>コード<400>は<400>、<0>すべて<400>の<400>文字<400>に<400>固有<400>の<400>番号<400>を<400>付与<400>し<400>ます<400></data> 325<data><0>プラットフォーム<400>に<400>は<400>依存<400>しま<400>せん<400></data> 326<data><0>プログラム<400>に<400>も<400>依存<400>しま<400>せん<400></data> 327<data><0>言語<400>に<400>も<400>依存<400>しま<400>せん<400></data> 328 329<data><0>コンピューター<400>は<400>、<0>本質<400>的<400>に<400>は<400>数字<400>しか<400>扱う<400>こと<400>が<400>でき<400>ま<400>せん<400>。<0>\ 330コンピューター<400>は<400>、<0>文字<400>や<400>記号<400>など<400>の<400>それぞれに<400>番号<400>を<400>割り振る<400>こと<400>によって<400>扱える<400>\ 331よう<400>にし<400>ます<400>。<0>ユニ<400>コード<400>が<400>出来る<400>まで<400>は<400>、<0>これらの<400>番号<400>を<400>割り振る<400>仕組み<400>が<400>\ 332何<400>百<400>種類<400>も<400>存在<400>しま<400>した<400>。<0>どの<400>一つ<400>を<400>とっても<400>、<0>十分<400>な<400>文字<400>を<400>含<400>\ 333んで<400>は<400>いま<400>せん<400>で<400>した<400>。<0>例えば<400>、<0>欧州<400>連合<400>一つ<400>を<400>見<400>て<400>も<400>、<0>その<400>\ 334すべて<400>の<400>言語<400>を<400>カバー<400>する<400>ため<400>に<400>は<400>、<0>いくつか<400>の<400>異なる<400>符号<400>化<400>の<400>仕組み<400>\ 335が<400>必要<400>で<400>した<400>。<0>英語<400>の<400>よう<400>な<400>一つ<400>の<400>言語<400>に<400>限<400>って<400>も<400>、<0>一つ<400>だけ<400>\ 336の<400>符号<400>化<400>の<400>仕組み<400>では<400>、<0>一般<400>的<400>に<400>使<400>われる<400>すべて<400>の<400>文字<400>、<0>句読点<400>、<0>\ 337技術<400>的<400>な<400>記号<400>など<400>を<400>扱う<400>に<400>は<400>不十分<400>で<400>した<400>。<0></data> 338 339<data><0>これらの<400>符号<400>化<400>の<400>仕組み<400>は<400>、<0>相互<400>に<400>矛盾<400>する<400>もの<400>でも<400>ありま<400>した<400>。<0>\ 340二つ<400>の<400>異なる<400>符号<400>化<400>の<400>仕組み<400>が<400>、<0>二つ<400>の<400>異なる<400>文字<400>に<400>同一<400>の<400>番号<400>\ 341を<400>付ける<400>こと<400>も<400>できる<400>し<400>、<0>同じ<400>文字<400>に<400>異なる<400>番号<400>を<400>付ける<400>こと<400>も<400>できる<400>\ 342の<400>です<400>。<0>どの<400>よう<400>な<400>コンピューター<400>も<400>(<0>特に<400>サーバー<400>は<400>)<0>多く<400>の<400>異<400>な<400>っ<400>\ 343た<400>符号<400>化<400>の<400>仕組み<400>を<400>サポート<400>する<400>必要<400>が<400>あり<400>ます<400>。<0>たとえ<400>データ<400>が<400>異なる<400>\ 344符号<400>化<400>の<400>仕組み<400>や<400>プラットフォーム<400>を<400>通過<400>し<400>て<400>も<400>、<0>いつ<400>どこ<400>で<400>データ<400>が<400>\ 345乱れる<400>か<400>分<400>から<400>ない<400>危険<400>を<400>冒す<400>こと<400>の<400>なる<400>の<400>です<400>。<0></data> 346 347<data><0>ユニ<400>コード<400>は<400>すべて<400>を<400>変<400>え<400>ます<400></data> 348 349<data><0>ユニ<400>コード<400>は<400>、<0>プラットフォーム<400>に<400>係<400>わら<400>ず<400>、<0>プログラム<400>に<400>係<400>わら<400>ず<400>、<0>\ 350言語<400>に<400>係<400>わら<400>ず<400>、<0>すべて<400>の<400>文字<400>に<400>独立<400>した<400>番号<400>を<400>与<400>え<400>ます<400>。<0>\ 351ユニ<400>コード<400>標準<400>は<400>、<0>アップル<400>、<0>ヒュー<400>レット<400>パッ<400>カード<400>、<0>IBM<200>、<0>ジャスト<400>システム<400>\ 352、<0>マイクロ<400>ソフト<400>、<0>オラクル<400>、<0>SAP<200>、<0>サン<400>、<0>サイ<400>ベース<400>など<400>の<400>産業<400>界<400>の<400>\ 353主導<400>的<400>企業<400>と<400>他の<400>多く<400>の<400>企業<400>に<400>採用<400>さ<400>れ<400>てい<400>ます<400>。<0>ユニ<400>コード<400>\ 354は<400>、<0>XML<200>、<0>Java<200>、<0>ECMAScript<200>(<0>JavaScript<200>)<0>、<0>LDAP<200>、<0>CORBA<200> <0>3.0<100>など<400>\ 355の<400>最先端<400>の<400>標準<400>の<400>前提<400>と<400>な<400>って<400>おり<400>、<0>ユニ<400>コード<400>を<400>実装<400>す<400>れ<400>ば<400>\ 356、<0>ISO<200>/<0>IEC<200></data> 357<data><0> <0>10646<100>に<400>適合<400>する<400>ことに<400>なり<400>ます<400>。<0>ユニ<400>コード<400>は<400>、<0>多く<400>の<400>\ 358オペレーティングシステム<400>と<400>すべて<400>の<400>最新<400>の<400>ブラウザー<400>と<400>他の<400>多く<400>の<400>製品<400>で<400>サポート<400>\ 359さ<400>れ<400>てい<400>ます<400>。<0>ユニ<400>コード<400>標準<400>の<400>出現<400>と<400>ユニ<400>コード<400>を<400>サポート<400>する<400>\ 360ツール<400>類<400>は<400>、<0>昨今<400>顕著<400>に<400>な<400>って<400>いる<400>ソフトウエア<400>技術<400>の<400>グローバル<400>化<400>の<400>\ 361流れ<400>に対して<400>、<0>特に<400>役<400>に<400>立<400>って<400>い<400>ます<400>。<0></data> 362 363<data><0>ユニ<400>コード<400>を<400>ク<400>ライアン<400>ト<400>サーバー<400>型<400>の<400>アプリケーション<400>や<400>、<0>多層<400>構造<400>\ 364を<400>持つ<400>アプリケーション<400>、<0>ウェブサイト<400>など<400>に<400>に<400>組み込む<400>こと<400>で<400>、<0>従来<400>の<400>文字<400>\ 365コードセット<400>を<400>用いる<400>より<400>も<400>明らか<400>な<400>コスト<400>削減<400>が<400>可能<400>です<400>。<0>ユニ<400>コード<400>は<400>\ 366、<0>単一<400>の<400>ソフトウエア<400>製品<400>、<0>単一<400>の<400>ウェブサイト<400>に<400>、<0>何ら<400>手<400>を<400>加える<400>こと<400>なく<400>\ 367、<0>複数<400>の<400>プラットフォーム<400>、<0>複数<400>の<400>言語<400>、<0>複数<400>の<400>国<400>を<400>カバー<400>する<400>こと<400>が<400>\ 368出来る<400>の<400>です<400>。<0>ユニ<400>コード<400>は<400>、<0>データ<400>が<400>多く<400>の<400>異なる<400>システム<400>の<400>間<400>を<400>、<0>\ 369何<400>の<400>乱れ<400>も<400>なし<400>に<400>転送<400>する<400>こと<400>を<400>可能<400>と<400>する<400>の<400>です<400>。<0></data> 370 371<data><0>ユニ<400>コード<400>コンソーシアム<400>について<400></data> 372 373<data><0>ユニ<400>コード<400>コンソーシアム<400>は<400>、<0>最新<400>の<400>ソフトウエア<400>製品<400>と<400>標準<400>において<400>テキスト<400>\ 374を<400>表現<400>する<400>こと<400>を<400>意味<400>する<400>“<0>ユニ<400>コード<400>標準<400>”<0>の<400>構築<400>、<0>発展<400>、<0>普及<400>、<0>\ 375利用<400>促進<400>を<400>目的<400>として<400>設立<400>さ<400>れ<400>た<400>非<400>営利<400>組織<400>です<400>。<0>同<400>コンソーシアム<400>\ 376の<400>会員<400>は<400>、<0>コンピューター<400>と<400>情報処理<400>に<400>係わる<400>広汎<400>な<400>企業<400>や<400>組織<400>から<400>構成<400>\ 377さ<400>れ<400>てい<400>ます<400>。<0>同<400>コンソーシアム<400>は<400>、<0>財政<400>的<400>に<400>は<400>、<0>純粋<400>に<400>会費<400>のみ<400>\ 378によって<400>運営<400>さ<400>れ<400>てい<400>ます<400>。<0>ユニ<400>コード<400>標準<400>を<400>支持<400>し<400>、<0>その<400>拡張<400>と<400>\ 379実装<400>を<400>支援<400>する<400>世界中<400>の<400>組織<400>や<400>個人<400>は<400>、<0>だれ<400>も<400>が<400>ユニ<400>コード<400>\ 380コンソーシアム<400>の<400>会員<400>なる<400>こと<400>が<400>でき<400>ます<400>。<0></data> 381 382<data><0>より<400>詳しい<400>こと<400>を<400>お<400>知<400>り<400>に<400>なり<400>たい<400>方<400>は<400>、<0>Glossary<200>,<0> <0>\ 383Technical<200> <0>Introduction<200> <0>および<400> <0>Useful<200> <0>Resources<200>を<400>ご<400>参照<400>くだ<400>さい<400>。<0></data> 384 385 386######################################################################################## 387# 388# 389# S e n t e n c e B o u n d a r y T e s t s 390# 391# 392########################################################################################## 393 394 395# 396# Test data originally from RBBI RBBITest::TestDefaultRuleBasedSentenceIteration() 397# 398<sent> 399 400 401<sent> 402<data>•This\n<100></data> 403<data>•Hello! •how are you? •I'am fine. •Thankyou. •How are you \ 404doing? •This\n<100> costs $20,00,000. •</data> 405 406 407# Sentence ending in a quote. 408<data>•"Sentence ending with a quote." •Bye.•</data> 409 410# Sentence, and test data, ending without a period or other terminator. 411<data>•Here is a random sentence, no ending period<100></data> 412 413 414<data>• (This is it). •Testing the sentence iterator. •\ 415"This isn't it." •Hi! \ 416•This is a simple sample sentence. •(This is it.) •This is a simple sample sentence. •\ 417"This isn't it." •\ 418Hi! •This is a simple sample sentence. •It does not have to make any sense as you can see. •Nel mezzo del cammin di nostra vita, mi ritrovai in una selva oscura. •Che la dritta via aveo smarrita. •He said, that I said, that you said!! •Don't rock the boat.\u2029•Because I am the daddy, that is why. 419•Not on my time (el timo.)! •</data> 420 421<data>•Hello. •So what!!\u2029•"But now," he said, \ 422"I know!" •\ 423Harris thumbed down several, including "Away We Go" (which became the huge success Oklahoma!). •One species, B. anthracis, is highly virulent. 424•Wolf said about Sounder:\ 425"Beautifully thought-out and directed." •\ 426Have you ever said, "This is where\tI shall live"? •He answered, \ 427"You may not!" •Another popular saying is: "How do you do?". \n•\ 428Yet another popular saying is: \ 429'I'm fine thanks.' •\ 430What is the proper use of the abbreviation pp.? •Yes, I am definatelly 12" tall!!\ 431•Now\r<100>is\n<100>the\r\n<100>time\n<100>\r<100>for\r<100>\r<100></data> 432 433<data>•No breaks when . is surrounded by UPPER.Case letters. •</data> 434<data>•No breaks when . is followed by Numeric .4 a.4 C.4 3.1 .•</data> 435<data>•No breaks when . is followed by a lower, with possible intervening punct .,a .$a .)a. •</data> 436 437# 438# Sentence Breaks: no break at the boundary between CJK and other letters 439# 440<data>•\u5487\u67ff\ue591\u5017\u61b3\u60a1\u9510\u8165:"JAVA\u821c\u8165\u7fc8\u51ce\u306d,\u2494\u56d8\u4ec0\u60b1\u8560\u51ba\u611d\u57b6\u2510\u5d46".\u2029•\u5487\u67ff\ue591\u5017\u61b3\u60a1\u9510\u8165\u9de8\u97e4JAVA\u821c\u8165\u7fc8\u51ce\u306d\ue30b\u2494\u56d8\u4ec0\u60b1\u8560\u51ba\u611d\u57b6\u2510\u5d46\u97e5\u7751\u3002•\u5487\u67ff\ue591\u5017\u61b3\u60a1\u9510\u8165\u9de8\u97e4\u6470\u8790JAVA\u821c\u8165\u7fc8\u51ce\u306d\ue30b\u2494\u56d8\u4ec0\u60b1\u8560\u51ba\u611d\u57b6\u2510\u5d46\u97e5\u7751\u2048•He said, "I can go there."\u2029•Bye, now.•</data> 441 442# 443# Treat fullwidth variants of .!? the same as their 444# normal counterparts 445# 446<data>•I know I'm right\uff0e •Right\uff1f •Right\uff01 •</data> 447 448 449# 450# Don't break sentences at boundary between CJK and digits 451# 452<data>•\u5487\u67ff\ue591\u5017\u61b3\u60a1\u9510\u8165\u9de8\u97e48888\u821c\u8165\u7fc8\u51ce\u306d\ue30b\u2494\u56d8\u4ec0\u60b1\u8560\u51ba\u611d\u57b6\u2510\u5d46\u97e5\u7751\u3002•Bye, now<100></data> 453 454# 455# Breaks around '(' following a sentence TERM. (Rule 9) 456# 457<data>•How do you do?(•Fine). •</data> 458<data>•How do you do? •(Fine). •</data> 459<data>•How do you do?(•fine). •</data> 460<data>•How do you do? •(fine). •</data> 461 462# 463<data>•Hello.123<100></data> # Rule 6 464<data>•Hello?•123<100></data> 465 466<data>•HELLO.Bye<100></data> # Rule 7 467<data>•HELLO?•Bye<100></data> 468 469<data>•Hello.goodbye<100></data> #Rule 8 470<data>•Hello. •Goodbye<100></data> 471<data>•Hello. goodbye<100></data> 472 473 474 475# 476# test for bug #4158381: No breaks when there are no terminators around 477# 478<data>•\<P>Provides a set of "lightweight" (all-java\<FONT SIZE="-2">\<SUP>TM\</SUP>\</FONT> language) components that, to the maximum degree possible, work the same on all platforms. •</data> 479<data>•Another test.\u2029•</data> 480 481# test for bug #4143071: Make sure sentences that end with digits 482# work right 483# 484<data>•Today is the 27th of May, 1998. •</data> 485<data>•Tomorrow with be 28 May 1998. •</data> 486<data>•The day after will be the 30th.\u2029•</data> 487 488# test for bug #4152416: Make sure sentences ending with a capital 489# letter are treated correctly 490# 491<data>•The type of all primitive \<code>boolean\</code> values accessed in the target VM. •Calls to xxx will return an implementor of this interface. \u2029•</data> 492 493# test for bug #4152117: Make sure sentence breaking is handling 494# punctuation correctly [COULD NOT REPRODUCE THIS BUG, BUT TEST IS 495# HERE TO MAKE SURE IT DOESN'T CROP UP] 496# 497<data>•Constructs a randomly generated BigInteger, uniformly distributed over the range \<tt>0\</tt> to \<tt>(2\<sup>numBits\</sup> - 1\)\</tt>, inclusive. •The uniformity of the distribution assumes that a fair source of random bits is provided in \<tt>rnd\</tt>. •Note that this constructor always constructs a non-negative biginteger. \n•Ahh abc. 498•</data> 499 500# sentence breaks for hindi which used Devanagari script 501# make sure there is sentence break after ?,danda(hindi phrase separator), 502# fullstop followed by space. (VERY old test) 503# 504<data>•\u0928\u092e\u0938\u094d\u200d\u0924\u0947 \u0930\u092e\u0947\u0936\u0905\u093e\u092a\u0915\u0948\u0938\u0947 \u0939\u0948?•\u092e\u0948 \u0905\u091a\u094d\u200d \u091b\u093e \u0939\u0942\u0901\u0964 •\u0905\u093e\u092a\r\n<100>\ 505\u0915\u0948\u0938\u0947 \u0939\u0948?•\u0935\u0939 \u0915\u094d\u200d\u092f\u093e\n\ 506<100>\u0939\u0948?•\u092f\u0939 \u0905\u093e\u092e \u0939\u0948. •\u092f\u0939 means "this". •"\u092a\u095d\u093e\u0908" meaning "education" or "studies". •\u0905\u093e\u091c(\u0938\u094d\u200d\u0935\u0924\u0902\u0924\u094d\u0930 \u0926\u093f\u0935\u093e\u0938) \u0939\u0948\u0964 •Let's end here. •</data> 507 508# Regression test for bug #1984, Sentence break in Arabic text. 509 510<data>\ 511•\u0623\u0633\u0627\u0633\u064b\u0627\u060c\u0020\u062a\u062a\u0639\u0627"\u0645\u0644\u0020\u0627\u0644\u062d\u0648\u0627\u0633\u064a\u0628\u0020"\u0641\u0642\u0637\u0020\u0645\u0639\u0020\u0627\u0644\u0623\u0631\u0642\u0627\u0645\u060c\u0648\u062a\u0642\u0648\u0645\u0020\u0628\u062a\u062e\u0632\u064a\u0646\u0020\u0627\u0644\u0623\u062d\u0631\u0641\u0020\u0648\u0627\u0644\u0645\u062d\u0627\u0631\u0641\u0020\u0627\u0644\u0623\u062e\u0631\u0649\u0020\u0628\u0639\u062f\u0020\u0623\u0646\u062a\u064f\u0639\u0637\u064a\u0020\u0631\u0642\u0645\u0627\u0020\u0645\u0639\u064a\u0646\u0627\u0020\u0644\u0643\u0644\u0020\u0648\u0627\u062d\u062f\u0020\u0645\u0646\u0647\u0627\u002e\u0020•\u0648\u0642\u0628\u0644\u0020\u0627\u062e\u062a\u0631\u0627\u0639\u0022\u064a\u0648\u0646\u0650\u0643\u0648\u062f\u0022\u060c\u0020\u0643\u0627\u0646\u0020\u0647\u0646\u0627\u0643\u0020\u0645\u0626\u0627\u062a\u0020\u0627\u0644\u0623\u0646\u0638\u0645\u0629\u0020\u0644\u0644\u062a\u0634\u0641\u064a\u0631\u0648\u062a\u062e\u0635\u064a\u0635\u0020\u0647\u0630\u0647\u0020\u0627\u0644\u0623\u0631\u0642\u0627\u0645\u0020\u0644\u0644\u0645\u062d\u0627\u0631\u0641\u060c\u0020\u0648\u0644\u0645\u0020\u064a\u0648\u062c\u062f\u0020\u0646\u0638\u0627\u0645\u062a\u0634\u0641\u064a\u0020\u0639\u0644\u0649\u0020\u062c\u0645\u064a\u0639\u0020\u0627\u0644\u0645\u062d\u0627\u0631\u0641\u0020\u0627\u0644\u0636\u0631\u0648\u0631\u064a\u0629. •</data> 512 513# Try a few more of the less common sentence endings. 514<data>•Hello, world\u3002 •Hello, world\u1803 •Hello, world\u2048 •Hello, world\u203c •Let's end here. •</data> 515 516 517 518 519################################################################ 520# 521# 522# L I N E B R E A K 523# 524# 525################################################################ 526 527<line> 528# 529# Test Character for each of the line break classes. 530# 531# 00A1;AI # INVERTED EXCLAMATION MARK ¡ 532# 0041;AL # LATIN CAPITAL LETTER A 533# 0009;BA # <control> 534# 00B4;BB # ACUTE ACCENT 535# 000C;BK # <control> 536# 2014;B2 # EM DASH 537# FFFC;CB # OBJECT REPLACEMENT CHARACTER 538# 0029;CL # RIGHT PARENTHESIS 539# 0301;CM # COMBINING ACUTE ACCENT 540# 0021;EX # EXCLAMATION MARK 541# 00A0;GL # NO-BREAK SPACE 542# 002D;HY # HYPHEN-MINUS 543# 4E00;ID # <CJK Ideograph, First> 544# 2024;IN # ONE DOT LEADER 545# 002C;IS # COMMA 546# 000A;LF # <control> 547# 0E5A;NS # THAI CHARACTER ANGKHANKHU 548# 0032;NU # DIGIT TWO 549# 0028;OP # LEFT PARENTHESIS 550# 0025;PO # PERCENT SIGN 551# 0024;PR # DOLLAR SIGN 552# 0022;QU # QUOTATION MARK 553# 0E01;SA # THAI CHARACTER KO KAI 554# DB7F;SG # Surrogate 555# 0020;SP # SPACE 556# 002F;SY # SOLIDUS / 557# F8FF;XX # Private Use 558# 200B;ZW # ZERO WIDTH SPACE 559 560 561# 2b Always break at end of text 562 563<data>• •\u00A1•</data> 564<data>• •\u0041•</data> 565<data>• •\u0009•</data> 566<data>• •\u00B4•</data> 567<data>• \u000C<100></data> # LB3C × BK 568<data>• •\u2014•</data> 569<data>• •\uFFFC•</data> 570<data>• \u0029•</data> # LB 8 × CL 571# <data>• • \u0301•</data> # LB 7a Treat SP CM* as if it were ID #TODO: SP CM 572<data>• \u0021•</data> # LB 8 × EX 573#<data>• \u00A0•</data> # LB 11b × GL TODO: fix. 574<data>• •\u002D•</data> 575<data>• •\u4E00•</data> 576<data>• •\u2024•</data> 577<data>• \u002C•</data> # LB 8 × IS 578<data>• \u000A<100></data> # LB3C × ( BK | CR | LF | NL ) 579<data>• •\u0E5A•</data> 580<data>• •\u0032•</data> 581<data>• •\u0028•</data> 582<data>• •\u0025•</data> 583<data>• •\u0024•</data> 584<data>• •\u0022•</data> 585<data>• •\u0E01•</data> 586<data>• •\uDB7F•</data> 587<data>• \u0020•</data> # LB4 - don't break before space. 588<data>• \u002F•</data> # LB 8 × SY 589<data>• •\uF8FF•</data> 590<data>• \u200B•</data> # LB4 - don't break before ZA 591 592 593# 3a Always break after hard line breaks. 594# 3c Never break before hard line breaks. 595 596<data>• •\u00A1\u2028<100>\u00A1•</data> 597<data>• •\u0041\u2028<100>\u0041•</data> 598<data>• •\u0009\u2028<100>\u0009•</data> 599<data>• •\u00B4\u2028<100>\u00B4•</data> 600<data>• \u000C<100>\u2028<100>\u000C<100></data> 601<data>• •\u2014\u2028<100>\u2014•</data> 602<data>• •\uFFFC\u2028<100>\uFFFC•</data> 603<data>• \u0029\u2028<100>\u0029•</data> 604#<data>• \u0301\u2028<100>\u0301•</data> # TODO: fix. 605<data>• \u0021\u2028<100>\u0021•</data> 606#<data>• \u00A0\u2028<100>\u00A0•</data> # TODO: fix 607<data>• •\u002D\u2028<100>\u002D•</data> 608<data>• •\u4E00\u2028<100>\u4E00•</data> 609<data>• •\u2024\u2028<100>\u2024•</data> 610<data>• \u002C\u2028<100>\u002C•</data> 611<data>• \u000A<100>\u2028<100>\u000A<100></data> 612<data>• •\u0E5A\u2028<100>\u0E5A•</data> 613<data>• •\u0032\u2028<100>\u0032•</data> 614<data>• •\u0028\u2028<100>\u0028•</data> 615<data>• •\u0025\u2028<100>\u0025•</data> 616<data>• •\u0024\u2028<100>\u0024•</data> 617<data>• •\u0022\u2028<100>\u0022•</data> 618<data>• •\u0E01\u2028<100>\u0E01•</data> 619<data>• •\uDB7F\u2028<100>\uDB7F•</data> 620<data>• \u0020\u2028<100>\u0020•</data> 621<data>• \u002F\u2028<100>\u002F•</data> 622<data>• •\uF8FF\u2028<100>\uF8FF•</data> 623<data>• \u200B\u2028<100>\u200B•</data> 624 625# Regional Indicator sequences. They group in pairs. The reverse rules are tricky. 626# Sequences are long enough that the non-exaustive monkey test won't reliably pick up problems. 627 628<data>•\U0001F1E6\U0001F1E6•\U0001F1E6\U0001F1E6•\U0001F1E6\U0001F1E6•\U0001F1E6\U0001F1E6•</data> 629<data>•\U0001F1E6\U0001F1E6•\U0001F1E6\U0001F1E6•\U0001F1E6\U0001F1E6•\U0001F1E6\U0001F1E6•\U0001F1E6•</data> 630 631<data>•\U0001F1E6\U0001F1E6•\U0001F1E6\U0001F1E6\u00a0\U0001F1E6\U0001F1E6•\U0001F1E6\U0001F1E6•</data> 632<data>•\U0001F1E6\U0001F1E6•\U0001F1E6\U0001F1E6\u00a0\U0001F1E6\U0001F1E6•\U0001F1E6\U0001F1E6•\U0001F1E6•</data> 633<data>•\U0001F1E6\U0001F1E6•\U0001F1E6\u00a0\U0001F1E6\U0001F1E6•\U0001F1E6\U0001F1E6•</data> 634<data>•\U0001F1E6\U0001F1E6•\U0001F1E6\u00a0\U0001F1E6\U0001F1E6•\U0001F1E6\U0001F1E6•\U0001F1E6•</data> 635 636 637# User Guide example 638 639<data>•Parlez-•vous •français ?•</data> 640 641# 642# Old Line Break Test data. Orginally located in RBBITest::TestDefaultRuleBasedLineIteration() 643# 644 645<line> 646 647<data>•Multi-•Level •example •of •a •semi-•idiotic •non-•sensical •(non-•important) •sentence. 648<100>Hi •Hello •How\n<100>are\r<100>you\u2028<100>fine.\t•good. •Now\r<100>is\n<100>the\r\n<100>time\n<100>\r<100>for\r<100>\r<100>all•</data> 649 650<line> 651<data>•Hello! •how\r\n<100> •(are)\r<100> •you? •I'am •fine- •Thankyou. •foo\u00a0bar 652<100>How, •are, •you? •This, •costs •$20,00,000.•</data> 653 654# test for bug #4068133 655# 656<data>•\u96f6•\u4e00\u3002•\u4e8c\u3001•\u4e09\u3002\u3001•\u56db\u3001\u3002\u3001•\u4e94,•\u516d.•\u4e03.\u3001,\u3002•\u516b•</data> 657 658# to test for bug #4086052 659<data>•foo\u00a0bar•</data> 660 661# to test for bug #4097920 662<data>•dog,cat,mouse •(one)•(two)\n<100></data> 663 664# to test for bug #4035266 665<data>•The •balance •is •$-23,456.78, •not •-•$32,456.78!\n<100></data> 666 667 668# to test for bug #4098467 669# What follows is a string of Korean characters (I found it in the Yellow Pages 670# ad for the Korean Presbyterian Church of San Francisco, and I hope I transcribed 671# it correctly), first as precomposed syllables, and then as conjoining jamo. 672# Both sequences should be semantically identical and break the same way. 673# precomposed syllables... (I == Rich Gillam?) 674# 675<data>•\uc0c1•\ud56d •\ud55c•\uc778 •\uc5f0•\ud569 •\uc7a5•\ub85c•\uad50•\ud68c•</data> 676 677# conjoining jamo... 678<data>•\u1109\u1161\u11bc•\u1112\u1161\u11bc •\u1112\u1161\u11ab•\u110b\u1175\u11ab •\u110b\u1167\u11ab•\u1112\u1161\u11b8 •\u110c\u1161\u11bc•\u1105\u1169•\u1100\u116d•\u1112\u116c•</data> 679 680# to test for bug #4117554: Fullwidth .!? should be treated as postJwrd 681<data>•\u4e01\uff0e•\u4e02\uff01•\u4e03\uff1f•</data> 682 683# Surrogate line break tests. 684# 685<data>•\u4e01•\ud840\udc01•\u4e02•abc •\ue000 •\udb80\udc01•</data> #This line and the following are equivalent. 686<data>•\u4e01•\U00020001•\u4e02•abc •\ue000 •\U000f0001•</data> 687 688# Regression for bug 836 689# Note: Unicode 5.1 changed this behavior 690# Unicode 5.2 changed it again, there is no break following the '(' 691<data>•AAA(AAA •</data> 692 693# Try some words from other scripts. 694# Greek, Cyrillic, Hebrew, Arabic, Arabic, Georgian, Latin 695# 696<data>•ΑΒΓ •БВГ •אבג֓ •ابت •١٢٣ •\u10A0\u10A1\u10A2 •ABC •</data> 697 698# 699# ticket #4853: unpaired surrogates should behave like AL 700# 701<data>•abc\ud801xyz•</data> 702 703# 704# Regression tests for failures that originally came from the monkey test. 705# Monkey test failure lines can, with slight reformatting, be copied into this section 706# as test cases. The error display from here is more informative. 707# 708<data>•\ufffc•\u30e3\u000c<100>\u1b39\u300a\u002f\u203a\u200b•\ufffc•\uaf64•\udcfb•</data> 709<data>•\u114d\u31f3•\ube44\u002d•\u0362\u24e2\u276e\u2014\u205f\ufe16•\uc877•\u0fd0\u000a<100>\u20a3•</data> 710<data>•\u080a\u215b\U0001d7d3\u002c•\u2025\U000e012e•\u02df\u118d\u0029\ua8d6\u0085<100>\u6cc4\u2024\u202f\ufffc•</data> 711 712# Test for #10176 (in root) 713<line> 714<data>•abc/•s •def•</data> 715<data>•abc/\u05D9 •def•</data> 716<data>•\u05E7\u05D7/\u05D9 •\u05DE\u05E2\u05D9\u05DC•</data> 717<data>•\u05D3\u05E8\u05D5\u05E9\u05D9\u05DD •\u05E9\u05D7\u05E7\u05E0\u05D9\u05DD/\u05D9\u05D5\u05EA•</data> 718 719# Ticket #11556 don't break "R$" or "JP¥" 720<locale en> 721<line> 722<data>•R$ •JP¥ •a9 •3a •H% •CA$ •Travi$ •Scott •Ke$ha •Curren$y •A$AP •Rocky•</data> 723 724 725 726######################################################################################## 727# 728# 729# T i t l e B o u n d a r y T e s t s 730# 731# 732########################################################################################## 733<title> 734<data>•Here •is •a •short •sample •sentence. •And •another.•</data> 735<data>•HERE •IS •A •SHORT •SAMPLE •SENTENCE. •AND •ANOTHER.•</data> 736<data>• •Start •and •end •with •spaces •</data> 737<data>•Include 123 456 ^& •some 54332 •numbers 4445•abc123•abc •ending 1223 •</data> 738 739<data>•Combining\u0301 \u0301•ma\u0306rks •bye •</data> 740<data>•123 •Start •with •a •number.•</data> 741 742<data>•'•start •with •a •case-•ignorable •cha'r'a'cter•</data> 743<data>•' '' •start •with •case-•ignorable & •case-•insensitive •cha'r'a'cter•</data> 744<data>• ''•aaa' •bbb '•ccc' '•ddd''' '''•eee '''•fff''' •ggg ''•</data> 745# Note: apostrophe is case-ignorable. space is not cased. 746 747########################################################################################## 748# 749# Thai Tests 750# 751########################################################################################## 752<locale th> 753<word> 754# 755# Test data originally from the test code source file 756# // @suwit -- Thai sample data from GVT Guideline 757# 758<data>•\u0E2B\u0E19\u0E36\u0E48\u0E07<200>\u0E04\u0E33<200>\u0E44\u0E17\u0E22<200>\ 759\u0E2A\u0E32\u0E21\u0E32\u0E23\u0E16<200>\u0E1B\u0E23\u0E30\u0E01\u0E2D\u0E1A<200>\ 760\u0E14\u0E49\u0E27\u0E22<200>\u0e2b\u0e25\u0e32\u0e22<200>\ 761\u0e1e\u0e22\u0e32\u0e07\u0e04\u0e4c<200></data> 762 763# Test data originally from http://bugs.icu-project.org/trac/search?q=r30327 764<data>•กู<200> •กิน<200>กุ้ง<200> •ปิ้่<200>งอ<200>ยู่<200>ใน<200>ถ้ำ<200></data> 765 766<data>•\u0E01\u0E39<200>\u0020•\u0E01\u0E34\u0E19<200>\u0E01\u0E38\u0E49\u0E07<200>\ 767\u0020•\u0E1B\u0E34\u0E49\u0E48<200>\u0E07\u0E2D<200>\u0E22\u0E39\u0E48<200>\ 768\u0E43\u0E19<200>\u0E16\u0E49\u0E33<200></data> 769 770<line> 771<data>•0E01\u0E39\u0020•\u0E01\u0E34\u0E19•\u0E01\u0E38\u0E49\u0E07\ 772\u0020•\u0E1B\u0E34\u0E49\u0E48•\u0E07\u0E2D•\u0E22\u0E39\u0E48•\ 773\u0E43\u0E19•\u0E16\u0E49\u0E33•</data> 774 775# Data originally from intltest RBBITest::TestThaiLineBreak() 776# 777# \u0e2f-- the Thai paiyannoi character-- isn't a letter. It's a symbol that 778# represents elided letters at the end of a long word. It should be bound to 779# the end of the word and not treated as an independent punctuation mark. 780# 781# the one time where the paiyannoi occurs somewhere other than at the end 782# of a word is in the Thai abbrevation for "etc.", which both begins and 783# ends with a paiyannoi 784# 785<line> 786<data>•\u0e2a\u0e16\u0e32\u0e19\u0e35\u0e2f•\ 787\u0e08\u0e30•\ 788\u0e23\u0e30\u0e14\u0e21•\ 789\u0e40\u0e08\u0e49\u0e32•\ 790\u0e2b\u0e19\u0e49\u0e32\u0e17\u0e35\u0e48•\ 791\u0e2d\u0e2d\u0e01•\ 792\u0e21\u0e32•\ 793\u0e40\u0e23\u0e48\u0e07•\ 794\u0e23\u0e30\u0e1a\u0e32\u0e22•\ 795\u0e2d\u0e22\u0e48\u0e32\u0e07•\ 796\u0e40\u0e15\u0e47\u0e21•\ 797\u0e2f\u0e25\u0e2f•\ 798\u0e17\u0e35\u0e48•\ 799\u0e19\u0e31\u0e49\u0e19•</data> 800 801# Data originally from RBBITest::TestMixedThaiLineBreak() 802# @suwit -- Test Arabic numerals, Thai numerals, Punctuation and English characters start 803# 804<line> 805<data>•\u0E1B\u0E35•\ 806\u0E1E\u0E38\u0E17\u0E18\u0E28\u0E31\u0E01\u0E23\u0E32\u0E0A •\ 8072545 •\ 808\u0E40\u0E1B\u0E47\u0E19•\ 809\u0E1B\u0E35•\ 810\u0E09\u0E25\u0E2D\u0E07•\ 811\u0E04\u0E23\u0E1A•\ 812\u0E23\u0E2D\u0E1A •\ 813\"\u0E52\u0E52\u0E50 •\ 814\u0E1b\u0E35\" •\ 815\u0E02\u0E2d\u0E07•\ 816\u0E01\u0E23\u0E38\u0E07•\ 817\u0E23\u0E31\u0E15\u0E19\u0E42\u0E01\u0E2A\u0E34\u0E19\u0E17\u0E23\u0E4C •\ 818(\u0E01\u0E23\u0E38\u0E07\u0E40\u0E17\u0E1e\u0E2F•\ 819\u0E2B\u0E23\u0E37\u0E2D •\ 820Bangkok)•</data> 821 822# Data originally from RBBITest::TestMaiyamok() 823# The Thai maiyamok character is a shorthand symbol that means "repeat the previous 824# word". Instead of appearing as a word unto itself, however, it's kept together 825# with the word before it. 826# 827<line> 828<data>•\u0e44\u0e1b\u0e46•\ 829\u0e21\u0e32\u0e46•\ 830\u0e23\u0e30\u0e2b\u0e27\u0e48\u0e32\u0e07•\ 831\u0e01\u0e23\u0e38\u0e07•\ 832\u0e40\u0e17\u0e1e•\ 833\u0e41\u0e25\u0e30•\ 834\u0e40\u0e03\u0e35•\ 835\u0e22\u0e07•\ 836\u0e43\u0e2b\u0e21\u0e48•</data> 837 838# Test for #10296 839<line> 840<data>•ใช•มั้ย•</data> 841<data>•มั๊ยล่ะ•ที่รัก•</data> 842 843# Test for #10593 844<line> 845<data>•เล่น•ผ่าน•ทาง•บลูทูธ•บน•อุปกรณ์•</data> 846 847# Test for city names #10691 848<line> 849<data>•ไป•ที่•ซานฟรานซิสโก•</data> 850 851# Test for #10630, #10631 852<line> 853<data>•แท็ก•แอปพลิเคชัน•เป็น•พิเศษ•</data> 854 855# Test for #11019 856<line> 857<data>•เบ•เบราว์เซอร์•โพ•โพสต์•โพสท์•</data> 858 859# Test for #11688 860<line> 861<data>•อัปเดต•อีเวนต์•</data> 862 863########################################################################################## 864# 865# Lao Tests 866# 867########################################################################################## 868<locale en> 869# Basic check for #7647 870<line> 871<data>•ສະບາຍດີ•</data> 872<data>•ດີ•ຂອບໃຈ•</data> 873<data>•ເຈົ້າ•ເວົ້າ•ພາສາ•ອັງກິດ•ໄດ້•ບໍ່•</data> 874<data>•ກະລຸນາ•ເວົ້າ•ຊ້າ•ໆ•</data> 875 876########################################################################################## 877# 878# Burmese/Myanmar Tests 879# 880########################################################################################## 881<locale en> 882# Basic sanity check for #10326 (some text from http://www.unicode.org/udhr/d/udhr_mya.txt) 883<line> 884<data>•လူ•တိုင်း•သည် •တူညီ •လွတ်လပ်•သော •ဂုဏ်•သိ•က္•ခါ•ဖြ•င့် •လည်းကောင်း၊ •</data> 885<data>•တူညီ•လွတ်လပ်•သော •အ•ခွ•င့်•အရေး•များ•ဖြ•င့် •လည်းကောင်း၊ •မွေး•ဖွား•လာ•သူများ •ဖြစ်သည်။•</data> 886<data>•ထို•သူ•တို့၌ •ပိုင်းခြား •ဝေဖန်•တတ်•သော •ဉာဏ်•နှ•င့် •ကျ•င့်•ဝတ် •သိတတ်•သော •စိတ်•တို့•ရှိ•ကြ၍ •</data> 887<data>•ထို•သူ•တို့သည် •အချင်းချင်း •မေတ္တာ•ထား၍ •ဆက်ဆံ•ကျ•င့်•သုံး•</data> 888 889########################################################################################## 890# 891# Khmer Tests 892# 893########################################################################################## 894 895# Test data originally from http://bugs.icu-project.org/trac/search?q=r30327 896# from the file testdata/wordsegments.txt 897<locale en> 898<word> 899 900<data>•តើ<200>លោក<200>មក<200>ពី<200>ប្រទេស<200>ណា<200></data> 901<data>•សណ្ដូក<200>ក<200>បណ្ដែត<200>ខ្លួន<200></data> 902<data>•ពណ៌ស<200>ម្ដេច<200>ថា<200>ខ្មៅ<200></data> 903#ប្រយោគ|ពី|របៀប|រួបរួម|និង|ភាព|ផ្សេងគ្នា|ដែល|អាច|ចូល<200></data> 904<data>•ប្រយោគ<200>ពី<200>របៀប<200>ដែល<200>និង<200>ភាព<200>ផ្សេងគ្នា<200>ដែល<200>អាច<200>ចូល<200></data> 905#ប្រយោគ|ពី|របៀប|ជា|មួយ|និង|ភាព|ផ្សេងគ្នា|ដែល|អាច|ចូល<200></data> 906<data>•សូម<200>ចំណាយពេល<200>បន្តិច<200>ដើម្បី<200>អធិស្ឋាន<200>អរព្រះគុណ<200>ដល់<200>ព្រះអង្គ<200></data> 907<data>•ការ<200>ថោកទាប<200>បរិប្បូណ៌<200>ដោយ<200></data> 908<data>•ប្រើប្រាស់<200>ស្អាត<200>ទាំង<200>ចិត្ត<200>សិស្ស<200>នោះ<200></data> 909<data>•បើ<200>អ្នក<200>ប្រព្រឺត្ត<200>អំពើអាក្រក់<200>មុខ<200>ជា<200>មាន<200></data> 910<data>•ប្រដាប់<200>ប្រដា<200>រ<200>រៀនសូត្រ<200>បន្ទប់<200>រៀន<200></data> 911<data>•ដើរតួ<200>មនុស្សគ<200>ឥត<200>បញ្ចេញ<200>យោបល់<200>សោះ<200>ឡើយ<200></data> 912<data>•មិន<200>អាច<200>ឲ្យ<200>យើង<200>ធ្វើ<200>កសិកម្ម<200>បាន<200>ឡើយ<200></data> 913<data>•បន្ត<200>សេចក្ត<200>ទៅទៀត<200></data> 914<data>•ក្រុម<200>ប៉ូលិស<200>បណ្តាក់<200>គ្នា<200></data> 915<data>•គ្មាន<200>សុខ<200>សំរាន្ត<200>ដង<200>ណា<200></data> 916<data>•បាន<200>សុខភាព<200>បរិប្បូណ៌<200></data> 917<data>•ជា<200>មេចោរ<200>ខ្ញុំ<200>នឹង<200>ស្លាប់<200>ទៅវិញ<200>ជា<200>មេចោរ<200></data> 918<data>•ឯ<200>ការ<200>វាយ<200>ផ្ចាល<200>ដែល<200>នាំ<200></data> 919<data>•គេ<200>ដឹក<200>ទៅ<200>សំឡាប់<200></data> 920#អ្នក|ដែល|ជា|មន្ត្រី|ធំ|លើ|គាត់|ទេ<200></data> 921<data>•យក<200>ទៅ<200>សម្លាប់ចោល<200>ស្ងាត់<200></data> 922<data>•ត្រូវ<200>បាន<200>គេ<200>សម្លាប់<200></data> 923<data>•នៅក្នុង<200>ស្រុក<200>ខ្ល<200>ងហ្ស៊ុន<200></data> 924 925 926# 927# Jitterbug 3671 Test Case 928# 929<data>•สวัสดี<200>ครับ<200>สบาย<200>ดี<200>ไหม<200> •ครับ<200></data> 930 931# 932# Trac ticket 5595 Test Case 933<data>•บท<200>ที่๑พายุ<200>ไซโคลน<200>โด<200>โรธี<200>อาศัย<200>อยู่<200>ท่ามกลาง<200>\ 934ทุ่งใหญ่<200>ใน<200>แคนซัส<200>กับ<200>ลุง<200>เฮ<200>นรี<200>ชาวไร่<200>และ<200>ป้า<200>เอ็ม<200>\ 935ภรรยา<200>ชาวไร่<200>บ้าน<200>ของ<200>พวก<200>เขา<200>หลัง<200>เล็ก<200>เพราะ<200>ไม้<200>\ 936สร้าง<200>บ้าน<200>ต้อง<200>ขน<200>มา<200>ด้วย<200>เกวียน<200>เป็น<200>ระยะ<200>ทาง<200>หลาย<200>\ 937ไมล์<200></data> 938 939#################################################################################### 940# 941# Tailored (locale specific) breaking. 942# 943#################################################################################### 944 945# Japanese line break tailoring test 946 947<locale ja> 948<line> 949<data>•\u3041•\u3043•\u3045•\u31f1•</data> 950<locale en> 951<line> 952<data>•\u3041\u3043\u3045\u31f1•</data> 953 954# The following data was originally in RBBITest::TestJapaneseWordBreak() 955<locale ja> 956<word> 957<data>•\u4ECA\u65E5<400>\u306F<400>\u3044\u3044<400>\u5929\u6C17<400>\u3067\u3059<400>\u306D<400>\u3002•\u000D\u000A•</data> 958 959# UBreakIteratorType UBRK_WORD, Locale "ja" 960# Don't break in runs of hiragana or runs of ideograph, where the latter includes \u3005 \u3007 \u303B (cldrbug #2009). 961# \u79C1\u9054\u306B\u4E00\u3007\u3007\u3007\u306E\u30B3\u30F3\u30D4\u30E5\u30FC\u30BF\u304C\u3042\u308B\u3002\u5948\u3005\u306F\u30EF\u30FC\u30C9\u3067\u3042\u308B\u3002 962# modified to work with dbbi code - should verify 963 964<locale ja> 965<word> 966<data>•私<400>達<400>に<400>一<400>〇<400>〇〇<400>の<400>コンピュータ<400>が<400>ある<400>。<0>奈々<400>は<400>ワード<400>で<400>ある<400>。•</data> 967 968# Test for #10176 (in ja) 969<line> 970<data>•abc/•s •def•</data> 971<data>•abc/\u05D9 •def•</data> 972<data>•\u05E7\u05D7/\u05D9 •\u05DE\u05E2\u05D9\u05DC•</data> 973<data>•\u05D3\u05E8\u05D5\u05E9\u05D9\u05DD •\u05E9\u05D7\u05E7\u05E0\u05D9\u05DD/\u05D9\u05D5\u05EA•</data> 974 975 976<locale root> 977<word> 978<data>•私<400>達<400>に<400>一<400>〇<400>〇〇<400>の<400>コンピュータ<400>が<400>ある<400>。<0>奈々<400>は<400>ワード<400>で<400>ある<400>。•</data> 979# The following test is for #10300 980<data>•例えば<400>オーストラリア<400>。•</data> 981# The following test is for #10571 982<data>•一部<400>の<400>地域<400>では<400>、<0>ブラジル<400>、<0>インドネシア<400>、<0>オーストリア<400>、<0>ニュージーランド<400>で<400>ある<400>。•</data> 983 984# UBreakIteratorType UBRK_SENTENCE, Locale "el" 985# Add break after Greek question mark (cldrbug #2069). 986# "\u0391\u03B2, \u03B3\u03B4; \u0395 \u03B6\u03B7\u037E \u0398 \u03B9\u03BA. " 987# "\u039B\u03BC \u03BD\u03BE! \u039F\u03C0, \u03A1\u03C2? \u03A3" 988# which is "Αβ, γδ; Ε ζη; Θ ικ. Λμ νξ! Οπ, Ρς? Σ" 989 990<locale root> 991<sent> 992<data>•Αβ, γδ; Ε ζη; Θ ικ. •Λμ νξ! •Οπ, Ρς? •Σ<100></data> 993 994<locale el> 995<sent> 996<data>•Αβ, γδ; •Ε ζη; •Θ ικ. •Λμ νξ! •Οπ, Ρς? •Σ<100></data> 997 998# UBreakIteratorType UBRK_WORD, Locale "en_US_POSIX" 999# Words don't include colon or period (cldrbug #1969). 1000 1001<locale en_US> 1002<word> 1003<data>•Can't<200> •have<200> •breaks<200> •in<200> •xx:yy<200> •or<200> •struct.field<200> \ 1004•for<200> •CS<200>-•types<200>.•</data> 1005<data>•\uFF92\uFF76\uFF9E<400> •</data> 1006 1007<locale en_US_POSIX> 1008<word> 1009<data>•Can't<200> •have<200> •breaks<200> •in<200> •xx<200>:•yy<200> •or<200> •struct<200>.•field<200> \ 1010•for<200> •CS<200>-•types<200>.•</data> 1011<data>•\u06c9<200>\uc799\ufffa•</data> 1012<data>•\uFF92\uFF76\uFF9E<400> •</data> 1013 1014 1015# UBreakIteratorType UBRK_CHARACTER, Locale "th" 1016# Clusters should not include spacing Thai/Lao vowels (prefix or postfix), except for [SARA] AM (cldrbug #2161). 1017# Update: As of Unicode 6.1 root has same behavior as th for this. 1018# 1019# "\u0E01\u0E23\u0E30\u0E17\u0E48\u0E2D\u0E21\u0E23\u0E08\u0E19\u0E32 " 1020# "(\u0E2A\u0E38\u0E0A\u0E32\u0E15\u0E34-\u0E08\u0E38\u0E11\u0E32\u0E21\u0E32\u0E28) " 1021# "\u0E40\u0E14\u0E47\u0E01\u0E21\u0E35\u0E1B\u0E31\u0E0D\u0E2B\u0E32 " 1022# which is "กระท่อมรจนา (สุชาติ-จุฑามาศ) เด็กมีปัญหา " 1023 1024<locale th> 1025<char> 1026<data>•\u0E01•\u0E23•\u0E30•\u0E17\u0E48•\u0E2D•\u0E21•\u0E23•\u0E08•\u0E19•\u0E32• •\ 1027(•\u0E2A\u0E38•\u0E0A•\u0E32•\u0E15\u0E34•-•\u0E08\u0E38•\u0E11•\u0E32•\u0E21•\u0E32•\u0E28•)• •\ 1028\u0E40•\u0E14\u0E47•\u0E01•\u0E21\u0E35•\u0E1B\u0E31•\u0E0D•\u0E2B•\u0E32• •</data> 1029 1030# Finnish line breaking 1031# 1032# These rules deal with hyphens when there is a space on the leading side. 1033# There should be a break opportunity between the space and the hyphen, and not after the hyphen. 1034# See CLDR ticket 3029. 1035# See ICU ticket 8151 1036 1037<locale root> 1038<line> 1039<data>•abc •- •def •abc •-•def •abc- •def •abc-•def•</data> # With ASCII hyphen 1040<data>•abc •‐ •def •abc •‐•def •abc‐ •def •abc‐•def•</data> # With Unicode u2010 hyphen 1041 1042<locale fi> 1043<line> 1044# TODO: problems with Finnish line break rules cause these two lines to fail. 1045#<data>•abc •- •def •abc •-def •abc- •def •abc-•def•</data> # With ASCII hyphen 1046#<data>•abc •‐ •def •abc •‐def •abc‐ •def •abc‐•def•</data> # With Unicode u2010 hyphen 1047 1048<data>•abc •- •def •abc •-def •abc- •def •</data> # With ASCII hyphen 1049<data>•abc •‐ •def •abc •‐def •abc‐ •def •</data> # With Unicode u2010 hyphen 1050 1051# Test for #10176 (in fi) 1052<line> 1053<data>•abc/•s •def•</data> 1054<data>•abc/\u05D9 •def•</data> 1055<data>•\u05E7\u05D7/\u05D9 •\u05DE\u05E2\u05D9\u05DC•</data> 1056<data>•\u05D3\u05E8\u05D5\u05E9\u05D9\u05DD •\u05E9\u05D7\u05E7\u05E0\u05D9\u05DD/\u05D9\u05D5\u05EA•</data> 1057 1058#################################################################################### 1059# 1060# Test CSS line break variants: strict, normal, loose 1061# 1062#################################################################################### 1063 1064<locale ja@lb=strict> 1065<line> 1066# •no brk before 3063 •no brk before 301C•no brk btw 2026 •no brk before FF01• 1067<data>•\u3084\u3063•\u3071•\u308A\u0020•\u0031\u301C\u0020•\u2026\u2026\u0020•\u30A2\uFF01\u0020•</data> 1068 1069<locale ja@lb=normal> 1070<line> 1071# •brk OK before 3063 •brk OK before 301C •no brk btw 2026 •no brk before FF01• 1072<data>•\u3084•\u3063•\u3071•\u308A\u0020•\u0031•\u301C\u0020•\u2026\u2026\u0020•\u30A2\uFF01\u0020•</data> 1073 1074<locale ja@lb=loose> 1075<line> 1076# •brk OK before 3063 •brk OK before 301C •brk OK btw 2026 •brk OK before FF01• 1077<data>•\u3084•\u3063•\u3071•\u308A\u0020•\u0031•\u301C\u0020•\u2026•\u2026\u0020•u30A2•\uFF01\u0020•</data> 1078 1079<locale en@lb=strict> 1080<line> 1081# •no brk before 3063 •no brk before 301C•no brk btw 2026 •no brk before FF01• 1082<data>•\u3084\u3063•\u3071•\u308A\u0020•\u0031\u301C\u0020•\u2026\u2026\u0020•\u30A2\uFF01\u0020•</data> 1083 1084<locale en@lb=normal> 1085<line> 1086# •brk OK before 3063 •no brk before 301C •no brk btw 2026 •no brk before FF01• 1087<data>•\u3084•\u3063•\u3071•\u308A\u0020•\u0031\u301C\u0020•\u2026\u2026\u0020•\u30A2\uFF01\u0020•</data> 1088 1089<locale en@lb=loose> 1090<line> 1091# •brk OK before 3063 •no brk before 301C •brk OK btw 2026 •no brk before FF01• 1092<data>•\u3084•\u3063•\u3071•\u308A\u0020•\u0031\u301C\u0020•\u2026•\u2026\u0020•u30A2\uFF01\u0020•</data> 1093