1// vim: tabstop=4 expandtab shiftwidth=4 softtabstop=4 filetype=cpp.doxygen 2 3// 4// Copyright (c) 2009-2011 Artyom Beilis (Tonkikh) 5// 6// Distributed under the Boost Software License, Version 1.0. (See 7// accompanying file LICENSE_1_0.txt or copy at 8// http://www.boost.org/LICENSE_1_0.txt) 9// 10 11/*! 12\page boundary_analysys Boundary analysis 13 14- \ref boundary_analysys_basics 15- \ref boundary_analysys_segments 16 - \ref boundary_analysys_segments_basics 17 - \ref boundary_analysys_segments_rules 18 - \ref boundary_analysys_segments_search 19- \ref boundary_analysys_break 20 - \ref boundary_analysys_break_basics 21 - \ref boundary_analysys_break_rules 22 - \ref boundary_analysys_break_search 23 24 25\section boundary_analysys_basics Basics 26 27Boost.Locale provides a boundary analysis tool, allowing you to split text into characters, 28words, or sentences, and find appropriate places for line breaks. 29 30\note This task is not a trivial task. 31\par 32A Unicode code point and a character are not equivalent, for example: 33Hebrew word Shalom - "שָלוֹם" that consists of 4 characters and 6 code points (4 base letters and 2 diacritical marks) 34\par 35Words may not be separated by space characters in some languages like in Japanese or Chinese. 36 37Boost.Locale provides 2 major classes for boundary analysis: 38 39- \ref boost::locale::boundary::segment_index - an object that holds an index of segments in the text (like words, characters, 40 sentences). It provides an access to \ref boost::locale::boundary::segment "segment" objects via iterators. 41- \ref boost::locale::boundary::boundary_point_index - an object that holds an index of boundary points in the text. 42 It allows to iterate over the \ref boost::locale::boundary::boundary_point "boundary_point" objects. 43 44Each of the classes above use an iterator type as template parameter. 45Both of these classes accept in their constructor: 46 47- A flag that defines boundary analysis \ref boost::locale::boundary::boundary_type "boundary_type". 48- The pair of iterators that define the text range that should be analysed 49- A locale parameter (if not given the global one is used) 50 51For example: 52\code 53namespace ba=boost::locale::boundary; 54std::string text= ... ; 55std::locale loc = ... ; 56ba::segment_index<std::string::const_iterator> map(ba::word,text.begin(),text.end(),loc); 57\endcode 58 59Each of them provide a members \c begin(), \c end() and \c find() that allow to iterate 60over the selected segments or boundaries in the text or find a location of a segment or 61boundary for given iterator. 62 63 64Convenience a typedefs like \ref boost::locale::boundary::ssegment_index "ssegment_index" 65or \ref boost::locale::boundary::wcboundary_point_index "wcboundary_point_index" provided as well, 66where "w", "u16" and "u32" prefixes define a character type \c wchar_t, 67\c char16_t and \c char32_t and "c" and "s" prefixes define whether <tt>std::basic_string<CharType>::const_iterator</tt> 68or <tt>CharType const *</tt> are used. 69 70\section boundary_analysys_segments Iterating Over Segments 71\section boundary_analysys_segments_basics Basic Iteration 72 73The text segments analysis is done using \ref boost::locale::boundary::segment_index "segment_index" class. 74 75It provides a bidirectional iterator that returns \ref boost::locale::boundary::segment "segment" object. 76The segment object represents a pair of iterators that define this segment and a rule according to which it was selected. 77It can be automatically converted to \c std::basic_string object. 78 79To perform boundary analysis, we first create an index object and then iterate over it: 80 81For example: 82 83\code 84using namespace boost::locale::boundary; 85boost::locale::generator gen; 86std::string text="To be or not to be, that is the question." 87// Create mapping of text for token iterator using global locale. 88ssegment_index map(word,text.begin(),text.end(),gen("en_US.UTF-8")); 89// Print all "words" -- chunks of word boundary 90for(ssegment_index::iterator it=map.begin(),e=map.end();it!=e;++it) 91 std::cout <<"\""<< * it << "\", "; 92std::cout << std::endl; 93\endcode 94 95Would print: 96 97\verbatim 98"To", " ", "be", " ", "or", " ", "not", " ", "to", " ", "be", ",", " ", "that", " ", "is", " ", "the", " ", "question", ".", 99\endverbatim 100 101This sentence "生きるか死ぬか、それが問題だ。" (<a href="http://tatoeba.org/eng/sentences/show/868189">from Tatoeba database</a>) 102would be split into following segments in \c ja_JP.UTF-8 (Japanese) locale: 103 104\verbatim 105"生", "きるか", "死", "ぬか", "、", "それが", "問題", "だ", "。", 106\endverbatim 107 108The boundary analysis that is done by Boost.Locale 109is much more complicated then just splitting the text according 110to white space characters, even thou it is not perfect. 111 112 113\section boundary_analysys_segments_rules Using Rules 114 115The segments selection can be customized using \ref boost::locale::boundary::segment_index::rule(rule_type) "rule()" and 116\ref boost::locale::boundary::segment_index::full_select(bool) "full_select()" member functions. 117 118By default segment_index's iterator return each text segment defined by two boundary points regardless 119the way they were selected. Thus in the example above we could see text segments like "." or " " 120that were selected as words. 121 122Using a \c rule() member function we can specify a binary mask of rules we want to use for selection of 123the boundary points using \ref bl_boundary_word_rules "word", \ref bl_boundary_line_rules "line" 124and \ref bl_boundary_sentence_rules "sentence" boundary rules. 125 126For example, by calling 127 128\code 129map.rule(word_any); 130\endcode 131 132Before starting the iteration process, specify a selection mask that fetches: numbers, letter, Kana letters and 133ideographic characters ignoring all non-word related characters like white space or punctuation marks. 134 135So the code: 136 137\code 138using namespace boost::locale::boundary; 139std::string text="To be or not to be, that is the question." 140// Create mapping of text for token iterator using global locale. 141ssegment_index map(word,text.begin(),text.end()); 142// Define a rule 143map.rule(word_any); 144// Print all "words" -- chunks of word boundary 145for(ssegment_index::iterator it=map.begin(),e=map.end();it!=e;++it) 146 std::cout <<"\""<< * it << "\", "; 147std::cout << std::endl; 148\endcode 149 150Would print: 151 152\verbatim 153"To", "be", "or", "not", "to", "be", "that", "is", "the", "question", 154\endverbatim 155 156And the for given text="生きるか死ぬか、それが問題だ。" and rule(\ref boost::locale::boundary::word_ideo "word_ideo"), the example above would print. 157 158\verbatim 159"生", "死", "問題", 160\endverbatim 161 162You can access specific rules the segments where selected it using \ref boost::locale::boundary::segment::rule() "segment::rule()" member 163function. Using a bit-mask of rules. 164 165For example: 166 167\code 168boost::locale::generator gen; 169using namespace boost::locale::boundary; 170std::string text="生きるか死ぬか、それが問題だ。"; 171ssegment_index map(word,text.begin(),text.end(),gen("ja_JP.UTF-8")); 172for(ssegment_index::iterator it=map.begin(),e=map.end();it!=e;++it) { 173 std::cout << "Segment " << *it << " contains: "; 174 if(it->rule() & word_none) 175 std::cout << "white space or punctuation marks "; 176 if(it->rule() & word_kana) 177 std::cout << "kana characters "; 178 if(it->rule() & word_ideo) 179 std::cout << "ideographic characters"; 180 std::cout<< std::endl; 181} 182\endcode 183 184Would print 185 186\verbatim 187Segment 生 contains: ideographic characters 188Segment きるか contains: kana characters 189Segment 死 contains: ideographic characters 190Segment ぬか contains: kana characters 191Segment 、 contains: white space or punctuation marks 192Segment それが contains: kana characters 193Segment 問題 contains: ideographic characters 194Segment だ contains: kana characters 195Segment 。 contains: white space or punctuation marks 196\endverbatim 197 198One important things that should be noted that each segment is defined 199by a pair of boundaries and the rule of its ending point defines 200if it is selected or not. 201 202In some cases it may be not what we actually look like. 203 204For example we have a text: 205 206\verbatim 207Hello! How 208are you? 209\endverbatim 210 211And we want to fetch all sentences from the text. 212 213The \ref bl_boundary_sentence_rules "sentence rules" have two options: 214 215- Split the text on the point where sentence terminator like ".!?" detected: \ref boost::locale::boundary::sentence_term "sentence_term" 216- Split the text on the point where sentence separator like "line feed" detected: \ref boost::locale::boundary::sentence_sep "sentence_sep" 217 218Naturally to ignore sentence separators we would call \ref boost::locale::boundary::segment_index::rule(rule_type v) "segment_index::rule(rule_type v)" 219with sentence_term parameter and then run the iterator. 220 221\code 222boost::locale::generator gen; 223using namespace boost::locale::boundary; 224std::string text= "Hello! How\n" 225 "are you?\n"; 226ssegment_index map(sentence,text.begin(),text.end(),gen("en_US.UTF-8")); 227map.rule(sentence_term); 228for(ssegment_index::iterator it=map.begin(),e=map.end();it!=e;++it) 229 std::cout << "Sentence [" << *it << "]" << std::endl; 230\endcode 231 232However we would get the expected segments: 233\verbatim 234Sentence [Hello! ] 235Sentence [are you? 236] 237\endverbatim 238 239The reason is that "How\n" is still considered a sentence but selected by different 240rule. 241 242This behavior can be changed by setting \ref boost::locale::boundary::segment_index::full_select(bool) "segment_index::full_select(bool)" 243to \c true. It would force iterator to join the current segment with all previous segments that may not fit the required rule. 244 245So we add this line: 246 247\code 248map.full_select(true); 249\endcode 250 251Right after "map.rule(sentence_term);" and get expected output: 252 253\verbatim 254Sentence [Hello! ] 255Sentence [How 256are you? 257] 258\endverbatim 259 260\subsection boundary_analysys_segments_search Locating Segments 261 262Sometimes it is useful to find a segment that some specific iterator is pointing on. 263 264For example a user had clicked at specific point, we want to select a word on this 265location. 266 267\ref boost::locale::boundary::segment_index "segment_index" provides 268\ref boost::locale::boundary::segment_index::find() "find(base_iterator p)" 269member function for this purpose. 270 271This function returns the iterator to the segmet such that \a p points to. 272 273 274For example: 275 276\code 277text="to be or "; 278ssegment_index map(word,text.begin(),text.end(),gen("en_US.UTF-8")); 279ssegment_index::iterator p = map.find(text.begin() + 4); 280if(p!=map.end()) 281 std::cout << *p << std::endl; 282\endcode 283 284Would print: 285 286\verbatim 287be 288\endverbatim 289 290\note 291 292if the iterator lays inside the segment this segment returned. If the segment does 293not fit the selection rules, then the segment following requested position 294is returned. 295 296For example: For \ref boost::locale::boundary::word "word" boundary analysis with \ref boost::locale::boundary::word_any "word_any" rule: 297 298- "t|o be or ", would point to "to" - the iterator in the middle of segment "to". 299- "to |be or ", would point to "be" - the iterator at the beginning of the segment "be" 300- "to| be or ", would point to "be" - the iterator does is not point to segment with required rule so next valid segment is selected "be". 301- "to be or| ", would point to end as not valid segment found. 302 303 304\section boundary_analysys_break Iterating Over Boundary Points 305\section boundary_analysys_break_basics Basic Iteration 306 307The \ref boost::locale::boundary::boundary_point_index "boundary_point_index" is similar to 308\ref boost::locale::boundary::segment_index "segment_index" in its interface but as a different role. 309Instead of returning text chunks (\ref boost::locale::boundary::segment "segment"s, it returns 310\ref boost::locale::boundary::boundary_point "boundary_point" object that 311represents a position in text - a base iterator used that is used for 312iteration of the source text C++ characters. 313The \ref boost::locale::boundary::boundary_point "boundary_point" object 314also provides a \ref boost::locale::boundary::boundary_point::rule() "rule()" member 315function that defines a rule this boundary was selected according to. 316 317\note The beginning and the ending of the text are considered boundary points, so even 318an empty text consists of at least one boundary point. 319 320Lets see an example of selecting first two sentences from a text: 321 322\code 323using namespace boost::locale::boundary; 324boost::locale::generator gen; 325 326// our text sample 327std::string const text="First sentence. Second sentence! Third one?"; 328// Create an index 329sboundary_point_index map(sentence,text.begin(),text.end(),gen("en_US.UTF-8")); 330 331// Count two boundary points 332sboundary_point_index::iterator p = map.begin(),e=map.end(); 333int count = 0; 334while(p!=e && count < 2) { 335 ++count; 336 ++p; 337} 338 339if(p!=e) { 340 std::cout << "First two sentences are: " 341 << std::string(text.begin(),p->iterator()) 342 << std::endl; 343} 344else { 345 std::cout <<"There are less then two sentences in this " 346 <<"text: " << text << std::endl; 347}\endcode 348 349Would print: 350 351\verbatim 352First two sentences are: First sentence. Second sentence! 353\endverbatim 354 355\section boundary_analysys_break_rules Using Rules 356 357Similarly to the \ref boost::locale::boundary::segment_index "segment_index" the 358\ref boost::locale::boundary::boundary_point_index "boundary_point_index" provides 359a \ref boost::locale::boundary::boundary_point_index::rule(rule_type r) "rule(rule_type mask)" 360member function to filter boundary points that interest us. 361 362It allows to set \ref bl_boundary_word_rules "word", \ref bl_boundary_line_rules "line" 363and \ref bl_boundary_sentence_rules "sentence" rules for filtering boundary points. 364 365Lets change an example above a little: 366 367\code 368// our text sample 369std::string const text= "First sentence. Second\n" 370 "sentence! Third one?"; 371\endcode 372 373If we run our program as is on the sample above we would get: 374\verbatim 375First two sentences are: First sentence. Second 376\endverbatim 377 378Which is not something that we really expected. As the "Second\n" 379is considered an independent sentence that was separated by 380a line separator "Line Feed". 381 382However, we can set set a rule \ref boost::locale::boundary::sentence_term "sentence_term" 383and the iterator would use only boundary points that are created 384by a sentence terminators like ".!?". 385 386So by adding: 387\code 388map.rule(sentence_term); 389\endcode 390 391Right after the generation of the index we would get the desired output: 392 393\verbatim 394First two sentences are: First sentence. Second 395sentence! 396\endverbatim 397 398You can also use \ref boost::locale::boundary::boundary_point::rule() "boundary_point::rule()" member 399function to learn about the reason this boundary point was created by comparing it with an appropriate 400mask. 401 402For example: 403 404\code 405using namespace boost::locale::boundary; 406boost::locale::generator gen; 407// our text sample 408std::string const text= "First sentence. Second\n" 409 "sentence! Third one?"; 410sboundary_point_index map(sentence,text.begin(),text.end(),gen("en_US.UTF-8")); 411 412for(sboundary_point_index::iterator p = map.begin(),e=map.end();p!=e;++p) { 413 if(p->rule() & sentence_term) 414 std::cout << "There is a sentence terminator: "; 415 else if(p->rule() & sentence_sep) 416 std::cout << "There is a sentence separator: "; 417 if(p->rule()!=0) // print if some rule exists 418 std::cout << "[" << std::string(text.begin(),p->iterator()) 419 << "|" << std::string(p->iterator(),text.end()) 420 << "]\n"; 421} 422\endcode 423 424Would give the following output: 425\verbatim 426There is a sentence terminator: [First sentence. |Second 427sentence! Third one?] 428There is a sentence separator: [First sentence. Second 429|sentence! Third one?] 430There is a sentence terminator: [First sentence. Second 431sentence! |Third one?] 432There is a sentence terminator: [First sentence. Second 433sentence! Third one?|] 434\endverbatim 435 436\subsection boundary_analysys_break_search Locating Boundary Points 437 438Sometimes it is useful to find a specific boundary point according to given 439iterator. 440 441\ref boost::locale::boundary::boundary_point_index "boundary_point_index" provides 442a \ref boost::locale::boundary::boundary_point_index::find() "iterator find(base_iterator p)" member 443function. 444 445It would return an iterator to a boundary point on \a p's location or at the 446location following it if \a p does not point to appropriate position. 447 448For example, for word boundary analysis: 449 450- If a base iterator points to "to |be", then the returned boundary point would be "to |be" (same position) 451- If a base iterator points to "t|o be", then the returned boundary point would be "to| be" (next valid position) 452 453For example if we want to select 6 words around specific boundary point we can use following code: 454 455\code 456using namespace boost::locale::boundary; 457boost::locale::generator gen; 458// our text sample 459std::string const text= "To be or not to be, that is the question."; 460 461// Create a mapping 462sboundary_point_index map(word,text.begin(),text.end(),gen("en_US.UTF-8")); 463// Ignore wite space 464map.rule(word_any); 465 466// define our arbitraty point 467std::string::const_iterator pos = text.begin() + 12; // "no|t"; 468 469// Get the search range 470sboundary_point_index::iterator 471 begin =map.begin(), 472 end = map.end(), 473 it = map.find(pos); // find a boundary 474 475// go 3 words backward 476for(int count = 0;count <3 && it!=begin; count ++) 477 --it; 478 479// Save the start 480std::string::const_iterator start = *it; 481 482// go 6 words forward 483for(int count = 0;count < 6 && it!=end; count ++) 484 ++it; 485 486// make sure we at valid position 487if(it==end) 488 --it; 489 490// print the text 491std::cout << std::string(start,it->iterator()) << std::endl; 492\endcode 493 494That would print: 495 496\verbatim 497 be or not to be, that 498\endverbatim 499 500 501*/ 502 503 504