• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1// vim: tabstop=4 expandtab shiftwidth=4 softtabstop=4 filetype=cpp.doxygen
2
3//
4//  Copyright (c) 2009-2011 Artyom Beilis (Tonkikh)
5//
6//  Distributed under the Boost Software License, Version 1.0. (See
7//  accompanying file LICENSE_1_0.txt or copy at
8//  http://www.boost.org/LICENSE_1_0.txt)
9//
10
11/*!
12\page boundary_analysys Boundary analysis
13
14- \ref boundary_analysys_basics
15- \ref boundary_analysys_segments
16    - \ref boundary_analysys_segments_basics
17    - \ref boundary_analysys_segments_rules
18    - \ref boundary_analysys_segments_search
19- \ref boundary_analysys_break
20    - \ref boundary_analysys_break_basics
21    - \ref boundary_analysys_break_rules
22    - \ref boundary_analysys_break_search
23
24
25\section boundary_analysys_basics Basics
26
27Boost.Locale provides a boundary analysis tool, allowing you to split text into characters,
28words, or sentences, and find appropriate places for line breaks.
29
30\note This task is not a trivial task.
31\par
32A Unicode code point and a character are not equivalent, for example:
33Hebrew word Shalom - "שָלוֹם" that consists of 4 characters and 6 code points (4 base letters and 2 diacritical marks)
34\par
35Words may not be separated by space characters in some languages like in Japanese or Chinese.
36
37Boost.Locale provides 2 major classes for boundary analysis:
38
39-   \ref boost::locale::boundary::segment_index - an object that holds an index of segments in the text (like words, characters,
40    sentences). It provides an access to \ref boost::locale::boundary::segment "segment" objects via iterators.
41-   \ref boost::locale::boundary::boundary_point_index - an object that holds an index of boundary points in the text.
42    It allows to iterate over the \ref boost::locale::boundary::boundary_point "boundary_point" objects.
43
44Each of the classes above use an iterator type as template parameter.
45Both of these classes accept in their constructor:
46
47- A flag that defines boundary analysis \ref boost::locale::boundary::boundary_type "boundary_type".
48- The pair of iterators that define the text range that should be analysed
49- A locale parameter (if not given the global one is used)
50
51For example:
52\code
53namespace ba=boost::locale::boundary;
54std::string text= ... ;
55std::locale loc = ... ;
56ba::segment_index<std::string::const_iterator> map(ba::word,text.begin(),text.end(),loc);
57\endcode
58
59Each of them provide a members \c begin(), \c end() and \c find() that allow to iterate
60over the selected segments or boundaries in the text or find a location of a segment or
61boundary for given iterator.
62
63
64Convenience a typedefs like \ref boost::locale::boundary::ssegment_index "ssegment_index"
65or \ref boost::locale::boundary::wcboundary_point_index "wcboundary_point_index" provided as well,
66where "w", "u16" and "u32" prefixes define a character type \c wchar_t,
67\c char16_t and \c char32_t and "c" and "s" prefixes define whether <tt>std::basic_string<CharType>::const_iterator</tt>
68or <tt>CharType const *</tt> are used.
69
70\section boundary_analysys_segments Iterating Over Segments
71\section boundary_analysys_segments_basics Basic Iteration
72
73The text segments analysis is done using \ref boost::locale::boundary::segment_index "segment_index" class.
74
75It provides a bidirectional iterator that returns \ref boost::locale::boundary::segment "segment" object.
76The segment object represents a pair of iterators that define this segment and a rule according to which it was selected.
77It can be automatically converted to \c std::basic_string object.
78
79To perform boundary analysis, we first create an index object and then iterate over it:
80
81For example:
82
83\code
84using namespace boost::locale::boundary;
85boost::locale::generator gen;
86std::string text="To be or not to be, that is the question."
87// Create mapping of text for token iterator using global locale.
88ssegment_index map(word,text.begin(),text.end(),gen("en_US.UTF-8"));
89// Print all "words" -- chunks of word boundary
90for(ssegment_index::iterator it=map.begin(),e=map.end();it!=e;++it)
91    std::cout <<"\""<< * it << "\", ";
92std::cout << std::endl;
93\endcode
94
95Would print:
96
97\verbatim
98"To", " ", "be", " ", "or", " ", "not", " ", "to", " ", "be", ",", " ", "that", " ", "is", " ", "the", " ", "question", ".",
99\endverbatim
100
101This sentence "生きるか死ぬか、それが問題だ。" (<a href="http://tatoeba.org/eng/sentences/show/868189">from Tatoeba database</a>)
102would be split into following segments in \c ja_JP.UTF-8 (Japanese) locale:
103
104\verbatim
105"生", "きるか", "死", "ぬか", "、", "それが", "問題", "だ", "。",
106\endverbatim
107
108The boundary analysis that is done by Boost.Locale
109is much more complicated then just splitting the text according
110to white space characters, even thou it is not perfect.
111
112
113\section boundary_analysys_segments_rules Using Rules
114
115The segments selection can be customized using \ref boost::locale::boundary::segment_index::rule(rule_type) "rule()" and
116\ref boost::locale::boundary::segment_index::full_select(bool) "full_select()" member functions.
117
118By default segment_index's iterator return each text segment defined by two boundary points regardless
119the way they were selected. Thus in the example above we could see text segments like "." or " "
120that were selected as words.
121
122Using a \c rule() member function we can specify a binary mask of rules we want to use for selection of
123the boundary points using \ref bl_boundary_word_rules "word", \ref bl_boundary_line_rules "line"
124and \ref bl_boundary_sentence_rules "sentence" boundary rules.
125
126For example, by calling
127
128\code
129map.rule(word_any);
130\endcode
131
132Before starting the iteration process, specify a selection mask that fetches: numbers, letter, Kana letters and
133ideographic characters ignoring all non-word related characters like white space or punctuation marks.
134
135So the code:
136
137\code
138using namespace boost::locale::boundary;
139std::string text="To be or not to be, that is the question."
140// Create mapping of text for token iterator using global locale.
141ssegment_index map(word,text.begin(),text.end());
142// Define a rule
143map.rule(word_any);
144// Print all "words" -- chunks of word boundary
145for(ssegment_index::iterator it=map.begin(),e=map.end();it!=e;++it)
146    std::cout <<"\""<< * it << "\", ";
147std::cout << std::endl;
148\endcode
149
150Would print:
151
152\verbatim
153"To", "be", "or", "not", "to", "be", "that", "is", "the", "question",
154\endverbatim
155
156And the for given text="生きるか死ぬか、それが問題だ。" and rule(\ref boost::locale::boundary::word_ideo "word_ideo"), the example above would print.
157
158\verbatim
159"生", "死", "問題",
160\endverbatim
161
162You can access specific rules the segments where selected it using \ref boost::locale::boundary::segment::rule() "segment::rule()" member
163function. Using a bit-mask of rules.
164
165For example:
166
167\code
168boost::locale::generator gen;
169using namespace boost::locale::boundary;
170std::string text="生きるか死ぬか、それが問題だ。";
171ssegment_index map(word,text.begin(),text.end(),gen("ja_JP.UTF-8"));
172for(ssegment_index::iterator it=map.begin(),e=map.end();it!=e;++it) {
173    std::cout << "Segment " << *it << " contains: ";
174    if(it->rule() & word_none)
175        std::cout << "white space or punctuation marks ";
176    if(it->rule() & word_kana)
177        std::cout << "kana characters ";
178    if(it->rule() & word_ideo)
179        std::cout << "ideographic characters";
180    std::cout<< std::endl;
181}
182\endcode
183
184Would print
185
186\verbatim
187Segment 生 contains: ideographic characters
188Segment きるか contains: kana characters
189Segment 死 contains: ideographic characters
190Segment ぬか contains: kana characters
191Segment 、 contains: white space or punctuation marks
192Segment それが contains: kana characters
193Segment 問題 contains: ideographic characters
194Segment だ contains: kana characters
195Segment 。 contains: white space or punctuation marks
196\endverbatim
197
198One important things that should be noted that each segment is defined
199by a pair of boundaries and the rule of its ending point defines
200if it is selected or not.
201
202In some cases it may be not what we actually look like.
203
204For example we have a text:
205
206\verbatim
207Hello! How
208are you?
209\endverbatim
210
211And we want to fetch all sentences from the text.
212
213The \ref bl_boundary_sentence_rules "sentence rules" have two options:
214
215- Split the text on the point where sentence terminator like ".!?" detected: \ref boost::locale::boundary::sentence_term "sentence_term"
216- Split the text on the point where sentence separator like "line feed" detected: \ref boost::locale::boundary::sentence_sep "sentence_sep"
217
218Naturally to ignore sentence separators we would call \ref boost::locale::boundary::segment_index::rule(rule_type v) "segment_index::rule(rule_type v)"
219with sentence_term parameter and then run the iterator.
220
221\code
222boost::locale::generator gen;
223using namespace boost::locale::boundary;
224std::string text=   "Hello! How\n"
225                    "are you?\n";
226ssegment_index map(sentence,text.begin(),text.end(),gen("en_US.UTF-8"));
227map.rule(sentence_term);
228for(ssegment_index::iterator it=map.begin(),e=map.end();it!=e;++it)
229    std::cout << "Sentence [" << *it << "]" << std::endl;
230\endcode
231
232However we would get the expected segments:
233\verbatim
234Sentence [Hello! ]
235Sentence [are you?
236]
237\endverbatim
238
239The reason is that "How\n" is still considered a sentence but selected by different
240rule.
241
242This behavior can be changed by setting \ref boost::locale::boundary::segment_index::full_select(bool) "segment_index::full_select(bool)"
243to \c true. It would force iterator to join the current segment with all previous segments that may not fit the required rule.
244
245So we add this line:
246
247\code
248map.full_select(true);
249\endcode
250
251Right after "map.rule(sentence_term);" and get expected output:
252
253\verbatim
254Sentence [Hello! ]
255Sentence [How
256are you?
257]
258\endverbatim
259
260\subsection boundary_analysys_segments_search Locating Segments
261
262Sometimes it is useful to find a segment that some specific iterator is pointing on.
263
264For example a user had clicked at specific point, we want to select a word on this
265location.
266
267\ref boost::locale::boundary::segment_index "segment_index" provides
268\ref boost::locale::boundary::segment_index::find() "find(base_iterator p)"
269member function for this purpose.
270
271This function returns the iterator to the segmet such that \a p points to.
272
273
274For example:
275
276\code
277text="to be or ";
278ssegment_index map(word,text.begin(),text.end(),gen("en_US.UTF-8"));
279ssegment_index::iterator  p = map.find(text.begin() + 4);
280if(p!=map.end())
281    std::cout << *p << std::endl;
282\endcode
283
284Would print:
285
286\verbatim
287be
288\endverbatim
289
290\note
291
292if the iterator lays inside the segment this segment returned. If the segment does
293not fit the selection rules, then the segment following requested position
294is returned.
295
296For example: For \ref boost::locale::boundary::word "word" boundary analysis with \ref boost::locale::boundary::word_any "word_any" rule:
297
298- "t|o be or ", would point to "to" - the iterator in the middle of segment "to".
299- "to |be or ", would point to "be" - the iterator at the beginning of the segment "be"
300- "to| be or ", would point to "be" - the iterator does is not point to segment with required rule so next valid segment is selected "be".
301- "to be or| ", would point to end as not valid segment found.
302
303
304\section boundary_analysys_break Iterating Over Boundary Points
305\section boundary_analysys_break_basics Basic Iteration
306
307The \ref boost::locale::boundary::boundary_point_index "boundary_point_index" is similar  to
308\ref boost::locale::boundary::segment_index "segment_index" in its interface but as a different role.
309Instead of returning text chunks (\ref boost::locale::boundary::segment "segment"s, it returns
310\ref boost::locale::boundary::boundary_point "boundary_point" object that
311represents a position in text - a base iterator used that is used for
312iteration of the source text C++ characters.
313The \ref boost::locale::boundary::boundary_point "boundary_point" object
314also provides a \ref boost::locale::boundary::boundary_point::rule() "rule()" member
315function that defines a rule this boundary was selected according to.
316
317\note The beginning and the ending of the text are considered boundary points, so even
318an empty text consists of at least one boundary point.
319
320Lets see an example of selecting first two sentences from a text:
321
322\code
323using namespace boost::locale::boundary;
324boost::locale::generator gen;
325
326// our text sample
327std::string const text="First sentence. Second sentence! Third one?";
328// Create an index
329sboundary_point_index map(sentence,text.begin(),text.end(),gen("en_US.UTF-8"));
330
331// Count two boundary points
332sboundary_point_index::iterator p = map.begin(),e=map.end();
333int count = 0;
334while(p!=e && count < 2) {
335    ++count;
336    ++p;
337}
338
339if(p!=e) {
340    std::cout   << "First two sentences are: "
341                << std::string(text.begin(),p->iterator())
342                << std::endl;
343}
344else {
345    std::cout   <<"There are less then two sentences in this "
346                <<"text: " << text << std::endl;
347}\endcode
348
349Would print:
350
351\verbatim
352First two sentences are: First sentence. Second sentence!
353\endverbatim
354
355\section boundary_analysys_break_rules Using Rules
356
357Similarly to the \ref boost::locale::boundary::segment_index "segment_index" the
358\ref boost::locale::boundary::boundary_point_index "boundary_point_index" provides
359a \ref boost::locale::boundary::boundary_point_index::rule(rule_type r) "rule(rule_type mask)"
360member function to filter boundary points that interest us.
361
362It allows to set \ref bl_boundary_word_rules "word", \ref bl_boundary_line_rules "line"
363and \ref bl_boundary_sentence_rules "sentence" rules for filtering boundary points.
364
365Lets change an example above a little:
366
367\code
368// our text sample
369std::string const text= "First sentence. Second\n"
370                        "sentence! Third one?";
371\endcode
372
373If we run our program as is on the sample above we would get:
374\verbatim
375First two sentences are: First sentence. Second
376\endverbatim
377
378Which is not something that we really expected. As the "Second\n"
379is considered an independent sentence that was separated by
380a line separator "Line Feed".
381
382However, we can set set a rule \ref boost::locale::boundary::sentence_term "sentence_term"
383and the iterator would use only boundary points that are created
384by a sentence terminators like ".!?".
385
386So by adding:
387\code
388map.rule(sentence_term);
389\endcode
390
391Right after the generation of the index we would get the desired output:
392
393\verbatim
394First two sentences are: First sentence. Second
395sentence!
396\endverbatim
397
398You can also use \ref boost::locale::boundary::boundary_point::rule() "boundary_point::rule()" member
399function to learn about the reason this boundary point was created by comparing it with an appropriate
400mask.
401
402For example:
403
404\code
405using namespace boost::locale::boundary;
406boost::locale::generator gen;
407// our text sample
408std::string const text= "First sentence. Second\n"
409                        "sentence! Third one?";
410sboundary_point_index map(sentence,text.begin(),text.end(),gen("en_US.UTF-8"));
411
412for(sboundary_point_index::iterator p = map.begin(),e=map.end();p!=e;++p) {
413    if(p->rule() & sentence_term)
414        std::cout << "There is a sentence terminator: ";
415    else if(p->rule() & sentence_sep)
416        std::cout << "There is a sentence separator: ";
417    if(p->rule()!=0) // print if some rule exists
418        std::cout   << "[" << std::string(text.begin(),p->iterator())
419                    << "|" << std::string(p->iterator(),text.end())
420                    << "]\n";
421}
422\endcode
423
424Would give the following output:
425\verbatim
426There is a sentence terminator: [First sentence. |Second
427sentence! Third one?]
428There is a sentence separator: [First sentence. Second
429|sentence! Third one?]
430There is a sentence terminator: [First sentence. Second
431sentence! |Third one?]
432There is a sentence terminator: [First sentence. Second
433sentence! Third one?|]
434\endverbatim
435
436\subsection boundary_analysys_break_search Locating Boundary Points
437
438Sometimes it is useful to find a specific boundary point according to given
439iterator.
440
441\ref boost::locale::boundary::boundary_point_index "boundary_point_index" provides
442a \ref boost::locale::boundary::boundary_point_index::find() "iterator find(base_iterator p)" member
443function.
444
445It would return an iterator to a boundary point on \a p's location or at the
446location following it if \a p does not point to appropriate position.
447
448For example, for word boundary analysis:
449
450- If a base iterator points to "to |be", then the returned boundary point would be "to |be" (same position)
451- If a base iterator points to "t|o be", then the returned boundary point would be "to| be" (next valid position)
452
453For example if we want to select 6 words around specific boundary point we can use following code:
454
455\code
456using namespace boost::locale::boundary;
457boost::locale::generator gen;
458// our text sample
459std::string const text= "To be or not to be, that is the question.";
460
461// Create a mapping
462sboundary_point_index map(word,text.begin(),text.end(),gen("en_US.UTF-8"));
463// Ignore wite space
464map.rule(word_any);
465
466// define our arbitraty point
467std::string::const_iterator pos = text.begin() + 12; // "no|t";
468
469// Get the search range
470sboundary_point_index::iterator
471    begin =map.begin(),
472    end = map.end(),
473    it = map.find(pos); // find a boundary
474
475// go 3 words backward
476for(int count = 0;count <3 && it!=begin; count ++)
477    --it;
478
479// Save the start
480std::string::const_iterator start = *it;
481
482// go 6 words forward
483for(int count = 0;count < 6 && it!=end; count ++)
484    ++it;
485
486// make sure we at valid position
487if(it==end)
488    --it;
489
490// print the text
491std::cout << std::string(start,it->iterator()) << std::endl;
492\endcode
493
494That would print:
495
496\verbatim
497 be or not to be, that
498\endverbatim
499
500
501*/
502
503
504