• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1#!/bin/sh
2
3# vim: indentexpr= nosmartindent autoindent
4# vim: tabstop=2 shiftwidth=2 softtabstop=2
5
6# This is a regex that I reverse engineered from the sentence boundary chain
7# rules in UAX #29. Unlike the grapheme regex, which is essentially provided
8# for us in UAX #29, no such sentence regex exists.
9#
10# I looked into how ICU achieves this, since UAX #29 hints that producing
11# finite state machines for grapheme/sentence/word/line breaking is possible,
12# but only easy to do for graphemes. ICU does this by implementing their own
13# DSL for describing the break algorithms in terms of the chaining rules
14# directly. You can see an example for sentences in
15# icu4c/source/data/brkitr/rules/sent.txt. ICU then builds a finite state
16# machine from those rules in a mostly standard way, but implements the
17# "chaining" aspect of the rules by connecting overlapping end and start
18# states. For example, given SB7:
19#
20#     (Upper | Lower) ATerm x Upper
21#
22# Then the naive way to convert this into a regex would be something like
23#
24#     [\p{sb=Upper}\p{sb=Lower}]\p{sb=ATerm}\p{sb=Upper}
25#
26# Unfortunately, this is incorrect. Why? Well, consider an example like so:
27#
28#     U.S.A.
29#
30# A correct implementation of the sentence breaking algorithm should not insert
31# any breaks here, exactly in accordance with repeatedly applying rule SB7 as
32# given above. Our regex fails to do this because it will first match `U.S`
33# without breaking them---which is correct---but will then start looking for
34# its next rule beginning with a full stop (in ATerm) and followed by an
35# uppercase letter (A). This will wind up triggering rule SB11 (without
36# matching `A`), which inserts a break.
37#
38# The reason why this happens is because our initial application of rule SB7
39# "consumes" the next uppercase letter (S), which we want to reuse as a prefix
40# in the next rule application. A natural way to express this would be with
41# look-around, although it's not clear that works in every case since you
42# ultimately might want to consume that ending uppercase letter. In any case,
43# we can't use look-around in our truly regular regexes, so we must fix this.
44# The approach we take is to explicitly repeat rules when a suffix of a rule
45# is a prefix of another rule. In the case of SB7, the end of the rule, an
46# uppercase letter, also happens to match the beginning of the rule. This can
47# in turn be repeated indefinitely. Thus, our actual translation to a regex is:
48#
49#     [\p{sb=Upper}\p{sb=Lower}]\p{sb=ATerm}\p{sb=Upper}(\p{sb=ATerm}\p{sb=Upper}*
50#
51# It turns out that this is exactly what ICU does, but in their case, they do
52# it automatically. In our case, we connect the chaining rules manually. It's
53# tedious. With that said, we do no implement Unicode line breaking with this
54# approach, which is a far scarier beast. In that case, it would probably be
55# worth writing the code to do what ICU does.
56#
57# In the case of sentence breaks, there aren't *too* many overlaps of this
58# nature. We list them out exhaustively to make this clear, because it's
59# essentially impossible to easily observe this in the regex. (It took me a
60# full day to figure all of this out.) Rules marked with N/A mean that they
61# specify a break, and this strategy only really applies to stringing together
62# non-breaks.
63#
64#     SB1   - N/A
65#     SB2   - N/A
66#     SB3   - None
67#     SB4   - N/A
68#     SB5   - None
69#     SB6   - None
70#     SB7   - End overlaps with beginning of SB7
71#     SB8   - End overlaps with beginning of SB7
72#     SB8a  - End overlaps with beginning of SB6, SB8, SB8a, SB9, SB10, SB11
73#     SB9   - None
74#     SB10  - None
75#     SB11  - None
76#     SB998 - N/A
77#
78# SB8a is in particular quite tricky to get right without look-ahead, since it
79# allows ping-ponging between match rules SB8a and SB9-11, where SB9-11
80# otherwise indicate that a break has been found. In the regex below, we tackle
81# this by only permitting part of SB8a to match inside our core non-breaking
82# repetition. In particular, we only allow the parts of SB8a to match that
83# permit the non-breaking components to continue. If a part of SB8a matches
84# that guarantees a pop out to SB9-11, (like `STerm STerm`), then we let it
85# happen. This still isn't correct because an SContinue might be seen which
86# would allow moving back into SB998 and thus the non-breaking repetition, so
87# we handle that case as well.
88#
89# Finally, the last complication here is the sprinkling of $Ex* everywhere.
90# This essentially corresponds to the implementation of SB5 by following
91# UAX #29's recommendation in S6.2. Essentially, we use it avoid ever breaking
92# in the middle of a grapheme cluster.
93
94CR="\p{sb=CR}"
95LF="\p{sb=LF}"
96Sep="\p{sb=Sep}"
97Close="\p{sb=Close}"
98Sp="\p{sb=Sp}"
99STerm="\p{sb=STerm}"
100ATerm="\p{sb=ATerm}"
101SContinue="\p{sb=SContinue}"
102Numeric="\p{sb=Numeric}"
103Upper="\p{sb=Upper}"
104Lower="\p{sb=Lower}"
105OLetter="\p{sb=OLetter}"
106
107Ex="[\p{sb=Extend}\p{sb=Format}]"
108ParaSep="[$Sep $CR $LF]"
109SATerm="[$STerm $ATerm]"
110
111LetterSepTerm="[$OLetter $Upper $Lower $ParaSep $SATerm]"
112
113echo "(?x)
114(
115  # SB6
116  $ATerm $Ex*
117    $Numeric
118  |
119  # SB7
120  [$Upper $Lower] $Ex* $ATerm $Ex*
121    $Upper $Ex*
122    # overlap with SB7
123    ($ATerm $Ex* $Upper $Ex*)*
124  |
125  # SB8
126  $ATerm $Ex* $Close* $Ex* $Sp* $Ex*
127    ([^$LetterSepTerm] $Ex*)* $Lower $Ex*
128    # overlap with SB7
129    ($ATerm $Ex* $Upper $Ex*)*
130  |
131  # SB8a
132  $SATerm $Ex* $Close* $Ex* $Sp* $Ex*
133  (
134    $SContinue
135    |
136    $ATerm $Ex*
137      # Permit repetition of SB8a
138      (($Close $Ex*)* ($Sp $Ex*)* $SATerm)*
139      # In order to continue non-breaking matching, we now must observe
140      # a match with a rule that keeps us in SB6-8a. Otherwise, we've entered
141      # one of SB9-11 and know that a break must follow.
142      (
143        # overlap with SB6
144        $Numeric
145        |
146        # overlap with SB8
147        ($Close $Ex*)* ($Sp $Ex*)*
148          ([^$LetterSepTerm] $Ex*)* $Lower $Ex*
149          # overlap with SB7
150          ($ATerm $Ex* $Upper $Ex*)*
151        |
152        # overlap with SB8a
153        ($Close $Ex*)* ($Sp $Ex*)* $SContinue
154      )
155    |
156    $STerm $Ex*
157      # Permit repetition of SB8a
158      (($Close $Ex*)* ($Sp $Ex*)* $SATerm)*
159      # As with ATerm above, in order to continue non-breaking matching, we
160      # must now observe a match with a rule that keeps us out of SB9-11.
161      # For STerm, the only such possibility is to see an SContinue. Anything
162      # else will result in a break.
163      ($Close $Ex*)* ($Sp $Ex*)* $SContinue
164  )
165  |
166  # SB998
167  # The logic behind this catch-all is that if we get to this point and
168  # see a Sep, CR, LF, STerm or ATerm, then it has to fall into one of
169  # SB9, SB10 or SB11. In the cases of SB9-11, we always find a break since
170  # SB11 acts as a catch-all to induce a break following a SATerm that isn't
171  # handled by rules SB6-SB8a.
172  [^$ParaSep $SATerm]
173)*
174# The following collapses rules SB3, SB4, part of SB8a, SB9, SB10 and SB11.
175($SATerm $Ex* ($Close $Ex*)* ($Sp $Ex*)*)* ($CR $LF | $ParaSep)?
176"
177