1#!/bin/sh 2 3# vim: indentexpr= nosmartindent autoindent 4# vim: tabstop=2 shiftwidth=2 softtabstop=2 5 6# This is a regex that I reverse engineered from the sentence boundary chain 7# rules in UAX #29. Unlike the grapheme regex, which is essentially provided 8# for us in UAX #29, no such sentence regex exists. 9# 10# I looked into how ICU achieves this, since UAX #29 hints that producing 11# finite state machines for grapheme/sentence/word/line breaking is possible, 12# but only easy to do for graphemes. ICU does this by implementing their own 13# DSL for describing the break algorithms in terms of the chaining rules 14# directly. You can see an example for sentences in 15# icu4c/source/data/brkitr/rules/sent.txt. ICU then builds a finite state 16# machine from those rules in a mostly standard way, but implements the 17# "chaining" aspect of the rules by connecting overlapping end and start 18# states. For example, given SB7: 19# 20# (Upper | Lower) ATerm x Upper 21# 22# Then the naive way to convert this into a regex would be something like 23# 24# [\p{sb=Upper}\p{sb=Lower}]\p{sb=ATerm}\p{sb=Upper} 25# 26# Unfortunately, this is incorrect. Why? Well, consider an example like so: 27# 28# U.S.A. 29# 30# A correct implementation of the sentence breaking algorithm should not insert 31# any breaks here, exactly in accordance with repeatedly applying rule SB7 as 32# given above. Our regex fails to do this because it will first match `U.S` 33# without breaking them---which is correct---but will then start looking for 34# its next rule beginning with a full stop (in ATerm) and followed by an 35# uppercase letter (A). This will wind up triggering rule SB11 (without 36# matching `A`), which inserts a break. 37# 38# The reason why this happens is because our initial application of rule SB7 39# "consumes" the next uppercase letter (S), which we want to reuse as a prefix 40# in the next rule application. A natural way to express this would be with 41# look-around, although it's not clear that works in every case since you 42# ultimately might want to consume that ending uppercase letter. In any case, 43# we can't use look-around in our truly regular regexes, so we must fix this. 44# The approach we take is to explicitly repeat rules when a suffix of a rule 45# is a prefix of another rule. In the case of SB7, the end of the rule, an 46# uppercase letter, also happens to match the beginning of the rule. This can 47# in turn be repeated indefinitely. Thus, our actual translation to a regex is: 48# 49# [\p{sb=Upper}\p{sb=Lower}]\p{sb=ATerm}\p{sb=Upper}(\p{sb=ATerm}\p{sb=Upper}* 50# 51# It turns out that this is exactly what ICU does, but in their case, they do 52# it automatically. In our case, we connect the chaining rules manually. It's 53# tedious. With that said, we do no implement Unicode line breaking with this 54# approach, which is a far scarier beast. In that case, it would probably be 55# worth writing the code to do what ICU does. 56# 57# In the case of sentence breaks, there aren't *too* many overlaps of this 58# nature. We list them out exhaustively to make this clear, because it's 59# essentially impossible to easily observe this in the regex. (It took me a 60# full day to figure all of this out.) Rules marked with N/A mean that they 61# specify a break, and this strategy only really applies to stringing together 62# non-breaks. 63# 64# SB1 - N/A 65# SB2 - N/A 66# SB3 - None 67# SB4 - N/A 68# SB5 - None 69# SB6 - None 70# SB7 - End overlaps with beginning of SB7 71# SB8 - End overlaps with beginning of SB7 72# SB8a - End overlaps with beginning of SB6, SB8, SB8a, SB9, SB10, SB11 73# SB9 - None 74# SB10 - None 75# SB11 - None 76# SB998 - N/A 77# 78# SB8a is in particular quite tricky to get right without look-ahead, since it 79# allows ping-ponging between match rules SB8a and SB9-11, where SB9-11 80# otherwise indicate that a break has been found. In the regex below, we tackle 81# this by only permitting part of SB8a to match inside our core non-breaking 82# repetition. In particular, we only allow the parts of SB8a to match that 83# permit the non-breaking components to continue. If a part of SB8a matches 84# that guarantees a pop out to SB9-11, (like `STerm STerm`), then we let it 85# happen. This still isn't correct because an SContinue might be seen which 86# would allow moving back into SB998 and thus the non-breaking repetition, so 87# we handle that case as well. 88# 89# Finally, the last complication here is the sprinkling of $Ex* everywhere. 90# This essentially corresponds to the implementation of SB5 by following 91# UAX #29's recommendation in S6.2. Essentially, we use it avoid ever breaking 92# in the middle of a grapheme cluster. 93 94CR="\p{sb=CR}" 95LF="\p{sb=LF}" 96Sep="\p{sb=Sep}" 97Close="\p{sb=Close}" 98Sp="\p{sb=Sp}" 99STerm="\p{sb=STerm}" 100ATerm="\p{sb=ATerm}" 101SContinue="\p{sb=SContinue}" 102Numeric="\p{sb=Numeric}" 103Upper="\p{sb=Upper}" 104Lower="\p{sb=Lower}" 105OLetter="\p{sb=OLetter}" 106 107Ex="[\p{sb=Extend}\p{sb=Format}]" 108ParaSep="[$Sep $CR $LF]" 109SATerm="[$STerm $ATerm]" 110 111LetterSepTerm="[$OLetter $Upper $Lower $ParaSep $SATerm]" 112 113echo "(?x) 114( 115 # SB6 116 $ATerm $Ex* 117 $Numeric 118 | 119 # SB7 120 [$Upper $Lower] $Ex* $ATerm $Ex* 121 $Upper $Ex* 122 # overlap with SB7 123 ($ATerm $Ex* $Upper $Ex*)* 124 | 125 # SB8 126 $ATerm $Ex* $Close* $Ex* $Sp* $Ex* 127 ([^$LetterSepTerm] $Ex*)* $Lower $Ex* 128 # overlap with SB7 129 ($ATerm $Ex* $Upper $Ex*)* 130 | 131 # SB8a 132 $SATerm $Ex* $Close* $Ex* $Sp* $Ex* 133 ( 134 $SContinue 135 | 136 $ATerm $Ex* 137 # Permit repetition of SB8a 138 (($Close $Ex*)* ($Sp $Ex*)* $SATerm)* 139 # In order to continue non-breaking matching, we now must observe 140 # a match with a rule that keeps us in SB6-8a. Otherwise, we've entered 141 # one of SB9-11 and know that a break must follow. 142 ( 143 # overlap with SB6 144 $Numeric 145 | 146 # overlap with SB8 147 ($Close $Ex*)* ($Sp $Ex*)* 148 ([^$LetterSepTerm] $Ex*)* $Lower $Ex* 149 # overlap with SB7 150 ($ATerm $Ex* $Upper $Ex*)* 151 | 152 # overlap with SB8a 153 ($Close $Ex*)* ($Sp $Ex*)* $SContinue 154 ) 155 | 156 $STerm $Ex* 157 # Permit repetition of SB8a 158 (($Close $Ex*)* ($Sp $Ex*)* $SATerm)* 159 # As with ATerm above, in order to continue non-breaking matching, we 160 # must now observe a match with a rule that keeps us out of SB9-11. 161 # For STerm, the only such possibility is to see an SContinue. Anything 162 # else will result in a break. 163 ($Close $Ex*)* ($Sp $Ex*)* $SContinue 164 ) 165 | 166 # SB998 167 # The logic behind this catch-all is that if we get to this point and 168 # see a Sep, CR, LF, STerm or ATerm, then it has to fall into one of 169 # SB9, SB10 or SB11. In the cases of SB9-11, we always find a break since 170 # SB11 acts as a catch-all to induce a break following a SATerm that isn't 171 # handled by rules SB6-SB8a. 172 [^$ParaSep $SATerm] 173)* 174# The following collapses rules SB3, SB4, part of SB8a, SB9, SB10 and SB11. 175($SATerm $Ex* ($Close $Ex*)* ($Sp $Ex*)*)* ($CR $LF | $ParaSep)? 176" 177