pcre2unicode.3 - OpenGrok cross reference for /third_party/pcre2/pcre2/doc/pcre2unicode.3

Lines Matching +full:before +full:- +full:script
3 PCRE - Perl-compatible regular expressions (revised API)
10 strings of text in UTF-8, UTF-16, and UTF-32 format (depending on the code unit
33 it are treated as UTF strings instead of strings of individual one-code-unit
46 Any and LC (synonym L&), the Unicode script names such as Arabic or Han,
73 in non-UTF mode.
85 but its use can lead to some strange effects because it breaks up multi-unit
90 documentation). For this reason, there is a build-time option that disables
91 support for \eC completely. There is also a less draconian compile-time option
95 \fBpcre2_dfa_match()\fP when in UTF-8 or UTF-16 mode, that is, when a character
97 a match-time error. Also, the JIT optimization does not support \eC in these
98 modes. If JIT optimization is requested for a UTF-8 or UTF-16 pattern that
105 non-UTF mode, all with code points less than 256. This remains true even when
125 all low-valued characters unless the PCRE2_UCP option is set, but there is an
133 .SH "UNICODE CASE-EQUIVALENCE"
138 and that have at most two case-equivalent values. For these, a direct table
140 more than two code points that are case-equivalent, and these are treated
141 specially. Setting PCRE2_UCP without PCRE2_UTF allows Unicode-style case
142 processing for non-UTF character encodings such as UCS-2.
145 case equivalents, have a non-ASCII one as well (long S and Kelvin sign).
146 Recognition of these non-ASCII characters as case-equivalent to their ASCII
149 ASCII or non-ASCII; there can be no mixing.
153 .SH "SCRIPT RUNS"
158 parentheses is a script run. In concept, a script run is a sequence of
159 characters that are all from the same Unicode script. However, because some
163 Every Unicode character has a Script property, mostly with a value
164 corresponding to the name of a script, such as Latin, Greek, or Cyrillic. There
168 surrogate code points. In the PCRE2 32-bit library, characters whose code
170 only in non-UTF mode, are assigned the Unknown script.
177 previous character. These are considered to take on the script of the character
183 possible to check this, a Unicode property called Script Extension exists. Its
185 characters, the list contains just one script, the same one as the Script
186 property. However, for characters such as U+102E0 more than one Script is
187 listed. There are also some Common characters that have a single, non-Common
188 script in their Script Extension list.
191 of characters is a script run. Note, however, that there are some special cases
192 involving the Chinese Han script, and an additional constraint for decimal
196 .SS "Basic script run rules"
199 A string that is less than two characters long is a script run. This is the
200 only case in which an Unknown character can be part of a script run. Longer
201 strings are checked using only the Script Extensions property, not the basic
202 Script property.
204 If a character's Script Extension property is the single value "Inherited", it
205 is always accepted as part of a script run. This is also true for the property
207 remaining characters in a script run must have at least one script in common in
208 their Script Extension lists. In set-theoretic terminology, the intersection of
212 in the Latin script, and the dot is Common, so this string is a script run.
214 string that looks the same, but with Cyrillic "o"s is not a script run.
216 More interesting examples involve characters with more than one script in their
217 Script Extension. Consider the following characters:
222 The first has the Script Extension list Arabic, Hanifi Rohingya, Syriac, and
224 appear in script runs of either Arabic or Hanifi Rohingya. The first could also
225 appear in Syriac or Thaana script runs, but the second could not.
228 .SS "The Chinese Han script"
231 The Chinese Han script is commonly used in conjunction with other scripts for
235 script runs and are, in effect, "virtual scripts". Thus, a script run may
247 scripts (including the Common script) contain more than one set. Some of these
249 digits. In addition to the script checking described above, if a script run
283 UTF-16 and UTF-32 strings can indicate their endianness by special code knows
284 as a byte-order mark (BOM). The PCRE2 functions do not handle this, expecting
287 Unless PCRE2_NO_UTF_CHECK is set, a UTF string is checked before any other
289 \fBpcre2_dfa_match()\fP calls with a non-zero starting offset, the check is
294 Otherwise, it starts at the length of the longest lookbehind before the
296 characters before the starting offset. Note that the sequences \eb and \eB are
297 one-character lookbehinds.
301 area. The so-called "non-character" code points are not excluded because
304 Characters in the "Surrogate Area" of Unicode are reserved for use by UTF-16,
306 0xFFFF. The code points that are encoded by UTF-16 pairs are available
307 independently in the UTF-8 and UTF-32 encodings. (In other words, the whole
308 surrogate thing is a fudge for UTF-16 which unfortunately messes up UTF-8 and
309 UTF-32.)
315 option. However, this is possible only in UTF-8 and UTF-32 modes, because these
316 values are not representable in UTF-16.
320 .SS "Errors in UTF-8 strings"
323 The following negative error codes are given for invalid UTF-8 strings:
331 The string ends with a truncated UTF-8 character; the code specifies how many
332 bytes are missing (1 to 5). Although RFC 3629 restricts UTF-8 characters to be
355 A 4-byte character has a value greater than 0x10ffff; these code points are
360 A 3-byte character has a value in the range 0xd800 to 0xdfff; this range of
361 code points are reserved by RFC 3629 for use with UTF-16, and so are excluded
362 from UTF-8.
370 A 2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it codes for a
379 byte can only validly occur as the second or subsequent byte of a multi-byte
385 never occur in a valid UTF-8 string.
389 .SS "Errors in UTF-16 strings"
392 The following negative error codes are given for invalid UTF-16 strings:
401 .SS "Errors in UTF-32 strings"
404 The following negative error codes are given for invalid UTF-32 strings:
452 UTF-sequence, that sequence is skipped, and the match starts at the next valid
463 Note, however, that the 16-bit and 32-bit PCRE2 libraries process strings as
484 Copyright (c) 1997-2023 University of Cambridge.