pcre2unicode.html - OpenGrok cross reference for /external/pcre/doc/html/pcre2unicode.html

Lines Matching full:are
27 There are two ways of telling PCRE2 to switch to UTF mode, where characters may
42 In UTF mode, both the pattern and any subject strings that are matched against
43 it are treated as UTF strings instead of strings of individual one-code-unit
44 characters. There are also some other changes to the way characters are
53 The Unicode properties that can be tested are a subset of those that Perl
54 supports. Currently they are limited to the general category properties such as
57 properties Any and LC (synonym L&). Full lists are given in the
61 documentation. In general, only the short names for properties are supported.
72 values have to use braced sequences. Unbraced octal code points up to \777 are
89 In UTF mode, capture group names are not restricted to ASCII, and may contain
117 and \B, because they are defined in terms of \w and \W. If you want
121 are used to determine which characters match. There are more details in the
129 Similarly, characters that match the POSIX named character classes are all
142 of Unicode properties except for characters whose code points are less than 128
145 more than two code points that are case-equivalent, and these are treated
156 characters that are all from the same Unicode script. However, because some
157 scripts are commonly used together, and because some diacritical and other
158 marks are used with multiple scripts, it is not that simple.
163 are also three special values:
168 points are greater than the Unicode maximum (U+10FFFF), which are accessible
169 only in non-UTF mode, are assigned the Unknown script.
172 "Common" is used for characters that are used with many scripts. These include
178 previous character. These are considered to take on the script of the character
182 Some Inherited characters are used with many scripts, but many of them are only
189 listed. There are also some Common characters that have a single, non-Common
194 of characters is a script run. Note, however, that there are some special cases
196 digits. These are covered in subsequent sections.
204 strings are checked using only the Script Extensions property, not the basic
216 A simple example is an Internet name such as "google.com". The letters are all
240 and Han. These three combinations are treated as special cases when checking
241 script runs and are, in effect, "virtual scripts". Thus, a script run may
254 decimal digits them are visually indistinguishable from the common ASCII
264 are (by default) checked for validity on entry to the relevant functions. If an
271 In some situations, you may already know that your strings are valid, and
303 code unit of a character or to the end of the subject. If there are no
306 starting offset, or at the start of the subject if there are not that many
307 characters before the starting offset. Note that the sequences \b and \B are
313 area. The so-called "non-character" code points are not excluded because
317 Characters in the "Surrogate Area" of Unicode are reserved for use by UTF-16,
318 where they are used in pairs to encode code points with values greater than
319 0xFFFF. The code points that are encoded by UTF-16 pairs are available
330 values are not representable in UTF-16.
336 The following negative error codes are given for invalid UTF-8 strings:
345 bytes are missing (1 to 5). Although RFC 3629 restricts UTF-8 characters to be
364 these code points are excluded by RFC 3629.
368 A 4-byte character has a value greater than 0x10ffff; these code points are
374 code points are reserved by RFC 3629 for use with UTF-16, and so are excluded
404 The following negative error codes are given for invalid UTF-16 strings:
416 The following negative error codes are given for invalid UTF-32 strings:
454 are a few points to consider:
457 The internal boundaries are not interpreted as the beginnings or ends of lines
472 data, knowing that any matched strings that are returned are valid UTF. This