pcre2unicode.3 - OpenGrok cross reference for /external/pcre/doc/pcre2unicode.3

Lines Matching full:are
14 There are two ways of telling PCRE2 to switch to UTF mode, where characters may
32 In UTF mode, both the pattern and any subject strings that are matched against
33 it are treated as UTF strings instead of strings of individual one-code-unit
34 characters. There are also some other changes to the way characters are
43 The Unicode properties that can be tested are a subset of those that Perl
44 supports. Currently they are limited to the general category properties such as
47 properties Any and LC (synonym L&). Full lists are given in the
55 documentation. In general, only the short names for properties are supported.
66 values have to use braced sequences. Unbraced octal code points up to \e777 are
79 In UTF mode, capture group names are not restricted to ASCII, and may contain
106 and \eB, because they are defined in terms of \ew and \eW. If you want
110 are used to determine which characters match. There are more details in the
122 Similarly, characters that match the POSIX named character classes are all
134 of Unicode properties except for characters whose code points are less than 128
137 more than two code points that are case-equivalent, and these are treated
149 characters that are all from the same Unicode script. However, because some
150 scripts are commonly used together, and because some diacritical and other
151 marks are used with multiple scripts, it is not that simple.
159 points are greater than the Unicode maximum (U+10FFFF), which are accessible
160 only in non-UTF mode, are assigned the Unknown script.
162 "Common" is used for characters that are used with many scripts. These include
167 previous character. These are considered to take on the script of the character
170 Some Inherited characters are used with many scripts, but many of them are only
177 listed. There are also some Common characters that have a single, non-Common
181 of characters is a script run. Note, however, that there are some special cases
183 digits. These are covered in subsequent sections.
191 strings are checked using only the Script Extensions property, not the basic
201 A simple example is an Internet name such as "google.com". The letters are all
224 and Han. These three combinations are treated as special cases when checking
225 script runs and are, in effect, "virtual scripts". Thus, a script run may
238 decimal digits them are visually indistinguishable from the common ASCII
254 In some situations, you may already know that your strings are valid, and
282 code unit of a character or to the end of the subject. If there are no
285 starting offset, or at the start of the subject if there are not that many
286 characters before the starting offset. Note that the sequences \eb and \eB are
291 area. The so-called "non-character" code points are not excluded because
294 Characters in the "Surrogate Area" of Unicode are reserved for use by UTF-16,
295 where they are used in pairs to encode code points with values greater than
296 0xFFFF. The code points that are encoded by UTF-16 pairs are available
306 values are not representable in UTF-16.
313 The following negative error codes are given for invalid UTF-8 strings:
322 bytes are missing (1 to 5). Although RFC 3629 restricts UTF-8 characters to be
341 these code points are excluded by RFC 3629.
345 A 4-byte character has a value greater than 0x10ffff; these code points are
351 code points are reserved by RFC 3629 for use with UTF-16, and so are excluded
382 The following negative error codes are given for invalid UTF-16 strings:
394 The following negative error codes are given for invalid UTF-32 strings:
431 The internal boundaries are not interpreted as the beginnings or ends of lines
443 data, knowing that any matched strings that are returned are valid UTF. This