unicode.rst - OpenGrok cross reference for /third_party/python/Doc/howto/unicode.rst

Lines Matching full:unicode
4   Unicode HOWTO
9 This HOWTO discusses Python's support for the Unicode specification
11 people commonly encounter when trying to work with Unicode.
14 Introduction to Unicode
26 Python's string type uses the Unicode Standard for representing
30 Unicode (https://www.unicode.org/) is a specification that aims to
32 its own unique code.  The Unicode specifications are continually
42 The Unicode standard describes how characters are represented by
45 `actual number assigned <https://www.unicode.org/versions/latest/#Summary>`_
50 The Unicode standard contains a lot of tables listing characters and
89 To summarize the previous section: a Unicode string is a sequence of
93 to 8-bit bytes.  The rules for translating a Unicode string into a
127 defaults to using it.  UTF stands for "Unicode Transformation Format",
138 1. It can handle any Unicode code point.
139 2. A Unicode string is turned into a sequence of bytes that contains embedded
160 The `Unicode Consortium site <https://www.unicode.org>`_ has character charts, a
161 glossary, and PDF versions of the Unicode specification.  Be prepared for some
162 difficult reading.  `A chronology <https://www.unicode.org/history/>`_ of the
163 origin and development of Unicode is also available on the site.
166 `discusses the history of Unicode and UTF-8 <https://www.youtube.com/watch?v=MijmeoH9LT4>`_
170 guide <https://jkorpela.fi/unicode/guide.html>`_ to reading the
171 Unicode character tables.
173 …e-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-set…
183 Python's Unicode Support
186 Now that you've learned the rudiments of Unicode, we can look at Python's
187 Unicode features.
192 Since Python 3.0, the language's :class:`str` type contains Unicode
193 characters, meaning any string created using ``"unicode rocks!"``, ``'unicode
194 rocks!'``, or the triple-quoted string syntax is stored as Unicode.
197 include a Unicode character in a string literal::
206 Side note: Python 3 also supports using Unicode characters in identifiers::
232 character out of the Unicode result), or ``'backslashreplace'`` (inserts a
254 One-character Unicode strings can also be created with the :func:`chr`
255 built-in function, which takes integers and returns a Unicode string of length 1
257 built-in :func:`ord` function that takes a one-character Unicode string and
269 which returns a :class:`bytes` representation of the Unicode string, encoded in the
309 Unicode Literals in Python Source Code
312 In Python source code, specific Unicode code points can be written using the
319     ... #         ^^^^^^ four-digit Unicode escape
320     ... #                     ^^^^^^^^^^ eight-digit Unicode escape
355 Unicode Properties
358 The Unicode specification includes a database of information about
397 `the General Category Values section of the Unicode Character Database documentation <https://www.u…
404 Unicode adds some complication to comparing strings, because the same
414 case-insensitive form following an algorithm described by the Unicode
461 The Unicode Standard also specifies how to do caseless comparisons::
480 section 3.13 of the Unicode Standard for a discussion and an example.)
483 Unicode Regular Expressions
507 Similarly, ``\w`` matches a wide variety of Unicode characters but
509 and ``\s`` will match either Unicode whitespace characters or
516 .. comment should these be mentioned earlier, e.g. at the start of the "introduction to Unicode" fi…
518 Some good alternative discussions of Python's Unicode support are:
521 * `Pragmatic Unicode <https://nedbatchelder.com/text/unipain.html>`_, a PyCon 2012 presentation by …
530 Marc-André Lemburg gave `a presentation titled "Python and Unicode" (PDF slides)
531 <https://downloads.egenix.com/python/Unicode-EPC2002-Talk.pdf>`_ at
533 2's Unicode features (where the Unicode string type is called ``unicode`` and
537 Reading and Writing Unicode Data
540 Once you've written some code that works with Unicode data, the next problem is
541 input/output.  How do you get Unicode strings into your program, and how do you
542 convert Unicode into a form suitable for storage or transmission?
546 your application support Unicode natively.  XML parsers often return Unicode
547 data, for example.  Many relational databases also support Unicode-valued
548 columns and can return Unicode values from an SQL query.
550 Unicode data is usually converted to a particular encoding before it gets
555 One problem is the multi-byte nature of encodings; one Unicode character can be
558 where only part of the bytes encoding a single Unicode character are read at the
563 string and its Unicode version in memory.)
568 that assumes the file's contents are in a specified encoding and accepts Unicode
574 Reading Unicode from a file is therefore simple::
576     with open('unicode.txt', encoding='utf-8') as f:
588 The Unicode character ``U+FEFF`` is used as a byte-order mark (BOM), and is often
603 Unicode filenames
607 that contain arbitrary Unicode characters.  Usually this is
608 implemented by converting the Unicode string into some encoding that
619 usually just provide the Unicode string as the filename, and it will be
626 Functions in the :mod:`os` module such as :func:`os.stat` will also accept Unicode
630 the Unicode version of filenames, or should it return bytes containing
632 provided the directory path as bytes or a Unicode string.  If you pass a
633 Unicode string as the path, filenames will be decoded using the filesystem's
634 encoding and a list of Unicode strings will be returned, while passing a byte
656 the Unicode versions.
659 Unicode with these APIs.  The bytes APIs should only be used on
664 Tips for Writing Unicode-aware Programs
668 Unicode.
672     Software should only work with Unicode strings internally, decoding the input
675 If you attempt to write processing functions that accept both Unicode and byte
741 The `PDF slides for Marc-André Lemburg's presentation "Writing Unicode-aware
743 <https://downloads.egenix.com/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf>`_
747 `The Guts of Unicode in Python
748 <https://pyvideo.org/video/1768/the-guts-of-unicode-in-python>`_
749 is a PyCon 2013 talk by Benjamin Peterson that discusses the internal Unicode