unicode.rst - OpenGrok cross reference for /external/python/cpython2/Doc/howto/unicode.rst

Lines Matching full:unicode
2   Unicode HOWTO
7 This HOWTO discusses Python 2.x's support for Unicode, and explains
9 with Unicode.  For the Python 3 version, see
10 <https://docs.python.org/3/howto/unicode.html>.
12 Introduction to Unicode
58 Unicode standardization effort began.
60 Unicode started out using 16-bit characters instead of 8-bit characters.  16
63 goal was to have Unicode contain the alphabets for every single human language.
65 Unicode specification uses a wider range of codes, 0--1,114,111 (0x10ffff in
68 There's a related ISO standard, ISO 10646.  Unicode and ISO 10646 were
70 revision of Unicode.
72 (This discussion of Unicode's history is highly simplified.  I don't think the
74 the Unicode consortium site listed in the References for more information.)
88 The Unicode standard describes how characters are represented by **code
91 character with value 0x12ca (4810 decimal).  The Unicode standard contains a lot
117 To summarize the previous section: a Unicode string is a sequence of code
120 for translating a Unicode string into a sequence of bytes are called an
153 Encodings don't have to handle every possible Unicode character, and most
155 encoding.  The rules for converting a Unicode string into the ASCII encoding are
161 2. If the code point is 128 or greater, the Unicode string can't be represented
165 Latin-1, also known as ISO-8859-1, is a similar encoding.  Unicode code points
177 UTF-8 is one of the most commonly used encodings.  UTF stands for "Unicode
190 1. It can handle any Unicode code point.
191 2. A Unicode string is turned into a string of bytes containing no embedded zero
207 The Unicode Consortium site at <http://www.unicode.org> has character charts, a
208 glossary, and PDF versions of the Unicode specification.  Be prepared for some
209 difficult reading.  <http://www.unicode.org/history/> is a chronology of the
210 origin and development of Unicode.
213 to reading the Unicode character tables, available at
214 <https://www.cs.tut.fi/~jkorpela/unicode/guide.html>.
217 <http://www.joelonsoftware.com/articles/Unicode.html>.
221 .. Jason Orendorff XXX http://www.jorendorff.com/articles/unicode/ is broken
228 Python 2.x's Unicode Support
231 Now that you've learned the rudiments of Unicode, we can look at Python's
232 Unicode features.
235 The Unicode Type
238 Unicode strings are expressed as instances of the :class:`unicode` type, one of
242 basestring)``.  Under the hood, Python represents Unicode strings as either 16-
245 The :func:`unicode` constructor has the signature ``unicode(string[, encoding,
247 is converted to Unicode using the specified encoding; if you leave off the
251     >>> unicode('abcdef')
253     >>> s = unicode('abcdef')
255     <type 'unicode'>
256     >>> unicode('abcdef' + chr(255))    #doctest: +NORMALIZE_WHITESPACE
266 Unicode result).  The following examples show the differences::
268     >>> unicode('\x80abc', errors='strict')     #doctest: +NORMALIZE_WHITESPACE
273     >>> unicode('\x80abc', errors='replace')
275     >>> unicode('\x80abc', errors='ignore')
284 One-character Unicode strings can also be created with the :func:`unichr`
285 built-in function, which takes integers and returns a Unicode string of length 1
287 built-in :func:`ord` function that takes a one-character Unicode string and
295 Instances of the :class:`unicode` type have many of the same methods as the
310 Note that the arguments to these methods can be Unicode strings or 8-bit
311 strings.  8-bit strings will be converted to Unicode before carrying out the
323 Much Python code that operates on strings will therefore work with Unicode
325 more updating for Unicode; more on this later.)
328 returns an 8-bit string version of the Unicode string, encoded in the requested
330 ``unicode()`` constructor, with one additional possibility; as well as 'strict',
373 Unicode Literals in Python Source Code
376 In Python source code, Unicode literals are written as strings prefixed with the
382 Unicode literals can also use the same escape sequences as 8-bit strings,
390     ... #          ^^^^^^ four-digit Unicode escape
391     ... #                      ^^^^^^^^^^ eight-digit Unicode escape
407 Python supports writing Unicode literals in any encoding, but you have to
449 Unicode Properties
452 The Unicode specification includes a database of information about code points.
454 name, its category, the numeric value if applicable (Unicode has characters
488 <http://www.unicode.org/reports/tr44/#General_Category_Values> for a
494 The Unicode and 8-bit string types are described in the Python library reference
502 Unicode".  A PDF version of his slides is available at
503 <https://downloads.egenix.com/python/Unicode-EPC2002-Talk.pdf>, and is an
504 excellent overview of the design of Python's Unicode features.
507 Reading and Writing Unicode Data
510 Once you've written some code that works with Unicode data, the next problem is
511 input/output.  How do you get Unicode strings into your program, and how do you
512 convert Unicode into a form suitable for storage or transmission?
516 your application support Unicode natively.  XML parsers often return Unicode
517 data, for example.  Many relational databases also support Unicode-valued
518 columns and can return Unicode values from an SQL query.
520 Unicode data is usually converted to a particular encoding before it gets
523 ``unicode(str, encoding)``.  However, the manual approach is not recommended.
525 One problem is the multi-byte nature of encodings; one Unicode character can be
528 where only part of the bytes encoding a single Unicode character are read at the
533 string and its Unicode version in memory.)
539 a specified encoding and accepts Unicode parameters for methods such as
553 Reading Unicode from a file is therefore simple::
556     f = codecs.open('unicode.rst', encoding='utf-8')
569 Unicode character U+FEFF is used as a byte-order mark (BOM), and is often
579 Unicode filenames
583 arbitrary Unicode characters.  Usually this is implemented by converting the
584 Unicode string into some encoding that varies depending on the system.  For
594 usually just provide the Unicode string as the filename, and it will be
602 Functions in the :mod:`os` module such as :func:`os.stat` will also accept Unicode
606 the Unicode version of filenames, or should it return 8-bit strings containing
608 provided the directory path as an 8-bit string or a Unicode string.  If you pass
609 a Unicode string as the path, filenames will be decoded using the filesystem's
610 encoding and a list of Unicode strings will be returned, while passing an 8-bit
631 the Unicode versions.
635 Tips for Writing Unicode-aware Programs
639 Unicode.
643     Software should only work with Unicode strings internally, converting to a
646 If you attempt to write processing functions that accept both Unicode and 8-bit
670 For example, let's say you have a content management system that takes a Unicode
690 The PDF slides for Marc-André Lemburg's presentation "Writing Unicode-aware
692 <https://downloads.egenix.com/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf>
723    - [ ] Unicode introduction
730        - [ ] Unicode Python type
731            - [ ] Writing unicode literals
736                - [ ] unicode() constructor
737            - [ ] Unicode type
741            - [ ] Reading/writing Unicode data into files
743            - [ ] Unicode filenames
744        - [ ] Writing Unicode programs
745            - [ ] Do everything in Unicode